L4 第 9 篇 🐣 难度 🕒 17 分钟

Tool Use 工程实战：让 LLM 真正会用工具

LLM 能用工具——但要让它"用得稳、用得对、用得省"是另一门工程艺术。这一篇讲实战。

阿

阿莱

2026/8/10

L4-04 Agent 让你看到 LLM 调工具的可能性。 L4-07 MCP 给了”标准接口”。

这一篇——工程实战。怎么让 LLM 调工具稳、对、省—— 真实生产环境的细节。

Tool Use 基础

主流 LLM 都内置 tool calling：

OpenAI Function Calling

from openai import OpenAI
client = OpenAI()

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Beijing 今天天气如何？"}],
    tools=tools
)

# 模型决定调用工具
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    args = json.loads(tool_call.function.arguments)
    result = get_weather(args["city"])  # 你执行
    # 把结果送回 LLM
    ...

Anthropic Tool Use

import anthropic
client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet",
    max_tokens=1024,
    tools=[{
        "name": "get_weather",
        "description": "Get current weather for a city",
        "input_schema": {...}
    }],
    messages=[{"role": "user", "content": "..."}]
)

格式略不同但思路一样。

Tool Description 的艺术

模型用得对不对，80% 在 tool description。

差的描述

{
    "name": "search",
    "description": "Search.",
    "parameters": {...}
}

模型不知道：

搜什么？网页？数据库？文件？
输入格式？
何时该用 vs 何时不该？

好的描述

{
    "name": "search_company_docs",
    "description": """Search the company's internal documentation.

    Use this tool when:
    - User asks about company policies, procedures, or processes
    - User asks "how do we do X" at the company
    - User mentions specific internal tools or systems

    Do NOT use this tool when:
    - User asks general programming questions (use code_search instead)
    - User asks about public information (the model knows it directly)

    Returns: top 5 most relevant document excerpts with source links.
    """,
    "parameters": {
        "query": {
            "type": "string",
            "description": "Natural language search query. Be specific about what you want to find."
        },
        "category": {
            "type": "string",
            "enum": ["hr", "engineering", "sales", "general"],
            "description": "Optional category filter."
        }
    }
}

关键 element：

何时用 + 何时不用
输入格式 + 例子
返回值的样子
副作用警告

Anthropic 公开过 Claude 优秀工具调用的内部 prompt——充分描述工具是核心。

几个常见的工程问题

问题 1：模型选错工具

症状：有 30 个工具，模型经常选错。

原因：

工具描述不够清晰
工具之间功能重叠
上下文里塞太多工具描述

解决：

A. 减少工具数量：

# ❌ 一次给 30 个工具
client.chat.completions.create(tools=all_30_tools, ...)

# ✅ 用 RAG 先选 top-5 相关工具
relevant_tools = retrieve_tools_by_user_query(user_query)
client.chat.completions.create(tools=relevant_tools, ...)

B. 工具分类 + 二阶选择：

# Step 1: 模型先选"工具类别"
category = llm.classify_tool_category(user_query)
# Step 2: 在该类别内的工具中选具体的
tools = get_tools_in_category(category)
response = llm.call_with_tools(tools, ...)

问题 2：参数错误

症状：模型调用工具但参数格式错。

原因：

参数 schema 写得模糊
没给例子
复杂嵌套结构难”猜”

解决：

A. 加 examples：

{
    "description": "...",
    "parameters": {
        "filter": {
            "type": "object",
            "description": """
            Example: {"category": "books", "price_max": 50}
            Example: {"author": "Tolkien", "year_range": [1950, 1960]}
            """
        }
    }
}

B. Pydantic 验证：

from pydantic import BaseModel, ValidationError

class WeatherParams(BaseModel):
    city: str
    units: str = "celsius"

def safe_call(tool_call):
    try:
        params = WeatherParams(**json.loads(tool_call.arguments))
        return get_weather(**params.dict())
    except ValidationError as e:
        # 把错误送回 LLM 让它修
        return {"error": str(e), "expected_schema": WeatherParams.schema()}

让 LLM 看到错误能自己改——比让它一次就对省事多。

问题 3：调用次数失控

症状：Agent 调了 50 次工具还在打转。

原因：

死循环（A 工具调 B，B 调 A）
没满足”完成”条件
重复尝试同一个失败的调用

解决：

A. 设置 max_iterations：

def run_agent(query, max_iterations=10):
    history = []
    for i in range(max_iterations):
        response = llm.call(query, history)
        if response.is_done():
            return response.final_answer
        tool_result = execute_tool(response.tool_call)
        history.append((response, tool_result))
    return "Agent exceeded max iterations - giving up"

B. 检测重复行为：

def detect_loop(history):
    last_5 = [h.tool_name for h in history[-5:]]
    if len(set(last_5)) == 1:
        return True  # 反复调用同一个工具
    return False

C. 设置成本上限：

total_cost = sum(c.cost for c in history)
if total_cost > 0.50:  # 单任务超过 $0.50 美元
    return "Cost limit exceeded"

问题 4：工具调用慢

症状：每次 tool call 20 秒，用户骂街。

原因：

工具本身慢（外部 API）
LLM 思考时间长
串行调用太多

解决：

A. 并行调用：

# OpenAI / Anthropic 支持 parallel tool calls
response = client.chat.completions.create(
    tools=tools,
    parallel_tool_calls=True  # ← 关键
)
# 同时执行多个工具
results = await asyncio.gather(*[execute(tc) for tc in response.tool_calls])

B. 缓存：

@lru_cache(maxsize=1000)
def cached_search(query):
    return actual_search(query)

C. 流式 + 早返回：

让用户看到”思考过程”—— 即使最终答案要几十秒，过程有反馈。

问题 5：成本爆炸

症状：单次对话 token 数飞涨。

原因：

每次都重发完整工具描述（上下文累积）
工具返回内容超长
调用次数多

解决：

A. 工具描述精简：

不要 1000 字描述 30 个工具—— 只给相关的 5 个。

B. 限制工具返回长度：

def search_files(query):
    results = full_search(query)
    # 只返回 top 5，每个截取前 200 字
    return [
        {"file": r.path, "excerpt": r.content[:200]}
        for r in results[:5]
    ]

C. 上下文压缩：

长对话历史压缩成摘要—— 保留关键信息但省 token。

工程模板

一个相对完整的 tool-using LLM 应用模板：

import json
import asyncio
from openai import AsyncOpenAI

class ToolAgent:
    def __init__(self, tools_registry, model="gpt-4o"):
        self.client = AsyncOpenAI()
        self.tools_registry = tools_registry  # name → {schema, impl}
        self.model = model

    def get_tool_schemas(self, relevant_tools):
        return [
            {"type": "function", "function": self.tools_registry[name]["schema"]}
            for name in relevant_tools
        ]

    async def run(self, user_query, max_iter=10):
        history = [{"role": "user", "content": user_query}]
        cost = 0

        for iteration in range(max_iter):
            # 1. 根据 query 选相关工具（RAG）
            relevant = self.select_tools(user_query, top_k=5)

            # 2. LLM 调用
            response = await self.client.chat.completions.create(
                model=self.model,
                messages=history,
                tools=self.get_tool_schemas(relevant),
                parallel_tool_calls=True,
            )
            cost += self.estimate_cost(response)
            if cost > 0.50:
                return "Cost limit exceeded"

            msg = response.choices[0].message
            history.append(msg)

            # 3. 检查是否完成
            if not msg.tool_calls:
                return msg.content

            # 4. 执行所有工具（并行）
            tool_results = await asyncio.gather(*[
                self.execute_tool(tc) for tc in msg.tool_calls
            ])

            # 5. 工具结果回填 history
            for tc, result in zip(msg.tool_calls, tool_results):
                history.append({
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": json.dumps(result)
                })

        return "Max iterations exceeded"

    async def execute_tool(self, tool_call):
        try:
            func = self.tools_registry[tool_call.function.name]["impl"]
            args = json.loads(tool_call.function.arguments)
            result = await func(**args)
            return {"success": True, "result": result}
        except Exception as e:
            return {"success": False, "error": str(e)}

这一套模板覆盖了 80% 的工程问题：

工具选择
成本控制
并行调用
错误处理
迭代限制

一个真实经验

某团队用 LLM Agent 做 SQL 查询助手——

v1（朴素）

给模型一堆 SQL 工具
让它自己想怎么查

问题：经常生成错误 SQL，跑了 5 次还不对。

v2（约束）

限定可用的表 / 字段
用 Pydantic 验证 SQL 结构
错误立刻反馈给模型

改善：成功率从 60% → 85%。

v3（增强）

加了”先用 small model 验证 SQL，再 GPT-4 执行”
缓存常见查询
用户能看到推理过程

结果：响应快 5×，成本省 60%，准确率 95%。

工程优化的累积效应巨大——不是 LLM 厉害不厉害，是工程做得多细致。

安全考虑

Tool Use = 给 LLM “做事”的能力—— 风险显著上升：

1. Prompt Injection

用户输入或工具返回里藏指令—— LLM 可能执行恶意操作。

防御：

工具白名单 + 沙箱
危险操作需用户确认
监控异常调用模式

2. 权限越界

LLM 看到 admin 工具——可能误用。

防御：

按用户角色提供不同工具集
写敏感操作日志
必要时人工 review

3. 数据出口

LLM 可能把敏感数据发到外部 API。

防御：

工具调用前后做 PII 检测
标记可疑数据流
审计所有外部调用

详见 L6-03 红队与越狱。

工具	用途
LangChain Tools	通用工具抽象
OpenAI Functions	OpenAI 原生
Anthropic Tools	Claude 原生
MCP（L4-07）	跨 LLM 标准
Pydantic AI	类型安全
Instructor	结构化输出

🚧 3 个常见坑

⚠️ 实战避坑

坑 1：一次给 LLM 50 个工具 工具超过 10-15 个 LLM 选错率显著上升——按场景预筛 / 用路由器先选子集。

坑 2：工具返回原始 JSON 给 LLM 应该返回自然语言摘要 + 结构化数据混合——LLM 读 JSON 容易忽略关键字段。

坑 3：没考虑工具失败模式 网络挂、API 限流、参数错——工具调用必须 try/except，把错误以可理解的格式返回给 LLM 让它自我修正。

🔗 被以下 6 篇文章引用

📬

读到这里说明你认真在学 🎯

订阅每周精选 —— 下一篇新文章 / 新可视化第一时间送到邮箱。

💬

讨论区

· 用 GitHub 账号登录评论

⚠️ Giscus 评论未配置 —— 在 src/components/Comments.astro 顶部填入仓库 ID 和分类 ID（见组件注释里的配置步骤）。