AI Interview Question 2: How to Ensure Reliable Tool Calling by Large Language Models (LLMs)

How to ensure that large language models (LLMs) work reliably and controllably when calling tools, rather than relying solely on prompts to "convince" the model. A systematic multi-level constraint framework is needed.

Take the weather query example, where the model exhibits three common "hallucination" behaviors in tool calling:
1. Not calling the tool and directly fabricating an answer.
2. Passing incorrectly formatted parameters when calling the tool (e.g., the tool does not support "the day after tomorrow", but the parameter date="the day after tomorrow" is passed).
3. Arbitrarily converting parameter formats (e.g., converting "the day after tomorrow" to a specific date without being asked), even if the tool does not require it.

The root cause is that model output is inherently probabilistic, and prompts only impose "soft constraints" on the probability distribution, not a mandatory mechanism to ensure strict compliance. In complex scenarios, such "soft constraints" easily fail.

To address this, a multi-level engineering solution is needed:

Level 1: Optimize Prompts (Soft Constraints)
- Positioning is the starting point of the constraint system, but by no means the end.
- Treat prompts as "operation contracts", clearly explaining the tool's purpose, the type and boundaries of each parameter, and listing examples of illegal values.
- Include Few-shot examples to anchor the model's behavior pattern through in-context learning by showing examples of "correct input → correct call".
Level 2: Introduce JSON Schema (Hard Constraints)
- This is a key step from "reasoning" to "setting guardrails".
- Replace natural language parameter descriptions with machine-readable, verifiable structured definitions (JSON Schema). Strictly define field types, required fields, enumeration value ranges, and prohibit the model from outputting any undefined fields by setting additionalProperties: false.
- Major API platforms support such structured output constraints during the model decoding phase, preventing format violations at the source.
Level 3: Establish a Validation-Repair-Retry Loop (Execution Fallback)
- Even with Schema, syntax and Schema validation must be performed after obtaining the model output.
- When validation fails, design an automatic cleaning and retry mechanism (with a limit), feeding error information back to the model to correct the output. After exceeding the retry limit, a degradation or manual handling plan is needed.
Architecture Level: Separation of Concerns
- Separate decision from execution, forming a three-layer architecture:
  - Model Layer: Responsible only for decision-making (determining which tool to call and what parameters to generate).
  - Framework Layer: Responsible for the execution framework, including Schema validation, tool invocation, retry handling, and result integration. This ensures that model errors do not directly affect tool safety, and tool changes do not require frequent prompt adjustments.
  - Tool Layer: Specific business capability implementation.
- Frameworks like LangChain and LlamaIndex are doing exactly this.

Limitations of the current approach: It handles parameter format issues well, but still lacks sufficient coverage for parameter semantics (e.g., equivalence between "Shanghai" and "Hu"). This will be an engineering challenge to address in the future.

Core Conclusion: Making LLMs reliably call tools is essentially a software engineering problem. It requires a systematic engineering solution ranging from soft constraints, hard constraints, execution fallback, to architectural design, rather than merely relying on prompt optimization.

AI Interview Question 2: How to Ensure Reliable Tool Calling by Large Language Models (LLMs)

AI Interview Question 2: How to Ensure Reliable Tool Calling by Large Language Models (LLMs)

评论

发表评论（匿名）