🔧

AI Systems

Function Calling & Tool Use

How LLMs decide when to call APIs, the schemas they emit, and the round-trip back to natural language.

TL;DR

Function calling is the bridge between language and action. You define a tool as a JSON schema (name, description, parameters). The LLM doesn't run the function — it decides *when* to call one and emits a structured JSON describing the call. Your code parses, validates, and executes; the result is injected back into the conversation as a tool message; the LLM responds with the grounded answer. With chaining, parallel calls, retries, and a strict permission model, this becomes the foundation of every AI assistant that actually does things.

When to use

Whenever your LLM needs to access fresh data (weather, prices, inventory, calendars), invoke an external API (send email, create ticket, schedule meeting), or perform precise computation (math, code execution, database queries). Don't use it when the answer is already implicit in the model's training data or in the conversation context — adding tools adds latency and points of failure.

The bridge from language to action

Pre-2023, an LLM was a fluent writer trapped in a box. Ask it for the current weather and it’d say “It’s around 22°C in Tokyo, but I can’t actually check live data.” Useful, but limited.

Function calling broke the box open. Now an LLM can decide:

Then your code runs the function, returns the result, and the LLM produces a grounded answer using the actual data.

The full request lifecycle

Every function-calling round trip follows the same shape:

sequenceDiagram
    autonumber
    participant U as User
    participant App as Your code
    participant LLM as LLM API
    participant Tool as Tool / API

    U->>App: "What's the weather in Tokyo?"
    App->>LLM: messages + tool definitions
    LLM-->>App: tool_call: get_weather(location="Tokyo")
    Note over App: 1. Parse JSON<br/>2. Validate schema + business rules<br/>3. Check permissions
    App->>Tool: GET /weather?city=Tokyo
    Tool-->>App: { temp: 22, conditions: "cloudy" }
    App->>LLM: messages + tool_result
    LLM-->>App: "It's 22°C and cloudy in Tokyo."
    App-->>U: "It's 22°C and cloudy in Tokyo."

Six messages. Two model calls. One real-world action. Done thousands of times a day in any production assistant.

1. The tool definition

A tool is a JSON schema. Three required pieces:

{
  "name": "get_weather",
  "description": "Get the current weather for a given location. Use when the user asks about temperature, conditions, or forecasts for a city or region.",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name or 'lat,lng', e.g. 'Tokyo' or '35.68,139.69'"
      },
      "unit": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"],
        "default": "celsius"
      }
    },
    "required": ["location"]
  }
}

2. The request

You include the tool definitions in your API call alongside the user message:

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's the weather in Tokyo?"},
    ],
    tools=[get_weather_tool, get_calendar_tool, ...],
    tool_choice="auto",  # model decides
)

tool_choice options:

"auto" — model decides (default)
"none" — never call tools
"required" — must call a tool
{"type": "function", "function": {"name": "X"}} — must call exactly tool X

3. The model’s response

For a tool-using turn, the response contains a tool_calls field instead of text content:

{
  "role": "assistant",
  "content": null,
  "tool_calls": [{
    "id": "call_abc123",
    "type": "function",
    "function": {
      "name": "get_weather",
      "arguments": "{\"location\": \"Tokyo\", \"unit\": \"celsius\"}"
    }
  }]
}

4. Your code parses, validates, executes

This is the trust boundary. Three things must happen in this order:

# 1. Parse
args = json.loads(tool_call.function.arguments)

# 2. Validate
schema_validator.validate(args, get_weather_tool["parameters"])
assert args["location"] in allowed_locations  # business rule

# 3. Execute
result = weather_api.fetch(args["location"], unit=args.get("unit", "celsius"))

5. The tool result goes back into the conversation

messages.append({
    "role": "tool",
    "tool_call_id": tool_call.id,
    "content": json.dumps(result)
})
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
)

Now the model has the actual data. It produces a natural-language answer:

“It’s currently 22°C and partly cloudy in Tokyo.”

The full conversation now has five messages (system, user, assistant-with-tool-call, tool-result, assistant-final). Save all of them for the next turn so the model has full context.

6. Tool chaining — multi-step reasoning

Real tasks often need multiple tools in sequence. “Find papers on retrieval-augmented generation, summarize the top 3, email me the summary.”

The flow:

LLM emits web_search tool call
Your code searches, returns results
LLM emits summarize tool call (or just summarizes in-prompt)
Your code summarizes
LLM emits send_email tool call
Your code sends, returns confirmation
LLM produces final response: “Done. Email sent.”

7. Parallel tool calls

When tools are independent, the model can request them in one turn:

"tool_calls": [
  {"function": {"name": "get_weather", "arguments": "..."}},
  {"function": {"name": "get_calendar", "arguments": "..."}},
  {"function": {"name": "get_traffic", "arguments": "..."}}
]

results = await asyncio.gather(*[
    execute_tool(call) for call in response.tool_calls
])
for call, result in zip(response.tool_calls, results):
    messages.append({"role": "tool", "tool_call_id": call.id, "content": result})

8. Failure handling

Tools fail. Network blips, rate limits, bad arguments, downstream outages. The system needs:

Retries with exponential backoff (1s, 2s, 4s, give up)
Fallbacks — if get_weather_v2 fails, try get_weather_v1
Error messages back to the model — don’t just throw; tell the LLM what went wrong so it can recover

A failed tool result might be:

{
  "role": "tool",
  "tool_call_id": "call_abc",
  "content": "{\"error\": \"location not found\", \"suggestion\": \"try a more specific name\"}"
}

9. Security — the layered defense

Function calling is the most dangerous LLM feature, because it’s the one that takes real-world action.

flowchart TD
    LLM[LLM emits tool_call] --> L1[Layer 1: Permission scoping<br/><i>Tool not in user's allowlist?<br/>Model never sees it.</i>]
    L1 -->|allowed| L2[Layer 2: Argument validation<br/><i>JSON schema + business rules<br/>SQL/XSS/path-traversal patterns blocked</i>]
    L2 -->|valid| L3[Layer 3: Sandboxing<br/><i>Code execution in isolated container<br/>no net, no FS, no privileges</i>]
    L3 --> L4[Layer 4: Audit log<br/><i>Every call recorded:<br/>who, when, args, result</i>]
    L4 --> L5{Layer 5:<br/>High-risk tool?}
    L5 -->|yes| HA[Human approval<br/><i>send money, delete data,<br/>external email</i>]
    L5 -->|no| EX[Execute]
    HA -->|approved| EX
    HA -->|denied| BLK[Block + log]
    EX --> R[Result]

    L1 -.blocked.-> BLK
    L2 -.invalid.-> BLK

    style LLM fill:#7e1d1d,stroke:#ef4444,color:#fff
    style L1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style L2 fill:#0e7490,stroke:#06b6d4,color:#fff
    style L3 fill:#581c87,stroke:#a855f7,color:#fff
    style L4 fill:#365314,stroke:#84cc16,color:#fff
    style L5 fill:#9a3412,stroke:#f97316,color:#fff
    style HA fill:#9a3412,stroke:#f97316,color:#fff
    style EX fill:#365314,stroke:#84cc16,color:#fff
    style R fill:#1c2333,stroke:#475569,color:#e7eaf1
    style BLK fill:#7f1d1d,stroke:#f43f5e,color:#fff

10. The shape of every modern AI assistant

Function calling is the architectural primitive behind every assistant that actually does things — Siri, Alexa, ChatGPT plugins, Claude with computer use, Cursor, GitHub Copilot Workspace, every customer support bot built since 2023.

The pattern is always:

Define the actions as tools
Let the model decide
Validate and execute
Feed results back
Loop

🧪 Simulator soon

An interactive simulator for this concept is on the way — tweak the knobs, watch behaviour change in real time.

🎨 Visualization soon

An interactive diagram you can hover, click, and explore.

💻 Code Phase 4 soon

A 30-line build challenge with starter code, hints, and a reference implementation.

🎯 Common interview questions

Q1. How does the LLM know when to call a function vs. answer directly? ▾

It's trained on examples of "user message + tool definitions → either tool_call OR direct response." The tool's description in the schema is the most important signal. A description like "Get the current weather for a city" implicitly tells the model "use this when the user asks about weather." Vague descriptions ("Returns information") cause the model to either over-call or under-call. Clear descriptions plus 2–3 example use cases in the system prompt is the single biggest factor in tool-calling accuracy.

Q2. What happens if the model produces malformed JSON? ▾

Modern LLMs (GPT-4, Claude 3.5+) emit constrained JSON via the provider's structured output mode (`response_format` — `json_schema`), which guarantees parseable output. If you're not using constrained mode, you need a parser with auto-repair (most libraries have one) plus a retry budget. If retries fail, treat it as a non-tool turn and let the LLM respond directly.

Q3. Tool chaining vs. parallel tool calls — when do you use which? ▾

Chaining — when later tools depend on earlier ones (search → summarize → email). Parallel — when calls are independent (weather AND calendar AND traffic). Most modern APIs let the model emit multiple tool_calls in one response; your code dispatches them concurrently and feeds back all results. Saves significant latency on multi-fact queries.

Q4. How do you keep the LLM from calling dangerous tools? ▾

Three layers of defense. (1) Permission scoping — declare allowed tools per user, per session, per trust level; the LLM never sees tools it can't use. (2) Argument validation — every tool argument is checked against the JSON schema *plus* business rules (no SQL `DROP`, no email to external domains for support agents, etc.) before execution. (3) Sandboxing — code execution tools run in a process- and network-isolated container with no privileges. Never let the model directly run shell commands.

Q5. What's the cost model — does function calling add tokens? ▾

Yes, three times. (1) Tool definitions are sent in every request (typically a few hundred tokens). (2) The tool_call output adds tokens. (3) The tool result is sent back as input on the next turn. A common production tactic — only include tools that the message routing has determined are likely relevant (e.g., a "billing" agent only sees billing tools), instead of dumping all 50 tools every time.

↗ Related concepts

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…

Loading comments…