🤖

AI Systems

Agentic Workflows

Multi-agent orchestration: planner, executor, critic — and how they coordinate without falling over.

TL;DR

An agentic system is the architectural step beyond "prompt → response." An Orchestrator LLM receives a task, plans subtasks, dispatches them through a message queue to specialist agents (Researcher, Coder, Reviewer), each with tools and shared memory (Vector DB + Redis). A guardrail layer validates every output, retries on low confidence, and escalates to humans when stuck. The result — a self-healing team of AI agents that plans, executes, recovers, and ships work autonomously.

When to use

When a task requires more than one LLM call to complete, has steps that depend on each other, needs tools (web search, code execution, API calls), and benefits from recovery loops (retry, escalate). Examples — research reports, code generation pipelines, customer support workflows that escalate, content production lines, autonomous testing, data pipelines that ask for human input on edge cases. Don't use it for one-shot Q&A or simple summarization — that's overkill and adds latency.

What changed when LLMs got tools

The original LLM API was stateless — text in, text out, one shot. Anything beyond a single response had to be hand-coded by you. That model is dying. The new model is the agent — an LLM in a loop, with memory, with tools, deciding what to do next based on what just happened.

This sounds more impressive than it is. An agent is just:

while not done:
    plan = llm("here is the task and what's happened so far. what should we do next?")
    result = execute(plan)  # could be a tool call, sub-LLM call, or terminal answer
    update_state(plan, result)

The architecture around that loop is what makes it production-grade.

1. The Orchestrator is the brain

The Orchestrator is one LLM (usually a high-quality model — GPT-4, Claude Opus) with three responsibilities:

Decompose the user task into subtasks
Dispatch subtasks to specialist agents
Monitor progress, decide when a subtask is done, when to retry, when to escalate

A canonical orchestrator prompt:

You are coordinating a team of specialist agents. The user wants:
{task}

Team:
- Researcher: gathers facts via web search and document retrieval
- Coder: writes and runs code
- Reviewer: validates outputs against requirements

Decompose the task. Output a plan as JSON:
{ "steps": [{ "agent": "Researcher", "subtask": "..." }, ...] }

2. Specialist agents — separation of concerns

Each specialist agent is a smaller LLM (often a cheap model — GPT-4o-mini, Haiku) prompted with one focused job:

Agent	Tools	When to invoke
Researcher	Web search, vector DB query, file read	”Find facts on X”
Coder	Code interpreter, file system, git	”Write code that does Y”
Reviewer	Diff, test runner, lint	”Validate that the output meets Z”
Planner	Sub-orchestration	Recursive breakdown of complex steps
Critic	Self-critique, fact-checking	Pre-exit quality gate

3. Message queue — the connective tissue

Specialist agents don’t call each other directly. They communicate through a message queue — Kafka, RabbitMQ, or even a simple Redis stream.

With a queue:

Tasks are durable — survive crashes
Work is parallelized — multiple agents can run simultaneously
Retries are free — re-publish on failure
Audit trail comes free — every message is logged

A typical message:

{
  "task_id": "task-abc123",
  "from": "orchestrator",
  "to": "researcher",
  "subtask": "Find the top 3 patents related to vector databases",
  "context_refs": ["vec://memory/task-abc123"],
  "deadline": "2026-05-05T16:00:00Z",
  "retries_remaining": 3
}

Agents are not stateless by accident. Memory is the most important architectural decision in an agentic system.

Two stores, two purposes:

Vector DB (long-lived semantic memory). Research findings, prior decisions, code snippets, summaries. Anything an agent might want to recall later via similarity search. Each agent reads, the orchestrator (or specifically a “Memorizer” agent) writes.

Redis (short-lived task state). The current state of each in-flight task — current step, partial results, lock holders. Faster than the vector DB, structured, transactional. Cleared when the task completes.

5. Tools — how agents act on the world

A tool is a function the agent can call to read or change external state. Tools are how you cross the boundary from “language” to “action.” See the Function Calling article for the full mechanism — in agentic systems, every tool call is the agent’s choice, not the developer’s.

Common tool kits:

Read — web_search, vector_query, http_get, file_read, db_query
Reason — code_interpreter (sandboxed Python)
Act — send_email, create_record, post_message, run_test

6. The hallucination guard

This is the most underrated piece of the system. Every output that crosses an agent boundary — especially outputs that leave the system to a user or external API — runs through a critic.

Three checks:

Self-critique — run a second LLM call: “Here is the proposed answer. Does it actually answer the question? Is it grounded in the retrieved context? Are there factual claims that need verification? Score 1–10.”
Faithfulness — for RAG-grounded outputs, verify each claim traces to a retrieved chunk
Confidence threshold — if either check returns < 7/10, retry with a stricter prompt, or escalate

7. Failure recovery — graceful degradation

Things go wrong constantly:

An LLM call returns garbage
A tool times out
An agent gets stuck in a loop
A subtask exceeds budget
An external API rate-limits

The architecture survives by treating failure as data:

def run_subtask(subtask):
    for attempt in range(3):
        try:
            result = agent.execute(subtask, timeout=60)
            if critic.passes(result):
                return result
        except TimeoutError:
            subtask = simplify(subtask)
        except LowConfidence:
            subtask = clarify(subtask)
    return escalate_to_human(subtask)

8. The control loop — putting it together

flowchart TD
    U[User Task] --> O[Orchestrator<br/><i>decompose into plan</i>]
    O --> Q{{Message Queue<br/>Kafka / Redis Streams}}
    Q --> R[Researcher<br/>web · vector · files]
    Q --> C[Coder<br/>sandbox · git · tests]
    Q --> RV[Reviewer<br/>diff · lint · validate]

    VDB[(Vector DB<br/>long-lived memory)]
    RD[(Redis<br/>task state)]

    R <-->|read/write| VDB
    C <-->|read/write| VDB
    RV <-->|read/write| VDB
    R <-->|state| RD
    C <-->|state| RD
    RV <-->|state| RD

    R --> Crit[Critic<br/>self-critique · faithfulness · confidence]
    C --> Crit
    RV --> Crit
    Crit -->|low conf| Q
    Crit -->|escalate| H[Human-in-the-loop]
    Crit -->|pass| Agg[Aggregator<br/>synthesise final answer]
    Agg --> Resp[Response to user]

    style U fill:#1c2333,stroke:#475569,color:#e7eaf1
    style O fill:#0e7490,stroke:#06b6d4,color:#fff
    style Q fill:#9a3412,stroke:#f97316,color:#fff
    style R fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style C fill:#581c87,stroke:#a855f7,color:#fff
    style RV fill:#365314,stroke:#84cc16,color:#fff
    style VDB fill:#0f1320,stroke:#475569,color:#cdd3df
    style RD fill:#0f1320,stroke:#475569,color:#cdd3df
    style Crit fill:#7e1d1d,stroke:#ef4444,color:#fff
    style H fill:#9a3412,stroke:#f97316,color:#fff
    style Agg fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style Resp fill:#365314,stroke:#84cc16,color:#fff

Every arrow can fail; every arrow has a retry, a timeout, and an escalation path.

9. What “production-grade” actually means

A weekend agentic demo runs three LLM calls and prints the result. A production agentic system has:

Step budget — max iterations, kill switch
Cost budget — per-task, per-user, hard cap
Latency budget — total time before timeout
Memory bound — context windows can’t grow forever
Audit log — every LLM call, tool call, decision, recorded
Replay — you can re-run any task from the audit log
Observability — dashboards on success rate, cost per task, tool usage, hallucination rate
Human-in-the-loop — clear escalation triggers and UI for human review

10. Where agents are real (and where they’re hype)

Real today:

Code generation pipelines (Cursor’s agents, Devin, Aider)
Customer support escalation (Decagon, Sierra)
Research and analysis (Perplexity Pro, OpenAI Deep Research)
Software testing (autonomous QA bots)
Knowledge worker assistants (Glean, Notion AI)

Hype today:

Fully autonomous “personal agents that book your flights and manage your life” (the trust and safety surface is too big)
Agents that learn permanently from each interaction (memory updates without curation are a liability)
“AGI through agentic loops” (no, agents are software architecture, not consciousness)

🧪 Simulator soon

An interactive simulator for this concept is on the way — tweak the knobs, watch behaviour change in real time.

🎨 Visualization soon

An interactive diagram you can hover, click, and explore.

💻 Code Phase 4 soon

A 30-line build challenge with starter code, hints, and a reference implementation.

🎯 Common interview questions

Q1. What's the difference between a chain and an agent? ▾

A chain is a fixed sequence — step 1 → step 2 → step 3, hard-coded. An agent is a control loop where the LLM **decides** what step comes next based on current state. Chains are reliable but inflexible; agents are flexible but harder to make reliable. Production systems usually have agentic *control* with chained *substeps* — the LLM decides which sub-pipeline to invoke, but each pipeline is deterministic.

Q2. How do you stop an agent from looping forever? ▾

Three layers. (1) Hard step budget — kill after N iterations regardless. (2) Cost budget — kill after $X spent. (3) Progress detection — if the last 3 outputs are semantically similar, the agent is stuck; escalate or stop. Without these, an agent can spin in self-talk forever and eat your budget.

Q3. How do agents share state without stepping on each other? ▾

Two stores. **Vector DB** for long-lived knowledge (research findings, past decisions, code snippets). **Redis** (or similar fast KV) for short-lived task state, locks, and per-agent scratchpads. Agents read freely; writes go through an orchestrator-validated channel so two agents can't simultaneously update the same task to conflicting states.

Q4. When should an agent use a tool vs. continue reasoning in-prompt? ▾

Use a tool when (1) the answer requires fresh information (web, DB, files), (2) the answer requires precise computation (math, code execution), (3) the answer requires external action (sending email, creating a record). Stay in-prompt for (a) reasoning over information already in context, (b) creative generation, (c) summarization. The agent's system prompt should include explicit examples of when to call each tool.

Q5. How do you guard against hallucination in an agentic system? ▾

Triple-check at exit. (1) Self-critic — a second LLM call with a critic prompt scores the proposed output. (2) Tool-grounded verification — claims that should be fact-checkable get re-asked through a search tool. (3) Confidence threshold — if either check returns "low confidence", retry with a different prompt or escalate to a human. The expensive checks only run on outputs that *would* leave the system; intermediate steps are unchecked.

↗ Related concepts

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…

Loading comments…