🌐

AI Systems

AI Gateway

The traffic-cop in front of your LLM: routing, caching, fallbacks, rate limits, observability.

TL;DR

An AI Gateway is the traffic-cop layer between your application and every LLM provider. It gives you one unified API across GPT, Claude, Gemini, and Llama; rate-limits per user/team/tier; semantic-caches identical queries to avoid re-paying; routes by complexity (cheap model for simple, big model for hard); falls back automatically when a provider is down; runs guardrails for PII and harmful content; and logs everything for cost, latency, and quality observability. Without one, every call to every LLM is a tiny ungoverned spend leak.

When to use

Always — once you have any production AI workload that calls more than one LLM, more than one model, or has more than ~1000 users. The gateway pays for itself the moment a provider has an outage, the moment a developer accidentally calls GPT-4 in a tight loop, or the moment your finance team asks "what is the AI bill per feature this month?" If you're pre-launch with one model and a tiny user base, skip it; you'll add it within six months.

The pattern that turns LLMs into infrastructure

Your app calls GPT-4. Six months later you also need Claude for some routes (better at long context). Then Gemini for cheap classification. Then Llama for an air-gapped enterprise deployment.

Suddenly your codebase has four LLM clients, four sets of error handling, four rate limit tracking systems, four cost dashboards, four sets of API keys to rotate. And on the day OpenAI has a regional outage, your support team is the one finding out.

Where the gateway lives in your stack

flowchart TD
    Apps[Your Apps]
    Auth[Auth / API Gateway<br/><i>your existing front door</i>]
    GW[AI Gateway<br/><b>THIS layer</b><br/>routing · caching · fallbacks ·<br/>rate-limiting · guardrails · audit]
    OAI[OpenAI<br/>GPT-4 / 4o-mini]
    Anth[Anthropic<br/>Claude 3.5 / Haiku]
    GG[Google<br/>Gemini Pro / Flash]
    Self[Self-hosted<br/>Llama / Mistral]
    Bed[AWS Bedrock]

    Apps --> Auth
    Auth --> GW
    GW --> OAI
    GW --> Anth
    GW --> GG
    GW --> Self
    GW --> Bed

    GW -.metrics.-> Obs[(Observability:<br/>cost, latency,<br/>quality, audit)]
    GW -.cache.-> Cache[(Semantic cache:<br/>vector store of<br/>recent Q→A pairs)]

    style Apps fill:#1c2333,stroke:#475569,color:#e7eaf1
    style Auth fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style GW fill:#0e7490,stroke:#06b6d4,color:#fff
    style OAI fill:#581c87,stroke:#a855f7,color:#fff
    style Anth fill:#581c87,stroke:#a855f7,color:#fff
    style GG fill:#581c87,stroke:#a855f7,color:#fff
    style Self fill:#581c87,stroke:#a855f7,color:#fff
    style Bed fill:#581c87,stroke:#a855f7,color:#fff
    style Obs fill:#365314,stroke:#84cc16,color:#fff
    style Cache fill:#9a3412,stroke:#f97316,color:#fff

It’s a separate microservice, deployed alongside your other backend services. Latency overhead is single-digit milliseconds — invisible against the 500–2000ms of an LLM call itself.

1. The unified API

Different providers have different schemas. OpenAI’s chat.completions, Anthropic’s messages, Google’s generateContent, all subtly different. The gateway exposes one shape — typically OpenAI-compatible because it’s the de facto standard:

gateway.chat.completions.create(
    model="best-cheap-chat",  # logical model name, not a real one
    messages=[...],
    tools=[...],
)

Behind the scenes, the gateway:

Translates your request into the picked provider’s schema
Calls the provider
Translates the response back to the unified schema
Returns

2. Rate limiting — the budget guardrail

Without a central rate limiter, any service in your fleet can blow your budget. The gateway tracks usage on multiple keys:

Key	Limit	Why
`user_id`	1000 req/hr	Per-user fairness
`team_id`	100k tokens/hr	Per-team budget
`feature_id`	10k req/hr	Feature-level cost containment
`model_id`	varies	Don’t exceed provider quotas
`total`	global cap	Last-line-of-defense

The mechanics are the same as any rate limiter — Redis token buckets, atomic Lua scripts, 429 responses with Retry-After.

When a tier hits the limit, the gateway can either reject (429), queue (with a deadline), or downgrade (use a cheaper model). The choice is per-route configuration.

3. Semantic caching — the cost killer

Lots of LLM workloads have repeated queries. FAQ bots, product Q&A, classification, embedding pipelines. Naively each one re-pays the LLM for an answer it has already generated.

Semantic caching:

Embed the incoming query
Search a vector cache of (query_vec, response) pairs
If similarity > 0.97 with a cached entry, return the cached response
Else call the LLM, store the result

4. Model routing — the smart cost optimizer

Not every query needs a 70B model. Most don’t.

A model router scores each query and picks the right tier:

Simple classification, retrieval-augmented Q&A, summarization of short text → cheap model (GPT-4o-mini, Haiku, Gemini Flash)
Multi-step reasoning, code generation, complex analysis → expensive model (GPT-4o, Claude Opus, Gemini Pro)

The scoring is itself an ML model — a small classifier that predicts task difficulty. Or a heuristic — query length, presence of code blocks, number of conversation turns. Or just user-tier — free users get the cheap model, paid users get the expensive one.

5. Fallback chains — resilience without code changes

Every provider has outages. OpenAI had multi-hour outages in 2023, 2024, 2025. Anthropic too. Google too. If your app depends on one, your app is down with them.

Fallback chains:

routes:
  chat:
    primary: openai/gpt-4o
    fallbacks:
      - anthropic/claude-3.5-sonnet
      - google/gemini-1.5-pro
      - openai/gpt-4o-mini   # last resort

The gateway tracks each provider’s health. After N consecutive failures (or P95 latency above threshold), the circuit breaker opens — that provider is skipped for the next M minutes. Requests flow to the next provider in the chain.

flowchart LR
    Req[Incoming<br/>request] --> P{Primary<br/>healthy?}
    P -->|yes| OAI[OpenAI<br/>GPT-4o]
    P -->|breaker open| F1{Fallback 1<br/>healthy?}
    OAI -->|fail| F1
    F1 -->|yes| Anth[Anthropic<br/>Claude 3.5]
    F1 -->|no| F2{Fallback 2<br/>healthy?}
    Anth -->|fail| F2
    F2 -->|yes| GG[Google<br/>Gemini 1.5]
    F2 -->|no| Last[GPT-4o-mini<br/>last resort]
    GG -->|fail| Last

    OAI --> R[Response<br/>to user]
    Anth --> R
    GG --> R
    Last --> R

    style Req fill:#1c2333,stroke:#475569,color:#e7eaf1
    style P fill:#9a3412,stroke:#f97316,color:#fff
    style F1 fill:#9a3412,stroke:#f97316,color:#fff
    style F2 fill:#9a3412,stroke:#f97316,color:#fff
    style OAI fill:#0e7490,stroke:#06b6d4,color:#fff
    style Anth fill:#581c87,stroke:#a855f7,color:#fff
    style GG fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style Last fill:#7e1d1d,stroke:#ef4444,color:#fff
    style R fill:#365314,stroke:#84cc16,color:#fff

6. Guardrails — the safety layer

Every LLM call passes through input and output guardrails:

Input guardrails:

PII detection — strip credit cards, SSNs, emails before sending to a 3rd-party LLM
Prompt injection detection — flag attempts to break system prompts
Banned patterns — keyword filter for things you never want sent to a public LLM (internal API keys, passwords)

Output guardrails:

Content filters — block harmful, NSFW, or off-policy outputs
Format validation — if the route promised JSON, reject non-JSON responses
Faithfulness check (for RAG routes) — does the answer actually trace to retrieved chunks?

Guardrails add ~10–50ms of latency. Worth it for any user-facing route.

7. Observability — the dashboard you live in

The single biggest win of having a gateway is knowing what’s happening. Without one, every team has its own dashboard (or none at all). With one, every request is logged with:

Tenant, user, feature, route
Model used, fallback chain history
Input tokens, output tokens, total cost
Latency (p50, p95, p99)
Cache hit / miss
Guardrail block (if any)
User feedback signal (if collected)

This feeds three dashboards:

Finance — cost per feature, per team, per user, per model
Reliability — error rates, latency percentiles, fallback frequency
Quality — feedback ratings, guardrail block rate, faithfulness score

8. Cost controls — preventing surprise bills

LLM costs run away easily. A bug that retries on every error. A new feature that uses 10× more tokens than expected. A demo to a customer that calls GPT-4 in a tight loop.

The gateway enforces caps:

Per-user daily budget
Per-team monthly budget
Per-feature spending alerts
Auto-downgrade when budgets are exceeded
Hard kill switch when 100% exhausted

9. Build vs. buy

Open source: LiteLLM, Portkey, OpenRouter (hosted), MLflow AI Gateway. Cover most features out of the box. Good if you have an infrastructure team.

Hosted: Cloudflare AI Gateway, Vercel AI SDK gateway, Helicone. Less to operate but data goes through them; suitable for non-regulated workloads.

Build: only when you have enterprise-specific requirements (regional data residency, custom audit needs, on-prem-only providers). Even then, you’re better off forking LiteLLM than starting from zero.

10. The shape of mature AI infrastructure

This is the most important piece of production AI infrastructure most teams haven’t built yet — and the one that turns LLMs from a cool feature into a manageable service.

🧪 Simulator soon

An interactive simulator for this concept is on the way — tweak the knobs, watch behaviour change in real time.

🎨 Visualization soon

An interactive diagram you can hover, click, and explore.

💻 Code Phase 4 soon

A 30-line build challenge with starter code, hints, and a reference implementation.

🎯 Common interview questions

Q1. Why not just call OpenAI directly from each service? ▾

Three reasons it falls apart. (1) Vendor lock-in — switching providers means changing every service. (2) No central rate limiting, so any service can blow your budget. (3) No fallback — if OpenAI is down, your whole product is down. The gateway is the abstraction that gives you optionality, governance, and resilience in one place. The cost is one extra hop (~5–15ms) which is dwarfed by LLM call latency anyway.

Q2. How does semantic caching work and when does it help? ▾

Each incoming query is embedded into a vector. The gateway searches a vector cache of recent (query_vector, response) pairs. A hit (cosine similarity above ~0.97) returns the cached response without calling the LLM. Helps when many users ask the same question — FAQ bots, product Q&A, classification. Doesn't help — and is dangerous — when answers depend on user-specific context. Always scope the cache by tenant; never cache across users.

Q3. How does the gateway pick which model to call? ▾

Three policies, often combined. (1) Tier-based — paid users get GPT-4, free users get GPT-4o-mini. (2) Complexity-based — a small classifier estimates how hard the query is and routes accordingly. (3) Cost-budget-based — once a user hits 80% of their budget, downgrade to a cheaper model. Most production gateways start with tier-based and graduate to complexity routing once they have eval data showing it works.

Q4. What's the right fallback policy when a provider fails? ▾

Define fallbacks per route. For most chat routes, the chain is `primary → secondary same-class → smaller model same provider`. Each provider has a circuit breaker — after N consecutive failures, mark unhealthy and skip for the next M minutes. Crucial — the fallback model must be capable of producing a comparable answer; falling back from GPT-4 to GPT-3.5 may break structured outputs that GPT-3.5 fails at. Test the fallback path regularly.

Q5. Where does the gateway live in the request flow? ▾

As a separate service, deployed behind your API gateway / authentication layer. Apps call `gateway.example.com/chat`; the gateway authenticates (your existing auth tokens), routes to the appropriate provider, applies the policies, and returns. Some teams deploy it as a sidecar; most run it as a standalone service for visibility. The most-used open-source options are LiteLLM, Portkey, OpenRouter; commercial ones include Cloudflare AI Gateway and Vercel AI SDK gateway.

↗ Related concepts

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…

Loading comments…