An AI Gateway is the traffic-cop layer between your application and every LLM provider. It gives you one unified API across GPT, Claude, Gemini, and Llama; rate-limits per user/team/tier; semantic-caches identical queries to avoid re-paying; routes by complexity (cheap model for simple, big model for hard); falls back automatically when a provider is down; runs guardrails for PII and harmful content; and logs everything for cost, latency, and quality observability. Without one, every call to every LLM is a tiny ungoverned spend leak.
Always — once you have any production AI workload that calls more than one LLM, more than one model, or has more than ~1000 users. The gateway pays for itself the moment a provider has an outage, the moment a developer accidentally calls GPT-4 in a tight loop, or the moment your finance team asks "what is the AI bill per feature this month?" If you're pre-launch with one model and a tiny user base, skip it; you'll add it within six months.
The pattern that turns LLMs into infrastructure
Your app calls GPT-4. Six months later you also need Claude for some routes (better at long context). Then Gemini for cheap classification. Then Llama for an air-gapped enterprise deployment.
Suddenly your codebase has four LLM clients, four sets of error handling, four rate limit tracking systems, four cost dashboards, four sets of API keys to rotate. And on the day OpenAI has a regional outage, your support team is the one finding out.
Where the gateway lives in your stack
flowchart TD
Apps[Your Apps]
Auth[Auth / API Gateway<br/><i>your existing front door</i>]
GW[AI Gateway<br/><b>THIS layer</b><br/>routing · caching · fallbacks ·<br/>rate-limiting · guardrails · audit]
OAI[OpenAI<br/>GPT-4 / 4o-mini]
Anth[Anthropic<br/>Claude 3.5 / Haiku]
GG[Google<br/>Gemini Pro / Flash]
Self[Self-hosted<br/>Llama / Mistral]
Bed[AWS Bedrock]
Apps --> Auth
Auth --> GW
GW --> OAI
GW --> Anth
GW --> GG
GW --> Self
GW --> Bed
GW -.metrics.-> Obs[(Observability:<br/>cost, latency,<br/>quality, audit)]
GW -.cache.-> Cache[(Semantic cache:<br/>vector store of<br/>recent Q→A pairs)]
style Apps fill:#1c2333,stroke:#475569,color:#e7eaf1
style Auth fill:#1e3a8a,stroke:#3b82f6,color:#fff
style GW fill:#0e7490,stroke:#06b6d4,color:#fff
style OAI fill:#581c87,stroke:#a855f7,color:#fff
style Anth fill:#581c87,stroke:#a855f7,color:#fff
style GG fill:#581c87,stroke:#a855f7,color:#fff
style Self fill:#581c87,stroke:#a855f7,color:#fff
style Bed fill:#581c87,stroke:#a855f7,color:#fff
style Obs fill:#365314,stroke:#84cc16,color:#fff
style Cache fill:#9a3412,stroke:#f97316,color:#fff
It’s a separate microservice, deployed alongside your other backend services. Latency overhead is single-digit milliseconds — invisible against the 500–2000ms of an LLM call itself.
1. The unified API
Different providers have different schemas. OpenAI’s chat.completions, Anthropic’s messages, Google’s generateContent, all subtly different. The gateway exposes one shape — typically OpenAI-compatible because it’s the de facto standard:
gateway.chat.completions.create(
model="best-cheap-chat", # logical model name, not a real one
messages=[...],
tools=[...],
)
Behind the scenes, the gateway:
- Translates your request into the picked provider’s schema
- Calls the provider
- Translates the response back to the unified schema
- Returns
2. Rate limiting — the budget guardrail
Without a central rate limiter, any service in your fleet can blow your budget. The gateway tracks usage on multiple keys:
| Key | Limit | Why |
|---|---|---|
user_id | 1000 req/hr | Per-user fairness |
team_id | 100k tokens/hr | Per-team budget |
feature_id | 10k req/hr | Feature-level cost containment |
model_id | varies | Don’t exceed provider quotas |
total | global cap | Last-line-of-defense |
The mechanics are the same as any rate limiter — Redis token buckets, atomic Lua scripts, 429 responses with Retry-After.
When a tier hits the limit, the gateway can either reject (429), queue (with a deadline), or downgrade (use a cheaper model). The choice is per-route configuration.
3. Semantic caching — the cost killer
Lots of LLM workloads have repeated queries. FAQ bots, product Q&A, classification, embedding pipelines. Naively each one re-pays the LLM for an answer it has already generated.
Semantic caching:
- Embed the incoming query
- Search a vector cache of
(query_vec, response)pairs - If similarity > 0.97 with a cached entry, return the cached response
- Else call the LLM, store the result
4. Model routing — the smart cost optimizer
Not every query needs a 70B model. Most don’t.
A model router scores each query and picks the right tier:
- Simple classification, retrieval-augmented Q&A, summarization of short text → cheap model (GPT-4o-mini, Haiku, Gemini Flash)
- Multi-step reasoning, code generation, complex analysis → expensive model (GPT-4o, Claude Opus, Gemini Pro)
The scoring is itself an ML model — a small classifier that predicts task difficulty. Or a heuristic — query length, presence of code blocks, number of conversation turns. Or just user-tier — free users get the cheap model, paid users get the expensive one.
5. Fallback chains — resilience without code changes
Every provider has outages. OpenAI had multi-hour outages in 2023, 2024, 2025. Anthropic too. Google too. If your app depends on one, your app is down with them.
Fallback chains:
routes:
chat:
primary: openai/gpt-4o
fallbacks:
- anthropic/claude-3.5-sonnet
- google/gemini-1.5-pro
- openai/gpt-4o-mini # last resort
The gateway tracks each provider’s health. After N consecutive failures (or P95 latency above threshold), the circuit breaker opens — that provider is skipped for the next M minutes. Requests flow to the next provider in the chain.
flowchart LR
Req[Incoming<br/>request] --> P{Primary<br/>healthy?}
P -->|yes| OAI[OpenAI<br/>GPT-4o]
P -->|breaker open| F1{Fallback 1<br/>healthy?}
OAI -->|fail| F1
F1 -->|yes| Anth[Anthropic<br/>Claude 3.5]
F1 -->|no| F2{Fallback 2<br/>healthy?}
Anth -->|fail| F2
F2 -->|yes| GG[Google<br/>Gemini 1.5]
F2 -->|no| Last[GPT-4o-mini<br/>last resort]
GG -->|fail| Last
OAI --> R[Response<br/>to user]
Anth --> R
GG --> R
Last --> R
style Req fill:#1c2333,stroke:#475569,color:#e7eaf1
style P fill:#9a3412,stroke:#f97316,color:#fff
style F1 fill:#9a3412,stroke:#f97316,color:#fff
style F2 fill:#9a3412,stroke:#f97316,color:#fff
style OAI fill:#0e7490,stroke:#06b6d4,color:#fff
style Anth fill:#581c87,stroke:#a855f7,color:#fff
style GG fill:#1e3a8a,stroke:#3b82f6,color:#fff
style Last fill:#7e1d1d,stroke:#ef4444,color:#fff
style R fill:#365314,stroke:#84cc16,color:#fff
6. Guardrails — the safety layer
Every LLM call passes through input and output guardrails:
Input guardrails:
- PII detection — strip credit cards, SSNs, emails before sending to a 3rd-party LLM
- Prompt injection detection — flag attempts to break system prompts
- Banned patterns — keyword filter for things you never want sent to a public LLM (internal API keys, passwords)
Output guardrails:
- Content filters — block harmful, NSFW, or off-policy outputs
- Format validation — if the route promised JSON, reject non-JSON responses
- Faithfulness check (for RAG routes) — does the answer actually trace to retrieved chunks?
Guardrails add ~10–50ms of latency. Worth it for any user-facing route.
7. Observability — the dashboard you live in
The single biggest win of having a gateway is knowing what’s happening. Without one, every team has its own dashboard (or none at all). With one, every request is logged with:
- Tenant, user, feature, route
- Model used, fallback chain history
- Input tokens, output tokens, total cost
- Latency (p50, p95, p99)
- Cache hit / miss
- Guardrail block (if any)
- User feedback signal (if collected)
This feeds three dashboards:
- Finance — cost per feature, per team, per user, per model
- Reliability — error rates, latency percentiles, fallback frequency
- Quality — feedback ratings, guardrail block rate, faithfulness score
8. Cost controls — preventing surprise bills
LLM costs run away easily. A bug that retries on every error. A new feature that uses 10× more tokens than expected. A demo to a customer that calls GPT-4 in a tight loop.
The gateway enforces caps:
- Per-user daily budget
- Per-team monthly budget
- Per-feature spending alerts
- Auto-downgrade when budgets are exceeded
- Hard kill switch when 100% exhausted
9. Build vs. buy
Open source: LiteLLM, Portkey, OpenRouter (hosted), MLflow AI Gateway. Cover most features out of the box. Good if you have an infrastructure team.
Hosted: Cloudflare AI Gateway, Vercel AI SDK gateway, Helicone. Less to operate but data goes through them; suitable for non-regulated workloads.
Build: only when you have enterprise-specific requirements (regional data residency, custom audit needs, on-prem-only providers). Even then, you’re better off forking LiteLLM than starting from zero.
10. The shape of mature AI infrastructure
This is the most important piece of production AI infrastructure most teams haven’t built yet — and the one that turns LLMs from a cool feature into a manageable service.
Comments 0
Discuss this page. Markdown supported. Be kind.