A rate limiter is a fast token-or-counter check at the edge of your system that says "yes, no, or wait" before any expensive work happens. The four canonical algorithms — fixed window, sliding window, token bucket, leaky bucket — each trade burst tolerance against fairness, and the right one depends on what you're protecting and from whom.
Anywhere you have an external API surface (public APIs, login endpoints, signup flows, password reset, search, AI inference, anything that calls a paid third-party). Use it for fairness (multi-tenant SaaS), abuse prevention (DDoS, scraping, brute-force), cost control (LLM token quotas), and SLA protection (so one runaway client can't starve the rest).
Why this is more than “just a counter”
Every rate limiter is a counter under the hood. What makes it interesting is that it sits on the hottest path in your architecture — every request goes through it before doing any work. Get it wrong and you either let abuse through, or you accidentally rate-limit your own customers during a traffic spike.
The job of a rate limiter is to answer one question, in under a millisecond, billions of times a day:
Three numbers determine the answer: who, what, and when.
1. The four canonical algorithms
There are exactly four algorithms anyone uses in production. Everything else is a variant.
Fixed window
Keep a counter per (user, minute). Reset it every minute. Cheap, simple, and broken at boundaries — a user can fire 100 requests in second 59 and another 100 in second 0 of the next minute, getting 200 in 2 seconds despite a 100/min limit.
key: "rl:user:42:2026-05-05T15:42"
INCR → 73
EXPIRE 60
Use this when you don’t care about boundary bursts (most non-abusive workloads).
Sliding window log
Store every request’s timestamp in a sorted set. To check, count entries in the last 60s. Perfectly accurate, but O(n) memory per user — bad at scale.
Sliding window counter (weighted)
The pragmatic compromise. Keep counts for the current and previous window, weight the previous one by how much it overlaps:
effective = current + previous × ((window_size - elapsed_in_current) / window_size)
Token bucket
Conceptually different. The user has a bucket of tokens that refills at a steady rate (R tokens/sec) up to a max (B tokens). Each request consumes one token. If the bucket is empty, the request is rejected.
tokens = min(B, tokens + (now - last_refill) × R)
last_refill = now
if tokens >= 1: tokens -= 1, allow
else: reject
The genius: the bucket capacity B controls burst tolerance, the refill rate R controls average rate. Tune them independently. AWS, Stripe, Cloudflare — all token bucket.
Leaky bucket
Inverse of token bucket. Requests enter a queue (“the bucket”); the queue drains at a fixed rate. If the queue is full, reject. Output is strictly smoothed — perfect when downstream is fragile.
2. The state lives in Redis (almost always)
A rate limiter without distributed state is just per-server limits, which means a determined client can hit the next pod and get a fresh quota. So the counter has to be shared.
Redis is the answer for one reason: you can do INCR, EXPIRE, and conditional logic atomically with Lua, and it returns in sub-millisecond.
A real production token-bucket Lua script:
-- KEYS[1] = bucket key, ARGV[1] = max, ARGV[2] = refill_rate, ARGV[3] = now
local data = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(data[1]) or tonumber(ARGV[1])
local ts = tonumber(data[2]) or tonumber(ARGV[3])
local elapsed = math.max(0, tonumber(ARGV[3]) - ts)
tokens = math.min(tonumber(ARGV[1]), tokens + elapsed * tonumber(ARGV[2]))
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
redis.call('EXPIRE', KEYS[1], 3600)
return 1 -- allow
else
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
return 0 -- reject
end
One round-trip. Atomic. Survives crashes. This is the canonical pattern.
3. What to limit on
The naive answer is “the user.” The real answer is multiple keys at once:
| Key | Limit | Why |
|---|---|---|
user_id | 1000/min | Per-user fairness |
ip | 5000/min | Anonymous abuse defense |
api_key | 50000/min | Tier-based pricing |
endpoint:user_id | 10/min on /login | Brute-force defense |
org_id | 10000/min | Tenant fairness |
A single request can be checked against all five — they’re all hash lookups in Redis. The most restrictive limit wins.
4. The 429 response
When you reject a request, the response matters as much as the rejection. Three headers are non-negotiable:
HTTP/1.1 429 Too Many Requests
Retry-After: 14
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1730820000
For browser-facing endpoints, also include the limit info on successful responses, so frontends can preemptively slow down.
5. Where it sits in the architecture
Place the rate limiter as early as possible — ideally at the edge (CDN/Cloudflare/AWS WAF) for IP-based limits, then again at the API gateway for user-based limits. You want to reject abusive traffic before it reaches any compute that costs money.
flowchart TD
Internet([Internet])
Edge[Edge: IP limits<br/><i>CDN / Cloudflare / WAF</i>]
Gateway[API Gateway:<br/>API key + user limits<br/><i>shared Redis</i>]
Service[Service:<br/>per-endpoint limits<br/><i>optional, fine-grained</i>]
DB[(Database /<br/>External APIs)]
Internet --> Edge
Edge -->|✓ allowed| Gateway
Edge -->|✗ 429| Reject1[Drop early]
Gateway -->|✓ allowed| Service
Gateway -->|✗ 429| Reject2[Reject with headers]
Service --> DB
style Internet fill:#1c2333,stroke:#475569,color:#e7eaf1
style Edge fill:#0e7490,stroke:#06b6d4,color:#fff
style Gateway fill:#1e3a8a,stroke:#3b82f6,color:#fff
style Service fill:#581c87,stroke:#a855f7,color:#fff
style DB fill:#0f1320,stroke:#475569,color:#cdd3df
style Reject1 fill:#7f1d1d,stroke:#f43f5e,color:#fff
style Reject2 fill:#7f1d1d,stroke:#f43f5e,color:#fff
The further in a request gets before being rejected, the more it costs you to reject it.
6. Common pitfalls
7. The math you should commit to memory
For sizing capacity at limit N requests/minute:
- Steady-state QPS:
N / 60 - Allowed burst: typically
N / 6(10-second worth of capacity) - Token bucket:
B = N/6,R = N/60 - Memory per key: ~100 bytes in Redis
- Throughput per Redis node: ~100k ops/sec
8. Why every system needs one
Without a rate limiter, three things happen, in this order:
- A buggy client retries in a tight loop and DDoSes you accidentally
- A scraper finds your API and pulls 100GB of data overnight
- An LLM-powered competitor calls your search endpoint to train on your results
A rate limiter is not optional infrastructure for any externally-reachable system. It’s as fundamental as TLS.
The good news: it’s also one of the smallest, most composable pieces of infra you can build. A few hundred lines of code, one Redis instance, and you’ve eliminated an entire class of operational pain.
Comments 0
Discuss this page. Markdown supported. Be kind.