← System Design

🚦

System Design

Rate Limiter

Token bucket, leaky bucket, sliding window — the four classic algorithms behind every API quota.

TL;DR

A rate limiter is a fast token-or-counter check at the edge of your system that says "yes, no, or wait" before any expensive work happens. The four canonical algorithms — fixed window, sliding window, token bucket, leaky bucket — each trade burst tolerance against fairness, and the right one depends on what you're protecting and from whom.

When to use

Anywhere you have an external API surface (public APIs, login endpoints, signup flows, password reset, search, AI inference, anything that calls a paid third-party). Use it for fairness (multi-tenant SaaS), abuse prevention (DDoS, scraping, brute-force), cost control (LLM token quotas), and SLA protection (so one runaway client can't starve the rest).

Why this is more than “just a counter”

Every rate limiter is a counter under the hood. What makes it interesting is that it sits on the hottest path in your architecture — every request goes through it before doing any work. Get it wrong and you either let abuse through, or you accidentally rate-limit your own customers during a traffic spike.

The job of a rate limiter is to answer one question, in under a millisecond, billions of times a day:

Three numbers determine the answer: who, what, and when.

1. The four canonical algorithms

There are exactly four algorithms anyone uses in production. Everything else is a variant.

Fixed window

Keep a counter per (user, minute). Reset it every minute. Cheap, simple, and broken at boundaries — a user can fire 100 requests in second 59 and another 100 in second 0 of the next minute, getting 200 in 2 seconds despite a 100/min limit.

key: "rl:user:42:2026-05-05T15:42"
INCR  → 73
EXPIRE 60

Use this when you don’t care about boundary bursts (most non-abusive workloads).

Sliding window log

Store every request’s timestamp in a sorted set. To check, count entries in the last 60s. Perfectly accurate, but O(n) memory per user — bad at scale.

Sliding window counter (weighted)

The pragmatic compromise. Keep counts for the current and previous window, weight the previous one by how much it overlaps:

effective = current + previous × ((window_size - elapsed_in_current) / window_size)

Token bucket

Conceptually different. The user has a bucket of tokens that refills at a steady rate (R tokens/sec) up to a max (B tokens). Each request consumes one token. If the bucket is empty, the request is rejected.

Token bucket — live

✓ 0✗ 0

Token (1 unit of capacity) Allowed Rejected (bucket empty)

tokens = min(B, tokens + (now - last_refill) × R)
last_refill = now
if tokens >= 1: tokens -= 1, allow
else: reject

The genius: the bucket capacity B controls burst tolerance, the refill rate R controls average rate. Tune them independently. AWS, Stripe, Cloudflare — all token bucket.

Leaky bucket

Inverse of token bucket. Requests enter a queue (“the bucket”); the queue drains at a fixed rate. If the queue is full, reject. Output is strictly smoothed — perfect when downstream is fragile.

2. The state lives in Redis (almost always)

A rate limiter without distributed state is just per-server limits, which means a determined client can hit the next pod and get a fresh quota. So the counter has to be shared.

Redis is the answer for one reason: you can do INCR, EXPIRE, and conditional logic atomically with Lua, and it returns in sub-millisecond.

A real production token-bucket Lua script:

-- KEYS[1] = bucket key, ARGV[1] = max, ARGV[2] = refill_rate, ARGV[3] = now
local data = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(data[1]) or tonumber(ARGV[1])
local ts = tonumber(data[2]) or tonumber(ARGV[3])
local elapsed = math.max(0, tonumber(ARGV[3]) - ts)
tokens = math.min(tonumber(ARGV[1]), tokens + elapsed * tonumber(ARGV[2]))
if tokens >= 1 then
  tokens = tokens - 1
  redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
  redis.call('EXPIRE', KEYS[1], 3600)
  return 1  -- allow
else
  redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
  return 0  -- reject
end

One round-trip. Atomic. Survives crashes. This is the canonical pattern.

3. What to limit on

The naive answer is “the user.” The real answer is multiple keys at once:

Key	Limit	Why
`user_id`	1000/min	Per-user fairness
`ip`	5000/min	Anonymous abuse defense
`api_key`	50000/min	Tier-based pricing
`endpoint:user_id`	10/min on `/login`	Brute-force defense
`org_id`	10000/min	Tenant fairness

A single request can be checked against all five — they’re all hash lookups in Redis. The most restrictive limit wins.

4. The 429 response

When you reject a request, the response matters as much as the rejection. Three headers are non-negotiable:

HTTP/1.1 429 Too Many Requests
Retry-After: 14
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1730820000

For browser-facing endpoints, also include the limit info on successful responses, so frontends can preemptively slow down.

5. Where it sits in the architecture

Place the rate limiter as early as possible — ideally at the edge (CDN/Cloudflare/AWS WAF) for IP-based limits, then again at the API gateway for user-based limits. You want to reject abusive traffic before it reaches any compute that costs money.

flowchart TD
    Internet([Internet])
    Edge[Edge: IP limits<br/><i>CDN / Cloudflare / WAF</i>]
    Gateway[API Gateway:<br/>API key + user limits<br/><i>shared Redis</i>]
    Service[Service:<br/>per-endpoint limits<br/><i>optional, fine-grained</i>]
    DB[(Database /<br/>External APIs)]

    Internet --> Edge
    Edge -->|✓ allowed| Gateway
    Edge -->|✗ 429| Reject1[Drop early]
    Gateway -->|✓ allowed| Service
    Gateway -->|✗ 429| Reject2[Reject with headers]
    Service --> DB

    style Internet fill:#1c2333,stroke:#475569,color:#e7eaf1
    style Edge fill:#0e7490,stroke:#06b6d4,color:#fff
    style Gateway fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style Service fill:#581c87,stroke:#a855f7,color:#fff
    style DB fill:#0f1320,stroke:#475569,color:#cdd3df
    style Reject1 fill:#7f1d1d,stroke:#f43f5e,color:#fff
    style Reject2 fill:#7f1d1d,stroke:#f43f5e,color:#fff

The further in a request gets before being rejected, the more it costs you to reject it.

6. Common pitfalls

7. The math you should commit to memory

For sizing capacity at limit N requests/minute:

Steady-state QPS: N / 60
Allowed burst: typically N / 6 (10-second worth of capacity)
Token bucket: B = N/6, R = N/60
Memory per key: ~100 bytes in Redis
Throughput per Redis node: ~100k ops/sec

8. Why every system needs one

Without a rate limiter, three things happen, in this order:

A buggy client retries in a tight loop and DDoSes you accidentally
A scraper finds your API and pulls 100GB of data overnight
An LLM-powered competitor calls your search endpoint to train on your results

A rate limiter is not optional infrastructure for any externally-reachable system. It’s as fundamental as TLS.

The good news: it’s also one of the smallest, most composable pieces of infra you can build. A few hundred lines of code, one Redis instance, and you’ve eliminated an entire class of operational pain.

🎨 Visualization soon

An interactive diagram you can hover, click, and explore.

💻 Code Phase 4 soon

A 30-line build challenge with starter code, hints, and a reference implementation.

🎯 Common interview questions

Q1. Token bucket vs leaky bucket — when do you pick which? ▾

Token bucket allows controlled bursts (refills tokens at a steady rate, but the bucket holds a buffer). It's right when bursts are normal and OK — most public APIs. Leaky bucket smooths to a strict constant rate (requests drip out at fixed cadence). Pick it when downstream is fragile and can't absorb bursts — e.g., feeding into a slow database write or an external SMS provider.

Q2. How does sliding-window beat fixed-window without exploding memory? ▾

Use the "weighted" sliding window. You keep two counters — current minute and previous minute. The effective count is `current + previous × overlap_fraction`. So at second 30 of a one-minute window, you weight the previous minute at 50%. That's O(1) memory per key, and it eliminates the burst-at-boundary problem of fixed windows.

Q3. Where do you store the counter — Redis or in-memory? ▾

Redis if you have more than one server (you almost always do), in-memory if you really really don't. The trade-off is one network hop (~0.5ms) for global accuracy vs zero-latency-but-each-pod-tracks-its-own. For 99% of cases, Redis with `INCR` + `EXPIRE` (or atomic Lua) is right. In-process is only for sidecar caches, not user-facing limits.

Q4. How do you handle distributed clocks when limiters run on many nodes? ▾

Don't trust local clocks. Either centralize the counter (Redis) or use a leader-elected counter authority. If you must distribute, use a token bucket with timestamp-based refill — the math works even if clocks drift slightly because each node still draws from a shared store. Avoid sliding-window log algorithms in distributed setups; they're the most clock-sensitive.

Q5. What happens when a user hits the limit — what do you return? ▾

HTTP 429 Too Many Requests with three headers — `Retry-After`, `X-RateLimit-Limit`, `X-RateLimit-Remaining`. Don't drop the connection; clients need to back off intelligently. For programmatic clients, also expose a `X-RateLimit-Reset` Unix timestamp so they can schedule retries precisely.

↗ Related concepts

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…

Loading comments…