← AI Systems
AI Systems

LLM Inference at Scale

KV cache, continuous batching, speculative decoding — what actually makes ChatGPT fast.

TL;DR

When you send a prompt to ChatGPT, a distributed system tokenizes it, splits the model across many GPUs (tensor + pipeline parallelism), processes the prompt in a "prefill" stage, then generates output tokens one at a time using a KV cache to avoid recomputation. Continuous batching keeps the GPUs busy by mixing in new requests as old ones finish; quantization shrinks the model into less GPU memory; speculative decoding gets you 2–3× faster generation. The architecture is an orchestra, not a single instrument.

When to use

Anyone serving LLMs in production needs to understand the inference stack — even if you're using a managed API. Knowing why prefill is fast and decode is slow lets you design prompts that don't waste tokens. Knowing how continuous batching works lets you understand why your P99 latency spikes when traffic is bursty. Knowing what quantization costs in quality lets you pick the right model for your use case. This is also the foundation if you're ever going to self-host (vLLM, TensorRT-LLM, TGI).

Why LLM inference is a system, not a function call

You think of chat.completions.create() as one call. Behind it, a multi-tenant cluster orchestrates dozens of GPUs, half a dozen distinct stages, and several optimizations that didn’t exist three years ago.

Understanding the stack matters for two reasons:

  • It shapes the cost and latency you’ll see at any provider
  • If you ever self-host (vLLM, TGI, TensorRT-LLM), the configuration knobs are these stages

Let’s go from “your text” to “the response”, step by step.

1. Tokenization

Your text isn’t characters to the model. It’s tokens — subword pieces from a vocabulary of typically 100k entries. The string “I love programming” might be [40, 1842, 13380] — three tokens.

Tokenization is fast (microseconds) but the cost model lives in tokens, not characters.

2. Prefill — processing the prompt

Once tokenized, the prompt enters the model. Prefill is the stage where the entire prompt is processed in one big parallel forward pass.

For a 1024-token prompt through an 80-layer transformer:

  • Each layer does Q = x · W_q, K = x · W_k, V = x · W_v, attention, FFN
  • All 1024 token positions get processed in parallel — the GPU is firing on all cylinders
  • At the end, 1024 (Key, Value) pairs per layer are written to the KV cache

Prefill is compute-bound — it scales nicely with prompt length, and on modern GPUs hits ~80% of peak FLOPS.

3. The KV cache — the secret weapon

Attention compares the current token’s query vector to every previous token’s key vector. Naively, for token N you’d recompute K(0..N-1) every time — quadratic cost.

The KV cache stores K and V for every position once they’re computed. Generating token N+1 only requires:

  • One new query, one new key, one new value (for token N+1)
  • Attention against the cached K, V from positions 0..N
Without cache:  O(N²) ops per request
With cache:     O(N) ops per request

4. Decode — generating one token at a time

Decode is the boring-feeling part. Each iteration:

  1. Take the most recently generated token
  2. One forward pass through the model
  3. Append the new K, V to the cache
  4. Sample the next token from the output distribution
  5. Repeat

5. Model parallelism — when one GPU isn’t enough

A 70B model in FP16 doesn’t fit on one GPU (most have 40–80 GB). You need to split it.

Tensor parallelism — split each layer’s matrices across GPUs. The matmul x · W where W is 16k × 16k can split into 4 GPUs each holding 16k × 4k. After matmul, all-reduce combines partial results. Communication-heavy, requires fast interconnect (NVLink), best inside a single node.

Pipeline parallelism — split the model into sequential stages. GPU 0 holds layers 0–19, GPU 1 holds 20–39, etc. Request flows through. While request A is in stage 2, request B can be in stage 1. Less communication, scales across nodes.

Expert parallelism (MoE models) — only some “experts” activate per token, so different GPUs hold different experts and tokens are routed.

6. Continuous batching — keeping the GPUs busy

Static batching: collect N requests, run them together until all finish, then start the next batch. Problem: requests vary in output length. The GPU sits idle on the long ones.

Continuous batching (introduced by Orca and popularized by vLLM): treat batching at the iteration level, not the request level. After every token-generation iteration:

  • Requests that finished release their slot
  • New incoming requests fill the empty slots
  • The GPU never sees a partially-empty batch

7. Quantization — fitting bigger models in less memory

A 70B model in FP16 is 140 GB. In INT8 (8-bit integers): 70 GB. In INT4: 35 GB. With a 1–3% quality loss, you can serve the same model on half or quarter the hardware.

Two flavors:

  • Weight-only quantization — weights are INT4/INT8, activations stay FP16. Easy, good quality.
  • Activation quantization — both weights and activations quantized. Bigger speedup (memory bandwidth!) but more quality loss without careful calibration.

8. Speculative decoding — multiple tokens per pass

The cleverest recent trick. A small “draft” model (e.g., 1B params) is fast at guessing what the big model would say. So:

  1. Draft model generates K tokens (e.g., 5)
  2. Target model verifies all K in a single forward pass (cheap because it’s prefill-style)
  3. The target’s output distribution at each position tells us which draft tokens to accept
  4. First mismatch → reject everything from there onward

9. The full request lifecycle

Putting it all together, a real ChatGPT request:

sequenceDiagram
    autonumber
    participant U as You
    participant LB as Load balancer
    participant R as Router
    participant P as Prefill cluster<br/>(compute-bound)
    participant D as Decode cluster<br/>(memory-bound)
    participant T as Telemetry

    U->>LB: POST /chat/completions
    LB->>R: route (region, shed if overloaded)
    R->>R: pick model variant<br/>(mini vs full)
    R->>P: prompt tokens
    P->>P: tokenize + prefill (one pass)
    P->>D: KV cache transfer
    loop for each output token
        D->>D: continuous-batched decode<br/>(mixed with other users)
        D-->>U: stream token (SSE)
    end
    D->>T: usage, latency, quality
    T-->>U: completion event

The split between prefill and decode is increasingly important. Modern systems (DeepMind, OpenAI, Anthropic) run dedicated prefill clusters and decode clusters because they have different optimal hardware (compute-bound vs memory-bound).

10. What this means for you

Even if you never self-host, knowing this stack changes how you build:

  • Cheap models exist because of all of these tricks. GPT-4o-mini isn’t a small model trained from scratch — it’s a smaller fine-tune or distillation, served through the same continuous-batching, quantized, speculative-decoded stack. Use the cheapest model that meets quality.
  • Self-host when scale justifies it. vLLM on commodity H100s gets within 2× of OpenAI’s serving cost at high utilization. Below 100k requests/day, hosted APIs win on operational cost.
🧪 Simulator

An interactive simulator for this concept is on the way — tweak the knobs, watch behaviour change in real time.

💻 Code

A 30-line build challenge with starter code, hints, and a reference implementation.

🎯 Common interview questions
Q1. What is the KV cache and why is it the most important optimization in LLM inference?

A transformer's attention layer projects every token into Key and Value vectors. To generate token N, attention compares query(N) to key(0..N-1). Without a cache, you'd recompute key(0..N-1) for every new token — quadratic cost. The KV cache stores these K and V vectors in GPU memory once, so generating token N only requires a single new forward pass over the new query. Memory cost is enormous (gigabytes per request) but it converts decode from O(N²) to O(N).

Q2. Why is prefill fast and decode slow?

Prefill processes the entire prompt in one parallel forward pass — the GPU is fully utilized. Decode generates one token at a time, each requiring its own forward pass — most of the GPU's compute sits idle waiting for memory bandwidth. Decode is **memory-bandwidth bound**, not compute bound. This is why long prompts feel "fast to start" but generation feels slow per token.

Q3. What is continuous batching and how is it different from static batching?

Static batching groups N requests, processes them all together until the slowest one finishes, then starts the next batch. Wastes GPU cycles when one request finishes early. Continuous batching (the vLLM / "iteration-level" approach) replaces finished requests within a running batch — when token T finishes for request A, request E can take A's slot for token T+1. Throughput rises dramatically because the GPU is never waiting for the slowest request.

Q4. Tensor parallelism vs pipeline parallelism — when do you use each?

Tensor parallelism splits each layer across GPUs (e.g., the matmul of layer 5 spans 4 GPUs). Communication-heavy (all-reduce after every layer), so it works best within a single node where GPUs have NVLink. Pipeline parallelism splits the model into stages, each stage on a different node — request 1 enters stage 1, while request 0 is in stage 2. Less communication-intensive, scales across nodes. Big models combine both — tensor parallel within a node, pipeline parallel across nodes.

Q5. What is speculative decoding?

A small fast "draft" model (e.g., a 1B-parameter model) generates K candidate tokens (typically K=4–8). The big target model (e.g., 70B) then verifies all K tokens in a single forward pass. Tokens that match the target's distribution are accepted; the first mismatch and everything after is rejected. Net effect — multiple tokens generated per big-model forward pass. Real-world speedup is 2–3× without quality loss.

↗ Related concepts

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…
Loading comments…