When you send a prompt to ChatGPT, a distributed system tokenizes it, splits the model across many GPUs (tensor + pipeline parallelism), processes the prompt in a "prefill" stage, then generates output tokens one at a time using a KV cache to avoid recomputation. Continuous batching keeps the GPUs busy by mixing in new requests as old ones finish; quantization shrinks the model into less GPU memory; speculative decoding gets you 2–3× faster generation. The architecture is an orchestra, not a single instrument.
Anyone serving LLMs in production needs to understand the inference stack — even if you're using a managed API. Knowing why prefill is fast and decode is slow lets you design prompts that don't waste tokens. Knowing how continuous batching works lets you understand why your P99 latency spikes when traffic is bursty. Knowing what quantization costs in quality lets you pick the right model for your use case. This is also the foundation if you're ever going to self-host (vLLM, TensorRT-LLM, TGI).
Why LLM inference is a system, not a function call
You think of chat.completions.create() as one call. Behind it, a multi-tenant cluster orchestrates dozens of GPUs, half a dozen distinct stages, and several optimizations that didn’t exist three years ago.
Understanding the stack matters for two reasons:
- It shapes the cost and latency you’ll see at any provider
- If you ever self-host (vLLM, TGI, TensorRT-LLM), the configuration knobs are these stages
Let’s go from “your text” to “the response”, step by step.
1. Tokenization
Your text isn’t characters to the model. It’s tokens — subword pieces from a vocabulary of typically 100k entries. The string “I love programming” might be [40, 1842, 13380] — three tokens.
Tokenization is fast (microseconds) but the cost model lives in tokens, not characters.
2. Prefill — processing the prompt
Once tokenized, the prompt enters the model. Prefill is the stage where the entire prompt is processed in one big parallel forward pass.
For a 1024-token prompt through an 80-layer transformer:
- Each layer does
Q = x · W_q,K = x · W_k,V = x · W_v, attention, FFN - All 1024 token positions get processed in parallel — the GPU is firing on all cylinders
- At the end, 1024 (Key, Value) pairs per layer are written to the KV cache
Prefill is compute-bound — it scales nicely with prompt length, and on modern GPUs hits ~80% of peak FLOPS.
3. The KV cache — the secret weapon
Attention compares the current token’s query vector to every previous token’s key vector. Naively, for token N you’d recompute K(0..N-1) every time — quadratic cost.
The KV cache stores K and V for every position once they’re computed. Generating token N+1 only requires:
- One new query, one new key, one new value (for token N+1)
- Attention against the cached K, V from positions 0..N
Without cache: O(N²) ops per request
With cache: O(N) ops per request
4. Decode — generating one token at a time
Decode is the boring-feeling part. Each iteration:
- Take the most recently generated token
- One forward pass through the model
- Append the new K, V to the cache
- Sample the next token from the output distribution
- Repeat
5. Model parallelism — when one GPU isn’t enough
A 70B model in FP16 doesn’t fit on one GPU (most have 40–80 GB). You need to split it.
Tensor parallelism — split each layer’s matrices across GPUs. The matmul x · W where W is 16k × 16k can split into 4 GPUs each holding 16k × 4k. After matmul, all-reduce combines partial results. Communication-heavy, requires fast interconnect (NVLink), best inside a single node.
Pipeline parallelism — split the model into sequential stages. GPU 0 holds layers 0–19, GPU 1 holds 20–39, etc. Request flows through. While request A is in stage 2, request B can be in stage 1. Less communication, scales across nodes.
Expert parallelism (MoE models) — only some “experts” activate per token, so different GPUs hold different experts and tokens are routed.
6. Continuous batching — keeping the GPUs busy
Static batching: collect N requests, run them together until all finish, then start the next batch. Problem: requests vary in output length. The GPU sits idle on the long ones.
Continuous batching (introduced by Orca and popularized by vLLM): treat batching at the iteration level, not the request level. After every token-generation iteration:
- Requests that finished release their slot
- New incoming requests fill the empty slots
- The GPU never sees a partially-empty batch
7. Quantization — fitting bigger models in less memory
A 70B model in FP16 is 140 GB. In INT8 (8-bit integers): 70 GB. In INT4: 35 GB. With a 1–3% quality loss, you can serve the same model on half or quarter the hardware.
Two flavors:
- Weight-only quantization — weights are INT4/INT8, activations stay FP16. Easy, good quality.
- Activation quantization — both weights and activations quantized. Bigger speedup (memory bandwidth!) but more quality loss without careful calibration.
8. Speculative decoding — multiple tokens per pass
The cleverest recent trick. A small “draft” model (e.g., 1B params) is fast at guessing what the big model would say. So:
- Draft model generates K tokens (e.g., 5)
- Target model verifies all K in a single forward pass (cheap because it’s prefill-style)
- The target’s output distribution at each position tells us which draft tokens to accept
- First mismatch → reject everything from there onward
9. The full request lifecycle
Putting it all together, a real ChatGPT request:
sequenceDiagram
autonumber
participant U as You
participant LB as Load balancer
participant R as Router
participant P as Prefill cluster<br/>(compute-bound)
participant D as Decode cluster<br/>(memory-bound)
participant T as Telemetry
U->>LB: POST /chat/completions
LB->>R: route (region, shed if overloaded)
R->>R: pick model variant<br/>(mini vs full)
R->>P: prompt tokens
P->>P: tokenize + prefill (one pass)
P->>D: KV cache transfer
loop for each output token
D->>D: continuous-batched decode<br/>(mixed with other users)
D-->>U: stream token (SSE)
end
D->>T: usage, latency, quality
T-->>U: completion event
The split between prefill and decode is increasingly important. Modern systems (DeepMind, OpenAI, Anthropic) run dedicated prefill clusters and decode clusters because they have different optimal hardware (compute-bound vs memory-bound).
10. What this means for you
Even if you never self-host, knowing this stack changes how you build:
- Cheap models exist because of all of these tricks. GPT-4o-mini isn’t a small model trained from scratch — it’s a smaller fine-tune or distillation, served through the same continuous-batching, quantized, speculative-decoded stack. Use the cheapest model that meets quality.
- Self-host when scale justifies it. vLLM on commodity H100s gets within 2× of OpenAI’s serving cost at high utilization. Below 100k requests/day, hosted APIs win on operational cost.
Comments 0
Discuss this page. Markdown supported. Be kind.