LLM inference is two different jobs. Prefill (process the prompt) is compute-bound and parallel. Decode (generate tokens one at a time) is memory-bandwidth-bound — every token re-reads all the weights and the entire KV cache. Almost every serving trick exists to make decode less wasteful or to pack more requests onto the GPU. This sheet goes deep: the arithmetic, the KV cache, batching, quantization, decoding tricks, parallelism, the stack, the tuning knobs, and the metrics that prove it works.
1. Prefill vs decode
| Prefill | Decode | |
|---|---|---|
| What | Encode the whole prompt; build KV cache for all prompt tokens at once | Generate one token per forward, autoregressively |
| Parallelism | All prompt positions in parallel → big GEMM | One position at a time → tall-skinny matmul |
| Bound by | Compute (FLOPs) | Memory bandwidth (read weights + KV every step) |
| Metric | TTFT (time to first token) | TPOT / inter-token latency |
| Cost driver | Prompt length | Output length × model size |
Total user-visible latency = TTFT + (output_tokens × TPOT). Throughput = total tokens/sec across all concurrent requests. You optimize them with different levers, so always measure the split first.
why decode is bandwidth-bound
Decode does very little math per token (one tall-skinny matmul per layer) but must read every weight
and the whole KV cache from HBM each step. Arithmetic intensity is tiny → you're waiting on memory,
not the ALUs. That's why shrinking bytes-moved (quantization, smaller KV) speeds decode, while adding
FLOPs does nothing.
2. The metrics that matter
- TTFT — time to first token. Dominated by prompt length (prefill compute) + queue wait. The "feels responsive" number for chat.
- TPOT / ITL — time per output token (inter-token latency). Dominated by model size, batch size, bandwidth, quantization.
- Throughput — total output (and input) tokens/sec; what your $/token is based on. Rises with batch but so does TPOT.
- Goodput — throughput of requests that met their SLO (e.g. TTFT<1s, TPOT<50ms). The honest capacity number; raw throughput can hide SLO violations.
- Latency percentiles — p50/p95/p99. Tail latency is where users feel pain; a good mean with bad p99 is a scheduling problem.
3. The KV cache (the real constraint)
- Each generated token attends to all previous tokens, so you cache their Key/Value tensors to avoid recomputing them every step.
- Size formula:
KV_bytes = 2 (K and V) × n_layers × n_kv_heads × head_dim × seq_len × batch × dtype_bytes. - It grows linearly with context length and concurrency, and it competes with weights for VRAM. At long context / high concurrency, KV cache — not weights — is what OOMs you.
- GQA/MQA shrink it: fewer
n_kv_heads→ proportionally smaller KV → more concurrency, longer context, faster decode.
# Example: 7B, 32 layers, 32 KV heads, head_dim 128, FP16, 4096 ctx, 1 seq KV = 2 × 32 × 32 × 128 × 4096 × 1 × 2 bytes ≈ 2.1 GB per sequence # 64 concurrent sequences → ~137 GB of KV alone. GQA (8 KV heads) → ÷4.
PagedAttention is why vLLM wins
Naive serving pre-allocates a contiguous KV block for the max possible length → huge internal
fragmentation and waste. PagedAttention pages the KV cache like OS virtual memory: fixed-size blocks,
a block table mapping logical→physical, allocated on demand, non-contiguous. Result: near-zero waste,
far higher batch sizes, and shareable blocks (common prefixes, beam search) — the core of vLLM's
throughput.
4. Continuous batching
- Static batching waits for the whole batch to finish — short requests block behind the longest one (head-of-line blocking), and a finished slot can't be reused mid-batch.
- Continuous (in-flight) batching schedules at the token level: every decode step, admit queued requests and evict finished ones. A completed sequence frees its slot immediately. Keeps the GPU densely packed under length variance.
- This is the single biggest throughput lever after the KV cache. Default in vLLM / TGI / TensorRT-LLM.
- Chunked prefill interleaves prefill chunks with ongoing decode so a big prompt doesn't stall everyone's token generation — keeps TPOT stable under load.
5. Quantization (smaller, faster)
| Method | What it quantizes | Notes |
|---|---|---|
| GPTQ | Weights → INT4/3 (post-training, layer-wise) | Calibration-based; widely supported. |
| AWQ | Weights → INT4 (protects salient channels) | Strong quality at 4-bit; popular default. |
| SmoothQuant / W8A8 | Weights + activations → INT8 | Uses INT8 Tensor Cores; needs activation handling. |
| FP8 (E4M3) | Weights + activations → 8-bit float | Hopper/Ada/Blackwell native; good balance. |
| KV-cache quant | Stored K/V → INT8/FP8 | More concurrency / longer context per GB. |
| GGUF (llama.cpp) | Weights → various k-quants | CPU/edge/local serving. |
Weight-only vs weight+activation: decode is bandwidth-bound, so weight-only 4-bit (AWQ/GPTQ) is the easiest big win — fewer weight bytes per token, minimal quality hit. W8A8/FP8 also speed compute-bound prefill by using low-precision Tensor Cores. Always validate on an eval set; quality loss is model- and task-dependent.
6. Faster decoding
- Speculative decoding — a small draft model proposes K tokens; the target verifies them in one parallel forward and accepts the longest prefix consistent with its own distribution (modified rejection sampling). Same output distribution, lower latency. Variants: Medusa heads, EAGLE, n-gram/prompt lookup, self-speculation.
- Prefix caching — reuse KV for shared prompt prefixes (system prompts, few-shot examples, RAG headers). Cuts prefill cost and TTFT for repeated heads. SGLang's RadixAttention generalizes this with a radix tree.
- Chunked prefill — bound TTFT under load by splitting long prompts.
- Multi-LoRA serving — many adapters on one base model in VRAM (S-LoRA), batching across adapters.
- CUDA graphs — capture the decode step to remove per-launch CPU overhead at small batch.
7. Attention variants that change serving
| Variant | Effect on serving |
|---|---|
| MHA (multi-head) | Full K/V per head → largest KV cache. |
| MQA (multi-query) | One K/V head shared by all query heads → smallest KV, fastest decode, small quality cost. |
| GQA (grouped-query) | K/V shared per group → middle ground; the modern default (Llama-2/3 70B, Mistral). |
| MLA (multi-head latent) | DeepSeek's low-rank KV compression → very small KV cache. |
| Sliding-window | Bounded attention window → bounded KV (Mistral). Trades long-range recall. |
8. Fitting big models across GPUs
- Tensor parallel (TP) — split each layer's matmuls across GPUs, all-reduce activations per layer. Low latency, needs NVLink, intra-node. Best for latency-sensitive serving on one node.
- Pipeline parallel (PP) — split layers into stages across GPUs/nodes; activations passed stage→stage. Crosses nodes well, adds pipeline latency, needs concurrency to fill stages.
- Expert parallel — route MoE experts across GPUs.
- Pick TP for single-node latency; add PP to span nodes for very large models. Replicate the whole setup behind a load balancer for horizontal scale.
9. The serving stack (2025-era)
| Tool | Use |
|---|---|
| vLLM | De-facto OSS server: PagedAttention, continuous batching, OpenAI-compatible API, broad model + quant support. Default. |
| TensorRT-LLM | NVIDIA, compiled engines, max performance on NVIDIA HW. More ops effort; pair with Triton Inference Server. |
| TGI | Hugging Face text-generation-inference; solid production server. |
| SGLang | Fast serving + structured/programmatic generation; RadixAttention prefix sharing. |
| llama.cpp / Ollama | Local / CPU / Apple Silicon / GGUF; great for dev + edge. |
| LMDeploy / DeepSpeed-MII | Other high-throughput servers worth benchmarking. |
10. Tuning knobs & serving math
# vLLM — the levers you actually touch --max-model-len 8192 # cap context = cap KV per request --gpu-memory-utilization 0.90 # fraction of VRAM for weights + KV pool --max-num-seqs 256 # concurrency cap (running requests) --max-num-batched-tokens 8192 # token budget per scheduler step --tensor-parallel-size 2 # split model across 2 GPUs --quantization awq # 4-bit weights --kv-cache-dtype fp8 # quantize KV cache --enable-prefix-caching # reuse shared prefixes --enable-chunked-prefill # bound TTFT under load
- VRAM budget ≈ weights + KV pool + activation overhead. Raising concurrency or context needs a bigger KV pool → shrink weights (quantize) or lower the others, or OOM.
- Bigger batch → more throughput but higher TPOT. Set
max-num-seqsat the knee where p99 latency still meets the SLO. - Leave headroom:
gpu-memory-utilizationtoo high (0.97) leaves no room for activation spikes → intermittent OOM.
11. Sampling & output control
- temperature — randomness; 0 = greedy/deterministic (use for eval, code, extraction).
- top-p (nucleus) / top-k — truncate the tail to the smallest set with cumulative prob p / top k tokens.
- repetition / frequency / presence penalty — discourage loops and over-repetition.
- stop sequences / EOS — must be configured correctly or generation runs to
max_tokensevery time (cost + latency + KV blowup). - structured output — JSON-schema / grammar-constrained decoding (Outlines, XGrammar) guarantees parseable output.
wrong chat template = garbage output
Each instruct model has an exact prompt format (special tokens, role markers). Use the model's own
chat template / tokenizer; a mismatched template produces rambling or ignored-instruction output and
is the most common "the model is broken" bug. Same for missing stop tokens → runaway generations.
12. Quick reference
Latency = TTFT + tokens × TPOT ; Throughput vs latency trade at batch
Prefill = compute-bound (prompt) → chunked prefill + prefix cache
Decode = bandwidth-bound (per token) → 4-bit weights, KV quant, speculate
KV cache grows with context × concurrency → the usual OOM ; GQA/MQA shrink it
Levers: continuous batching, PagedAttention, AWQ/GPTQ, prefix cache,
speculative decoding, TP/PP, chunked prefill
Knobs: max-model-len, max-num-seqs, gpu-memory-utilization, kv-cache-dtype
Metrics: TTFT, TPOT, throughput, GOODPUT, p99 ; serve with vLLM/TRT-LLM/SGLang
13. Interview Q&A
- Why is decode memory-bound but prefill compute-bound?Prefill processes many prompt tokens in parallel — a big GEMM (compute). Decode does one token at a time, re-reading all weights + KV per step with little math → bandwidth-bound.
- What does the KV cache cost and why does it matter?2 × layers × kv_heads × head_dim × seq_len × batch × dtype_bytes. Grows linearly with context × concurrency; at scale it dominates VRAM and is what OOMs you, not the weights. Drives PagedAttention, KV quant, GQA.
- What is continuous batching?Token-level scheduling — admit new and evict finished requests every step instead of waiting for a static batch. Removes head-of-line blocking; biggest throughput win.
- How does speculative decoding stay correct?A draft model proposes K tokens; the target verifies in one pass and accepts via modified rejection sampling calibrated to its own distribution. Output is distributed exactly as the target alone — same quality, lower latency.
- Which quantization for latency-sensitive decode?Weight-only 4-bit (AWQ/GPTQ): decode is bandwidth-bound, so fewer weight bytes per token cuts TPOT with minimal quality loss. Add KV-cache quant for long context.
- How do GQA/MQA change serving?Fewer KV heads → smaller KV cache → more concurrency, longer context, faster decode (less KV to read), small quality cost. Why modern serving models use GQA.
- TTFT is fine, throughput low — what do you check?Batching strategy (static vs continuous), max-num-seqs, KV pool limiting concurrency, queueing. Raise concurrency to the knee where p99 still meets SLO.
- Server OOMs only under load — why?KV cache scales with context × concurrency. Lower max-model-len/max-num-seqs, quantize KV, cap output tokens, back off gpu-memory-utilization.
- Tensor vs pipeline parallel for inference?TP splits each layer (low latency, NVLink, single node); PP splits layers into stages (crosses nodes, pipeline latency, needs concurrency). TP for latency, +PP to span nodes.
- How serve many fine-tunes cheaply?Multi-LoRA: one base model in VRAM + small swappable adapters (S-LoRA), batched across adapters. Avoids a full model per variant.
- What is goodput and why prefer it?Throughput of requests meeting their SLO. Raw throughput can be high while many requests violate latency targets; goodput reflects real user-facing capacity.
- Same prompt, different answer at temp 0 — bug?Not necessarily. Greedy removes sampling randomness, but batching changes floating-point reduction order, so tiny numerical differences are expected. Genuine divergence points to nondeterministic kernels or state.