LLM Inference Cheatsheet

LLM inference is two different jobs. Prefill (process the prompt) is compute-bound and parallel. Decode (generate tokens one at a time) is memory-bandwidth-bound — every token re-reads all the weights and the entire KV cache. Almost every serving trick exists to make decode less wasteful or to pack more requests onto the GPU. This sheet goes deep: the arithmetic, the KV cache, batching, quantization, decoding tricks, parallelism, the stack, the tuning knobs, and the metrics that prove it works.

1. Prefill vs decode

	Prefill	Decode
What	Encode the whole prompt; build KV cache for all prompt tokens at once	Generate one token per forward, autoregressively
Parallelism	All prompt positions in parallel → big GEMM	One position at a time → tall-skinny matmul
Bound by	Compute (FLOPs)	Memory bandwidth (read weights + KV every step)
Metric	TTFT (time to first token)	TPOT / inter-token latency
Cost driver	Prompt length	Output length × model size

Total user-visible latency = TTFT + (output_tokens × TPOT). Throughput = total tokens/sec across all concurrent requests. You optimize them with different levers, so always measure the split first.

why decode is bandwidth-bound Decode does very little math per token (one tall-skinny matmul per layer) but must read every weight and the whole KV cache from HBM each step. Arithmetic intensity is tiny → you're waiting on memory, not the ALUs. That's why shrinking bytes-moved (quantization, smaller KV) speeds decode, while adding FLOPs does nothing.

2. The metrics that matter

TTFT — time to first token. Dominated by prompt length (prefill compute) + queue wait. The "feels responsive" number for chat.
TPOT / ITL — time per output token (inter-token latency). Dominated by model size, batch size, bandwidth, quantization.
Throughput — total output (and input) tokens/sec; what your $/token is based on. Rises with batch but so does TPOT.
Goodput — throughput of requests that met their SLO (e.g. TTFT<1s, TPOT<50ms). The honest capacity number; raw throughput can hide SLO violations.
Latency percentiles — p50/p95/p99. Tail latency is where users feel pain; a good mean with bad p99 is a scheduling problem.

3. The KV cache (the real constraint)

Each generated token attends to all previous tokens, so you cache their Key/Value tensors to avoid recomputing them every step.
Size formula: KV_bytes = 2 (K and V) × n_layers × n_kv_heads × head_dim × seq_len × batch × dtype_bytes.
It grows linearly with context length and concurrency, and it competes with weights for VRAM. At long context / high concurrency, KV cache — not weights — is what OOMs you.
GQA/MQA shrink it: fewer n_kv_heads → proportionally smaller KV → more concurrency, longer context, faster decode.

# Example: 7B, 32 layers, 32 KV heads, head_dim 128, FP16, 4096 ctx, 1 seq
KV = 2 × 32 × 32 × 128 × 4096 × 1 × 2 bytes ≈ 2.1 GB  per sequence
# 64 concurrent sequences → ~137 GB of KV alone. GQA (8 KV heads) → ÷4.

PagedAttention is why vLLM wins Naive serving pre-allocates a contiguous KV block for the max possible length → huge internal fragmentation and waste. PagedAttention pages the KV cache like OS virtual memory: fixed-size blocks, a block table mapping logical→physical, allocated on demand, non-contiguous. Result: near-zero waste, far higher batch sizes, and shareable blocks (common prefixes, beam search) — the core of vLLM's throughput.

4. Continuous batching

Static batching waits for the whole batch to finish — short requests block behind the longest one (head-of-line blocking), and a finished slot can't be reused mid-batch.
Continuous (in-flight) batching schedules at the token level: every decode step, admit queued requests and evict finished ones. A completed sequence frees its slot immediately. Keeps the GPU densely packed under length variance.
This is the single biggest throughput lever after the KV cache. Default in vLLM / TGI / TensorRT-LLM.
Chunked prefill interleaves prefill chunks with ongoing decode so a big prompt doesn't stall everyone's token generation — keeps TPOT stable under load.

5. Quantization (smaller, faster)

Method	What it quantizes	Notes
GPTQ	Weights → INT4/3 (post-training, layer-wise)	Calibration-based; widely supported.
AWQ	Weights → INT4 (protects salient channels)	Strong quality at 4-bit; popular default.
SmoothQuant / W8A8	Weights + activations → INT8	Uses INT8 Tensor Cores; needs activation handling.
FP8 (E4M3)	Weights + activations → 8-bit float	Hopper/Ada/Blackwell native; good balance.
KV-cache quant	Stored K/V → INT8/FP8	More concurrency / longer context per GB.
GGUF (llama.cpp)	Weights → various k-quants	CPU/edge/local serving.

Weight-only vs weight+activation: decode is bandwidth-bound, so weight-only 4-bit (AWQ/GPTQ) is the easiest big win — fewer weight bytes per token, minimal quality hit. W8A8/FP8 also speed compute-bound prefill by using low-precision Tensor Cores. Always validate on an eval set; quality loss is model- and task-dependent.

6. Faster decoding

Speculative decoding — a small draft model proposes K tokens; the target verifies them in one parallel forward and accepts the longest prefix consistent with its own distribution (modified rejection sampling). Same output distribution, lower latency. Variants: Medusa heads, EAGLE, n-gram/prompt lookup, self-speculation.
Prefix caching — reuse KV for shared prompt prefixes (system prompts, few-shot examples, RAG headers). Cuts prefill cost and TTFT for repeated heads. SGLang's RadixAttention generalizes this with a radix tree.
Chunked prefill — bound TTFT under load by splitting long prompts.
Multi-LoRA serving — many adapters on one base model in VRAM (S-LoRA), batching across adapters.
CUDA graphs — capture the decode step to remove per-launch CPU overhead at small batch.

7. Attention variants that change serving

Variant	Effect on serving
MHA (multi-head)	Full K/V per head → largest KV cache.
MQA (multi-query)	One K/V head shared by all query heads → smallest KV, fastest decode, small quality cost.
GQA (grouped-query)	K/V shared per group → middle ground; the modern default (Llama-2/3 70B, Mistral).
MLA (multi-head latent)	DeepSeek's low-rank KV compression → very small KV cache.
Sliding-window	Bounded attention window → bounded KV (Mistral). Trades long-range recall.

8. Fitting big models across GPUs

Tensor parallel (TP) — split each layer's matmuls across GPUs, all-reduce activations per layer. Low latency, needs NVLink, intra-node. Best for latency-sensitive serving on one node.
Pipeline parallel (PP) — split layers into stages across GPUs/nodes; activations passed stage→stage. Crosses nodes well, adds pipeline latency, needs concurrency to fill stages.
Expert parallel — route MoE experts across GPUs.
Pick TP for single-node latency; add PP to span nodes for very large models. Replicate the whole setup behind a load balancer for horizontal scale.

9. The serving stack (2025-era)

Tool	Use
vLLM	De-facto OSS server: PagedAttention, continuous batching, OpenAI-compatible API, broad model + quant support. Default.
TensorRT-LLM	NVIDIA, compiled engines, max performance on NVIDIA HW. More ops effort; pair with Triton Inference Server.
TGI	Hugging Face text-generation-inference; solid production server.
SGLang	Fast serving + structured/programmatic generation; RadixAttention prefix sharing.
llama.cpp / Ollama	Local / CPU / Apple Silicon / GGUF; great for dev + edge.
LMDeploy / DeepSpeed-MII	Other high-throughput servers worth benchmarking.

10. Tuning knobs & serving math

# vLLM — the levers you actually touch
--max-model-len 8192               # cap context = cap KV per request
--gpu-memory-utilization 0.90      # fraction of VRAM for weights + KV pool
--max-num-seqs 256                 # concurrency cap (running requests)
--max-num-batched-tokens 8192      # token budget per scheduler step
--tensor-parallel-size 2           # split model across 2 GPUs
--quantization awq                 # 4-bit weights
--kv-cache-dtype fp8               # quantize KV cache
--enable-prefix-caching            # reuse shared prefixes
--enable-chunked-prefill           # bound TTFT under load

VRAM budget ≈ weights + KV pool + activation overhead. Raising concurrency or context needs a bigger KV pool → shrink weights (quantize) or lower the others, or OOM.
Bigger batch → more throughput but higher TPOT. Set max-num-seqs at the knee where p99 latency still meets the SLO.
Leave headroom: gpu-memory-utilization too high (0.97) leaves no room for activation spikes → intermittent OOM.

11. Sampling & output control

temperature — randomness; 0 = greedy/deterministic (use for eval, code, extraction).
top-p (nucleus) / top-k — truncate the tail to the smallest set with cumulative prob p / top k tokens.
repetition / frequency / presence penalty — discourage loops and over-repetition.
stop sequences / EOS — must be configured correctly or generation runs to max_tokens every time (cost + latency + KV blowup).
structured output — JSON-schema / grammar-constrained decoding (Outlines, XGrammar) guarantees parseable output.

wrong chat template = garbage output Each instruct model has an exact prompt format (special tokens, role markers). Use the model's own chat template / tokenizer; a mismatched template produces rambling or ignored-instruction output and is the most common "the model is broken" bug. Same for missing stop tokens → runaway generations.

12. Quick reference

Latency = TTFT + tokens × TPOT ; Throughput vs latency trade at batch
Prefill = compute-bound (prompt) → chunked prefill + prefix cache
Decode  = bandwidth-bound (per token) → 4-bit weights, KV quant, speculate
KV cache grows with context × concurrency → the usual OOM ; GQA/MQA shrink it
Levers: continuous batching, PagedAttention, AWQ/GPTQ, prefix cache,
        speculative decoding, TP/PP, chunked prefill
Knobs: max-model-len, max-num-seqs, gpu-memory-utilization, kv-cache-dtype
Metrics: TTFT, TPOT, throughput, GOODPUT, p99 ; serve with vLLM/TRT-LLM/SGLang

13. Interview Q&A

Why is decode memory-bound but prefill compute-bound?Prefill processes many prompt tokens in parallel — a big GEMM (compute). Decode does one token at a time, re-reading all weights + KV per step with little math → bandwidth-bound.
What does the KV cache cost and why does it matter?2 × layers × kv_heads × head_dim × seq_len × batch × dtype_bytes. Grows linearly with context × concurrency; at scale it dominates VRAM and is what OOMs you, not the weights. Drives PagedAttention, KV quant, GQA.
What is continuous batching?Token-level scheduling — admit new and evict finished requests every step instead of waiting for a static batch. Removes head-of-line blocking; biggest throughput win.
How does speculative decoding stay correct?A draft model proposes K tokens; the target verifies in one pass and accepts via modified rejection sampling calibrated to its own distribution. Output is distributed exactly as the target alone — same quality, lower latency.
Which quantization for latency-sensitive decode?Weight-only 4-bit (AWQ/GPTQ): decode is bandwidth-bound, so fewer weight bytes per token cuts TPOT with minimal quality loss. Add KV-cache quant for long context.
How do GQA/MQA change serving?Fewer KV heads → smaller KV cache → more concurrency, longer context, faster decode (less KV to read), small quality cost. Why modern serving models use GQA.
TTFT is fine, throughput low — what do you check?Batching strategy (static vs continuous), max-num-seqs, KV pool limiting concurrency, queueing. Raise concurrency to the knee where p99 still meets SLO.
Server OOMs only under load — why?KV cache scales with context × concurrency. Lower max-model-len/max-num-seqs, quantize KV, cap output tokens, back off gpu-memory-utilization.
Tensor vs pipeline parallel for inference?TP splits each layer (low latency, NVLink, single node); PP splits layers into stages (crosses nodes, pipeline latency, needs concurrency). TP for latency, +PP to span nodes.
How serve many fine-tunes cheaply?Multi-LoRA: one base model in VRAM + small swappable adapters (S-LoRA), batched across adapters. Avoids a full model per variant.
What is goodput and why prefer it?Throughput of requests meeting their SLO. Raw throughput can be high while many requests violate latency targets; goodput reflects real user-facing capacity.
Same prompt, different answer at temp 0 — bug?Not necessarily. Greedy removes sampling randomness, but batching changes floating-point reduction order, so tiny numerical differences are expected. Genuine divergence points to nondeterministic kernels or state.

LLM Inference — The Deep Serving Cheatsheet.