Debugging LLM Inference

Serving complaints are almost always one of: slow first token (TTFT), slow per token (TPOT), low throughput under load, OOM (usually KV cache), or bad output (quality/garbled). Each has a different cause. Measure TTFT and TPOT separately first.

Triage: which number is bad?

# measure the split, not just total latency
curl -N -s -w '\nTTFT/total via timing\n' .../v1/chat/completions -d '{...,"stream":true}'
# vLLM exposes Prometheus metrics
curl -s localhost:8000/metrics | grep -E 'time_to_first_token|time_per_output_token|num_requests|gpu_cache_usage'

High TTFT, fine TPOT → prefill / queue / long prompt problem.
Fine TTFT, high TPOT → decode bottleneck (model size, batch, bandwidth).
Both fine single-user, bad under load → batching / concurrency / KV pool.

Slow first token (TTFT)

Long prompts → prefill is compute-heavy. Enable chunked prefill and prefix caching for shared system prompts / RAG context.
Queue wait → requests piling up before scheduling; check queue depth / num_requests_waiting. Add capacity or cap concurrency.
Cold start → first request after load compiles/warms kernels (CUDA graphs, Triton autotune). Warm up on boot.

Slow per token (TPOT)

Decode is bandwidth-bound → quantize weights (AWQ/GPTQ 4-bit) to cut per-token reads.
Model too big for the GPU count → add tensor parallelism (within node, NVLink).
Batch too large pushing TPOT up → there's a throughput/latency knee; lower max-num-seqs to meet the SLO.
Try speculative decoding (draft model) to cut latency without quality loss.

Low throughput under concurrency

Static batching → switch to a server with continuous batching (vLLM/TGI/TRT-LLM). Biggest single win.
KV pool too small → can't hold enough concurrent sequences; raise gpu-memory-utilization or shrink max-model-len.
Head-of-line blocking from a few huge requests → enforce max output tokens; isolate long jobs.
Confirm the GPU is the bottleneck (high util) vs the API/tokenizer/network around it.

throughput collapses as context grows KV cache scales with context × concurrency. Long prompts shrink how many requests fit in the KV pool, so effective concurrency (and throughput) drops even though the GPU isn't "full". Cap max-model-len to what you actually need.

OOM during serving

Symptom. Crash on a long prompt or at high concurrency; CUDA out of memory after running fine at low load.

It's the KV cache, not the weights. Lower max-model-len, lower max-num-seqs, enable KV-cache quantization.
gpu-memory-utilization too high leaves no room for activation spikes → back it off (e.g. 0.85–0.90).
Set a hard max output tokens so a runaway generation can't grow KV unbounded.

Bad / garbled output

Wrong chat template → garbage or ignored instructions. Match the model's exact prompt format / special tokens. Most "model is broken" reports are template bugs.
Tokenizer mismatch → use the tokenizer that ships with the checkpoint.
Quantization too aggressive → quality drop; try a higher-bit scheme or different method (AWQ vs GPTQ), compare on an eval set.
Repetition / rambling → check sampling params (temperature, top-p, repetition/frequency penalty, stop tokens). A missing stop token = never-ending output.
Truncation → prompt + max_tokens exceeds context window; it silently drops the start. Budget tokens.

wrong stop token = runaway cost If the stop sequence / EOS isn't configured, generation runs to max_tokens every time — latency, cost, and KV all balloon. Verify stop tokens against the model card.

"Same prompt, different answer"

Sampling is stochastic → set temperature=0 (greedy) for reproducibility tests.
Even at temp 0, batching can shift results — floating-point reduction order depends on batch composition. Small numerical differences are expected, not a bug.

Quick reference

curl -s localhost:8000/metrics | grep -E 'first_token|per_output_token|cache_usage|waiting'
# TTFT bad → chunked prefill + prefix cache + warmup + capacity
# TPOT bad → 4-bit weights + tensor-parallel + speculative + smaller batch
# throughput → continuous batching + bigger KV pool, cap output tokens
# OOM → lower max-model-len / max-num-seqs, KV quant, gpu-mem-util 0.85
# bad output → chat template + tokenizer + sampling params + stop tokens
nvidia-smi dmon -s u            # is the GPU actually the bottleneck?

Debugging LLM Inference & Serving.