← Debug Guides

DEBUG GUIDE · GENAI · SERVING PLAYBOOK

Debugging LLM Inference & Serving.

inference llm vllm latency
Serving complaints are almost always one of: slow first token (TTFT), slow per token (TPOT), low throughput under load, OOM (usually KV cache), or bad output (quality/garbled). Each has a different cause. Measure TTFT and TPOT separately first.

Triage: which number is bad?

# measure the split, not just total latency
curl -N -s -w '\nTTFT/total via timing\n' .../v1/chat/completions -d '{...,"stream":true}'
# vLLM exposes Prometheus metrics
curl -s localhost:8000/metrics | grep -E 'time_to_first_token|time_per_output_token|num_requests|gpu_cache_usage'
  • High TTFT, fine TPOT → prefill / queue / long prompt problem.
  • Fine TTFT, high TPOT → decode bottleneck (model size, batch, bandwidth).
  • Both fine single-user, bad under load → batching / concurrency / KV pool.

Slow first token (TTFT)

  • Long prompts → prefill is compute-heavy. Enable chunked prefill and prefix caching for shared system prompts / RAG context.
  • Queue wait → requests piling up before scheduling; check queue depth / num_requests_waiting. Add capacity or cap concurrency.
  • Cold start → first request after load compiles/warms kernels (CUDA graphs, Triton autotune). Warm up on boot.

Slow per token (TPOT)

  • Decode is bandwidth-bound → quantize weights (AWQ/GPTQ 4-bit) to cut per-token reads.
  • Model too big for the GPU count → add tensor parallelism (within node, NVLink).
  • Batch too large pushing TPOT up → there's a throughput/latency knee; lower max-num-seqs to meet the SLO.
  • Try speculative decoding (draft model) to cut latency without quality loss.

Low throughput under concurrency

  • Static batching → switch to a server with continuous batching (vLLM/TGI/TRT-LLM). Biggest single win.
  • KV pool too small → can't hold enough concurrent sequences; raise gpu-memory-utilization or shrink max-model-len.
  • Head-of-line blocking from a few huge requests → enforce max output tokens; isolate long jobs.
  • Confirm the GPU is the bottleneck (high util) vs the API/tokenizer/network around it.
throughput collapses as context grows KV cache scales with context × concurrency. Long prompts shrink how many requests fit in the KV pool, so effective concurrency (and throughput) drops even though the GPU isn't "full". Cap max-model-len to what you actually need.

OOM during serving

Symptom. Crash on a long prompt or at high concurrency; CUDA out of memory after running fine at low load.

  • It's the KV cache, not the weights. Lower max-model-len, lower max-num-seqs, enable KV-cache quantization.
  • gpu-memory-utilization too high leaves no room for activation spikes → back it off (e.g. 0.85–0.90).
  • Set a hard max output tokens so a runaway generation can't grow KV unbounded.

Bad / garbled output

  • Wrong chat template → garbage or ignored instructions. Match the model's exact prompt format / special tokens. Most "model is broken" reports are template bugs.
  • Tokenizer mismatch → use the tokenizer that ships with the checkpoint.
  • Quantization too aggressive → quality drop; try a higher-bit scheme or different method (AWQ vs GPTQ), compare on an eval set.
  • Repetition / rambling → check sampling params (temperature, top-p, repetition/frequency penalty, stop tokens). A missing stop token = never-ending output.
  • Truncation → prompt + max_tokens exceeds context window; it silently drops the start. Budget tokens.
wrong stop token = runaway cost If the stop sequence / EOS isn't configured, generation runs to max_tokens every time — latency, cost, and KV all balloon. Verify stop tokens against the model card.

"Same prompt, different answer"

  • Sampling is stochastic → set temperature=0 (greedy) for reproducibility tests.
  • Even at temp 0, batching can shift results — floating-point reduction order depends on batch composition. Small numerical differences are expected, not a bug.

Quick reference

curl -s localhost:8000/metrics | grep -E 'first_token|per_output_token|cache_usage|waiting'
# TTFT bad → chunked prefill + prefix cache + warmup + capacity
# TPOT bad → 4-bit weights + tensor-parallel + speculative + smaller batch
# throughput → continuous batching + bigger KV pool, cap output tokens
# OOM → lower max-model-len / max-num-seqs, KV quant, gpu-mem-util 0.85
# bad output → chat template + tokenizer + sampling params + stop tokens
nvidia-smi dmon -s u            # is the GPU actually the bottleneck?
← prev: GPU & CUDA next: LLM Training →
© cvam — written in plaintext, served warm