Serving complaints are almost always one of: slow first token (TTFT), slow per token (TPOT), low throughput under load, OOM (usually KV cache), or bad output (quality/garbled). Each has a different cause. Measure TTFT and TPOT separately first.
Triage: which number is bad?
# measure the split, not just total latency
curl -N -s -w '\nTTFT/total via timing\n' .../v1/chat/completions -d '{...,"stream":true}'
# vLLM exposes Prometheus metrics
curl -s localhost:8000/metrics | grep -E 'time_to_first_token|time_per_output_token|num_requests|gpu_cache_usage'
- High TTFT, fine TPOT → prefill / queue / long prompt problem.
- Fine TTFT, high TPOT → decode bottleneck (model size, batch, bandwidth).
- Both fine single-user, bad under load → batching / concurrency / KV pool.
Slow first token (TTFT)
- Long prompts → prefill is compute-heavy. Enable chunked prefill and prefix caching for shared system prompts / RAG context.
- Queue wait → requests piling up before scheduling; check queue depth /
num_requests_waiting. Add capacity or cap concurrency. - Cold start → first request after load compiles/warms kernels (CUDA graphs, Triton autotune). Warm up on boot.
Slow per token (TPOT)
- Decode is bandwidth-bound → quantize weights (AWQ/GPTQ 4-bit) to cut per-token reads.
- Model too big for the GPU count → add tensor parallelism (within node, NVLink).
- Batch too large pushing TPOT up → there's a throughput/latency knee; lower
max-num-seqsto meet the SLO. - Try speculative decoding (draft model) to cut latency without quality loss.
Low throughput under concurrency
- Static batching → switch to a server with continuous batching (vLLM/TGI/TRT-LLM). Biggest single win.
- KV pool too small → can't hold enough concurrent sequences; raise
gpu-memory-utilizationor shrinkmax-model-len. - Head-of-line blocking from a few huge requests → enforce max output tokens; isolate long jobs.
- Confirm the GPU is the bottleneck (high util) vs the API/tokenizer/network around it.
throughput collapses as context grows
KV cache scales with context × concurrency. Long prompts shrink how many requests fit in the KV
pool, so effective concurrency (and throughput) drops even though the GPU isn't "full". Cap
max-model-len to what you actually need.OOM during serving
Symptom. Crash on a long prompt or at high concurrency; CUDA out of memory
after running fine at low load.
- It's the KV cache, not the weights. Lower
max-model-len, lowermax-num-seqs, enable KV-cache quantization. gpu-memory-utilizationtoo high leaves no room for activation spikes → back it off (e.g. 0.85–0.90).- Set a hard max output tokens so a runaway generation can't grow KV unbounded.
Bad / garbled output
- Wrong chat template → garbage or ignored instructions. Match the model's exact prompt format / special tokens. Most "model is broken" reports are template bugs.
- Tokenizer mismatch → use the tokenizer that ships with the checkpoint.
- Quantization too aggressive → quality drop; try a higher-bit scheme or different method (AWQ vs GPTQ), compare on an eval set.
- Repetition / rambling → check sampling params (temperature, top-p, repetition/frequency penalty, stop tokens). A missing stop token = never-ending output.
- Truncation → prompt + max_tokens exceeds context window; it silently drops the start. Budget tokens.
wrong stop token = runaway cost
If the stop sequence / EOS isn't configured, generation runs to
max_tokens every time —
latency, cost, and KV all balloon. Verify stop tokens against the model card."Same prompt, different answer"
- Sampling is stochastic → set
temperature=0(greedy) for reproducibility tests. - Even at temp 0, batching can shift results — floating-point reduction order depends on batch composition. Small numerical differences are expected, not a bug.
Quick reference
curl -s localhost:8000/metrics | grep -E 'first_token|per_output_token|cache_usage|waiting' # TTFT bad → chunked prefill + prefix cache + warmup + capacity # TPOT bad → 4-bit weights + tensor-parallel + speculative + smaller batch # throughput → continuous batching + bigger KV pool, cap output tokens # OOM → lower max-model-len / max-num-seqs, KV quant, gpu-mem-util 0.85 # bad output → chat template + tokenizer + sampling params + stop tokens nvidia-smi dmon -s u # is the GPU actually the bottleneck?