Questions for LLM-serving, inference-optimization, and ML-platform roles — graded easy → hard with full answers. Click to expand. Companion to the LLM Inference cheatsheet.
Easy — fundamentals
What are the two phases of LLM inference? easy
Prefill processes the entire prompt in parallel and builds the KV cache — it's compute-bound (a big matmul). Decode then generates output tokens one at a time, autoregressively; each step reads all the weights and the growing KV cache to produce a single token, so it's memory-bandwidth-bound. They have different bottlenecks and different metrics (TTFT for prefill, per-token latency for decode).
What is the KV cache and why does it exist? easy
In attention, each new token attends to all previous tokens via their Key and Value vectors. Without caching you'd recompute K and V for the whole sequence at every step — O(N²) wasted work. The KV cache stores the K and V tensors for past tokens so each decode step only computes K/V for the one new token and reuses the rest. The cost is memory: it grows linearly with sequence length and batch size, and at scale it's usually what runs you out of VRAM.
Define TTFT and TPOT. easy
TTFT = time to first token: how long until the user sees the first output. Dominated by prompt length (prefill) and queue wait. TPOT (a.k.a. inter-token latency) = average time per output token after the first. Total latency a user feels ≈ TTFT + (output_tokens × TPOT). They're tuned separately — prefill tricks fix TTFT, decode tricks fix TPOT.
Why do we quantize models for inference? easy
To shrink the model's memory footprint and bandwidth needs. Decode is bandwidth-bound — every token re-reads all the weights — so storing weights in INT8/INT4 instead of FP16 means fewer bytes to read per token, which directly speeds up decode and lets bigger models fit on smaller GPUs. Methods like GPTQ and AWQ do this with minimal quality loss. The trade-off is some accuracy degradation and calibration effort.
What does a server like vLLM give you over a plain model.generate() loop? easy
Production serving features a naive loop lacks: continuous batching (pack many concurrent requests efficiently), PagedAttention (memory-efficient KV cache), prefix caching, an OpenAI-compatible API, streaming, and tensor parallelism. The result is far higher throughput and GPU utilization at scale, plus the operational glue you'd otherwise build yourself.
What are the two phases of LLM inference and how do they differ? easy
Inference splits into prefill and decode. Prefill processes the entire prompt in parallel in one forward pass, which is compute-bound and builds the KV cache; decode then generates output tokens one at a time autoregressively, which is memory-bandwidth-bound because each step streams the weights to produce a single token. This asymmetry is why prefill drives time-to-first-token and decode drives per-token latency.
What is the KV cache and why does it exist? easy
During decode, each new token attends to all previous tokens, which would mean recomputing their key/value projections every step — O(n²) work. The KV cache stores those keys and values once so each step only computes the new token's attention against the cache. It trades memory (which grows with sequence length and batch) for avoiding massive recomputation, making autoregressive decoding practical.
What are TTFT and TPOT? easy
TTFT (time to first token) is how long until the first output token appears, dominated by the prefill phase and prompt length. TPOT (time per output token) is the steady-state per-token latency during decode. Together they define the user-perceived latency: TTFT for responsiveness and TPOT for streaming speed, and you tune them separately.
What is batching in inference and why does it raise throughput? easy
Batching processes multiple requests together in one forward pass so the cost of loading the model weights from HBM is amortized across many tokens instead of one. Since decode is memory-bandwidth-bound, serving requests one at a time wastes the GPU; batching dramatically increases tokens/second. The trade-off is that larger batches can raise individual request latency.
What is quantization in serving and what does it buy you? easy
Quantization stores and/or computes weights (and sometimes activations and the KV cache) in lower precision such as INT8, FP8, or INT4 instead of FP16. This shrinks memory footprint — letting bigger models or more concurrent requests fit — and increases throughput on hardware with low-precision units. The risk is some accuracy loss, which good methods (AWQ/GPTQ) and validation keep small.
What is the difference between greedy and sampling decoding? easy
Greedy decoding picks the single highest-probability token at each step, giving deterministic but often repetitive output. Sampling instead draws from the probability distribution — shaped by temperature, top-k, and top-p — producing more diverse and creative text. You choose greedy (or low temperature) for factual/deterministic tasks and sampling for open-ended generation.
What is temperature in sampling? easy
Temperature is a scalar that divides the logits before the softmax: values below 1 sharpen the distribution toward the most likely tokens (more deterministic), while values above 1 flatten it (more random and diverse). Temperature 0 is effectively greedy decoding. It's the primary knob for trading off coherence against creativity.
What are top-k and top-p (nucleus) sampling? easy
Top-k restricts sampling to the k most probable tokens, cutting off the long unlikely tail. Top-p (nucleus) instead keeps the smallest set of tokens whose cumulative probability reaches p, so the candidate set adapts to how peaked or flat the distribution is. Both reduce incoherent outputs from the tail while preserving useful diversity, and are often combined with temperature.
Why is the decode phase memory-bandwidth-bound? easy
In decode you generate one token at a time, so (especially at low batch) you read the entire set of model weights from HBM just to produce a single token's worth of compute. The arithmetic per byte loaded is tiny, so the limiter is memory bandwidth, not FLOPs. This is why batching, quantization, and KV-cache efficiency — all about reducing/amortizing memory traffic — are the main throughput levers.
What is a context window? easy
The context window is the maximum number of tokens (prompt plus generated output) the model can attend to at once, set by its training and positional scheme. Exceeding it forces truncation, a sliding window, summarization, or switching to a longer-context model. It also bounds KV-cache size, so larger windows cost proportionally more memory.
Medium — applied serving
Explain continuous (in-flight) batching and why it beats static batching. medium
Static batching groups requests and runs them together until the whole batch finishes — so a batch of mixed lengths is held hostage by its longest sequence (head-of-line blocking), and a finished short request can't free its slot. Continuous batching schedules at the token level: every decode step the scheduler can admit new requests and evict completed ones, so a finished sequence immediately frees its slot for a queued request. This keeps the GPU densely packed regardless of length variance, dramatically improving throughput and utilization under concurrency. It's the default in vLLM/TGI/TensorRT-LLM.
What is PagedAttention and what problem does it solve? medium
Naive KV-cache allocation reserves one contiguous block per sequence sized for the max possible length. That causes massive internal fragmentation and wasted VRAM, capping how many sequences you can run. PagedAttention (from vLLM) borrows OS virtual memory: it splits the KV cache into fixed-size blocks and uses a block table to map a sequence's logical positions to non-contiguous physical blocks. No need for contiguity, near-zero waste, blocks allocated on demand — so you fit far more concurrent sequences in the same VRAM, and you can even share blocks across sequences (e.g. common prefixes, beam search).
TTFT is high but per-token latency is fine. How do you diagnose and fix it? medium
High TTFT points at prefill or queueing, not decode. Check: (1) prompt length — long prompts mean heavy prefill; enable chunked prefill and prefix caching so shared system prompts / RAG context aren't recomputed. (2) queue wait — look at requests-waiting / queue depth in metrics; if requests pile up before scheduling, add capacity or cap concurrency. (3) cold start — the first request after load triggers kernel autotune / CUDA graph capture; warm up on boot. Measure the split (stream the response, read vLLM's Prometheus metrics) before guessing.
Your server OOMs only under high concurrency or long prompts, but is fine at low load. Why? medium
It's the KV cache, not the weights. KV memory scales with context length × number of concurrent sequences. At low load it fits; add concurrency or longer prompts and the KV pool overflows. Fixes: lower max-model-len to what you actually need, cap max-num-seqs, enable KV-cache quantization (INT8/FP8), set a hard max-output-tokens so a runaway generation can't grow KV unbounded, and back off gpu-memory-utilization (e.g. 0.85–0.90) to leave room for activation spikes.
How does weight-only 4-bit quantization speed up decode specifically? medium
Decode is memory-bandwidth-bound: each generated token requires reading the entire weight matrix from HBM, and there's very little math per byte. If weights are stored 4-bit instead of 16-bit, you move ~4× fewer bytes per token, so the bandwidth-limited step finishes faster — a near-linear win on per-token latency for memory-bound decode. Activations stay higher precision (weight-only), and methods like AWQ/GPTQ keep quality high by protecting salient weights / calibrating. Prefill (compute-bound) benefits less.
What is continuous (in-flight) batching? medium
Traditional static batching waits for a whole batch to finish before starting the next, wasting the GPU when requests have different lengths. Continuous batching lets the scheduler add new requests to and remove finished ones from the running batch at every decode step, so the GPU stays full under mixed-length traffic. It's the single biggest throughput win in modern serving stacks like vLLM and TGI.
What is PagedAttention? medium
PagedAttention manages the KV cache in fixed-size pages, like operating-system virtual memory, instead of one contiguous reservation per sequence. This eliminates internal/external fragmentation, lets sequences grow without pre-allocating for max length, and enables sharing pages (e.g. for a common prefix). The result is much higher concurrency and memory efficiency, which is core to vLLM's throughput.
Give the KV-cache size formula and explain the terms. medium
Per request it's roughly 2 × n_layers × n_kv_heads × head_dim × seq_len × dtype_bytes, and you multiply by batch size for the whole serving load. The factor 2 covers keys and values; n_kv_heads (not query heads) reflects GQA/MQA sharing; and it scales linearly with sequence length. This formula is how you compute how much memory concurrency and context length will cost.
What is the difference between MHA, MQA, and GQA? medium
In Multi-Head Attention every query head has its own key/value heads, giving the largest KV cache. Multi-Query Attention shares a single key/value head across all query heads, shrinking the KV cache dramatically at some quality cost. Grouped-Query Attention is the middle ground — groups of query heads share key/value heads — capturing most of MQA's memory savings with little quality loss, which is why Llama-2/3 and others use it.
What is speculative decoding? medium
A small, fast draft model proposes several future tokens, and the large target model verifies them all in a single forward pass, accepting the longest correct prefix and falling back where they disagree. Because verification is one parallel pass rather than many sequential decode steps, you get fewer expensive big-model steps while provably preserving the target model's output distribution. It speeds up decode most when the draft model agrees often.
How does tensor parallelism help inference? medium
Tensor parallelism splits each layer's weight matrices across multiple GPUs, so a model too large for one GPU's memory fits and the matmuls run in parallel. It requires an all-reduce within every layer to combine partial results, so it depends on fast interconnect (NVLink) and is normally kept within a node. It's the standard way to serve very large models with acceptable latency.
What is chunked prefill? medium
Chunked prefill splits a long prompt's prefill into smaller pieces processed across several scheduler steps, interleaving them with ongoing decode work. Without it, a single long prompt monopolizes a step and spikes everyone else's latency (head-of-line blocking). It smooths tail latency under mixed traffic at the cost of slightly more scheduling complexity.
How do you size the KV cache for a deployment? medium
Compute per-token KV bytes from the formula, multiply by your target maximum context length and desired concurrency, then check it against (GPU memory − model weights). That difference, divided by per-request KV size, gives your concurrency ceiling. You stretch it with PagedAttention, KV quantization (FP8/INT8), or a GQA model, and you admission-control load so you don't oversubscribe and trigger eviction.
What is prefix/prompt caching? medium
Many requests share a long common prefix — a system prompt, few-shot examples, or a document. Prefix caching computes and stores that prefix's KV once and reuses it across requests instead of re-running prefill each time. This sharply reduces prefill cost and TTFT for workloads with repeated context, and PagedAttention makes the page sharing efficient.
How do batch size and the throughput/latency trade-off interact? medium
Increasing batch size raises tokens/second (throughput) because weight loads are amortized, but it also increases each request's latency and can worsen TTFT as requests wait to be scheduled. So you pick the largest batch that still meets your latency SLA, using continuous batching to keep utilization high without forcing requests to wait for a full batch. It's a deliberate throughput-vs-latency tuning exercise driven by your SLA.
Hard — senior & systems
Explain speculative decoding and why it doesn't change the output distribution. hard
Decode is latency-bound by sequential token generation. Speculative decoding uses a small, cheap draft model to propose K future tokens, then the large target model verifies all K in a single forward pass (it can score K positions in parallel because the candidate tokens are known). Using a modified rejection-sampling scheme, it accepts the longest prefix that's consistent with the target's own probabilities and resamples at the first rejection. Because acceptance/rejection is calibrated to the target's distribution, the final output is provably distributed exactly as if the target had generated it alone — same quality, lower latency when the draft is often right. Variants: self-speculation, Medusa heads, EAGLE, n-gram/prompt lookup.
Walk through the VRAM budget of a serving deployment and how you'd size it. hard
VRAM ≈ weights + KV-cache pool + activation/overhead. Weights = params × bytes/param (e.g. 7B × 2 = 14 GB in FP16, ~4 GB in 4-bit). The KV pool is whatever's left after weights and a safety margin — and it determines your max concurrent tokens: KV per token ≈ 2 × layers × kv_heads × head_dim × dtype_bytes, so total live tokens = pool ÷ that. To size: pick the model and precision (sets weights), set gpu-memory-utilization ~0.9 (leave headroom for activation spikes), and the remaining pool ÷ per-token KV gives your concurrency × context budget. If you need more concurrency or longer context, quantize weights (free up pool), quantize KV, or add GPUs with tensor parallelism. GQA/MQA models have far smaller KV (fewer kv-heads) → much higher concurrency.
How do GQA/MQA change serving economics versus full multi-head attention? hard
Multi-Query Attention (MQA) shares a single K/V head across all query heads; Grouped-Query Attention (GQA) shares K/V across groups of query heads — a middle ground. Since the KV cache size is proportional to the number of KV heads, MQA/GQA shrink the KV cache by the head-reduction factor (e.g. 8× or more). That means: much higher concurrency and longer context per GB, lower decode bandwidth (fewer KV bytes to read per token → faster TPOT), at a small quality cost versus full MHA. This is why nearly all modern serving-oriented models (Llama-2/3 70B, Mistral, etc.) use GQA — it directly improves the memory-bound decode path.
You need to serve hundreds of fine-tuned variants of one base model cost-effectively. How? hard
Don't load hundreds of full models. Use multi-LoRA serving (e.g. S-LoRA / vLLM LoRA): keep one copy of the frozen base model in VRAM and load only the small LoRA adapters (a few MB each) per variant. At request time you apply the right adapter's low-rank delta on top of the shared base, batching requests for different adapters together with specialized kernels. This amortizes the expensive base weights across all tenants and makes per-variant cost tiny. If a few variants are extremely hot, you can also merge their LoRA into a dedicated copy to remove adapter overhead.
Single-GPU throughput is fine but tail latency (p99) is bad under load. What's happening and what do you tune? hard
High p99 with OK throughput usually means scheduling contention: a few very long generations or huge prompts hog the batch and KV pool, delaying others (head-of-line effects), or the batch is so large that per-step (TPOT) time stretches the tail. Tune: enforce a max output tokens and consider separate queues/instances for long vs short jobs; enable chunked prefill so big prompts don't block decode of others; lower max-num-seqs to the knee where p99 still meets SLO (bigger batch raises throughput but worsens per-request latency); use prefix caching to cut repeated prefill. Track goodput (requests meeting SLO), not raw throughput — that's the number that reflects user experience.
Tensor parallel vs pipeline parallel for inference — trade-offs? hard
Tensor parallel (TP) splits each layer's matmuls across GPUs and all-reduces activations every layer — low added latency but chatty, so it needs fast intra-node interconnect (NVLink); ideal for latency-sensitive serving on one node. Pipeline parallel (PP) splits layers into sequential stages across GPUs/nodes — communication is just activations handed between stages (less bandwidth), so it crosses nodes well, but it adds pipeline latency and needs request/micro-batch concurrency to keep all stages busy (otherwise bubbles). For inference: prefer TP within a node for low latency; add PP to span nodes when a model is too big for one node's GPUs. Expert parallel handles MoE routing on top.
What is disaggregated prefill/decode serving and why do it? hard
Prefill is compute-bound while decode is memory-bandwidth-bound, so running them on the same GPUs means one phase stalls the other (a long prefill blocks others' decode). Disaggregation runs prefill and decode on separate GPU pools, each tuned and scaled for its phase, and transfers the KV cache between them. This improves tail latency and lets you scale the two phases independently, at the cost of KV-transfer overhead and added system complexity.
Compare AWQ, GPTQ, and SmoothQuant. hard
GPTQ is a layer-wise weight quantization that minimizes output error using second-order information, good for low-bit weight-only quantization. AWQ is activation-aware — it identifies and preserves the salient weight channels that most affect outputs, giving robust low-bit weight quantization. SmoothQuant shifts activation outliers into the weights so that both activations and weights can be quantized to INT8 together (W8A8). You pick based on whether you need weight-only vs activation quantization and your target bit-width.
What are the trade-offs of an FP8/INT8 KV cache? hard
Quantizing the KV cache halves (FP8) or further reduces its memory, directly increasing the concurrency and context length you can serve for the same GPU memory. The downside is precision loss that can degrade long-context recall and accuracy, since attention is sensitive to KV fidelity. You mitigate with per-channel/per-token scaling and by keeping sensitive layers higher precision, and you must validate on real long-context tasks rather than assuming it's free.
How does multi-LoRA serving work? hard
Instead of loading N separately fine-tuned full models, you keep a single base model resident and apply each request's lightweight LoRA adapter on the fly. Systems batch requests with different adapters together using segmented/grouped GEMMs so the base-model compute is still shared. This serves many fine-tunes cheaply on one GPU, with only the small adapter weights swapped per request.
How do you engineer for a strict tail-latency (p99) SLA? hard
Bound the worst case at every level: cap maximum batch size so per-step time is predictable, use chunked prefill so a huge prompt can't monopolize a step, and add admission control/queueing so the system sheds or delays load instead of degrading. Separate prefill-heavy from decode traffic, prioritize/schedule requests, and autoscale replicas on queue depth. The principle is to make the longest possible request unable to dominate, then provision for the SLA.
What is constrained / structured decoding and what does it cost? hard
Constrained decoding masks the model's logits each step to only allow tokens valid under a grammar or JSON schema — often implemented with a finite-state machine or regex compiled over the vocabulary — guaranteeing parseable output. The cost is per-step mask computation/lookup and reduced sampling freedom, plus engineering to compile the grammar. It's the reliable way to get machine-consumable structured output instead of hoping the prompt is followed.
How does request scheduling and preemption work under KV-cache pressure? hard
When the KV cache is full and new or continuing requests need space, the scheduler must preempt some sequences — either swapping their KV to host memory or evicting and later recomputing it. Recompute is expensive (you redo prefill), so the policy (which sequences to preempt, swap vs recompute) materially affects latency. Good capacity planning and admission control minimize how often preemption happens at all.
How do you serve a model across multiple nodes? hard
Use tensor parallelism within each node where NVLink makes the per-layer all-reduce cheap, and pipeline parallelism across nodes to split the model into stages connected by slower inter-node links. Minimize cross-node traffic (especially KV movement), balance pipeline stages to avoid bubbles, and rely on fast interconnect like InfiniBand. The mapping of parallelism dimensions to the hardware topology is what determines whether multi-node latency is acceptable.
What is KV-cache offloading and when is it worth it? hard
KV-cache offloading spills cold KV pages to CPU RAM or NVMe to extend effective capacity, letting you serve very long contexts or more concurrency than GPU memory alone allows. It's worth it when fetching a page back is cheaper than recomputing it and when the access pattern means offloaded pages are rarely needed urgently. If hot pages get offloaded, the fetch latency hurts more than it helps, so it's a capacity-vs-latency trade-off.
How do you benchmark an inference server meaningfully? hard
Drive it with a realistic request distribution (prompt/output length mix and arrival pattern), not single-request best case, and report TTFT and TPOT percentiles (p50/p95/p99), tokens/second per GPU, and goodput — throughput that still meets the latency SLA — across increasing concurrency. Plot how latency degrades as load rises to find the usable operating point. Reporting only average single-stream latency or peak throughput hides the behavior that matters in production.
Scenario-based
TTFT is high but per-token latency (TPOT) is fine. Where do you look? medium
High TTFT = the prefill phase is slow (processing the whole prompt before the first token). Causes: very long prompts, no chunked prefill (so a big prompt monopolizes a step), prefill not batched, or cold KV cache. Fixes: chunked prefill (split prompt processing across steps so it interleaves with decode), prompt caching / prefix caching for repeated system prompts, shorter prompts, and ensure prefill is batched. TPOT being fine confirms decode/KV path is healthy.
GPU is large but serving throughput is low under load. Diagnose. medium
Almost always batching not configured. Static batching wastes the GPU between requests; you want continuous (in-flight) batching so new requests join mid-decode. Check the server (vLLM/TGI/TensorRT-LLM) batch settings: max_num_seqs/max_batch too low, or PagedAttention off so KV fragmentation caps concurrency. Raise batch limits, enable paged KV, and confirm you're not memory-capped on KV cache. Measure tokens/sec vs batch to find the knee.
You OOM on long-context requests. What's eating the memory and how do you fix it? hard
The KV cache — it grows linearly with sequence length × batch: 2 × n_layers × n_kv_heads × head_dim × seq_len × batch × dtype_bytes. Long context blows it up. Fixes: PagedAttention (no contiguous reservation, less waste), KV-cache quantization (FP8/INT8 KV), a GQA/MQA model (fewer KV heads → far smaller cache), cap max context / max concurrent long requests, and offload/evict. Decide based on whether you need the context or can summarize/truncate.
p50 latency is good but p99 is terrible. Why? hard
Head-of-line blocking: a few large requests (long prompts/outputs) stall the batch or queue, spiking tail latency for everyone behind them. Fixes: chunked prefill so a giant prompt doesn't monopolize a step, request scheduling/prioritization, separating prefill-heavy and decode-heavy traffic, and capping max tokens. Also check for occasional preemption/recompute when KV cache is full (vLLM evicts and recomputes — expensive). Tune so the worst-case request can't dominate.
Cost-per-token is too high. What levers do you pull? medium
Raise utilization first: better continuous batching = more tokens per GPU-second. Then quantize (INT8/FP8/AWQ) to fit a smaller/cheaper GPU or higher batch. Speculative decoding (small draft model) cuts decode steps for big speedups. Prompt/prefix caching avoids recomputing shared system prompts. Consider a smaller or distilled model if quality allows, and route easy queries to it. Measure tokens/sec/$ before and after each change.
Accuracy dropped after you quantized to INT4. How do you recover quality? hard
Naive round-to-nearest INT4 loses too much. Use a calibration-based method: GPTQ or AWQ (activation-aware — protects salient weight channels), which keep quality far better than RTN. Keep outlier-sensitive layers (e.g. attention, first/last) in higher precision (mixed-precision quant). Or step back to INT8/FP8 which is nearly lossless. Always evaluate on a real task set, not just perplexity, and compare per-task before shipping.
GPU memory is full and requests are being preempted/recomputed. How do you fix it? hard
Preemption means the KV cache is oversubscribed, so reduce its footprint and demand. Enable PagedAttention to cut fragmentation, quantize the KV cache to FP8/INT8, and cap the number of concurrent long-context requests so they don't exhaust memory. If you still need more, add tensor parallelism or more GPUs, and apply admission control so load is shed gracefully rather than thrashing on recompute.
Streaming responses but TTFT feels slow. What do you tune? medium
TTFT is dominated by prefill, so attack prefill cost: enable chunked prefill so long prompts don't block, use prefix caching so a shared system prompt isn't recomputed each request, shorten or trim prompts, and ensure prefill is batched. Then stream the first token to the client the moment it's produced. Together these cut the time before the user sees output.
You must serve many fine-tuned variants of one base model cheaply. How? hard
Use multi-LoRA serving: keep one copy of the base model resident and hot-swap each request's small LoRA adapter, batching mixed-adapter requests with grouped GEMMs. This avoids loading N full models and their N× memory, so dozens of fine-tunes can share a single GPU. Only the lightweight adapter weights differ per request.
Outputs differ across runs and QA complains. Why, and how do you fix it? medium
Sampling with temperature > 0 is inherently nondeterministic, so identical inputs yield different outputs. For reproducibility set temperature to 0 (greedy) or fix the random seed. Note that even then, batching and floating-point non-associativity can cause minor differences, so for strict determinism also pin batch composition and kernels where possible.
Latency is fine at low load but collapses past N concurrent users. Diagnose. hard
You've hit the KV-cache concurrency ceiling: beyond N users there's no memory for more sequences, so requests queue or trigger eviction/recompute, and latency spikes. Raise effective capacity (PagedAttention, quantized KV, a GQA model, or more GPUs) and/or admission-control the load so you stay under the ceiling. Compute the ceiling from per-request KV size vs available memory to confirm.
Cost per million tokens is too high. What levers do you pull? medium
Start with utilization — better continuous batching gives more tokens per GPU-second. Then quantize to fit a smaller/cheaper GPU or higher batch, add speculative decoding to cut decode steps, and cache shared prefixes to skip repeated prefill. Finally, route easy queries to a smaller/distilled model and reserve the large model for hard ones, measuring tokens/second/$ at each step.
You must serve a 70B model on a single 80GB GPU. How? hard
Weights at FP16 are ~140GB, so you must quantize: INT8 brings it to roughly 70GB which fits an 80GB card, leaving room for the KV cache. Then size KV for your target concurrency/context and use PagedAttention to use it efficiently. If INT8 still leaves too little headroom, drop to 4-bit (AWQ/GPTQ) or split across two GPUs with tensor parallelism.
p99 TTFT spikes whenever long prompts arrive. How do you fix it? hard
Long prompts cause head-of-line blocking during prefill. Enable chunked prefill so a big prompt is processed in pieces interleaved with other work, add request scheduling/prioritization, and impose a maximum prompt-length cap to bound the worst case. This keeps one giant prompt from monopolizing a scheduler step and inflating everyone's TTFT.
Accuracy regressed after enabling an FP8 KV cache. What do you check? medium
Verify the scaling factors are correct (per-channel/per-token scaling matters a lot for KV), since bad scales cause large errors. Consider keeping sensitive layers' KV at higher precision or falling back to INT8/FP16 KV, and evaluate specifically on long-context tasks where quantized KV hurts recall most. If quality can't be recovered, the memory savings aren't worth it for that workload.
You must guarantee a strict per-token latency SLA under bursty traffic. How do you design it? hard
Bound per-step time with a capped maximum batch size, and use continuous batching with admission control so bursts are queued/shed rather than blowing the SLA. Separate prefill from decode (and use chunked prefill) so big prompts don't stall decode, and autoscale replicas on queue depth so capacity tracks load. The combination keeps tail latency predictable even when traffic spikes.
For inference/serving roles (model-serving teams, inference startups, cloud ML-platform), the hard block is the loop: KV-cache math, PagedAttention, continuous batching, GQA/MQA economics, speculative decoding, and a "size this deployment / fix this latency" scenario. For applied / product-LLM roles, expect easy/medium: prefill vs decode, TTFT/TPOT, why quantize, what vLLM does. A very common live question: "latency is too high / throughput too low — walk me through how you'd debug it." Answer by splitting TTFT vs TPOT first, then naming the lever (batching, KV pool, quantization, parallelism) for each.