Deep dive 13 of the KubeCon Mumbai 2026 series. Shrinidhi Venkataraman and Nithin Rajan (AstraZeneca) gave the practitioner's map of LLM serving on Kubernetes: the difference between inference (the engine) and serving (the production wrapper), how the runtimes (vLLM, SGLang, Triton) differ, the aggregated-vs-disaggregated serving split, deploying the vLLM Production Stack via Helm, GPU-saving tricks (sleep mode, scale-to-zero with KEDA), benchmarking with NVIDIA AIPerf (70K tokens/sec, 0% errors), and full Prometheus/Grafana observability of the fleet.
This is the GPU-positive counterpart to deep dive 12: where that talk avoided GPUs entirely, this one is about serving models that genuinely need them — efficiently. Together they bracket the GPU-economics question the whole conference kept circling.
Inference vs serving — two different jobs
The inference runtimes compared
| Runtime | Focus & innovation | Best for |
|---|---|---|
| vLLM | high-throughput text completion, broad model support; PagedAttention, VRAM-fragmentation management; standard prompt-in/token-out APIs | bulk processing, high-traffic generic APIs, diverse model needs |
| SGLang | complex workflows, agents, structured generation; RadixAttention, chunked-prefill prefix-sharing across requests; a language for controlling agent execution & JSON output | multi-turn chat, RAG, AI agents with repeated context |
| Triton | multi-model, multi-framework enterprise orchestration; dynamic batching; custom C++/Python backends | varied AI workloads (CV, NLP, audio) in one unified infrastructure |
What is an inference stack?
A runtime alone isn't enough — you need the full environment to serve a model to real users at scale, connecting requests to the engine and keeping the infrastructure from buckling under heavy traffic. Three stacks the talk surveyed:
- Baseten — a managed, hosted serving platform for fast deployment.
- vLLM Production Stack — open-source, cloud-native: router, autoscaling, KV-cache & LoRA management.
- NVIDIA Dynamo — disaggregated serving across vLLM, TensorRT-LLM, and SGLang.
Aggregated vs disaggregated serving
An LLM request has two phases: prefill (process the prompt — favours low parallelism) and decode (generate tokens one at a time — favours high parallelism). Where you run them is a key architectural choice:
Fig 1 — disaggregation gives each phase its own GPUs (KV cache passed between), trading complexity for utilization.
Aggregated: prefill and decode share the same GPUs — one pool does both, simplest to run, but the two phases compete for the same hardware. Disaggregated: dedicated GPUs per phase with the KV cache passed between them — more moving parts, but far better GPU utilization at scale, because you can scale prefill and decode independently to match their very different parallelism profiles.
The vLLM Production Stack
The open-source stack (powered by LMCache Lab + vLLM) turns "three confusing vLLM pods" into a managed system: a cloud-native router over multiple vLLM replicas (each with KV-cache and a LoRA loader), with autoscaling, Grafana monitoring, shared KV-cache storage, and a LoRA manager. Deploying it is a Helm chart:
# Helm: vllm-project.github.io/production-stack · chart vllm-stack 0.1.11
servingEngineSpec:
modelSpec:
- name: "opt125m"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "facebook/opt-125m"
replicaCount: 1
requestCPU: 6
requestMemory: "16Gi"
requestGPU: 1
Sleep and wakeup mode — reclaim GPU memory
A neat GPU-saver: enable sleep mode and you can put an idle engine to sleep, freeing most of its VRAM, then wake it on demand.
vllmConfig:
extraArgs: ["--enable-sleep-mode"]
env:
- name: VLLM_SERVER_DEV_MODE
value: "1"
# Put the engine to sleep:
curl -X POST http://localhost:30080/sleep?id=... | jq
# → "Sleep mode freed 39.26 GiB memory, 1.20 GiB still in use.
# It took 5.75 seconds to fall asleep."
NVIDIA Dynamo — one topology, any engine
Dynamo's pitch: the same disaggregated topology runs on whichever engine fits the workload — swap the runtime, keep the architecture:
| Engine | When |
|---|---|
vLLM (dynamo.vllm) | default — broad model support |
TensorRT-LLM (dynamo.trtllm) | peak throughput on NVIDIA GPUs |
SGLang (dynamo.sglang) | agents & structured generation; NIXL transfer |
Each is configured with a --disaggregation-mode of prefill or decode — the topology is fixed, the engine is a swappable detail.
Scale to zero with KEDA
The biggest GPU cost-saver: drop idle deployments to zero replicas and wake them on the first request.
apiVersion: keda.sh/v1alpha1 kind: ScaledObject spec: minReplicaCount: 0 # scale-to-zero maxReplicaCount: 5 pollingInterval: 10 # seconds cooldownPeriod: 360 # 6 min
- Scale up on queue depth — e.g.
dynamo_frontend_queued_requests > 5. - Hold ≥ 1 replica while work is pending — a guard metric prevents scaling to zero mid-flight (so you don't drop in-flight requests).
- Wake from zero on new traffic — e.g. an ingress keepalive metric fires on the first hit.
Benchmarking with NVIDIA AIPerf
You can't tune what you don't measure. NVIDIA AIPerf drives load against the live endpoint and reports what users actually feel. A 100-concurrency, 10,000-request run:
| Metric | Result |
|---|---|
| Output tokens/sec | 70,516 |
| Requests/sec | 661 |
| Time to first token (avg) | 12.03 ms |
| Errors (over 60,230 reqs) | 0.0% |
The key serving metrics to watch: TTFT (time to first token — responsiveness), ITL (inter-token latency — streaming smoothness), throughput (tokens/sec, req/sec), and error rate under concurrency.
Observability — and watching it scale to zero
Every replica reports to Prometheus; Grafana shows the whole fleet in real time — frontend requests/sec, average TTFT, inter-token latency, request duration, input/output sequence length, and DCGM GPU utilization. The satisfying part: as the deployment drains, requests, latency, and GPU utilization all fall away together and scale to zero — exactly the behaviour you want, made visible.
What's next — the AI factory
The closer looked ahead to the hardware end of the spectrum: the NVIDIA Enterprise AI Factory (16 racks, 18 nodes/rack, 72 Blackwell + 36 Grace GPUs per rack, water-cooled, unified memory, 6–10× H200 performance per GPU) and NVIDIA Run:ai (deploy any inference workload from a UI, with fractional GPU, distributed inference, and monitoring). The throughline back to the keynotes: this is the "AI factory" framing — standardized, repeatable inference infrastructure — at physical scale.
FAQ
vLLM, SGLang, or Triton — which runtime?
vLLM for general high-throughput LLM serving (broad model support, PagedAttention). SGLang when you have heavy prefix reuse — multi-turn chat, RAG, agents — where RadixAttention's prefix-sharing wins. Triton when serving many model types (CV/NLP/audio) under one orchestration layer.
When is disaggregated serving worth the complexity?
At scale, where GPU utilization matters most. Prefill and decode have opposite parallelism needs; giving each its own pool (KV cache passed between) lets you scale them independently and keep GPUs busy. For small deployments, aggregated is simpler and fine.
How do I cut GPU cost on idle models?
Two levers: sleep mode (free most VRAM on an idle engine, wake on demand) and KEDA scale-to-zero (drop to zero replicas when no traffic). Mind the guard metric so you don't scale to zero with requests in flight.
What should I benchmark?
What users feel: TTFT (responsiveness), inter-token latency (streaming smoothness), throughput (tokens/sec and req/sec), and error rate under realistic concurrency. Drive it with a tool like AIPerf against the live endpoint, not a synthetic micro-benchmark.
Takeaways
- Inference ≠ serving. The engine generates tokens fast; the serving layer makes it reliable, scalable, and cost-efficient for many users.
- Pick the runtime by workload: vLLM (general), SGLang (prefix-reuse/agents), Triton (multi-framework).
- Disaggregate at scale — separate prefill and decode GPUs for far better utilization.
- Deploy the vLLM Production Stack via Helm; use sleep mode and KEDA scale-to-zero to cut idle GPU cost (with a guard metric).
- Benchmark what users feel (TTFT, ITL, throughput, errors) with AIPerf against the live endpoint.
- Observe the whole fleet in Prometheus/Grafana — and watch utilization correctly fall to zero.
Next in the series — Deep dive 14: The Lean Observability Stack, opening the observability/edge/storage block.
References
- KubeCon Mumbai 2026 — Day 1 index · the rest of the series
- vLLM Production Stack · SGLang · the runtimes & stack
- NVIDIA Dynamo · KEDA · disaggregated serving & scale-to-zero
- LMCache · the KV-cache layer powering the production stack