Plug In and Scale — Serving LLM Models on Kubernetes — KubeCon Mumbai 2026 Deep Dive 13

Deep dive 13 of the KubeCon Mumbai 2026 series. Shrinidhi Venkataraman and Nithin Rajan (AstraZeneca) gave the practitioner's map of LLM serving on Kubernetes: the difference between inference (the engine) and serving (the production wrapper), how the runtimes (vLLM, SGLang, Triton) differ, the aggregated-vs-disaggregated serving split, deploying the vLLM Production Stack via Helm, GPU-saving tricks (sleep mode, scale-to-zero with KEDA), benchmarking with NVIDIA AIPerf (70K tokens/sec, 0% errors), and full Prometheus/Grafana observability of the fleet.

This is the GPU-positive counterpart to deep dive 12: where that talk avoided GPUs entirely, this one is about serving models that genuinely need them — efficiently. Together they bracket the GPU-economics question the whole conference kept circling.

Inference vs serving — two different jobs

The distinction that organizes everything. Inference is the engine — raw model computation; goal: generate output tokens; metric: speed (tokens/sec); focus: fast execution of a single request. Serving is the infrastructure — the production wrapper around the engine; goal: manage requests, scaling, efficiency; metric: throughput, reliability, cost per token; focus: high volume across many users, reliably. You can have a blazing-fast engine and still fall over in production without the serving layer.

The inference runtimes compared

Runtime	Focus & innovation	Best for
vLLM	high-throughput text completion, broad model support; PagedAttention, VRAM-fragmentation management; standard prompt-in/token-out APIs	bulk processing, high-traffic generic APIs, diverse model needs
SGLang	complex workflows, agents, structured generation; RadixAttention, chunked-prefill prefix-sharing across requests; a language for controlling agent execution & JSON output	multi-turn chat, RAG, AI agents with repeated context
Triton	multi-model, multi-framework enterprise orchestration; dynamic batching; custom C++/Python backends	varied AI workloads (CV, NLP, audio) in one unified infrastructure

How to choose. Default to vLLM for general LLM serving (it's the broad-support workhorse). Reach for SGLang when you have heavy prefix reuse — multi-turn chat or RAG where RadixAttention's prefix-sharing pays off. Use Triton when you're serving many kinds of models (vision, audio, NLP) and want one orchestration layer. The runtimes aren't competitors so much as fits for different workload shapes.

What is an inference stack?

A runtime alone isn't enough — you need the full environment to serve a model to real users at scale, connecting requests to the engine and keeping the infrastructure from buckling under heavy traffic. Three stacks the talk surveyed:

Baseten — a managed, hosted serving platform for fast deployment.
vLLM Production Stack — open-source, cloud-native: router, autoscaling, KV-cache & LoRA management.
NVIDIA Dynamo — disaggregated serving across vLLM, TensorRT-LLM, and SGLang.

Aggregated vs disaggregated serving

An LLM request has two phases: prefill (process the prompt — favours low parallelism) and decode (generate tokens one at a time — favours high parallelism). Where you run them is a key architectural choice:

Fig 1 — disaggregation gives each phase its own GPUs (KV cache passed between), trading complexity for utilization.

Aggregated: prefill and decode share the same GPUs — one pool does both, simplest to run, but the two phases compete for the same hardware. Disaggregated: dedicated GPUs per phase with the KV cache passed between them — more moving parts, but far better GPU utilization at scale, because you can scale prefill and decode independently to match their very different parallelism profiles.

The vLLM Production Stack

The open-source stack (powered by LMCache Lab + vLLM) turns "three confusing vLLM pods" into a managed system: a cloud-native router over multiple vLLM replicas (each with KV-cache and a LoRA loader), with autoscaling, Grafana monitoring, shared KV-cache storage, and a LoRA manager. Deploying it is a Helm chart:

# Helm: vllm-project.github.io/production-stack · chart vllm-stack 0.1.11
servingEngineSpec:
  modelSpec:
    - name: "opt125m"
      repository: "vllm/vllm-openai"
      tag: "latest"
      modelURL: "facebook/opt-125m"
      replicaCount: 1
      requestCPU: 6
      requestMemory: "16Gi"
      requestGPU: 1

Sleep and wakeup mode — reclaim GPU memory

A neat GPU-saver: enable sleep mode and you can put an idle engine to sleep, freeing most of its VRAM, then wake it on demand.

vllmConfig:
  extraArgs: ["--enable-sleep-mode"]
  env:
    - name: VLLM_SERVER_DEV_MODE
      value: "1"
# Put the engine to sleep:
curl -X POST http://localhost:30080/sleep?id=... | jq
# → "Sleep mode freed 39.26 GiB memory, 1.20 GiB still in use.
#    It took 5.75 seconds to fall asleep."

NVIDIA Dynamo — one topology, any engine

Dynamo's pitch: the same disaggregated topology runs on whichever engine fits the workload — swap the runtime, keep the architecture:

Engine	When
vLLM (`dynamo.vllm`)	default — broad model support
TensorRT-LLM (`dynamo.trtllm`)	peak throughput on NVIDIA GPUs
SGLang (`dynamo.sglang`)	agents & structured generation; NIXL transfer

Each is configured with a --disaggregation-mode of prefill or decode — the topology is fixed, the engine is a swappable detail.

Scale to zero with KEDA

The biggest GPU cost-saver: drop idle deployments to zero replicas and wake them on the first request.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
  minReplicaCount: 0      # scale-to-zero
  maxReplicaCount: 5
  pollingInterval: 10     # seconds
  cooldownPeriod: 360     # 6 min

Scale up on queue depth — e.g. dynamo_frontend_queued_requests > 5.
Hold ≥ 1 replica while work is pending — a guard metric prevents scaling to zero mid-flight (so you don't drop in-flight requests).
Wake from zero on new traffic — e.g. an ingress keepalive metric fires on the first hit.

The scale-to-zero gotcha. Naively scaling GPU deployments to zero will drop requests that are still being processed. The fix is the guard metric: hold at least one replica as long as work is pending, and only collapse to zero after a cooldown with the queue truly empty. Get this wrong and "cost savings" becomes "dropped user requests." Note this is the same KEDA primitive the shared-first platforms talk used — here applied to expensive GPU pods where idle cost really bites.

Benchmarking with NVIDIA AIPerf

You can't tune what you don't measure. NVIDIA AIPerf drives load against the live endpoint and reports what users actually feel. A 100-concurrency, 10,000-request run:

Metric	Result
Output tokens/sec	70,516
Requests/sec	661
Time to first token (avg)	12.03 ms
Errors (over 60,230 reqs)	0.0%

The key serving metrics to watch: TTFT (time to first token — responsiveness), ITL (inter-token latency — streaming smoothness), throughput (tokens/sec, req/sec), and error rate under concurrency.

Observability — and watching it scale to zero

Every replica reports to Prometheus; Grafana shows the whole fleet in real time — frontend requests/sec, average TTFT, inter-token latency, request duration, input/output sequence length, and DCGM GPU utilization. The satisfying part: as the deployment drains, requests, latency, and GPU utilization all fall away together and scale to zero — exactly the behaviour you want, made visible.

Why this closes the loop with DD11. The agent-observability talk argued token usage and traces are the new golden signals; here you see the serving-layer version — TTFT, ITL, GPU utilization per replica. Serving observability is what proves your scale-to-zero and sleep-mode tricks are actually working rather than silently dropping traffic.

What's next — the AI factory

The closer looked ahead to the hardware end of the spectrum: the NVIDIA Enterprise AI Factory (16 racks, 18 nodes/rack, 72 Blackwell + 36 Grace GPUs per rack, water-cooled, unified memory, 6–10× H200 performance per GPU) and NVIDIA Run:ai (deploy any inference workload from a UI, with fractional GPU, distributed inference, and monitoring). The throughline back to the keynotes: this is the "AI factory" framing — standardized, repeatable inference infrastructure — at physical scale.

FAQ

vLLM, SGLang, or Triton — which runtime?

vLLM for general high-throughput LLM serving (broad model support, PagedAttention). SGLang when you have heavy prefix reuse — multi-turn chat, RAG, agents — where RadixAttention's prefix-sharing wins. Triton when serving many model types (CV/NLP/audio) under one orchestration layer.

When is disaggregated serving worth the complexity?

At scale, where GPU utilization matters most. Prefill and decode have opposite parallelism needs; giving each its own pool (KV cache passed between) lets you scale them independently and keep GPUs busy. For small deployments, aggregated is simpler and fine.

How do I cut GPU cost on idle models?

Two levers: sleep mode (free most VRAM on an idle engine, wake on demand) and KEDA scale-to-zero (drop to zero replicas when no traffic). Mind the guard metric so you don't scale to zero with requests in flight.

What should I benchmark?

What users feel: TTFT (responsiveness), inter-token latency (streaming smoothness), throughput (tokens/sec and req/sec), and error rate under realistic concurrency. Drive it with a tool like AIPerf against the live endpoint, not a synthetic micro-benchmark.

Takeaways

Inference ≠ serving. The engine generates tokens fast; the serving layer makes it reliable, scalable, and cost-efficient for many users.
Pick the runtime by workload: vLLM (general), SGLang (prefix-reuse/agents), Triton (multi-framework).
Disaggregate at scale — separate prefill and decode GPUs for far better utilization.
Deploy the vLLM Production Stack via Helm; use sleep mode and KEDA scale-to-zero to cut idle GPU cost (with a guard metric).
Benchmark what users feel (TTFT, ITL, throughput, errors) with AIPerf against the live endpoint.
Observe the whole fleet in Prometheus/Grafana — and watch utilization correctly fall to zero.

Next in the series — Deep dive 14: The Lean Observability Stack, opening the observability/edge/storage block.

References

KubeCon Mumbai 2026 — Day 1 index · the rest of the series
vLLM Production Stack · SGLang · the runtimes & stack
NVIDIA Dynamo · KEDA · disaggregated serving & scale-to-zero
LMCache · the KV-cache layer powering the production stack

Plug in and scale — serving LLM models on Kubernetes, made simple.