KubeCon India 2026 (Mumbai) — Day 1 Deep Dives

12 · Zero-GPU Autopilot — Kagent & Kgateway for Self-Healing Clusters

Deep dive 12 of 17 · AI, agents, GPU & serving

Jun 18, 2026 · conferences · 20 min read · 4600 words advanced

Zero-GPU autopilot — an autonomous SRE that runs entirely on CPU.

conferences kubecon ai-ops vllm gateway-api

Deep dive 12 of the KubeCon Mumbai 2026 series. Ashok M (DigitalOcean) and Dillibabu Sampath (Wells Fargo) built a self-healing cluster autopilot with a provocative constraint: zero dedicated GPUs and zero external AI APIs. A CPU-only local LLM (vLLM with Intel backends) powers Kagent — the "reasoning brain" that diagnoses cluster problems — and Kgateway — the Envoy-based Gateway API "guardrail" that actuates and bounds the fix. Everything runs inside the VPC: not a single log, metric, or prompt ever leaves. It's autonomous SRE for regulated, air-gapped, cost-sensitive environments.

This is the AI cluster's cost-and-sovereignty talk, and it rhymes hard with the morning keynotes' "sovereign AI" theme. Where DD10 built agents and DD11 observed them, this one answers "how do we run agentic ops cheaply and privately?" — and the answer is a closed loop with a clean separation between proposing a fix and enforcing its limits.

The problem — six pressures at once

The motivation was a stack of converging pains:

  • MTTR crisis — incidents take too long to resolve.
  • Traditional failures — manual, panic-driven response.
  • GPU cost — running LLM-based ops on accelerators is expensive.
  • AI-SRE complexity — heavyweight agents are overkill for simple, repetitive tasks.
  • Tuning AI to company best practices — generic models don't know your SOPs.
  • Data privacy & sovereignty — logs and prompts can't leave regulated boundaries.

The approach — private, deterministic, GPU-free

PrincipleWhat it means
100% private operationsnot a single log, metric, or prompt ever leaves the secure VPC boundary.
Zero external AI APIscompletely decoupled from costly third-party closed-source LLMs.
Zero dedicated GPUsoptimized, CPU-only local model serving.
Deterministic policiesAI actions governed by declarative Kubernetes Gateway API rules.
Local agent + gatewayKagent solves problems inside the cluster; Kgateway handles network traffic.

The architecture — a closed-loop self-healer

Perceive → Plan → Actuate → Reroute (all on CPU, inside the VPC) Prometheus +K8s Metrics API Kagent — reasoning brainvLLM (CPU 7B/8B) · MCPplan generation Sovereign RAG (SOPs,post-mortems) + Vector DB Kgateway (Envoy Gateway API)policy enforcement · traffic mgmt Managed workloadsApp V1 · App V2 (reroute/canary) actuate

Fig 1 — Kagent perceives metrics, retrieves local SOPs, plans a fix, and actuates it as a Gateway API change that Kgateway enforces.

The loop: Prometheus + K8s Metrics API feed health data → Kagent perceives → consults the Sovereign RAG (SOPs, compliance docs, post-mortems in a vector DB) → generates a plan on a CPU-served 7B/8B model → actuates a Gateway API change (e.g. patch an HTTPRoute or RateLimitPolicy) → Kgateway enforces it and reroutes/isolates/canaries/rolls back traffic across managed workloads. The whole control plane is "powered entirely by CPUs," inside a private cluster / air-gapped VPC.

Why vLLM — making CPU inference viable

The linchpin of "zero GPU" is making CPU inference fast enough. Two reasons vLLM:

  • vLLM + Intel backends — direct integration with OpenVINO and IPEX gives native CPU execution up to 10× faster.
  • Paged Attention — solves KV-cache fragmentation (which otherwise wastes 60–80% of memory) by using RAM "pages" like OS virtual memory. That enables high-concurrency agent sessions on CPU and prevents OOM during parallel alerts.
Why Paged Attention is the unlock. The KV cache (the model's working memory for a conversation) normally fragments badly, so you can't pack many concurrent sessions into RAM. PagedAttention manages it like virtual memory — non-contiguous pages, allocated on demand — so a single CPU box can hold many simultaneous agent sessions without blowing up. During an incident storm (many alerts at once), that's the difference between the autopilot staying up and OOM-killing itself.

Kagent — the sovereign reasoning brain

Kagent (a CNCF Sandbox framework) is a native Kubernetes reasoning engine built to understand cluster state — explicitly "not a generic chatbot." Three design choices make it work on small CPU models:

MechanismWhat it does
MCP (Model Context Protocol)a standardized bridge between the LLM and the Kubernetes API — replaces risky raw shell scripts with safe, structured API tool execution.
Context compactionfilters cluster-event "noise" into compact datasets, so massive telemetry fits the small context windows of 7B/8B CPU models.
Bounded agentic flowmulti-step deterministic plan generation — a structured pipeline: inspect metrics → match a local SOP → patch the route.
The key to running on a 7B model. Small CPU-served models have tiny context windows and weak free-form reasoning. Kagent compensates by narrowing the job: MCP gives it safe, typed tools (not a shell); context compaction shrinks the input; and a bounded flow constrains it to a deterministic inspect→match→patch pipeline. It's not asking the model to be brilliant — it's asking it to make small, structured decisions inside strong guardrails. That's how you avoid needing a frontier model on a GPU.

Kgateway — the policy guardrail

Kgateway is an Envoy-based Gateway API implementation (from Solo.io) that handles the actuation and bounds it:

  • Declarative actuation — operates strictly through standard resources like HTTPRoute / TCPRoute, replacing unsafe network "hacks" with structured, auditable, GitOps-friendly configs.
  • The "safety sandbox" — hard, human-defined policy boundaries (e.g. global rate-limit policies) that instantly reject any agentic action violating human-configured limits.
  • Network-layer mitigation — Envoy-native traffic shifting and canarying running entirely on host CPUs, resolving cascading errors instantly at L7 with zero costly pod restarts.

The closed loop — intention vs enforcement

The elegant heart of the design is a clean separation of concerns between the two components:

Kagent (intention)Kgateway (enforcement)
Roleproposes logical mitigation stepsevaluates & enforces hard boundaries
Example"isolate failing v2 pods""deny if weight < 50%"

The "sovereign way" replaces the old panic-driven kubectl delete pod with a disciplined pipeline: Prometheus alert → Kagent RAG triage → Kgateway HTTPRoute patch. And the zero-GPU synergy is the punchline: Kagent outputs ultra-compact structured JSON tool calls (no expensive, lengthy creative text generation), and Kgateway handles high-concurrency traffic shifting natively via high-performance C++ (Envoy) on CPU. Neither component needs a GPU because neither does the thing GPUs are for.

Why separating intention from enforcement matters. An autonomous agent you can't bound is a liability — it might "fix" an incident by doing something catastrophic. Here the LLM only proposes; a deterministic, human-authored Gateway API policy layer decides whether to allow it. The agent can't exceed limits a human set (e.g. never route <50% to a version), so you get autonomy with a hard safety floor. This is the same "make the wrong thing impossible" guardrail philosophy as the shared-first platforms and Kyverno talks — applied to AI ops.

FAQ

Can a 7B CPU model really run SRE autonomously?

For bounded, structured decisions — yes. The trick isn't a smarter model; it's narrowing the task with MCP tools, context compaction, and a deterministic inspect→match-SOP→patch flow. The model emits compact JSON tool calls, not essays, so a CPU-served 7B/8B is enough.

What stops the agent from doing something dangerous?

Kgateway's hard, human-defined policy boundaries. The agent only proposes; Gateway API policies (rate limits, traffic-weight rules) evaluate and can instantly reject any action that violates a human-configured limit. Autonomy with a safety floor.

Why fix incidents at the network layer instead of restarting pods?

Speed and blast radius. Envoy-native traffic shifting/canarying at L7 resolves cascading errors instantly with zero pod restarts — you reroute away from the failing version rather than waiting on slow, disruptive restarts. It's also auditable (declarative HTTPRoute changes).

Who is this for?

Regulated, air-gapped, or cost-sensitive shops (the speakers are from DigitalOcean and Wells Fargo). If logs/prompts can't leave your VPC and a GPU fleet for ops is unjustifiable, a CPU-only sovereign autopilot is the fit.

Takeaways

  • Autonomous SRE doesn't require GPUs or external APIs. A CPU-served small model in a closed loop can self-heal a cluster.
  • vLLM + Intel + PagedAttention make CPU inference fast and concurrent enough — no OOM during alert storms.
  • Kagent narrows the job with MCP tools, context compaction, and a bounded inspect→match-SOP→patch flow so a 7B model suffices.
  • Kgateway is the guardrail — declarative Gateway API actuation with hard, human-defined limits that reject unsafe agentic actions.
  • Separate intention from enforcement. The LLM proposes; deterministic policy decides. Autonomy with a safety floor.
  • 100% private & sovereign — no log, metric, or prompt leaves the VPC; the RAG knows your SOPs.

Next in the series — Deep dive 13: Serving and Scaling, the other side of GPU economics — how to serve models that genuinely need accelerators.

References

← prev: what did my agent do? next: serving and scaling →
© cvam — written in plaintext, served warm