Deep dive 12 of the KubeCon Mumbai 2026 series. Ashok M (DigitalOcean) and Dillibabu Sampath (Wells Fargo) built a self-healing cluster autopilot with a provocative constraint: zero dedicated GPUs and zero external AI APIs. A CPU-only local LLM (vLLM with Intel backends) powers Kagent — the "reasoning brain" that diagnoses cluster problems — and Kgateway — the Envoy-based Gateway API "guardrail" that actuates and bounds the fix. Everything runs inside the VPC: not a single log, metric, or prompt ever leaves. It's autonomous SRE for regulated, air-gapped, cost-sensitive environments.
This is the AI cluster's cost-and-sovereignty talk, and it rhymes hard with the morning keynotes' "sovereign AI" theme. Where DD10 built agents and DD11 observed them, this one answers "how do we run agentic ops cheaply and privately?" — and the answer is a closed loop with a clean separation between proposing a fix and enforcing its limits.
The problem — six pressures at once
The motivation was a stack of converging pains:
- MTTR crisis — incidents take too long to resolve.
- Traditional failures — manual, panic-driven response.
- GPU cost — running LLM-based ops on accelerators is expensive.
- AI-SRE complexity — heavyweight agents are overkill for simple, repetitive tasks.
- Tuning AI to company best practices — generic models don't know your SOPs.
- Data privacy & sovereignty — logs and prompts can't leave regulated boundaries.
The approach — private, deterministic, GPU-free
| Principle | What it means |
|---|---|
| 100% private operations | not a single log, metric, or prompt ever leaves the secure VPC boundary. |
| Zero external AI APIs | completely decoupled from costly third-party closed-source LLMs. |
| Zero dedicated GPUs | optimized, CPU-only local model serving. |
| Deterministic policies | AI actions governed by declarative Kubernetes Gateway API rules. |
| Local agent + gateway | Kagent solves problems inside the cluster; Kgateway handles network traffic. |
The architecture — a closed-loop self-healer
Fig 1 — Kagent perceives metrics, retrieves local SOPs, plans a fix, and actuates it as a Gateway API change that Kgateway enforces.
The loop: Prometheus + K8s Metrics API feed health data → Kagent perceives → consults the Sovereign RAG (SOPs, compliance docs, post-mortems in a vector DB) → generates a plan on a CPU-served 7B/8B model → actuates a Gateway API change (e.g. patch an HTTPRoute or RateLimitPolicy) → Kgateway enforces it and reroutes/isolates/canaries/rolls back traffic across managed workloads. The whole control plane is "powered entirely by CPUs," inside a private cluster / air-gapped VPC.
Why vLLM — making CPU inference viable
The linchpin of "zero GPU" is making CPU inference fast enough. Two reasons vLLM:
- vLLM + Intel backends — direct integration with OpenVINO and IPEX gives native CPU execution up to 10× faster.
- Paged Attention — solves KV-cache fragmentation (which otherwise wastes 60–80% of memory) by using RAM "pages" like OS virtual memory. That enables high-concurrency agent sessions on CPU and prevents OOM during parallel alerts.
Kagent — the sovereign reasoning brain
Kagent (a CNCF Sandbox framework) is a native Kubernetes reasoning engine built to understand cluster state — explicitly "not a generic chatbot." Three design choices make it work on small CPU models:
| Mechanism | What it does |
|---|---|
| MCP (Model Context Protocol) | a standardized bridge between the LLM and the Kubernetes API — replaces risky raw shell scripts with safe, structured API tool execution. |
| Context compaction | filters cluster-event "noise" into compact datasets, so massive telemetry fits the small context windows of 7B/8B CPU models. |
| Bounded agentic flow | multi-step deterministic plan generation — a structured pipeline: inspect metrics → match a local SOP → patch the route. |
Kgateway — the policy guardrail
Kgateway is an Envoy-based Gateway API implementation (from Solo.io) that handles the actuation and bounds it:
- Declarative actuation — operates strictly through standard resources like
HTTPRoute/TCPRoute, replacing unsafe network "hacks" with structured, auditable, GitOps-friendly configs. - The "safety sandbox" — hard, human-defined policy boundaries (e.g. global rate-limit policies) that instantly reject any agentic action violating human-configured limits.
- Network-layer mitigation — Envoy-native traffic shifting and canarying running entirely on host CPUs, resolving cascading errors instantly at L7 with zero costly pod restarts.
The closed loop — intention vs enforcement
The elegant heart of the design is a clean separation of concerns between the two components:
| Kagent (intention) | Kgateway (enforcement) | |
|---|---|---|
| Role | proposes logical mitigation steps | evaluates & enforces hard boundaries |
| Example | "isolate failing v2 pods" | "deny if weight < 50%" |
The "sovereign way" replaces the old panic-driven kubectl delete pod with a disciplined pipeline: Prometheus alert → Kagent RAG triage → Kgateway HTTPRoute patch. And the zero-GPU synergy is the punchline: Kagent outputs ultra-compact structured JSON tool calls (no expensive, lengthy creative text generation), and Kgateway handles high-concurrency traffic shifting natively via high-performance C++ (Envoy) on CPU. Neither component needs a GPU because neither does the thing GPUs are for.
FAQ
Can a 7B CPU model really run SRE autonomously?
For bounded, structured decisions — yes. The trick isn't a smarter model; it's narrowing the task with MCP tools, context compaction, and a deterministic inspect→match-SOP→patch flow. The model emits compact JSON tool calls, not essays, so a CPU-served 7B/8B is enough.
What stops the agent from doing something dangerous?
Kgateway's hard, human-defined policy boundaries. The agent only proposes; Gateway API policies (rate limits, traffic-weight rules) evaluate and can instantly reject any action that violates a human-configured limit. Autonomy with a safety floor.
Why fix incidents at the network layer instead of restarting pods?
Speed and blast radius. Envoy-native traffic shifting/canarying at L7 resolves cascading errors instantly with zero pod restarts — you reroute away from the failing version rather than waiting on slow, disruptive restarts. It's also auditable (declarative HTTPRoute changes).
Who is this for?
Regulated, air-gapped, or cost-sensitive shops (the speakers are from DigitalOcean and Wells Fargo). If logs/prompts can't leave your VPC and a GPU fleet for ops is unjustifiable, a CPU-only sovereign autopilot is the fit.
Takeaways
- Autonomous SRE doesn't require GPUs or external APIs. A CPU-served small model in a closed loop can self-heal a cluster.
- vLLM + Intel + PagedAttention make CPU inference fast and concurrent enough — no OOM during alert storms.
- Kagent narrows the job with MCP tools, context compaction, and a bounded inspect→match-SOP→patch flow so a 7B model suffices.
- Kgateway is the guardrail — declarative Gateway API actuation with hard, human-defined limits that reject unsafe agentic actions.
- Separate intention from enforcement. The LLM proposes; deterministic policy decides. Autonomy with a safety floor.
- 100% private & sovereign — no log, metric, or prompt leaves the VPC; the RAG knows your SOPs.
Next in the series — Deep dive 13: Serving and Scaling, the other side of GPU economics — how to serve models that genuinely need accelerators.
References
- KubeCon Mumbai 2026 — Day 1 index · the rest of the series
- Kagent · CNCF Sandbox reasoning framework
- Kgateway · Envoy-based Gateway API
- vLLM · Model Context Protocol · CPU inference & the LLM↔K8s bridge