TL;DR — Kubernetes is the operating system of the AI stack: it schedules pods onto nodes, keeps them alive, and now — since Dynamic Resource Allocation (DRA) went GA in v1.34 — allocates GPUs and TPUs as first-class resources. Every other tool in this section (Volcano, Kueue, KubeRay) builds on its scheduler. Learn the core objects and how GPU scheduling works, and the rest of the infra layer makes sense.
What it is
Kubernetes (K8s) is the open-source container orchestrator: you declare desired state (run N copies of this container, expose this port, give it this much GPU), and the control plane continuously reconciles reality toward it. It handles scheduling, self-healing, scaling, networking, and storage across a cluster of machines.
In the AI Native landscape it's the root of AI Native Infra › Orchestration and Scheduling — and effectively the foundation the whole group stands on. Per CNCF data, ~66% of organizations already run generative-AI inference on Kubernetes.
Why AI runs on it
AI workloads are bursty, expensive, and hardware-hungry. Training jobs need many GPUs scheduled together; inference needs autoscaling and fast restarts; both need to pack scarce, costly accelerators efficiently. Kubernetes gives you one declarative control plane for all of it — and a portable one, so the same manifests run on any cloud or on-prem instead of a proprietary GPU scheduler per vendor.
Core objects (the AI-relevant ones)
| Object | Role |
|---|---|
Pod | Smallest unit — one or more containers sharing network/storage. Your model server runs here. |
Deployment | Keeps N replicas of a stateless pod running; rolling updates. Good for inference servers. |
Job / Indexed Job | Run-to-completion work — a training or batch run. |
Service / Gateway | Stable network endpoint + load balancing in front of pods. |
scheduler | Decides which node each pod lands on — the piece AI schedulers extend. |
GPU scheduling — the hard part
Historically GPUs were exposed through the device plugin API as opaque countable resources (nvidia.com/gpu: 1) — crude: no sharing, no topology awareness, no fractional or multi-instance allocation.
Dynamic Resource Allocation (DRA) replaces that. It went GA in Kubernetes v1.34 (March 2026), and lets workloads request devices with rich constraints — specific GPU models, memory, topology, MIG partitions — via declarative ResourceClaims. NVIDIA donated its DRA GPU driver and Google its TPU driver to the community, so accelerator scheduling now lives in the K8s control plane instead of per-cloud tooling.
Fig 1 — DRA: pods claim accelerators with constraints; the scheduler matches them to the right device.
Where it's heading
The AI direction is now explicit in the project. v1.35 launched the Kubernetes AI Conformance program — a standard for what an "AI-ready" cluster must support. v1.36 added workload-aware scheduling, bringing DRA into the Workload API and driving integration with KubeRay, so jobs can declaratively request specific GPU types and topologies. The trend line: accelerator-first scheduling is becoming native, not bolted on.
Quick start
Spin up a throwaway local cluster and inspect it:
kind create cluster # or minikube start
kubectl get nodes
kubectl create deployment web --image=nginx --replicas=2
kubectl get pods -w
Requesting a GPU is a field on the pod spec — classically a resource limit, or a resourceClaims entry under DRA:
resources:
limits:
nvidia.com/gpu: 1 # device-plugin style (still common)
When to use, and what builds on it
Use it whenever AI workloads outgrow a single box: multiple models, shared GPUs, autoscaling, or team-level quotas. It's the default substrate of the AI Native Infra layer.
The default scheduler is great for services but weak for batch/gang-scheduled training. That gap is why the rest of this category exists: Volcano and YuniKorn add batch/gang scheduling, Kueue adds job queueing and quotas, and KubeRay runs Ray clusters on top.
kind locally. And don't expect the stock scheduler to gang-schedule a distributed training job correctly; that's exactly what Volcano/Kueue are for.References
- Official documentation — concepts, objects, tutorials.
- Dynamic Resource Allocation — the GPU/TPU scheduling model.
- v1.36: Workload-Aware Scheduling — the AI scheduling direction.
- kubernetes/kubernetes — source.
Extra reads
- Device management with DRA — Google's walkthrough.
- K8s GPU orchestration in 2026 — DRA, KAI scheduler, Grove.
- K8s GPU scheduling: DRA, KAI, MIG — the options compared.
- Solving the GPU utilization crisis — the cost case.
Verified against kubernetes.io docs and CNCF sources, May 2026. DRA is GA as of v1.34; latest covered release v1.36.