TL;DR — Making a GPU usable by pods needs a stack: kernel driver, container toolkit, device plugin, feature labels, monitoring. The NVIDIA GPU Operator installs and manages all of it automatically on every GPU node, and unlocks MIG, time-slicing, vGPU, and GPUDirect. It's the standard way to make NVIDIA GPUs work on Kubernetes.
What it is
The NVIDIA GPU Operator is a Kubernetes operator that deploys, configures, and manages every software component needed to provision NVIDIA GPUs in a cluster. Instead of hand-installing drivers and plugins on each node, you install one operator and it reconciles the full GPU stack. In the AI Native landscape it's in AI Native Infra › Accelerator and SuperPod.
Why it exists
A bare Kubernetes node can't use a GPU. You'd need to install the matching kernel driver, the NVIDIA Container Toolkit so containers see the device, a device plugin to advertise nvidia.com/gpu to the scheduler, feature labels, and monitoring — on every node, kept in sync across upgrades. The operator automates that whole chain and keeps it consistent.
The components it manages
It installs a dependency chain on each GPU node:
Fig 1 — The managed stack: feature discovery → driver → toolkit → device plugin → labels → monitoring → validation.
NFD/GFD label nodes by GPU type; the driver + container toolkit make CUDA work; the device plugin advertises GPUs to the scheduler; DCGM exports metrics; the MIG Manager partitions GPUs; the validator confirms it all works.
Advanced capabilities
- MIG (Multi-Instance GPU) — slice one A100/H100 into isolated GPU instances for many small workloads.
- Time-slicing — oversubscribe a GPU across pods when isolation isn't required.
- vGPU — virtual GPUs for partitioned/virtualized environments.
- GPUDirect RDMA & Storage — fast GPU-to-network and GPU-to-storage paths for distributed training.
- DCGM monitoring — GPU utilization, memory, temperature, errors exported to Prometheus.
Quick start
Install via Helm; the operator detects GPU nodes (label feature.node.kubernetes.io/pci-10de.present=true) and rolls out the stack:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace
kubectl get pods -n gpu-operator # watch the stack come up + validate
After it's ready, a pod requests a GPU the normal way (limits: nvidia.com/gpu: 1) and the scheduler places it on a GPU node.
When to use, when to skip
Use it on any self-managed Kubernetes cluster with NVIDIA GPUs — it's the supported, consistent way to provision the GPU stack, and the gateway to MIG/time-slicing/GPUDirect. Essentially mandatory for serious on-prem or self-managed GPU clusters.
Skip / adjust on managed clouds that pre-install GPU drivers (some GKE/EKS node images already handle parts — check before doubling up). For fine-grained GPU sharing beyond MIG/time-slicing, pair it with HAMi; the newer scheduling path is Kubernetes DRA.
vs / alongside
| Tool | Role | Note |
|---|---|---|
| GPU Operator | Install + manage the NVIDIA GPU stack | The baseline |
| HAMi | Fine-grained GPU sharing/virtualization | Layers on top |
| K8s Device Plugin | Just the GPU-advertising piece | Subset of the operator |
| DRA | Native accelerator scheduling (v1.34 GA) | The newer model |
References
- GPU Operator docs — official overview.
- Installing the operator — getting started.
- NVIDIA/gpu-operator — source.
Extra reads
- What it does under the hood — the component chain explained.
- Real-world guide — practical operation.
- GPU Operator on GKE — managed-cloud caveats.
Verified against the official NVIDIA GPU Operator docs (docs.nvidia.com), May 2026.