// AI NATIVE STACK

AI Native › AI Native Infra › Accelerator and SuperPod › NVIDIA GPU Operator

CRASH COURSE · AI-NATIVE · intermediate · 10 min read · v26.x

NVIDIA GPU Operator — the whole GPU stack, installed for you.

accelerator ai-native nvidia gpu kubernetes

TL;DR — Making a GPU usable by pods needs a stack: kernel driver, container toolkit, device plugin, feature labels, monitoring. The NVIDIA GPU Operator installs and manages all of it automatically on every GPU node, and unlocks MIG, time-slicing, vGPU, and GPUDirect. It's the standard way to make NVIDIA GPUs work on Kubernetes.

What it is

The NVIDIA GPU Operator is a Kubernetes operator that deploys, configures, and manages every software component needed to provision NVIDIA GPUs in a cluster. Instead of hand-installing drivers and plugins on each node, you install one operator and it reconciles the full GPU stack. In the AI Native landscape it's in AI Native Infra › Accelerator and SuperPod.

Why it exists

A bare Kubernetes node can't use a GPU. You'd need to install the matching kernel driver, the NVIDIA Container Toolkit so containers see the device, a device plugin to advertise nvidia.com/gpu to the scheduler, feature labels, and monitoring — on every node, kept in sync across upgrades. The operator automates that whole chain and keeps it consistent.

The components it manages

It installs a dependency chain on each GPU node:

NFD Driver Toolkit DevicePlugin GFD DCGM +MIG Mgr Validator

Fig 1 — The managed stack: feature discovery → driver → toolkit → device plugin → labels → monitoring → validation.

NFD/GFD label nodes by GPU type; the driver + container toolkit make CUDA work; the device plugin advertises GPUs to the scheduler; DCGM exports metrics; the MIG Manager partitions GPUs; the validator confirms it all works.

Advanced capabilities

  • MIG (Multi-Instance GPU) — slice one A100/H100 into isolated GPU instances for many small workloads.
  • Time-slicing — oversubscribe a GPU across pods when isolation isn't required.
  • vGPU — virtual GPUs for partitioned/virtualized environments.
  • GPUDirect RDMA & Storage — fast GPU-to-network and GPU-to-storage paths for distributed training.
  • DCGM monitoring — GPU utilization, memory, temperature, errors exported to Prometheus.

Quick start

Install via Helm; the operator detects GPU nodes (label feature.node.kubernetes.io/pci-10de.present=true) and rolls out the stack:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace

kubectl get pods -n gpu-operator        # watch the stack come up + validate

After it's ready, a pod requests a GPU the normal way (limits: nvidia.com/gpu: 1) and the scheduler places it on a GPU node.

When to use, when to skip

Use it on any self-managed Kubernetes cluster with NVIDIA GPUs — it's the supported, consistent way to provision the GPU stack, and the gateway to MIG/time-slicing/GPUDirect. Essentially mandatory for serious on-prem or self-managed GPU clusters.

Skip / adjust on managed clouds that pre-install GPU drivers (some GKE/EKS node images already handle parts — check before doubling up). For fine-grained GPU sharing beyond MIG/time-slicing, pair it with HAMi; the newer scheduling path is Kubernetes DRA.

heads up The driver container must match your node kernel — kernel upgrades can break driver pods if versions aren't pinned/compatible. And if NFD is already running in your cluster, disable the operator's bundled NFD to avoid conflicts.

vs / alongside

ToolRoleNote
GPU OperatorInstall + manage the NVIDIA GPU stackThe baseline
HAMiFine-grained GPU sharing/virtualizationLayers on top
K8s Device PluginJust the GPU-advertising pieceSubset of the operator
DRANative accelerator scheduling (v1.34 GA)The newer model

References

Extra reads

Verified against the official NVIDIA GPU Operator docs (docs.nvidia.com), May 2026.

← AI Native Stack
© cvam — written in plaintext, served warm