// AI NATIVE STACK

AI Native › AI Native Infra › Accelerator and SuperPod › K8s Device Plugin

CRASH COURSE · AI-NATIVE · intermediate · 9 min read · kubelet API

The Kubernetes Device Plugin — how a GPU becomes schedulable.

accelerator ai-native device-plugin gpu kubernetes

TL;DR — Kubernetes has no built-in idea of a GPU. The device plugin framework is the kubelet API that lets a vendor advertise hardware as a schedulable resource like nvidia.com/gpu. NVIDIA's k8s-device-plugin implements it, and can oversubscribe a GPU via time-slicing or MPS — sharing without isolation. It's the foundational layer the GPU Operator and DRA build on.

What it is

The device plugin framework is a Kubernetes interface for advertising non-CPU/memory resources — GPUs, NICs, FPGAs — to the kubelet so the scheduler can allocate them. The NVIDIA k8s-device-plugin is the GPU implementation: a DaemonSet that discovers GPUs on each node and exposes them as nvidia.com/gpu. In the AI Native landscape it's the bedrock piece of AI Native Infra › Accelerator and SuperPod.

Why it exists

The scheduler only understands CPU and memory natively. Without a device plugin, a GPU on a node is invisible — pods can't request it and the scheduler can't place GPU work. The plugin closes that gap by registering the device type with the kubelet and reporting how many are available, turning hardware into a first-class, requestable resource.

How it works

The plugin registers with the kubelet over a gRPC socket, lists the GPUs it found, and on allocation hands the kubelet the device IDs + mounts a container needs. The kubelet reports the count to the API server; the scheduler then treats nvidia.com/gpu like any countable resource in pod requests.

device pluginfinds GPUs kubeletregisters scheduler: nvidia.com/gpu pod gets GPU mounts

Fig 1 — Plugin → kubelet → scheduler: hardware becomes a requestable resource.

Oversubscription: time-slicing & MPS

By default one GPU = one allocatable unit. The NVIDIA plugin can oversubscribe it by advertising replicas:

  • Time-slicing — declare N replicas of a GPU; pods get a replica each and CUDA time-slices between them. Simple, but no isolation — they share memory and fault domain.
  • MPS (Multi-Process Service) — run kernels concurrently with some resource partitioning; better than raw time-slicing for certain workloads.

Enabling replicas relabels the node (nvidia.com/gpu.replicas, product tagged -SHARED) so you can target shared vs whole GPUs.

Quick start

If you run the GPU Operator, the device plugin is already installed — you rarely deploy it alone. Standalone, it's a DaemonSet:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
kubectl describe node <gpu-node> | grep nvidia.com/gpu     # see advertised GPUs

Then a pod requests limits: nvidia.com/gpu: 1 as usual. Time-slicing is enabled via a small ConfigMap setting the replica count.

When to use, when to skip

Use it — you effectively always do: it's the mechanism that makes GPUs schedulable. The real decisions are how you run it (bundled in the GPU Operator) and whether to enable time-slicing/MPS for cheap sharing of light workloads.

Move beyond it when you need real isolation or quotas: time-slicing has none — reach for MIG or HAMi. And the device plugin model is being superseded by DRA (GA in v1.34) for richer, constraint-based allocation.

heads up Time-slicing gives no memory or fault isolation — one pod can OOM the GPU and crash its neighbors. Use it only for trusted, light, bursty workloads (dev, small inference), never for strict multi-tenant isolation.

Where it sits

LayerRoleNote
Device PluginAdvertise + allocate GPUs to the schedulerThe foundation
GPU OperatorInstalls the plugin + driver + monitoringBundles it
HAMi / MIGIsolated sharingBeyond time-slicing
DRAConstraint-based native allocationThe successor

References

Extra reads

Verified against kubernetes.io and the NVIDIA k8s-device-plugin docs, May 2026.

← AI Native Stack
© cvam — written in plaintext, served warm