TL;DR — Kubernetes' default scheduler places pods one at a time — fine for web servers, fatal for distributed training, where workers must all start together or none should. Volcano is a CNCF batch scheduler that adds gang scheduling (all-or-nothing), queues, fair-share, bin-pack, and GPU-aware placement. It's the go-to scheduler for AI/HPC jobs on Kubernetes.
What it is
Volcano is a Kubernetes-native batch scheduling system for high-performance workloads — AI/ML training, deep learning, big data. It's a CNCF project that runs alongside (or replaces) the default scheduler and understands jobs, not just pods. In the AI Native landscape it sits in AI Native Infra › Orchestration and Scheduling.
Why it exists
Distributed training needs every worker running simultaneously to exchange gradients. The default scheduler will happily place 6 of 8 workers, leave 2 pending because the cluster is full, and let the running 6 sit idle burning GPU money while they wait for peers that may never arrive — and other jobs can deadlock the same way.
Gang scheduling fixes this: a job's pods are scheduled as a unit — all of them get resources, or none do. No half-started jobs, no GPU waste, no deadlock.
Fig 1 — Gang scheduling: a job's pods land together instead of stranding idle workers.
How it works
You submit a VcJob (Volcano Job) describing task groups and a minimum number of members that must be schedulable. Volcano's scheduler evaluates the whole job against queues and policies, then admits it only when the gang can run. It plugs into mainstream AI/data frameworks — Ray, PyTorch, TensorFlow, MindSpore, Spark, Flink — so their operators submit through Volcano.
Scheduling strategies
- Gang — all-or-nothing job admission (the headline).
- Queues + Proportion/Capacity — multi-level queues with resource quotas and weighted fair sharing between teams.
- Fair-share & Priority — balance across jobs; let urgent jobs preempt.
- Binpack — pack pods tightly to free whole nodes (good for cost + GPU defrag).
- DeviceShare — GPU sharing across pods.
- NUMA-aware & Task Topology — place tightly-coupled tasks for fast interconnect.
Quick start
Install via Helm (or the release manifest), then submit a job whose minAvailable enforces the gang:
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
kubectl get pods -n volcano-system # confirm scheduler + controllers up
# in a VcJob: require all 4 workers before any start
spec:
minAvailable: 4
schedulerName: volcano
Point a framework operator (e.g. KubeRay, PyTorch operator) at schedulerName: volcano and your distributed jobs gang-schedule automatically.
When to use, when to skip
Use it for distributed training and any multi-pod job that must start together, when you need team queues/quotas over a shared GPU cluster, or topology-aware placement for fast interconnect. It's the most feature-rich AI/HPC scheduler in the ecosystem.
Skip it for plain stateless inference — the default scheduler is fine. If you mainly need job queueing and quota on top of the existing scheduler rather than a full replacement, Kueue is lighter; for a Big-Data-heritage scheduler, YuniKorn is the alternative.
schedulerName — it coexists with the default scheduler rather than ripping it out. Increasingly it's paired with Kueue (Kueue for admission/quota, Volcano for gang placement) rather than chosen instead of it.vs the alternatives
| Tool | Best for | Trade-off |
|---|---|---|
| Volcano | Gang scheduling, rich AI/HPC policies, topology | A full scheduler to operate |
| Kueue | Job queueing + quota on the default scheduler | Not a placement scheduler itself |
| YuniKorn | Unified batch + service, Big Data heritage | Different policy model |
| Default scheduler | Stateless services, simple workloads | No gang scheduling |
References
- volcano.sh — project site.
- Official documentation — concepts, VcJob, plugins.
- Unified scheduling — strategies in depth.
- volcano-sh/volcano — source (CNCF).
Extra reads
- Batch scheduling on K8s: YuniKorn vs Volcano vs Kueue — the comparison.
- Volcano with Kubeflow — wiring into training operators.
- Configuring gang scheduling for ML — hands-on guide.
Verified against the official Volcano docs (volcano.sh) and CNCF sources, May 2026.