TL;DR — RDMA (Remote Direct Memory Access) lets one machine read/write another machine's memory directly — bypassing the CPU, kernel, and TCP/IP stack on both sides. For distributed GPU training, NCCL uses RDMA (over InfiniBand or RoCEv2) for all-reduce and gradient sync at 200–400 Gbps with single-digit microsecond latency. On Kubernetes, RDMA interfaces are exposed to pods via SR-IOV + Multus.
What it is
RDMA is a networking protocol that allows direct memory-to-memory data transfer between machines without involving the operating system's kernel network stack. The NIC (called an HCA — Host Channel Adapter — in InfiniBand) reads from or writes to application memory via DMA, completely bypassing the CPU and kernel on both the sender and receiver. In the AI Native landscape it's the foundational protocol in AI Native Infra › Network — the reason GPU clusters can train at scale.
Why it exists
Distributed training sends terabytes of gradients between GPUs every minute. TCP/IP adds ~30 µs latency per hop and saturates CPUs with copy/context-switch overhead. RDMA cuts latency to ~1–2 µs and achieves line-rate throughput (200–400 Gbps) with near-zero CPU usage — because the NIC handles everything. Without RDMA, multi-node GPU training would be orders of magnitude slower.
Fig 1 — RDMA: the NIC reads GPU memory on Node A and writes it directly to GPU memory on Node B.
How it works
The application registers a memory region with the RDMA NIC. To send data, it posts a work request to a Queue Pair (QP) — the NIC reads from the registered memory, sends it over the wire, and the remote NIC writes it directly into the destination's registered memory. Neither CPU touches the data. Completion notifications happen via completion queues (CQs). NCCL uses this for all-reduce, all-gather, and reduce-scatter in distributed training.
Transport options
| Transport | Network | Note |
|---|---|---|
| InfiniBand | Dedicated IB fabric | Highest performance, purpose-built for RDMA. NDR: 400 Gbps. |
| RoCEv2 | Standard Ethernet | RDMA over Ethernet; needs lossless config (PFC/ECN). Common in cloud. |
| iWARP | Standard Ethernet | RDMA over TCP; simpler config, lower performance. |
Key concepts for AI
- GPUDirect RDMA — NIC reads/writes GPU memory directly (no CPU staging buffer). Requires NVIDIA GPU + Mellanox NIC + CUDA drivers.
- NCCL — NVIDIA's collective communication library uses RDMA for all-reduce, all-gather across GPUs on different nodes.
- Queue Pairs (QPs) — the RDMA equivalent of a TCP connection. Each pair of communicating processes gets a QP.
- Lossless Ethernet — RoCEv2 needs Priority Flow Control (PFC) and ECN configured on switches to avoid packet drops.
RDMA on Kubernetes
Exposing RDMA to pods requires:
- Multus — attach a secondary NIC to the pod.
- SR-IOV — pass through a VF with RDMA capability.
- RDMA device plugin — expose
/dev/infiniband/devices to the pod. - GPU Operator — for GPUDirect RDMA, the GPU driver must be configured with peer memory.
The NVIDIA Network Operator bundles Multus, SR-IOV, and RDMA device plugins into one operator for GPU clusters.
When to use, when to skip
Use it for multi-node distributed training (data-parallel, model-parallel, pipeline-parallel) — it's not optional at scale, it's the standard. Any serious GPU cluster uses RDMA.
Skip it for single-node training, inference-only workloads, or managed cloud environments where the provider handles the network fabric. Also unnecessary for CPU-only workloads.
vs / alongside
| Approach | Latency | Throughput | Note |
|---|---|---|---|
| RDMA (IB) | ~1 µs | 400 Gbps | Gold standard for training |
| RDMA (RoCEv2) | ~2–3 µs | 200 Gbps | Ethernet-based, needs lossless config |
| TCP/IP | ~30 µs | 100 Gbps | Fallback; usable for small-scale |
| NVLink / NVSwitch | sub-µs | 900 Gbps | Intra-node GPU-to-GPU only |
References
- NVIDIA RDMA documentation — RDMA concepts and drivers.
- NCCL user guide — collective communication over RDMA.
- RDMA shared device plugin — K8s RDMA device exposure.
Extra reads
- NVIDIA Network Operator — bundles Multus + SR-IOV + RDMA for GPU K8s.
- RDMAmojo — community RDMA knowledge base.
Verified against NVIDIA networking docs and RDMA specifications, May 2026.