// AI NATIVE STACK

AI Native › AI Native Infra › Network › RDMA

CRASH COURSE · AI-NATIVE · advanced · 11 min read · concept

RDMA — zero-copy, kernel-bypass networking that makes GPU training scale.

network ai-native rdma infiniband nccl

TL;DR — RDMA (Remote Direct Memory Access) lets one machine read/write another machine's memory directly — bypassing the CPU, kernel, and TCP/IP stack on both sides. For distributed GPU training, NCCL uses RDMA (over InfiniBand or RoCEv2) for all-reduce and gradient sync at 200–400 Gbps with single-digit microsecond latency. On Kubernetes, RDMA interfaces are exposed to pods via SR-IOV + Multus.

What it is

RDMA is a networking protocol that allows direct memory-to-memory data transfer between machines without involving the operating system's kernel network stack. The NIC (called an HCA — Host Channel Adapter — in InfiniBand) reads from or writes to application memory via DMA, completely bypassing the CPU and kernel on both the sender and receiver. In the AI Native landscape it's the foundational protocol in AI Native Infra › Network — the reason GPU clusters can train at scale.

Why it exists

Distributed training sends terabytes of gradients between GPUs every minute. TCP/IP adds ~30 µs latency per hop and saturates CPUs with copy/context-switch overhead. RDMA cuts latency to ~1–2 µs and achieves line-rate throughput (200–400 Gbps) with near-zero CPU usage — because the NIC handles everything. Without RDMA, multi-node GPU training would be orders of magnitude slower.

Node A GPU HCA Node B GPU HCA RDMA — direct memory transfer no CPU, no kernel, no TCP

Fig 1 — RDMA: the NIC reads GPU memory on Node A and writes it directly to GPU memory on Node B.

How it works

The application registers a memory region with the RDMA NIC. To send data, it posts a work request to a Queue Pair (QP) — the NIC reads from the registered memory, sends it over the wire, and the remote NIC writes it directly into the destination's registered memory. Neither CPU touches the data. Completion notifications happen via completion queues (CQs). NCCL uses this for all-reduce, all-gather, and reduce-scatter in distributed training.

Transport options

TransportNetworkNote
InfiniBandDedicated IB fabricHighest performance, purpose-built for RDMA. NDR: 400 Gbps.
RoCEv2Standard EthernetRDMA over Ethernet; needs lossless config (PFC/ECN). Common in cloud.
iWARPStandard EthernetRDMA over TCP; simpler config, lower performance.

Key concepts for AI

  • GPUDirect RDMA — NIC reads/writes GPU memory directly (no CPU staging buffer). Requires NVIDIA GPU + Mellanox NIC + CUDA drivers.
  • NCCL — NVIDIA's collective communication library uses RDMA for all-reduce, all-gather across GPUs on different nodes.
  • Queue Pairs (QPs) — the RDMA equivalent of a TCP connection. Each pair of communicating processes gets a QP.
  • Lossless Ethernet — RoCEv2 needs Priority Flow Control (PFC) and ECN configured on switches to avoid packet drops.

RDMA on Kubernetes

Exposing RDMA to pods requires:

  1. Multus — attach a secondary NIC to the pod.
  2. SR-IOV — pass through a VF with RDMA capability.
  3. RDMA device plugin — expose /dev/infiniband/ devices to the pod.
  4. GPU Operator — for GPUDirect RDMA, the GPU driver must be configured with peer memory.

The NVIDIA Network Operator bundles Multus, SR-IOV, and RDMA device plugins into one operator for GPU clusters.

When to use, when to skip

Use it for multi-node distributed training (data-parallel, model-parallel, pipeline-parallel) — it's not optional at scale, it's the standard. Any serious GPU cluster uses RDMA.

Skip it for single-node training, inference-only workloads, or managed cloud environments where the provider handles the network fabric. Also unnecessary for CPU-only workloads.

heads up RoCEv2 on standard Ethernet requires careful switch configuration (PFC, ECN, buffer tuning). Misconfigured lossless Ethernet is worse than TCP — you get pauses and deadlocks instead of graceful retransmits. Test thoroughly or use InfiniBand.

vs / alongside

ApproachLatencyThroughputNote
RDMA (IB)~1 µs400 GbpsGold standard for training
RDMA (RoCEv2)~2–3 µs200 GbpsEthernet-based, needs lossless config
TCP/IP~30 µs100 GbpsFallback; usable for small-scale
NVLink / NVSwitchsub-µs900 GbpsIntra-node GPU-to-GPU only

References

Extra reads

Verified against NVIDIA networking docs and RDMA specifications, May 2026.

← AI Native Stack
© cvam — written in plaintext, served warm