Debugging GPU & CUDA

GPU bugs split into four buckets: OOM (out of memory), underutilization (GPU idle/slow), multi-GPU comms (NCCL hangs), and environment (driver / CUDA / version mismatch). Identify the bucket from nvidia-smi + the error string before changing code.

First moves

nvidia-smi                       # mem used, util, processes, temp, throttle
nvidia-smi dmon -s pucvmet       # rolling: power, util, mem, clocks, throttle
nvidia-smi topo -m               # interconnect (NVLink/PCIe) for multi-GPU
python -c "import torch;print(torch.__version__,torch.version.cuda,torch.cuda.is_available())"

async errors lie about the line number CUDA kernels run asynchronously, so a stack trace often points at the wrong line. Re-run with CUDA_LAUNCH_BLOCKING=1 to get the true failing op. For device-side asserts add TORCH_USE_CUDA_DSA=1.

CUDA out of memory

Symptom. torch.OutOfMemoryError: CUDA out of memory. Tried to allocate …

Causes & fixes.

Batch / sequence too big → lower batch, use gradient accumulation, shorten context.
Activations → enable gradient checkpointing; use mixed precision (BF16).
Fragmentation ("X free but can't allocate Y") → PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Cached/leaked tensors → don't hold references (e.g. appending loss not loss.item()); call torch.cuda.empty_cache() between phases.
Eval without no_grad → wrap inference in torch.no_grad()/inference_mode() (autograd graph eats memory).
Inference → KV cache is the hog: cap context, lower concurrency, quantize weights + KV.

nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1
# PyTorch: see what's allocated
python -c "import torch;print(torch.cuda.memory_summary())"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

"GPU memory not freed" after a crash A dead process can leave VRAM held. nvidia-smi → find the PID → kill -9. Zombie holding the device after OOM is common in notebooks; restart the kernel.

GPU underutilized / training slow

Symptom. Low or spiky GPU-util; throughput far below the hardware.

Low + spiky util → input pipeline starvation. Raise DataLoader num_workers, pin_memory=True, prefetch_factor; move preprocessing off the hot path.
Host↔device syncs → remove .item()/.cpu()/print(loss) from the loop; log every N steps. Profile with nsys to see the gaps.
100% util but slow → memory-bound or no Tensor Cores. Enable TF32/BF16, torch.compile, FlashAttention; pad dims to multiples of 8/16. Confirm with ncu.
Tiny kernels → launch overhead; fuse ops or use CUDA graphs.

Clock throttling (slow with no code change)

nvidia-smi -q -d CLOCK,PERFORMANCE   # 'Clocks Throttle Reasons'
nvidia-smi --query-gpu=clocks.sm,temperature.gpu,power.draw --format=csv -l 1

SW/HW thermal slowdown → cooling/airflow; clean dust; check inlet temp.
SW power cap → power limit too low (nvidia-smi -pl), PSU/rack budget.
Throttling looks like "compute-bound" but it's the hardware capping clocks — check this before optimizing kernels.

Multi-GPU: NCCL hangs & errors

Symptom. Training stalls at start or first all-reduce; NCCL timeout, watchdog hang, or unhandled system error.

export NCCL_DEBUG=INFO          # see ring/tree setup + where it stalls
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export NCCL_DEBUG_SUBSYS=ALL

Hang at init → mismatched world size / a rank crashed / firewall on the rendezvous port. Check every rank started and MASTER_ADDR/MASTER_PORT.
Collective mismatch → ranks call different ops or shapes (e.g. a conditional that runs on some ranks only). All ranks must hit the same collectives in the same order.
Cross-node slow/hang → wrong interface; set NCCL_SOCKET_IFNAME; enable IB or set NCCL_IB_DISABLE=0. PCIe-only is slow.
P2P issues → try NCCL_P2P_DISABLE=1 to isolate (slower but proves the cause).

one rank dies, the rest hang If a single rank throws (OOM, bad batch) mid-collective, the others wait at the barrier until the NCCL timeout. The real error is in that rank's log — grep all rank logs for the first traceback, not the timeout.

Environment & driver mismatch

CUDA driver version is insufficient for CUDA runtime → driver older than the CUDA toolkit the build needs. Update driver or use a matching container.
no kernel image is available → the build doesn't include your GPU's compute capability (sm_XX). Use a wheel/image built for your arch.
libcudnn / libcublas not found → missing/mismatched CUDA libs; prefer the official PyTorch/CUDA container.
Multi-GPU visibility → CUDA_VISIBLE_DEVICES set wrong hides or reorders GPUs.

Hardware faults (Xid errors)

dmesg -T | grep -i xid          # GPU hardware/driver faults
nvidia-smi -q | grep -i -A2 "ecc\|retired"   # ECC errors, retired pages

Xid 13/31/43/45 often = app/memory fault; Xid 48/63/64/79 = ECC / fallen-off-bus / hardware. Repeated uncorrectable ECC or "GPU has fallen off the bus" → drain the node and replace the card.

Quick reference

nvidia-smi ; nvidia-smi dmon -s pucvmet ; nvidia-smi topo -m
CUDA_LAUNCH_BLOCKING=1 python x.py        # real failing line
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True   # frag OOM
NCCL_DEBUG=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL     # comms
nvidia-smi -q -d CLOCK,PERFORMANCE        # throttle reasons
dmesg -T | grep -i xid                    # hardware faults
python -c "import torch;print(torch.cuda.memory_summary())"

Debugging GPU & CUDA Failures.