GPU bugs split into four buckets: OOM (out of memory), underutilization
(GPU idle/slow), multi-GPU comms (NCCL hangs), and environment (driver /
CUDA / version mismatch). Identify the bucket from nvidia-smi + the error string before
changing code.
First moves
nvidia-smi # mem used, util, processes, temp, throttle nvidia-smi dmon -s pucvmet # rolling: power, util, mem, clocks, throttle nvidia-smi topo -m # interconnect (NVLink/PCIe) for multi-GPU python -c "import torch;print(torch.__version__,torch.version.cuda,torch.cuda.is_available())"
async errors lie about the line number
CUDA kernels run asynchronously, so a stack trace often points at the wrong line. Re-run with
CUDA_LAUNCH_BLOCKING=1 to get the true failing op. For device-side asserts add
TORCH_USE_CUDA_DSA=1.CUDA out of memory
Symptom. torch.OutOfMemoryError: CUDA out of memory. Tried to allocate …
Causes & fixes.
- Batch / sequence too big → lower batch, use gradient accumulation, shorten context.
- Activations → enable gradient checkpointing; use mixed precision (BF16).
- Fragmentation ("X free but can't allocate Y") →
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. - Cached/leaked tensors → don't hold references (e.g. appending
lossnotloss.item()); calltorch.cuda.empty_cache()between phases. - Eval without no_grad → wrap inference in
torch.no_grad()/inference_mode()(autograd graph eats memory). - Inference → KV cache is the hog: cap context, lower concurrency, quantize weights + KV.
nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1 # PyTorch: see what's allocated python -c "import torch;print(torch.cuda.memory_summary())" export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
"GPU memory not freed" after a crash
A dead process can leave VRAM held.
nvidia-smi → find the PID → kill -9.
Zombie holding the device after OOM is common in notebooks; restart the kernel.GPU underutilized / training slow
Symptom. Low or spiky GPU-util; throughput far below the hardware.
- Low + spiky util → input pipeline starvation. Raise DataLoader
num_workers,pin_memory=True,prefetch_factor; move preprocessing off the hot path. - Host↔device syncs → remove
.item()/.cpu()/print(loss)from the loop; log every N steps. Profile withnsysto see the gaps. - 100% util but slow → memory-bound or no Tensor Cores. Enable TF32/BF16,
torch.compile, FlashAttention; pad dims to multiples of 8/16. Confirm withncu. - Tiny kernels → launch overhead; fuse ops or use CUDA graphs.
Clock throttling (slow with no code change)
nvidia-smi -q -d CLOCK,PERFORMANCE # 'Clocks Throttle Reasons' nvidia-smi --query-gpu=clocks.sm,temperature.gpu,power.draw --format=csv -l 1
- SW/HW thermal slowdown → cooling/airflow; clean dust; check inlet temp.
- SW power cap → power limit too low (
nvidia-smi -pl), PSU/rack budget. - Throttling looks like "compute-bound" but it's the hardware capping clocks — check this before optimizing kernels.
Multi-GPU: NCCL hangs & errors
Symptom. Training stalls at start or first all-reduce; NCCL timeout,
watchdog hang, or unhandled system error.
export NCCL_DEBUG=INFO # see ring/tree setup + where it stalls export TORCH_DISTRIBUTED_DEBUG=DETAIL export NCCL_DEBUG_SUBSYS=ALL
- Hang at init → mismatched world size / a rank crashed / firewall on the rendezvous port. Check every rank started and
MASTER_ADDR/MASTER_PORT. - Collective mismatch → ranks call different ops or shapes (e.g. a conditional that runs on some ranks only). All ranks must hit the same collectives in the same order.
- Cross-node slow/hang → wrong interface; set
NCCL_SOCKET_IFNAME; enable IB or setNCCL_IB_DISABLE=0. PCIe-only is slow. - P2P issues → try
NCCL_P2P_DISABLE=1to isolate (slower but proves the cause).
one rank dies, the rest hang
If a single rank throws (OOM, bad batch) mid-collective, the others wait at the barrier until the
NCCL timeout. The real error is in that rank's log — grep all rank logs for the first
traceback, not the timeout.
Environment & driver mismatch
CUDA driver version is insufficient for CUDA runtime→ driver older than the CUDA toolkit the build needs. Update driver or use a matching container.no kernel image is available→ the build doesn't include your GPU's compute capability (sm_XX). Use a wheel/image built for your arch.libcudnn / libcublas not found→ missing/mismatched CUDA libs; prefer the official PyTorch/CUDA container.- Multi-GPU visibility →
CUDA_VISIBLE_DEVICESset wrong hides or reorders GPUs.
Hardware faults (Xid errors)
dmesg -T | grep -i xid # GPU hardware/driver faults nvidia-smi -q | grep -i -A2 "ecc\|retired" # ECC errors, retired pages
Xid 13/31/43/45 often = app/memory fault; Xid 48/63/64/79 = ECC / fallen-off-bus / hardware. Repeated uncorrectable ECC or "GPU has fallen off the bus" → drain the node and replace the card.
Quick reference
nvidia-smi ; nvidia-smi dmon -s pucvmet ; nvidia-smi topo -m CUDA_LAUNCH_BLOCKING=1 python x.py # real failing line PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True # frag OOM NCCL_DEBUG=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL # comms nvidia-smi -q -d CLOCK,PERFORMANCE # throttle reasons dmesg -T | grep -i xid # hardware faults python -c "import torch;print(torch.cuda.memory_summary())"