Almost every "slow GPU" reduces to one of three diagnoses: you're memory-bandwidth bound (waiting on HBM, not math), you're starving the GPU (data pipeline / CPU / host-device syncs leave it idle), or you never enabled the fast path (Tensor Cores, fused kernels, the right dtype). This sheet goes layer by layer — hardware, memory, precision, kernels, the data pipeline, multi-GPU, and the profilers that tell you which diagnosis is true. Measure first; optimize the thing the profiler points at, never your guess.
1. GPU hardware model
A GPU is a throughput machine. It hides latency with massive parallelism instead of big caches and out-of-order execution like a CPU.
- SM (Streaming Multiprocessor) — the basic compute block. A data-center GPU has 100+ SMs. Each holds many CUDA cores, Tensor Cores, register file, shared memory, warp schedulers.
- Warp — 32 threads executed in lockstep (SIMT). The scheduler issues one instruction for the whole warp. If threads in a warp take different branches (warp divergence), both paths run serially — wasted lanes.
- Thread block — group of warps that run on one SM and can share shared memory + synchronize (
__syncthreads()). The grid is all blocks of a kernel launch. - Latency hiding — when one warp stalls on a memory load, the scheduler swaps in another ready warp. Enough resident warps (high occupancy) → the memory latency is hidden and the SM stays busy.
- CUDA core vs Tensor Core — CUDA cores do scalar/vector FP32/INT ops; Tensor Cores do small matrix-multiply-accumulate tiles per instruction and deliver the bulk of DL FLOPs at low precision.
Why it matters: you optimize by keeping SMs fed (enough parallel work, enough warps, no stalls) — not by making any single thread fast.
2. Memory hierarchy & bandwidth
| Tier | Scope | Speed / size | Notes |
|---|---|---|---|
| Registers | Per-thread | Fastest, tiny | Spilling → "local memory" (really global) tanks perf. |
| Shared / L1 | Per-SM | ~Very fast, 10s–100s KB | Software-managed scratchpad. The lever for tiling/reuse. |
| L2 cache | Chip-wide | Fast, a few–tens MB | Shared by all SMs. |
| Global (HBM) | Device | High BW, high latency, 10s–100s GB | The bottleneck for most DL kernels. |
| Host (system RAM) | CPU | Crosses PCIe/NVLink-C2C | Transfers are slow; overlap or avoid. |
- Coalescing — threads in a warp should touch contiguous addresses so the hardware merges them into a few wide transactions. Strided / scattered / misaligned access multiplies the number of transactions → wasted bandwidth.
- Bank conflicts — shared memory is split into banks; if threads in a warp hit the same bank with different addresses, accesses serialize. Pad shared arrays to avoid it.
- Tiling — load a block of data into shared memory once, reuse it across many threads, write back once. This is how a good GEMM turns a memory problem into a compute problem.
3. The roofline: memory-bound vs compute-bound
- Arithmetic intensity (AI) = FLOPs performed ÷ bytes moved from HBM. It's the x-axis of the roofline model.
- Low AI → memory-bound: you finish the math long before the data arrives. Examples: elementwise add, activation (GELU/ReLU), layernorm/RMSNorm, softmax, dropout, embedding lookup, attention.
- High AI → compute-bound: enough math per byte to saturate the ALUs. Examples: large GEMMs (the QKV/MLP projections), big convolutions.
- The roofline: achievable FLOPs = min(peak FLOPs, AI × peak bandwidth). Below the ridge point you're bandwidth-limited; raising FLOPs does nothing.
most ML ops are memory-bound
Attention, norms, activations, residual adds, elementwise scaling — all low AI, all bandwidth-bound.
That's exactly why fusion (do more work per byte loaded) and FlashAttention
(never write the N×N matrix to HBM) are the big wins. Adding FLOPs to a memory-bound kernel changes
nothing — you must cut bytes moved.
4. Numeric precision — pick the right dtype
| dtype | Bits (E/M) | When / why |
|---|---|---|
| FP32 | 8/23 | Full precision, slow, 4 bytes. Master weights, reductions, loss. Rarely needed for the whole net. |
| TF32 | 8/10 | Ampere+ matmul mode: FP32 range, reduced mantissa. ~FP32 accuracy, big speedup. Opt-in. |
| FP16 | 5/10 | 2 bytes, Tensor Cores. Narrow exponent range → underflow/overflow → needs loss scaling. |
| BF16 | 8/7 | 2 bytes, FP32's exponent range (no loss scaling), fewer mantissa bits. Default for modern training. |
| FP8 (E4M3 / E5M2) | 4/3, 5/2 | 1 byte. Hopper/Ada/Blackwell. E4M3 for fwd, E5M2 for grads. Needs per-tensor/block scaling. Max throughput, most fragile. |
| INT8 / INT4 | — | Inference quantization (weights and/or activations). Memory + bandwidth win; needs calibration. |
import torch
# Turn on the fast matmul paths (Ampere+)
torch.set_float32_matmul_precision("high") # TF32 for fp32 matmuls
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# Autocast: ops run in low precision where safe, FP32 where needed
with torch.autocast("cuda", dtype=torch.bfloat16):
logits = model(x)
loss = loss_fn(logits, y)
# FP16 path needs a GradScaler; BF16 usually does not
scaler = torch.cuda.amp.GradScaler(enabled=(dtype==torch.float16))
scaler.scale(loss).backward(); scaler.step(opt); scaler.update()
Mantissa vs exponent intuition: exponent bits = dynamic range (how big/small); mantissa bits = precision (how fine). BF16 keeps range (stable) but is coarse; FP16 is finer but can overflow — hence loss scaling. Reductions (sums of many terms) should accumulate in FP32 even when inputs are BF16.
5. Tensor Cores — the fast path
- Dedicated units doing a small matrix multiply-accumulate (e.g. 16×16 tiles) per instruction. They deliver an order of magnitude more throughput than CUDA cores for GEMMs.
- They engage only with (a) a supported low precision (FP16/BF16/FP8/INT8/TF32) and (b) friendly shapes. Keep matmul dims multiples of 8 (FP16/BF16) or 16 — pad vocab size, hidden dim, sequence, batch.
- Verify they're used: Nsight Compute "Tensor" pipe utilization, or torch profiler kernel names containing
s16816gemm,_tensor_,hmma,imma.
odd dimensions silently kill Tensor Cores
A vocab of 50257 or a hidden dim not divisible by 8 forces a slow fallback or wastes lanes. Pad to
the next multiple of 8/16 (e.g. vocab → 50304). Free speedup, no accuracy change.
6. Kernel fusion & launch overhead
- Each kernel launch costs a few microseconds of CPU + driver overhead and a round trip of intermediate tensors to HBM. Thousands of tiny ops = death by a thousand launches and a thousand HBM round-trips.
torch.compile(model)— TorchInductor traces the graph, fuses pointwise chains, picks better kernels (often Triton), reduces launches. First thing to try; usemode="max-autotune"for inference.- FlashAttention — fused, IO-aware attention. Tiles Q/K/V in SRAM, computes softmax online, never materializes the N×N score matrix in HBM → memory drops from O(N²) to O(N), and it's faster. Enables long context.
- CUDA Graphs — capture a fixed sequence of launches once, replay as a single graph launch. Removes per-kernel CPU overhead; big for small-batch inference / RL rollouts where launch overhead dominates.
- Fused kernels — fused AdamW (
fused=True), fused LayerNorm/RMSNorm, fused dropout+add+norm (apex / Triton / xFormers).
model = torch.compile(model) # fuse + autotune # inference-only graph capture (static shapes) g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): static_out = model(static_in) g.replay() # replays all launches as one
7. Occupancy & launch configuration
- Occupancy = active warps per SM ÷ max warps per SM. Limited by registers/thread, shared memory/block, and block size. Higher occupancy = more warps to hide latency — but it's not the only goal.
- More occupancy helps memory-bound kernels (more in-flight loads to hide latency). Compute-bound kernels can run great at lower occupancy if each thread does enough independent work (instruction-level parallelism).
- Register spilling drops occupancy and adds local-memory traffic — watch it in ncu.
- You rarely tune this by hand in PyTorch (the libraries do), but you read it in profiles to explain why a custom/Triton kernel is slow.
8. Fit a bigger model / batch (memory techniques)
| Technique | Trades | Use |
|---|---|---|
| Mixed precision (BF16) | ~½ memory for weights/grads/activations | Always, on Ampere+. |
| Gradient checkpointing | ~33% more compute for ~√N activation memory | Long sequences, deep nets. |
| Gradient accumulation | More steps for a big effective batch | Big batch on small VRAM. |
| FSDP / ZeRO sharding | Comms for memory (shard params/grads/opt) | Model too big to replicate. |
| 8-bit optimizer | Tiny quality risk for ½–¾ optimizer memory | bitsandbytes Adam. |
| CPU/NVMe offload | Big slowdown for huge capacity | Last resort (ZeRO-Offload). |
| Quantization (INT8/4) | Quality vs memory/bandwidth | Inference, QLoRA base. |
# reduce fragmentation OOMs ("X free but can't allocate Y")
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
from torch.utils.checkpoint import checkpoint out = checkpoint(block, x, use_reentrant=False) # recompute in backward
9. Don't starve the GPU (data pipeline)
DataLoader(ds, batch_size=B,
num_workers=8, # CPU workers feeding batches
pin_memory=True, # page-locked host mem → faster H2D
persistent_workers=True, # don't respawn each epoch
prefetch_factor=4) # batches queued ahead per worker
x = x.to("cuda", non_blocking=True) # async copy, overlaps with compute
- Low + spiky GPU-util → the feeder is the bottleneck, not the model. Raise workers, prefetch, move CPU preprocessing off the hot path (or onto the GPU with DALI/Kornia).
- Host↔device syncs stall the pipeline:
.item(),.cpu(),.numpy(),print(loss),tensor.tolist(), a Pythonifon a GPU scalar,torch.cuda.synchronize()— each forces the CPU to wait for the GPU, breaking overlap. - Keep metrics on-GPU; accumulate and sync once per N steps.
.item() every step is a silent killer
loss.item() in the loop forces a device sync — the CPU blocks until the GPU finishes,
destroying compute/transfer overlap. Log every N steps, or stash losses in a GPU tensor and sync in
one batch. This alone can be a 20–40% throughput swing.10. CUDA streams & async execution
- Operations on the same stream run in order; different streams can overlap. The default stream serializes everything.
- Use a copy stream to overlap H2D/D2H transfers with compute (prefetch next batch while the current one trains).
- Events (
torch.cuda.Event) time GPU work correctly — wall-clocktime.time()around a kernel is wrong because launches are async. Alwayssynchronize()or use events when benchmarking.
start, end = torch.cuda.Event(True), torch.cuda.Event(True) start.record(); out = model(x); end.record() torch.cuda.synchronize(); print(start.elapsed_time(end), "ms") # correct GPU timing
11. Multi-GPU & communication
- Data parallel (DDP) — each GPU a full model + a shard of the batch; gradients all-reduced each step. DDP overlaps the all-reduce with backward by bucketing gradients.
- FSDP / ZeRO — shard params/grads/optimizer states; all-gather a layer's params just-in-time for compute, free after. Fits huge models, more comms.
- Tensor parallel — split each layer's matmuls across GPUs (needs NVLink, intra-node). Pipeline parallel — split layers into stages (cross-node, watch the bubble).
- NCCL runs the collectives (all-reduce, all-gather, reduce-scatter). Goal: overlap comms with compute so the network is hidden under math.
- Topology: NVLink/NVSwitch >> PCIe >> Ethernet. Cross-node needs InfiniBand/RoCE or comms dominates. Check
nvidia-smi topo -m.
nvidia-smi topo -m # NV# = NVLink hops, PIX/PXB = PCIe, SYS = cross-socket export NCCL_DEBUG=INFO # confirm ring/tree + which transport NCCL chose
12. Profiling — measure, never guess
nvidia-smi dmon -s pucvmet # live: power util mem clocks throttle nvidia-smi -q -d CLOCK,PERFORMANCE # current clocks + throttle reasons nsys profile -o out python train.py # timeline: kernels, gaps, syncs, NCCL, H2D ncu --set full -k regex:gemm python x.py # one kernel: occupancy, mem%, pipes, stalls
from torch.profiler import profile, ProfilerActivity, schedule
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True, with_stack=True,
schedule=schedule(wait=1, warmup=1, active=3)) as p:
for _ in range(5): model(x); p.step()
print(p.key_averages().table(sort_by="cuda_time_total", row_limit=20))
p.export_chrome_trace("trace.json") # open in chrome://tracing / perfetto
| Symptom in profile | Diagnosis |
|---|---|
| Gaps between kernels on the GPU timeline | Starvation (data/CPU) or host-device sync |
| Many tiny kernels back-to-back | Launch overhead → fuse / torch.compile / CUDA graphs |
| High memory throughput, low compute | Memory-bound → fuse, cut bytes, FlashAttention |
| Low Tensor pipe utilization | Wrong dtype or bad shapes → BF16 + pad dims |
| NCCL kernels not under compute | Comms not overlapped → bucketing / topology / stage |
| SM clocks below boost, calm load | Power/thermal throttle, not your code |
13. Optimization checklist (in order)
- Profile a few steps (nsys + torch profiler). Identify: starved, memory-bound, or no fast path?
- Enable BF16/TF32 +
torch.compile. Confirm Tensor Cores engage; pad odd dims. - Fix the data pipeline: workers, pin_memory, prefetch; remove
.item()/prints from the loop. - Use FlashAttention; fuse norms/optimizer.
- If OOM: BF16 → checkpointing → accumulation → FSDP/ZeRO → offload.
- Multi-GPU: confirm comms overlaps compute; check topology; tune stage to interconnect.
- Re-profile. Repeat on the new bottleneck. Stop when you hit a hardware roofline.
14. Quick reference
nvidia-smi ; nvidia-smi dmon -s pucvmet ; nvidia-smi topo -m
nvidia-smi -q -d CLOCK,PERFORMANCE # throttle reasons
nsys profile -o out python train.py # timeline
ncu --set full -k regex:gemm python x.py # per-kernel deep dive
CUDA_LAUNCH_BLOCKING=1 # true failing line on async error
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True # frag OOM
# PyTorch knobs
torch.set_float32_matmul_precision("high")
torch.compile(model, mode="max-autotune")
torch.autocast("cuda", dtype=torch.bfloat16)
torch.utils.checkpoint.checkpoint(block, x, use_reentrant=False)
15. Interview Q&A
- Memory-bound vs compute-bound — how do you tell?Arithmetic intensity (FLOPs/byte) on the roofline. The profiler shows it: high memory throughput + low compute = memory-bound; ALUs/Tensor pipes saturated = compute-bound. Norms/activations/attention are memory-bound; big GEMMs are compute-bound.
- GPU-util is 100% but it's slow — why?Util just means a kernel is resident, not that it's efficient. Could be a memory-bound kernel saturating bandwidth, many tiny kernels with launch overhead, no Tensor Cores, or throttled clocks. ncu tells you which.
- GPU-util is low and spiky — why?Starvation: slow data pipeline, CPU-bound preprocessing, host-device syncs (.item()), or batch too small. Fix the feeder, not the model.
- What is coalesced memory access?Threads in a warp reading contiguous addresses so the hardware merges them into a few wide HBM transactions. Strided/scattered access multiplies transactions and wastes bandwidth.
- Why is occupancy not the only goal?It helps hide latency for memory-bound kernels, but a compute-bound kernel with enough instruction-level parallelism per thread can be fast at lower occupancy. Chasing 100% occupancy by shrinking per-thread work can hurt.
- BF16 vs FP16 for training?BF16 keeps FP32's exponent range → no loss scaling, far fewer overflow NaNs. FP16 is finer-grained but needs dynamic loss scaling. Default to BF16 on Ampere+.
- What does FlashAttention actually optimize?It's IO-aware: tiles Q/K/V in SRAM and computes softmax online, never writing the N×N score matrix to HBM. Turns attention's memory from O(N²) to O(N) → faster + far less memory → long context.
- Why must matmul dims be multiples of 8/16?Tensor Cores operate on fixed tile shapes; misaligned dims fall back to slower kernels or waste lanes. Pad vocab/hidden/batch to the next multiple — free speedup.
- How do you correctly time a GPU op?Launches are async, so wall-clock around a kernel is wrong. Use CUDA events (record/elapsed_time) or torch.cuda.synchronize() before stopping the timer.
- How do you hide multi-GPU communication?Overlap NCCL all-reduce with backward (DDP gradient buckets do this), use NVLink intra-node and fast fabric cross-node, and pick a ZeRO stage matched to the interconnect. Confirm in an nsys timeline that comms sits under compute.
- Throttling vs compute-bound — both look 'maxed'. Difference?Throttling = the GPU capped its clocks for power/thermal reasons (visible in nvidia-smi throttle reasons); the silicon could go faster. Compute-bound = clocks are at boost and the ALUs are genuinely saturated. Check clocks before touching code.
- CUDA cores vs Tensor Cores?CUDA cores do general scalar/vector FP/INT ops; Tensor Cores do small matrix multiply-accumulate tiles at low precision and provide most DL FLOPs. The fast path runs GEMMs on Tensor Cores.