← Cheatsheets

CHEATSHEET · GENAI · GPU PERFORMANCE

GPU Optimization — The Deep Cheatsheet.

gpu cuda performance ml-engineering
Almost every "slow GPU" reduces to one of three diagnoses: you're memory-bandwidth bound (waiting on HBM, not math), you're starving the GPU (data pipeline / CPU / host-device syncs leave it idle), or you never enabled the fast path (Tensor Cores, fused kernels, the right dtype). This sheet goes layer by layer — hardware, memory, precision, kernels, the data pipeline, multi-GPU, and the profilers that tell you which diagnosis is true. Measure first; optimize the thing the profiler points at, never your guess.

1. GPU hardware model

A GPU is a throughput machine. It hides latency with massive parallelism instead of big caches and out-of-order execution like a CPU.

  • SM (Streaming Multiprocessor) — the basic compute block. A data-center GPU has 100+ SMs. Each holds many CUDA cores, Tensor Cores, register file, shared memory, warp schedulers.
  • Warp — 32 threads executed in lockstep (SIMT). The scheduler issues one instruction for the whole warp. If threads in a warp take different branches (warp divergence), both paths run serially — wasted lanes.
  • Thread block — group of warps that run on one SM and can share shared memory + synchronize (__syncthreads()). The grid is all blocks of a kernel launch.
  • Latency hiding — when one warp stalls on a memory load, the scheduler swaps in another ready warp. Enough resident warps (high occupancy) → the memory latency is hidden and the SM stays busy.
  • CUDA core vs Tensor Core — CUDA cores do scalar/vector FP32/INT ops; Tensor Cores do small matrix-multiply-accumulate tiles per instruction and deliver the bulk of DL FLOPs at low precision.

Why it matters: you optimize by keeping SMs fed (enough parallel work, enough warps, no stalls) — not by making any single thread fast.

2. Memory hierarchy & bandwidth

TierScopeSpeed / sizeNotes
RegistersPer-threadFastest, tinySpilling → "local memory" (really global) tanks perf.
Shared / L1Per-SM~Very fast, 10s–100s KBSoftware-managed scratchpad. The lever for tiling/reuse.
L2 cacheChip-wideFast, a few–tens MBShared by all SMs.
Global (HBM)DeviceHigh BW, high latency, 10s–100s GBThe bottleneck for most DL kernels.
Host (system RAM)CPUCrosses PCIe/NVLink-C2CTransfers are slow; overlap or avoid.
  • Coalescing — threads in a warp should touch contiguous addresses so the hardware merges them into a few wide transactions. Strided / scattered / misaligned access multiplies the number of transactions → wasted bandwidth.
  • Bank conflicts — shared memory is split into banks; if threads in a warp hit the same bank with different addresses, accesses serialize. Pad shared arrays to avoid it.
  • Tiling — load a block of data into shared memory once, reuse it across many threads, write back once. This is how a good GEMM turns a memory problem into a compute problem.

3. The roofline: memory-bound vs compute-bound

  • Arithmetic intensity (AI) = FLOPs performed ÷ bytes moved from HBM. It's the x-axis of the roofline model.
  • Low AI → memory-bound: you finish the math long before the data arrives. Examples: elementwise add, activation (GELU/ReLU), layernorm/RMSNorm, softmax, dropout, embedding lookup, attention.
  • High AI → compute-bound: enough math per byte to saturate the ALUs. Examples: large GEMMs (the QKV/MLP projections), big convolutions.
  • The roofline: achievable FLOPs = min(peak FLOPs, AI × peak bandwidth). Below the ridge point you're bandwidth-limited; raising FLOPs does nothing.
most ML ops are memory-bound Attention, norms, activations, residual adds, elementwise scaling — all low AI, all bandwidth-bound. That's exactly why fusion (do more work per byte loaded) and FlashAttention (never write the N×N matrix to HBM) are the big wins. Adding FLOPs to a memory-bound kernel changes nothing — you must cut bytes moved.

4. Numeric precision — pick the right dtype

dtypeBits (E/M)When / why
FP328/23Full precision, slow, 4 bytes. Master weights, reductions, loss. Rarely needed for the whole net.
TF328/10Ampere+ matmul mode: FP32 range, reduced mantissa. ~FP32 accuracy, big speedup. Opt-in.
FP165/102 bytes, Tensor Cores. Narrow exponent range → underflow/overflow → needs loss scaling.
BF168/72 bytes, FP32's exponent range (no loss scaling), fewer mantissa bits. Default for modern training.
FP8 (E4M3 / E5M2)4/3, 5/21 byte. Hopper/Ada/Blackwell. E4M3 for fwd, E5M2 for grads. Needs per-tensor/block scaling. Max throughput, most fragile.
INT8 / INT4Inference quantization (weights and/or activations). Memory + bandwidth win; needs calibration.
import torch
# Turn on the fast matmul paths (Ampere+)
torch.set_float32_matmul_precision("high")     # TF32 for fp32 matmuls
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Autocast: ops run in low precision where safe, FP32 where needed
with torch.autocast("cuda", dtype=torch.bfloat16):
    logits = model(x)
    loss = loss_fn(logits, y)
# FP16 path needs a GradScaler; BF16 usually does not
scaler = torch.cuda.amp.GradScaler(enabled=(dtype==torch.float16))
scaler.scale(loss).backward(); scaler.step(opt); scaler.update()

Mantissa vs exponent intuition: exponent bits = dynamic range (how big/small); mantissa bits = precision (how fine). BF16 keeps range (stable) but is coarse; FP16 is finer but can overflow — hence loss scaling. Reductions (sums of many terms) should accumulate in FP32 even when inputs are BF16.

5. Tensor Cores — the fast path

  • Dedicated units doing a small matrix multiply-accumulate (e.g. 16×16 tiles) per instruction. They deliver an order of magnitude more throughput than CUDA cores for GEMMs.
  • They engage only with (a) a supported low precision (FP16/BF16/FP8/INT8/TF32) and (b) friendly shapes. Keep matmul dims multiples of 8 (FP16/BF16) or 16 — pad vocab size, hidden dim, sequence, batch.
  • Verify they're used: Nsight Compute "Tensor" pipe utilization, or torch profiler kernel names containing s16816gemm, _tensor_, hmma, imma.
odd dimensions silently kill Tensor Cores A vocab of 50257 or a hidden dim not divisible by 8 forces a slow fallback or wastes lanes. Pad to the next multiple of 8/16 (e.g. vocab → 50304). Free speedup, no accuracy change.

6. Kernel fusion & launch overhead

  • Each kernel launch costs a few microseconds of CPU + driver overhead and a round trip of intermediate tensors to HBM. Thousands of tiny ops = death by a thousand launches and a thousand HBM round-trips.
  • torch.compile(model) — TorchInductor traces the graph, fuses pointwise chains, picks better kernels (often Triton), reduces launches. First thing to try; use mode="max-autotune" for inference.
  • FlashAttention — fused, IO-aware attention. Tiles Q/K/V in SRAM, computes softmax online, never materializes the N×N score matrix in HBM → memory drops from O(N²) to O(N), and it's faster. Enables long context.
  • CUDA Graphs — capture a fixed sequence of launches once, replay as a single graph launch. Removes per-kernel CPU overhead; big for small-batch inference / RL rollouts where launch overhead dominates.
  • Fused kernels — fused AdamW (fused=True), fused LayerNorm/RMSNorm, fused dropout+add+norm (apex / Triton / xFormers).
model = torch.compile(model)                 # fuse + autotune
# inference-only graph capture (static shapes)
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g): static_out = model(static_in)
g.replay()                                   # replays all launches as one

7. Occupancy & launch configuration

  • Occupancy = active warps per SM ÷ max warps per SM. Limited by registers/thread, shared memory/block, and block size. Higher occupancy = more warps to hide latency — but it's not the only goal.
  • More occupancy helps memory-bound kernels (more in-flight loads to hide latency). Compute-bound kernels can run great at lower occupancy if each thread does enough independent work (instruction-level parallelism).
  • Register spilling drops occupancy and adds local-memory traffic — watch it in ncu.
  • You rarely tune this by hand in PyTorch (the libraries do), but you read it in profiles to explain why a custom/Triton kernel is slow.

8. Fit a bigger model / batch (memory techniques)

TechniqueTradesUse
Mixed precision (BF16)~½ memory for weights/grads/activationsAlways, on Ampere+.
Gradient checkpointing~33% more compute for ~√N activation memoryLong sequences, deep nets.
Gradient accumulationMore steps for a big effective batchBig batch on small VRAM.
FSDP / ZeRO shardingComms for memory (shard params/grads/opt)Model too big to replicate.
8-bit optimizerTiny quality risk for ½–¾ optimizer memorybitsandbytes Adam.
CPU/NVMe offloadBig slowdown for huge capacityLast resort (ZeRO-Offload).
Quantization (INT8/4)Quality vs memory/bandwidthInference, QLoRA base.
# reduce fragmentation OOMs ("X free but can't allocate Y")
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
from torch.utils.checkpoint import checkpoint
out = checkpoint(block, x, use_reentrant=False)   # recompute in backward

9. Don't starve the GPU (data pipeline)

DataLoader(ds, batch_size=B,
           num_workers=8,            # CPU workers feeding batches
           pin_memory=True,          # page-locked host mem → faster H2D
           persistent_workers=True,  # don't respawn each epoch
           prefetch_factor=4)        # batches queued ahead per worker
x = x.to("cuda", non_blocking=True) # async copy, overlaps with compute
  • Low + spiky GPU-util → the feeder is the bottleneck, not the model. Raise workers, prefetch, move CPU preprocessing off the hot path (or onto the GPU with DALI/Kornia).
  • Host↔device syncs stall the pipeline: .item(), .cpu(), .numpy(), print(loss), tensor.tolist(), a Python if on a GPU scalar, torch.cuda.synchronize() — each forces the CPU to wait for the GPU, breaking overlap.
  • Keep metrics on-GPU; accumulate and sync once per N steps.
.item() every step is a silent killer loss.item() in the loop forces a device sync — the CPU blocks until the GPU finishes, destroying compute/transfer overlap. Log every N steps, or stash losses in a GPU tensor and sync in one batch. This alone can be a 20–40% throughput swing.

10. CUDA streams & async execution

  • Operations on the same stream run in order; different streams can overlap. The default stream serializes everything.
  • Use a copy stream to overlap H2D/D2H transfers with compute (prefetch next batch while the current one trains).
  • Events (torch.cuda.Event) time GPU work correctly — wall-clock time.time() around a kernel is wrong because launches are async. Always synchronize() or use events when benchmarking.
start, end = torch.cuda.Event(True), torch.cuda.Event(True)
start.record(); out = model(x); end.record()
torch.cuda.synchronize(); print(start.elapsed_time(end), "ms")  # correct GPU timing

11. Multi-GPU & communication

  • Data parallel (DDP) — each GPU a full model + a shard of the batch; gradients all-reduced each step. DDP overlaps the all-reduce with backward by bucketing gradients.
  • FSDP / ZeRO — shard params/grads/optimizer states; all-gather a layer's params just-in-time for compute, free after. Fits huge models, more comms.
  • Tensor parallel — split each layer's matmuls across GPUs (needs NVLink, intra-node). Pipeline parallel — split layers into stages (cross-node, watch the bubble).
  • NCCL runs the collectives (all-reduce, all-gather, reduce-scatter). Goal: overlap comms with compute so the network is hidden under math.
  • Topology: NVLink/NVSwitch >> PCIe >> Ethernet. Cross-node needs InfiniBand/RoCE or comms dominates. Check nvidia-smi topo -m.
nvidia-smi topo -m         # NV# = NVLink hops, PIX/PXB = PCIe, SYS = cross-socket
export NCCL_DEBUG=INFO     # confirm ring/tree + which transport NCCL chose

12. Profiling — measure, never guess

nvidia-smi dmon -s pucvmet            # live: power util mem clocks throttle
nvidia-smi -q -d CLOCK,PERFORMANCE    # current clocks + throttle reasons
nsys profile -o out python train.py   # timeline: kernels, gaps, syncs, NCCL, H2D
ncu --set full -k regex:gemm python x.py  # one kernel: occupancy, mem%, pipes, stalls
from torch.profiler import profile, ProfilerActivity, schedule
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             record_shapes=True, with_stack=True,
             schedule=schedule(wait=1, warmup=1, active=3)) as p:
    for _ in range(5): model(x); p.step()
print(p.key_averages().table(sort_by="cuda_time_total", row_limit=20))
p.export_chrome_trace("trace.json")   # open in chrome://tracing / perfetto
Symptom in profileDiagnosis
Gaps between kernels on the GPU timelineStarvation (data/CPU) or host-device sync
Many tiny kernels back-to-backLaunch overhead → fuse / torch.compile / CUDA graphs
High memory throughput, low computeMemory-bound → fuse, cut bytes, FlashAttention
Low Tensor pipe utilizationWrong dtype or bad shapes → BF16 + pad dims
NCCL kernels not under computeComms not overlapped → bucketing / topology / stage
SM clocks below boost, calm loadPower/thermal throttle, not your code

13. Optimization checklist (in order)

  1. Profile a few steps (nsys + torch profiler). Identify: starved, memory-bound, or no fast path?
  2. Enable BF16/TF32 + torch.compile. Confirm Tensor Cores engage; pad odd dims.
  3. Fix the data pipeline: workers, pin_memory, prefetch; remove .item()/prints from the loop.
  4. Use FlashAttention; fuse norms/optimizer.
  5. If OOM: BF16 → checkpointing → accumulation → FSDP/ZeRO → offload.
  6. Multi-GPU: confirm comms overlaps compute; check topology; tune stage to interconnect.
  7. Re-profile. Repeat on the new bottleneck. Stop when you hit a hardware roofline.

14. Quick reference

nvidia-smi ; nvidia-smi dmon -s pucvmet ; nvidia-smi topo -m
nvidia-smi -q -d CLOCK,PERFORMANCE      # throttle reasons
nsys profile -o out python train.py     # timeline
ncu --set full -k regex:gemm python x.py # per-kernel deep dive
CUDA_LAUNCH_BLOCKING=1                   # true failing line on async error
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True   # frag OOM
# PyTorch knobs
torch.set_float32_matmul_precision("high")
torch.compile(model, mode="max-autotune")
torch.autocast("cuda", dtype=torch.bfloat16)
torch.utils.checkpoint.checkpoint(block, x, use_reentrant=False)

15. Interview Q&A

  • Memory-bound vs compute-bound — how do you tell?Arithmetic intensity (FLOPs/byte) on the roofline. The profiler shows it: high memory throughput + low compute = memory-bound; ALUs/Tensor pipes saturated = compute-bound. Norms/activations/attention are memory-bound; big GEMMs are compute-bound.
  • GPU-util is 100% but it's slow — why?Util just means a kernel is resident, not that it's efficient. Could be a memory-bound kernel saturating bandwidth, many tiny kernels with launch overhead, no Tensor Cores, or throttled clocks. ncu tells you which.
  • GPU-util is low and spiky — why?Starvation: slow data pipeline, CPU-bound preprocessing, host-device syncs (.item()), or batch too small. Fix the feeder, not the model.
  • What is coalesced memory access?Threads in a warp reading contiguous addresses so the hardware merges them into a few wide HBM transactions. Strided/scattered access multiplies transactions and wastes bandwidth.
  • Why is occupancy not the only goal?It helps hide latency for memory-bound kernels, but a compute-bound kernel with enough instruction-level parallelism per thread can be fast at lower occupancy. Chasing 100% occupancy by shrinking per-thread work can hurt.
  • BF16 vs FP16 for training?BF16 keeps FP32's exponent range → no loss scaling, far fewer overflow NaNs. FP16 is finer-grained but needs dynamic loss scaling. Default to BF16 on Ampere+.
  • What does FlashAttention actually optimize?It's IO-aware: tiles Q/K/V in SRAM and computes softmax online, never writing the N×N score matrix to HBM. Turns attention's memory from O(N²) to O(N) → faster + far less memory → long context.
  • Why must matmul dims be multiples of 8/16?Tensor Cores operate on fixed tile shapes; misaligned dims fall back to slower kernels or waste lanes. Pad vocab/hidden/batch to the next multiple — free speedup.
  • How do you correctly time a GPU op?Launches are async, so wall-clock around a kernel is wrong. Use CUDA events (record/elapsed_time) or torch.cuda.synchronize() before stopping the timer.
  • How do you hide multi-GPU communication?Overlap NCCL all-reduce with backward (DDP gradient buckets do this), use NVLink intra-node and fast fabric cross-node, and pick a ZeRO stage matched to the interconnect. Confirm in an nsys timeline that comms sits under compute.
  • Throttling vs compute-bound — both look 'maxed'. Difference?Throttling = the GPU capped its clocks for power/thermal reasons (visible in nvidia-smi throttle reasons); the silicon could go faster. Compute-bound = clocks are at boost and the ALUs are genuinely saturated. Check clocks before touching code.
  • CUDA cores vs Tensor Cores?CUDA cores do general scalar/vector FP/INT ops; Tensor Cores do small matrix multiply-accumulate tiles at low precision and provide most DL FLOPs. The fast path runs GEMMs on Tensor Cores.
← Cheatsheets next: LLM Inference →
© cvam — written in plaintext, served warm