GPU & CUDA Interview Questions

Real questions asked for ML-systems, ML-infra, performance, and HPC roles — graded easy → hard with full answers. Click any question to expand. Pair this with the GPU Optimization cheatsheet for the concepts behind the answers.

easy fundamentals — screening / new-grad medium applied — most ML-engineer loops hard senior / systems & perf

Easy — fundamentals

Why are GPUs faster than CPUs for deep learning? easy

GPUs are throughput machines: thousands of simple cores running in parallel, optimized for doing the same operation across huge amounts of data (SIMT). Deep learning is dominated by large matrix multiplies and elementwise ops — embarrassingly parallel work. A CPU has a few powerful cores tuned for latency and branchy serial code with big caches; a GPU trades single-thread speed and caches for raw parallel throughput and very high memory bandwidth (HBM). For one matmul the GPU wins by orders of magnitude; for serial logic the CPU wins.

What is a CUDA core vs a Tensor Core? easy

A CUDA core is a general scalar ALU — it does FP32/FP64/INT arithmetic, one thread's work at a time. A Tensor Core is a specialized unit that performs a small matrix multiply-accumulate (e.g. a 16×16 tile) in a single instruction, at low precision (FP16/BF16/FP8/INT8/TF32). Tensor Cores provide the vast majority of a modern GPU's deep-learning FLOPs; the "fast path" for training/inference runs GEMMs on them.

What is VRAM and what consumes it during training? easy

VRAM is the GPU's on-board memory (HBM). During training it holds: model weights, gradients, optimizer states (Adam stores momentum + variance + often an FP32 master copy — ~12 bytes/param), and activations saved for the backward pass. Activations usually dominate and scale with batch × sequence length × layers. Inference is lighter: weights + the KV cache + a little activation scratch.

What does nvidia-smi show you? easy

GPU utilization %, memory used/total, running processes and their memory, temperature, power draw, and clock/throttle state. It's the first command in any GPU debug. Caveat: "utilization" only means a kernel was resident in the sampling window — it does not mean the GPU is efficient. You can be 100% utilized and still slow (memory-bound, tiny kernels, throttled clocks).

What is mixed-precision training? easy

Running most ops in 16-bit (BF16 or FP16) instead of FP32 to halve memory and use Tensor Cores, while keeping the parts that need range/precision (loss, reductions, master weights) in FP32. Frameworks do this automatically via autocast. BF16 is preferred on Ampere+ because it shares FP32's exponent range (no loss scaling needed); FP16 needs a gradient/loss scaler to avoid underflow.

What is the difference between a CUDA core and a Tensor Core? easy

A CUDA core is a general-purpose scalar ALU that does one FP/INT operation per clock and handles arbitrary code. A Tensor Core is a specialized unit that performs a small matrix multiply-accumulate (e.g. 4×4) per instruction in mixed precision, giving an order-of-magnitude more throughput on the GEMMs that dominate deep learning. You get peak performance only when your work maps onto Tensor Cores (right dtype and shapes); otherwise you fall back to the slower CUDA cores.

Describe the GPU memory hierarchy and why it matters. easy

From fastest/smallest to slowest/largest: registers (per-thread), shared memory / L1 (per-SM, programmer-managed), L2 (shared across SMs), and global HBM (device memory). Each level down is roughly an order of magnitude slower and larger. Performance work is largely about keeping hot data in registers/shared memory and minimizing trips to global HBM, since HBM bandwidth is the usual bottleneck for DL kernels.

What is a warp and why does it matter for performance? easy

A warp is a group of 32 threads that execute the same instruction in lockstep (SIMT). The whole warp is the real scheduling unit, so if threads in a warp take different branches the hardware executes each path serially with the others masked off — warp divergence — wasting throughput. Writing code so threads in a warp follow the same path and access memory uniformly is key to GPU efficiency.

What is memory coalescing? easy

When the 32 threads of a warp access consecutive global-memory addresses, the hardware merges them into a few wide transactions, using the full bus width. If the accesses are scattered or strided, each thread triggers its own transaction, multiplying memory traffic and stalling the warp. Coalesced access patterns are one of the biggest levers for memory-bound kernels.

What is HBM and why is its bandwidth so important? easy

High-Bandwidth Memory is stacked DRAM on the GPU package connected by a very wide bus, giving terabytes/second of bandwidth. Many DL kernels (and all of LLM decode) are memory-bandwidth-bound — limited by how fast weights/activations stream from HBM, not by compute. So HBM bandwidth and capacity often determine real-world throughput more than peak FLOPs.

What is the difference between FP16 and BF16? easy

Both are 16-bit floats, but they split the bits differently: FP16 has a 5-bit exponent and 10-bit mantissa (more precision, narrow dynamic range), while BF16 keeps FP32's 8-bit exponent with a 7-bit mantissa (wide range, less precision). BF16's wide range means it rarely overflows/underflows, so it's more numerically stable for training and usually needs no loss scaling — which is why modern training defaults to it.

What is occupancy and is higher always better? easy

Occupancy is the ratio of active warps on an SM to the hardware maximum; more resident warps let the scheduler hide memory and instruction latency by switching between them. But it has diminishing returns — once latency is hidden, more occupancy doesn't help, and pushing it can increase register/shared-memory pressure or spilling. It's a means to latency hiding, not a direct performance metric to maximize blindly.

What does a CUDA stream do? easy

A stream is an ordered queue of GPU operations; work within a stream runs in issue order, but operations in different streams can overlap. That lets you, for example, copy the next batch host→device on one stream while computing the current batch on another, hiding transfer latency. Proper multi-stream scheduling is how you overlap compute with data movement.

What is pinned (page-locked) host memory? easy

Normal host memory is pageable, so the driver must stage it before a DMA transfer; pinned memory is locked in physical RAM so the GPU can DMA directly. This makes host↔device copies faster and, crucially, allows them to be truly asynchronous so they overlap with compute. The trade-off is that pinning is a limited resource and over-pinning starves the OS.

Why are GPUs faster than CPUs for machine learning? easy

CPUs devote most of their transistors to a few latency-optimized cores with large caches and branch prediction, optimized for sequential, branchy code. GPUs spend their transistors on thousands of simpler throughput-optimized cores plus very high memory bandwidth, ideal for the massively data-parallel, regular matrix math in ML where the same operation runs over huge tensors. So for dense linear algebra the GPU's throughput crushes the CPU, even though each GPU core is individually slower.

Medium — applied

My GPU utilization is low and spiky during training. What's wrong and how do you fix it? medium

That pattern means the GPU is starved — it finishes a batch and waits for the next. Causes and fixes:

Slow data pipeline — raise DataLoader num_workers, set pin_memory=True, increase prefetch_factor, use persistent_workers. Move heavy preprocessing off the critical path (precompute, or do it on-GPU with DALI/Kornia).
Host-device syncs in the loop — loss.item(), .cpu(), print(loss) force the CPU to wait for the GPU each step. Log every N steps; keep metrics on-GPU.
Batch too small — not enough parallel work per step.

Confirm with an nsys timeline: gaps between kernels = starvation. Fix the feeder, not the model.

GPU utilization is pinned at 100% but throughput is far below the hardware's peak. Why? medium

Utilization ≠ efficiency. A kernel is resident but not getting full value from the silicon. Likely causes:

Memory-bound kernels saturating HBM bandwidth (norms, activations, attention) — fix by fusion (torch.compile), FlashAttention, cutting bytes moved.
No Tensor Cores — wrong dtype (still FP32) or shapes not multiples of 8/16. Enable BF16/TF32, pad dims.
Many tiny kernels — launch overhead dominates; fuse or use CUDA graphs.
Clock throttling — power/thermal cap; check nvidia-smi -q -d CLOCK.

Use Nsight Compute on the hot kernel: it reports memory throughput, compute throughput, and Tensor pipe utilization so you know which one it is.

Explain memory-bound vs compute-bound and how you'd identify each. medium

Arithmetic intensity = FLOPs ÷ bytes read from HBM. Low intensity → memory-bound (you finish the math before the data arrives); high intensity → compute-bound (ALUs saturated). On the roofline, you're under the diagonal (bandwidth-limited) or under the flat top (compute-limited).

Identify: profile the kernel. High DRAM throughput + low compute throughput = memory-bound; the opposite = compute-bound. Most DL ops (elementwise, layernorm, softmax, attention) are memory-bound — which is why fusion and FlashAttention help and adding FLOPs does not. Big GEMMs are compute-bound — which is why Tensor Cores and lower precision help.

You hit CUDA out-of-memory. Walk through how you'd reduce memory without changing the model. medium

In rough order of cost/benefit:

Mixed precision (BF16) — halves weight/grad/activation bytes.
Smaller batch + gradient accumulation — same effective batch, less peak memory.
Gradient (activation) checkpointing — recompute activations in backward instead of storing them; ~√N memory for ~33% more compute.
Fix fragmentation — PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True when "X free but can't allocate Y".
no_grad/inference_mode for eval — don't build the autograd graph.
8-bit optimizer / FSDP-ZeRO sharding — shrink/shard optimizer states.

For inference specifically, the KV cache is usually the culprit — cap context length and concurrency, quantize weights and KV.

Why does loss.item() in the training loop hurt performance? medium

GPU kernels launch asynchronously — the CPU queues work and moves on. .item() needs the actual value, so it forces a device synchronization: the CPU blocks until the GPU has finished everything queued. That stalls the pipeline and breaks the overlap between CPU dispatch, data transfer, and GPU compute. Done every step it can cost 20–40% throughput. Fix: accumulate loss in a GPU tensor and call .item() once every N steps for logging.

How do you correctly benchmark the runtime of a GPU operation? medium

Because launches are async, wrapping a kernel in time.time() measures dispatch time, not execution. Correct approaches: (1) CUDA events — start.record(); op(); end.record(); torch.cuda.synchronize(); start.elapsed_time(end); or (2) call torch.cuda.synchronize() before reading the clock. Also warm up first (lazy init, autotune, cudnn benchmark, torch.compile compilation) and average over many iterations.

How do you tell whether a kernel is memory-bound or compute-bound? medium

Use the roofline model: compute the kernel's arithmetic intensity (FLOPs performed per byte moved from memory) and compare it to the hardware's ridge point (peak FLOPs ÷ peak bandwidth). If intensity is below the ridge point you're memory-bound (limited by bandwidth), above it you're compute-bound (limited by ALU/Tensor-Core throughput). Profilers like Nsight Compute plot this directly, telling you whether to optimize data movement or arithmetic.

What is kernel fusion and why does it help? medium

Kernel fusion combines several elementwise/operator steps into a single kernel so intermediate results stay in registers/shared memory instead of being written to and re-read from HBM. This cuts both kernel-launch overhead and, more importantly, expensive memory round-trips — turning several memory-bound passes into one. It's why frameworks fuse activation+bias+dropout or use fused attention/optimizers.

Explain FlashAttention and why it's faster. medium

Standard attention materializes the full N×N scores matrix in HBM, making it memory-bound and quadratic in memory. FlashAttention is IO-aware: it tiles Q, K, and V into blocks, computes attention block-by-block in fast on-chip SRAM, and uses an online (streaming) softmax so it never writes the full matrix to HBM. The result is far less memory traffic and lower memory use, enabling longer sequences and big speedups.

How does mixed-precision training work? medium

Forward and backward passes compute in FP16/BF16 for speed and lower memory, while a master copy of the weights is kept in FP32 so small updates aren't lost to rounding. For FP16 specifically you also apply loss scaling — multiply the loss before backprop and unscale before the optimizer step — to keep tiny gradients from underflowing. BF16 usually skips loss scaling thanks to its wider exponent range.

How does NCCL all-reduce work and why is the ring variant used? medium

All-reduce sums each GPU's gradients and distributes the result to all GPUs. The ring all-reduce arranges GPUs in a ring and passes chunks around in scatter-reduce then all-gather phases, so each GPU sends/receives a fixed amount independent of the number of GPUs — making it bandwidth-optimal. NCCL picks ring or tree algorithms based on message size and topology to minimize time.

What is the difference between PCIe and NVLink for multi-GPU? medium

PCIe is the general expansion bus and is comparatively narrow, so GPU-to-GPU communication over it bottlenecks collective operations. NVLink is a dedicated high-bandwidth GPU interconnect (and NVSwitch fabric) giving many times the bandwidth and direct peer-to-peer access. For tensor parallelism and frequent all-reduce, NVLink vs PCIe is often the difference between near-linear and poor multi-GPU scaling.

What is a CUDA graph and when does it help? medium

A CUDA graph captures a sequence of kernels and their dependencies once, then replays the whole sequence with a single launch instead of many per-kernel launches. This removes CPU-side launch overhead, which dominates when you have lots of small kernels — common in low-batch inference. It helps most when the work is static and repeated every step.

What are shared-memory bank conflicts? medium

Shared memory is divided into banks that can each service one access per cycle; if multiple threads in a warp hit different addresses in the same bank, those accesses serialize, hurting throughput. The classic case is a stride that maps many threads to one bank. You avoid it by padding arrays or choosing access patterns/strides that spread threads across distinct banks.

When and why would you use FP8? medium

On Hopper and newer GPUs, FP8 (the e4m3 and e5m2 formats) roughly doubles Tensor-Core throughput and halves memory vs FP16, which is valuable for both inference and increasingly training. Its dynamic range is tiny, so it relies on per-tensor (or finer) scaling factors to keep values representable. You use it where the speed/memory win outweighs the small accuracy risk, validated on real tasks.

Why isn't higher occupancy always faster? medium

Occupancy exists to hide latency by giving the scheduler other warps to run while some stall; once you have enough warps to cover the latency, adding more yields nothing. Worse, raising occupancy usually means fewer registers/less shared memory per thread, which can cause register spilling to local memory and actually slow the kernel. And if you're memory-bandwidth-bound, no amount of occupancy helps — the bottleneck is HBM, not warp availability.

Hard — senior & systems

What is memory coalescing, and how does access pattern affect a kernel's bandwidth? hard

The memory system serves a warp (32 threads) by combining their global-memory accesses into the fewest possible aligned transactions (e.g. 32/64/128-byte segments). If the 32 threads read contiguous, aligned addresses, the hardware coalesces them into a couple of wide transactions — full bandwidth. If they read strided or scattered addresses, each may require a separate transaction, so you fetch far more bytes than you use and effective bandwidth collapses. This is why data layout (AoS vs SoA), row- vs column-major traversal, and alignment matter, and why transposes/gathers are expensive. The shared-memory analog is bank conflicts: threads hitting different addresses in the same bank serialize.

Explain FlashAttention — what problem it solves and why it's faster despite doing more compute. hard

Standard attention computes S = QKᵀ (an N×N matrix), softmax(S), then ·V. Materializing and re-reading that N×N matrix in HBM makes attention memory-bound and O(N²) in memory — the bottleneck for long sequences. FlashAttention is IO-aware: it tiles Q, K, V into blocks that fit in on-chip SRAM, computes the attention for each block, and uses the online softmax trick (running max + running sum) to combine blocks without ever writing the full score matrix to HBM. It recomputes some quantities in the backward pass (more FLOPs) but slashes HBM traffic — and since attention was bandwidth-bound, fewer bytes moved means faster wall-clock plus O(N) memory, enabling much longer context.

A distributed training job hangs at the first all-reduce. How do you debug it? hard

Set NCCL_DEBUG=INFO and TORCH_DISTRIBUTED_DEBUG=DETAIL and re-run. Common causes:

Collective mismatch — ranks call different collectives or different shapes (a conditional that runs on some ranks, logging inside a collective on rank 0 only, uneven batch sizes). All ranks must execute the same collectives in the same order. TORCH_DISTRIBUTED_DEBUG=DETAIL flags this.
A rank died silently — one rank OOM'd or threw mid-collective; the rest block at the barrier until the watchdog timeout. The real traceback is in that rank's log — grep all rank logs for the first error, not the timeout.
Networking — wrong NCCL_SOCKET_IFNAME, IB disabled, or a blocked rendezvous port; init can't form the ring. Wrong MASTER_ADDR/PORT or world size also hangs at init.

Isolate transport issues with NCCL_P2P_DISABLE=1 / NCCL_IB_DISABLE=1 (slower but proves the cause).

How do you decide between data, tensor, pipeline, and ZeRO/FSDP parallelism for a given model? hard

Start from "does the model + states fit on one GPU?"

Fits → DDP (data parallel): replicate, shard the batch, all-reduce gradients. Simplest, scales throughput.
Doesn't fit, sub-~100B → FSDP/ZeRO: shard optimizer states (stage 1), then gradients (2), then params (3). All-gather params per layer on demand. Trades comms for memory; pick the lowest stage that fits.
Huge layers / need low latency → add tensor parallel within a node (needs NVLink — it all-reduces activations every layer).
Cross-node, very deep → add pipeline parallel (stages across nodes; mind the pipeline bubble — use micro-batching).

The largest models combine all three (3D parallelism = DP × TP × PP), arranging TP within a node (fast NVLink), PP across nodes, DP on top. Always match the strategy to the interconnect — sharding on slow PCIe/Ethernet makes comms dominate and GPUs idle.

Training was fine, then got ~20% slower with no code change. The node is calm. What do you check? hard

Suspect clock throttling, not your code. Check nvidia-smi -q -d CLOCK,PERFORMANCE for "Clocks Throttle Reasons" and compare SM clocks to boost. Causes: thermal (dust, failed fan, hot inlet, neighbor load raising rack temp) → SW/HW thermal slowdown; power cap too low or PSU/rack budget → SW power cap; or a noisy-neighbor VM stealing the host. Also check dmesg -T | grep -i xid for ECC errors / a degrading card (repeated correctable ECC throttles to be safe). It looks like "compute-bound" but the silicon is capping clocks — verify hardware before optimizing kernels.

What is occupancy, and is higher always better? hard

Occupancy = resident warps per SM ÷ the SM's max warps. It's bounded by registers/thread, shared memory/block, and block size. More resident warps give the scheduler more ready work to hide memory latency, so memory-bound kernels usually benefit from higher occupancy. But it's not a universal goal: a compute-bound kernel with enough instruction-level parallelism per thread can saturate the ALUs at modest occupancy, and pushing occupancy by cutting registers/shared memory can force spills (local-memory traffic) or reduce per-thread reuse, making it slower. You target "enough occupancy to hide latency," then optimize for reuse and ILP.

How would you serve or train under FP8, and what breaks? hard

FP8 (E4M3 for forward/weights, E5M2 for gradients) gives the most throughput on Hopper/Ada/Blackwell Tensor Cores and halves memory vs BF16. The challenge is dynamic range: 8 bits can't represent the spread of values in a layer, so you need scaling — per-tensor or finer per-block/per-channel scales, often with delayed scaling (track recent amax to set the scale). What breaks: outliers in activations (especially in attention and certain layers) overflow/underflow and tank accuracy; some layers (norms, softmax, final logits, master weights) must stay higher precision; and naive FP8 without good scaling diverges. Libraries like Transformer Engine manage the scales. You validate on an eval set — FP8 quality is fragile and model-dependent.

How would you design a memory plan to train a model that doesn't fit on one GPU? hard

Shard the model state with ZeRO-3 / FSDP so parameters, gradients, and optimizer state are split across data-parallel ranks rather than replicated. Add tensor parallelism within a node (over NVLink) to split individual layers, and pipeline parallelism across nodes to split by layer groups. Layer in activation checkpointing to trade compute for activation memory, and optionally offload optimizer state to CPU/NVMe. The exact mix is driven by model size, interconnect topology, and the compute/communication trade-off.

Compare tensor, pipeline, and data parallelism and their trade-offs. hard

Data parallelism replicates the model and all-reduces gradients — simple and scales well, but communication and memory grow with model size. Tensor parallelism splits each layer's matrices across GPUs, requiring an all-reduce within every layer, so it needs very fast interconnect (NVLink) and is usually kept inside a node. Pipeline parallelism splits the model into stages on different GPUs and streams micro-batches through, which is communication-light but introduces a pipeline 'bubble' of idle time that micro-batching and 1F1B scheduling reduce. Large-scale training combines all three (3D parallelism) mapped to the hardware topology.

How do you profile a slow kernel? hard

Use Nsight Compute to get the metrics that matter: achieved occupancy, memory throughput vs the roofline, the dominant warp-stall reasons (memory dependency, execution dependency, etc.), and Tensor-Core utilization. Those tell you whether you're memory-bound (optimize coalescing, reuse, fusion), latency-bound (raise occupancy/ILP), or compute-bound (improve Tensor-Core use and shapes). The method is: measure, identify the single limiting factor, fix it, then re-measure rather than guessing.

How do you reduce inter-GPU communication during training? hard

Overlap communication with computation by bucketing gradients and launching all-reduce as soon as each bucket's gradients are ready (DDP does this), so comms hide behind backprop. Increase the per-GPU batch size so the compute-to-communication ratio rises, and use topology-aware/hierarchical collectives (reduce within a node over NVLink, then across nodes). For extreme cases, gradient compression or lower-precision communication further cuts bytes on the wire.

What is arithmetic intensity and how do you increase it? hard

Arithmetic intensity is FLOPs performed per byte moved from memory; it determines where you sit on the roofline. You raise it by reusing data more before evicting it — fusing operations, tiling so blocks stay in shared memory/registers, using larger tiles/batches, and avoiding redundant HBM reads/writes of intermediates. Increasing intensity shifts a memory-bound kernel toward compute-bound, where the GPU's FLOPs can actually be utilized.

What is the difference between QAT and PTQ? hard

Post-Training Quantization (PTQ) quantizes an already-trained model, using a small calibration set to choose scales/zero-points; it's cheap and fast but can lose accuracy at very low bit-widths. Quantization-Aware Training (QAT) inserts fake-quantization into the forward pass during (re)training so the model learns weights robust to quantization, giving the best low-bit accuracy at the cost of a training run. You reach for PTQ first (often with AWQ/GPTQ) and only do QAT when PTQ accuracy is insufficient.

What is the cost of warp divergence and how do you mitigate it? hard

When threads in a warp take different branches, the hardware executes each taken path sequentially with non-participating threads masked, so heavy divergence can cut effective throughput by the number of distinct paths. Mitigations include restructuring or sorting data so threads in a warp follow the same path, replacing small branches with predication/arithmetic, and aligning work to warp boundaries. The goal is uniform control flow within each 32-thread warp.

What is Multi-Instance GPU (MIG) and when do you use it? hard

MIG partitions an A100/H100 into several fully isolated GPU instances, each with a dedicated slice of SMs, L2, and memory and hardware-enforced isolation. You use it for multi-tenant or mixed-SLA serving where you need guaranteed QoS and fault/performance isolation between workloads, rather than letting them contend for one GPU. The trade-off is fixed partition sizes and that a single large job can't use the whole GPU while partitioned.

How do you overlap data movement with computation? hard

Use multiple CUDA streams with double buffering: while the GPU computes on buffer A, asynchronously prefetch the next tile/batch into buffer B on a separate stream, then swap. This requires pinned host memory for true async DMA and careful dependency management so compute waits only on the data it needs. Done well, transfer latency disappears behind compute and the GPU stays busy.

How do you read a roofline plot to guide optimization? hard

The roofline has a sloped region (performance limited by memory bandwidth) and a flat region (limited by peak compute), meeting at the ridge point. Plot your kernel's achieved GFLOP/s at its arithmetic intensity: if it sits under the sloped roof, you're memory-bound — improve coalescing, reuse, and fusion to raise intensity; if it's under the flat roof, you're compute-bound — improve occupancy, ILP, and Tensor-Core utilization. The gap between your point and the roof shows the headroom and which roof to chase.

Scenario-based

Your training job OOMs at batch 32 on an A100 40GB but fits at 16 — you need batch-32 throughput. What do you try? medium

Get effective batch 32 without the memory: gradient accumulation (run two micro-batches of 16, accumulate grads, one optimizer step) is the zero-risk first move. Then cut memory: activation/gradient checkpointing (recompute activations in backward, trades compute for memory), mixed precision BF16, an 8-bit optimizer (bitsandbytes Adam) to shrink optimizer state, and ZeRO/FSDP sharding or CPU offload if multi-GPU. Verify the math: params + grads + optimizer state + activations. Activations usually dominate at large batch — checkpointing hits that hardest.

nvidia-smi shows 95% GPU utilization but throughput is low. What's going on? hard

GPU "utilization" only means a kernel was running, not that the SMs or Tensor Cores were busy — it's a misleading metric. Profile with Nsight Compute / DCGM for real SM occupancy and Tensor-Core activity. Common causes: memory-bound kernels (low arithmetic intensity — see the roofline), tons of tiny kernels with launch overhead (fuse them / use CUDA graphs), not using Tensor Cores (wrong dtype or shapes not multiples of 8), or a CPU/dataloader bottleneck feeding the GPU. Fix the specific limiter — raise batch, fuse ops, use FP16/BF16 with aligned shapes.

Training scales from 1→8 GPUs but you only get ~3x speedup. Diagnose. hard

Communication overhead is eating the scaling. Check the interconnect — PCIe is far slower than NVLink for all-reduce; cross-node needs fast NIC + correct NCCL setup. Verify NCCL is actually using NVLink/IB (NCCL_DEBUG=INFO). Other causes: per-GPU batch too small (comm dominates compute — raise global batch), gradient all-reduce not overlapped with backward (use gradient bucketing / DDP's overlap), or load imbalance. Measure comm vs compute time; if comm-bound, bigger batches, gradient compression, or better topology.

Inference latency is mostly fine but spikes intermittently. What causes it? medium

Intermittent = something periodic or contention. Suspects: thermal throttling (check clocks under sustained load), memory fragmentation forcing reallocation, PCIe host↔device transfers stalling, cold caches on first request after idle, GC/allocator pauses, or batching jitter (a big request blocking a queue). For LLM serving specifically, a long prompt doing prefill blocks decode for others (head-of-line). Profile the spike window, check GPU clocks/temp, and look at queue/batch composition at spike time.

Pick hardware to serve a 70B model. How do you size it? hard

Start with memory math: 70B params × 2 bytes (FP16) = ~140 GB just for weights, plus KV cache that grows with batch×context. That won't fit one 80GB GPU — options: 2× H100/A100 80GB with tensor parallelism, or quantize to INT8 (~70GB, fits one 80GB) or FP8/INT4 for more headroom. Then size KV cache for target concurrency/context and pick batching (continuous batching + PagedAttention). Trade-offs: quantization saves memory/cost but may cost accuracy; TP adds inter-GPU comm. Decide from throughput target, latency SLA, and budget.

nvidia-smi shows the GPU at 0% but your job is clearly running. Why? easy

The GPU isn't the bottleneck or isn't being used. Likely CPU/data-loading bound — the GPU sits idle waiting for the next batch (slow dataloader, too few workers, preprocessing on CPU). Or the process is on a different device than you're watching (CUDA_VISIBLE_DEVICES mismatch), blocked on I/O (reading data/checkpoints), or doing synchronous host work. Check per-process GPU mem, increase dataloader workers / prefetch, profile the input pipeline.

A kernel only reaches 50% of peak FLOPs. How do you improve it? hard

First place it on the roofline to learn the limiter. If it's memory-bound (below the sloped roof), raise arithmetic intensity by fusing ops, improving coalescing, and reusing data in shared memory so you stop hammering HBM. If it's compute-bound but still at 50%, check Tensor-Core utilization (dtype and shapes must be Tensor-Core-friendly, dimensions multiples of 8/16), occupancy, and warp-stall reasons, then fix the dominant stall. Always measure, change one thing, and re-measure rather than optimizing blindly.

Two models must share one GPU without interfering. How do you set that up? medium

If you need hard isolation and predictable latency between them — for example different tenants or SLAs — use MIG to carve the GPU into separate instances with dedicated memory and compute. If they're cooperative and you mainly want higher utilization, use MPS to let their kernels run concurrently on the shared GPU. In both cases cap each process's memory so one can't starve the other, and monitor for contention.

Host-to-device transfer dominates your runtime. What do you do? medium

Switch host buffers to pinned memory and issue the copies asynchronously on a separate stream so they overlap with compute via double buffering. Batch many small transfers into fewer large ones to amortize overhead, and most importantly keep data resident on the GPU across steps instead of round-tripping it every iteration. If the data is reused, cache it in device memory rather than re-sending.

FP16 training diverges but FP32 is stable and slow. How do you fix it? medium

The divergence is almost certainly FP16's narrow dynamic range causing overflow/underflow. The cleanest fix is to switch to BF16, which keeps FP32's exponent range and trains stably without loss scaling. If you must stay on FP16, add proper dynamic loss scaling plus gradient clipping so small gradients don't underflow and spikes don't explode. Either way you keep the speed of 16-bit while restoring stability.

Batch-1, latency-critical inference is slow. What helps? hard

At batch 1 you're dominated by kernel-launch overhead and memory latency, not compute, so cut per-step overhead: capture the model in a CUDA graph to replay all kernels in one launch, use fused kernels, and remove Python/host overhead in the hot loop. Quantize weights to shrink the per-token memory you must stream from HBM. Essentially you optimize for launch and bandwidth efficiency rather than throughput tricks like big batching.

Adding GPUs raises cost but barely improves throughput. What's wrong? medium

This is classic communication-bound scaling. Check whether the GPUs are connected by NVLink or only PCIe, since slow interconnect throttles all-reduce; verify the per-GPU batch is large enough that compute outweighs communication; and confirm gradient all-reduce overlaps the backward pass instead of running serially. If comms dominate, increase batch size, enable overlap/bucketing, and use topology-aware collectives.

OOM appears only on long sequences. How do you fit them? medium

Activation and KV memory grow with sequence length, so target that growth: use activation checkpointing to recompute instead of store activations, FlashAttention so attention doesn't materialize the N×N matrix, and consider sequence/context parallelism to split the sequence across GPUs. If the long context isn't essential, cap or chunk it. Each of these trades some compute for the memory that long sequences demand.

The profiler shows heavy memory-related warp stalls. How do you fix it? hard

The SMs are starving on HBM, so reduce and reshape memory traffic: make global accesses coalesced, stage frequently-reused data in shared memory, and fuse passes to cut HBM round-trips. Raise occupancy or instruction-level parallelism so the scheduler has other warps to run during outstanding loads, and prefetch where possible. Re-profile to confirm the stall reason shifted away from memory dependency.

You need INT4 inference but must keep accuracy. What's your approach? hard

Use a calibration-based, activation-aware method like AWQ or GPTQ rather than naive round-to-nearest, since they protect the salient/outlier weight channels that matter most. Keep especially sensitive layers (often the first/last or attention projections) at higher precision in a mixed-precision scheme. Then validate on a real downstream task set, not just perplexity, and fall back to INT8/FP8 if INT4 quality is unacceptable.

How do you choose between MIG and MPS for a multi-tenant inference box? hard

Choose MIG when tenants need hardware-enforced isolation and predictable QoS — different customers, strict SLAs, or untrusted workloads — accepting fixed partition sizes and some stranded capacity. Choose MPS when the workloads are cooperative and you want maximum utilization and flexible sharing, since it lets kernels from multiple processes run concurrently without rigid partitions. In short: MIG for isolation/guarantees, MPS for utilization/flexibility.

what industry actually asks

For ML-systems / infra roles (NVIDIA, Meta, OpenAI, Anthropic, scale-ups), expect the hard block: roofline reasoning, "GPU is busy but slow," FlashAttention internals, NCCL/parallelism strategy, and a live debugging scenario. For applied-ML / new-grad screens, expect the easy/medium block: mixed precision, OOM mitigation, why .item() is slow, basic profiling. Many loops give you a real symptom ("here's nvidia-smi / a profiler trace — what's wrong?") and grade your debugging method, not trivia. Always say "I'd profile first" and name the tool (nsys for timeline, ncu for one kernel).

GPU & CUDA — Interview Questions.

Easy — fundamentals

Medium — applied

Hard — senior & systems

Scenario-based