LLM Training Interview Questions

Questions for ML-engineer, research-engineer, and training-infra roles — graded easy → hard with full answers. Click to expand. Companion to the LLM Training cheatsheet.

easy fundamentals medium applied training hard senior / distributed

Easy — fundamentals

What consumes GPU memory during training, and roughly how much? easy

Four things: weights (2 bytes/param in BF16), gradients (2 bytes/param), optimizer states (Adam keeps FP32 momentum + variance + master weights ≈ 12 bytes/param), and activations (scale with batch × sequence × layers). So with Adam in mixed precision it's ~16 bytes/param just for the model states — a 7B model is ~110 GB before activations. That's the whole reason sharding (ZeRO/FSDP) and checkpointing exist.

What's the difference between an epoch, a step, and a batch? easy

A batch is the set of samples processed in one forward/backward. A step (iteration) is one optimizer update — usually one batch, or several micro-batches if you use gradient accumulation. An epoch is one full pass over the training set. Large LLM pretraining is often measured in tokens seen rather than epochs (frequently <1 epoch over a giant corpus).

Why use a learning-rate warmup? easy

At the start, weights are random and gradient estimates are noisy; a full learning rate can take a huge, destabilizing step and diverge (especially with adaptive optimizers whose variance estimates aren't warmed up). Warmup ramps the LR from ~0 to the peak over the first few hundred/thousand steps, letting the optimizer stabilize, then a schedule (usually cosine) decays it. Skipping warmup is a classic cause of early loss spikes / NaNs in transformer training.

What is fine-tuning, and how is it different from pretraining? easy

Pretraining trains a model from scratch on a massive general corpus (next-token prediction) — hugely expensive, learns general language/knowledge. Fine-tuning takes that pretrained model and continues training on a smaller, task- or domain-specific dataset to adapt its behavior (e.g. instruction-following, a domain style, a format). It's far cheaper and is how most teams customize models. Parameter-efficient methods like LoRA make it cheaper still.

What is gradient accumulation? easy

A way to simulate a large batch when it won't fit in memory: run several smaller micro-batches, summing (accumulating) their gradients, and only call the optimizer step + zero-grad after N of them. The effective batch size is micro_batch × N, giving large-batch training dynamics without the memory cost of a large batch — at the price of more forward/backward passes per update.

How much memory does training a model take? easy

~16 bytes/param with Adam mixed precision: FP16 weights(2)+grads(2)+FP32 master(4)+Adam m,v(8) — plus activations scaling with batch and sequence.

What is gradient accumulation? easy

Summing gradients over several micro-batches before one optimizer step, simulating a large batch without its memory.

What is learning-rate warmup? easy

Ramping LR from ~0 over the first steps to avoid destabilizing the fresh model, then decaying (cosine/linear).

What is gradient clipping? easy

Capping gradient norm (e.g. 1.0) to stop exploding gradients from blowing up a step — key for stability.

What is a checkpoint? easy

A snapshot of weights + optimizer state + scheduler + step (+RNG) so training resumes exactly after a crash.

What is LoRA? easy

Low-Rank Adaptation: freeze base weights, train small low-rank A·B matrices on chosen layers — few trainable params, cheap fine-tuning.

What is QLoRA? easy

LoRA on a 4-bit (NF4) frozen base with paged optimizers — fine-tune very large models on modest GPUs.

SFT vs pretraining? easy

Pretraining learns general language from huge unlabeled text (next-token); SFT teaches instruction-following on curated prompt→response pairs.

Epoch vs step? easy

A step is one optimizer update on one batch; an epoch is one full pass over the dataset.

What is MFU? easy

Model FLOPs Utilization — fraction of peak GPU FLOPs actually achieved; low MFU = starved/stalled GPUs.

Medium — applied training

How does gradient checkpointing reduce memory, and what's the cost? medium

Normally every layer's activations are kept in memory for the backward pass. Gradient (activation) checkpointing saves only a subset (checkpoints) and recomputes the rest during backward by re-running the forward of those segments. For an N-layer net this turns O(N) activation memory into roughly O(√N) at the cost of ~one extra forward pass (~33% more compute). It's the standard lever for fitting long sequences / large models, and you can tune how aggressively you checkpoint to trade memory vs speed.

LoRA vs full fine-tuning — how does LoRA work and when do you choose it? medium

LoRA freezes the pretrained weights and injects small trainable low-rank matrices A·B alongside chosen weight matrices (typically attention projections): the update is ΔW = (α/r)·B·A with rank r ≪ dimension, so you train ~0.1–1% of the parameters. Benefits: tiny optimizer/gradient memory, fast, adapters are a few MB and mergeable at inference (zero overhead) or swappable (multi-LoRA serving), and less catastrophic forgetting. Choose LoRA/QLoRA when you have limited VRAM, want many cheap variants, or are adapting style/format. Choose full fine-tuning when you need maximum quality, are changing the model substantially, and have the compute.

What is QLoRA and how does it let you fine-tune a big model on one GPU? medium

QLoRA loads the frozen base model in 4-bit (NF4 quantization) to slash the weight memory, then trains LoRA adapters in BF16 on top, with the 4-bit weights dequantized on the fly for each forward. Add tricks like double quantization (quantize the quantization constants) and paged optimizers (offload optimizer state to CPU on memory spikes). Net effect: a 65B model that would need hundreds of GB for full fine-tuning fits on a single large GPU, with quality close to 16-bit fine-tuning because only the small adapters are trained.

BF16 vs FP16 for training stability — which and why? medium

BF16 has the same 8-bit exponent as FP32, so it covers the same dynamic range and rarely overflows/underflows — you can train without loss scaling, and it's much more robust to gradient spikes. FP16 has only 5 exponent bits, so small gradients underflow to zero and large values overflow to inf; it needs a dynamic loss scaler (scale loss up before backward, unscale before the step, back off on overflow). On Ampere and newer, default to BF16 — fewer NaNs, simpler. FP16 is mainly for older hardware lacking BF16.

Your training loss suddenly goes to NaN. How do you debug it? medium

First look at grad-norm just before the NaN. A spike there means an optimization blow-up: LR too high / no warmup → lower LR, add warmup; no gradient clipping → clip global grad-norm to ~1.0; FP16 overflow → switch to BF16 or fix the loss scaler. A NaN with flat grad-norm points elsewhere: a bad data sample (inf/NaN in inputs, empty/over-long sequence, label out of range) — inspect the exact batch at the failing step; or numerics in a custom layer (log(0), divide-by-zero, unstable softmax) — add epsilons. torch.autograd.set_detect_anomaly(True) pinpoints the offending op (slow, for a repro run).

Explain the ZeRO stages / FSDP. medium

ZeRO shards optimizer state (1), gradients (2), and parameters (3) across DP ranks to cut per-GPU memory; FSDP is PyTorch's stage-3 full sharding.

What is activation checkpointing? medium

Drop intermediate activations in forward, recompute them in backward — trades compute for big activation-memory savings.

BF16 vs FP16 for training? medium

BF16 preferred: FP32 exponent range, rarely overflows, no loss scaling; FP16 has more precision but narrow range causes instability.

What is 3D parallelism? medium

Data + tensor + pipeline parallelism combined to scale to thousands of GPUs, mapped to interconnect topology (TP in-node, PP/DP across).

RLHF vs DPO? medium

RLHF trains a reward model then PPO-optimizes the policy (complex/unstable); DPO optimizes preferences directly with a simple loss, no reward model.

Why offload optimizer state? medium

Adam moments are 8 bytes/param — the biggest chunk; offloading to CPU/NVMe (ZeRO-Offload) frees GPU memory at a bandwidth cost.

How do you debug a loss spike? medium

Check bad batch, LR too high, missing clip, FP16 overflow; resume from last good checkpoint with clip/BF16 and inspect gradient norms.

What is DoRA and why beyond LoRA? medium

Weight-Decomposed LoRA adapts magnitude and direction separately, often closing the gap to full fine-tuning better than vanilla LoRA.

How is pretraining data quality managed? medium

Dedup, quality/toxicity/PII filtering, domain mixing/weighting, and eval-set decontamination — quality often beats raw quantity.

What is a cosine LR schedule with warmup? medium

Warm up linearly then decay following a cosine curve to a small final LR — smooth decay that empirically trains well for LLMs.

Hard — senior & distributed

Explain ZeRO stages 1/2/3 and the memory vs communication trade-off. hard

Data-parallel training normally replicates weights, gradients, and optimizer states on every GPU — redundant. ZeRO partitions them across the data-parallel group: Stage 1 shards optimizer states (biggest single win for Adam); Stage 2 also shards gradients; Stage 3 also shards parameters, so each GPU stores only a slice and all-gathers a layer's full params just-in-time for its forward/backward, freeing them after. Memory drops roughly linearly with the number of GPUs, but stage 3 adds all-gather (params) + reduce-scatter (grads) traffic every layer. The trade-off: higher stages fit bigger models but need more bandwidth; on slow interconnect they idle the GPUs. Rule: pick the lowest stage that fits, and add ZeRO-Offload (CPU/NVMe) only when even stage 3 doesn't.

Compare data, tensor, and pipeline parallelism. How do they combine in 3D parallelism? hard

Data parallel replicates the model and splits the batch; communication is a gradient all-reduce per step. Tensor parallel splits the math within each layer (e.g. shard the attention/MLP matmuls across GPUs) and all-reduces activations every layer — very chatty, needs NVLink, kept intra-node. Pipeline parallel splits the model's layers into sequential stages on different GPUs; activations pass stage→stage (low bandwidth, crosses nodes), but you must micro-batch to fill the pipeline or waste time in the bubble. 3D parallelism composes them for frontier models: TP within a node (fast links), PP across nodes, and DP (often ZeRO) replicating those groups for throughput — sized so each dimension's communication maps to the matching tier of the network topology.

A multi-node run hangs partway through (not at start). What are the likely causes? hard

Mid-run hangs are almost always a collective desync or a dead rank. (1) Ranks diverged in control flow — e.g. a branch, uneven last batch, or logging/eval that only some ranks enter, so they call different collectives and the rest block on the barrier until the NCCL watchdog timeout. Fix with DistributedSampler/drop_last or the DDP join() context, and TORCH_DISTRIBUTED_DEBUG=DETAIL to flag mismatches. (2) One rank crashed (OOM, bad sample, hardware/Xid) and the others wait forever — the true error is in that rank's log, so grep all ranks for the first traceback, not the timeout message. (3) A flaky NIC / NCCL transport stall — NCCL_DEBUG=INFO, check the fabric. Set a sane collective timeout so it fails loudly instead of hanging.

SFT vs RLHF vs DPO — what does each do and why did DPO get popular? hard

SFT (supervised fine-tuning) trains on curated instruction→response pairs — it teaches the format and basic helpfulness, the foundation of a chat model. RLHF aligns to human preferences: train a reward model from pairwise preference data, then optimize the policy against it with PPO while a KL penalty keeps it near the SFT model. It's powerful but a complex, unstable pipeline (reward hacking, four models in memory, finicky PPO). DPO (Direct Preference Optimization) skips the reward model and RL loop: it derives a simple classification-style loss directly on preference pairs that provably optimizes the same objective, using the SFT model as a reference. It's far simpler, more stable, and cheaper — hence its popularity as the default. Newer variants (ORPO, KTO, GRPO for reasoning/RLVR) push further, but DPO is the pragmatic baseline.

What is MFU (Model FLOPs Utilization) and why track it instead of GPU utilization? hard

MFU = the model's useful FLOPs per second ÷ the hardware's theoretical peak FLOPs. It answers "what fraction of the GPU's compute am I actually turning into model training?" GPU-utilization from nvidia-smi only says a kernel was resident, not that it was efficient — you can be 100% "utilized" at 15% MFU. MFU exposes the real inefficiency: low MFU means starvation, host-device syncs, comms-bound parallelism, no Tensor Cores, or bad kernels. Good large-scale training targets ~40–55% MFU; if you're far below, profile before scaling. It's also the honest number for cost/throughput planning.

Why must you checkpoint optimizer state, not just weights, and what else does a robust resume need? hard

Adam's per-parameter momentum and variance define the optimization trajectory; resuming from weights alone resets them, so the optimizer effectively restarts cold and you get a loss bump and disrupted dynamics. A correct checkpoint saves weights + optimizer state + LR-scheduler step + RNG seeds + the data iterator/sampler position (so you don't re-see early data or skip data) + the step/token count. For large models, optimizer state is sharded too, so use distributed/sharded checkpointing (and async checkpointing to avoid stalling the run). Because long runs will hit node failures, checkpoint frequently and test the resume path before you depend on it.

How do you compute and improve MFU? hard

MFU = achieved/peak FLOPs; improve via bigger batch, kernel fusion (FlashAttention, torch.compile), comm/compute overlap, fixing the input pipeline, BF16/Tensor Cores.

How does pipeline parallelism avoid bubbles? hard

Split the batch into micro-batches and interleave forward/backward (GPipe/1F1B schedules) so stages stay busy, shrinking the idle pipeline bubble.

Reward hacking in RLHF — how prevent it? hard

KL penalty to the reference policy, improve/retrain the reward model on exploited cases, diverse evals, and consider DPO for stability.

How do scaling laws guide a training run? hard

Chinchilla-style laws balance params vs tokens for a compute budget (~20 tokens/param); they tell you the optimal model size and dataset size to not under/over-train.

What causes NCCL hangs/deadlocks and how to debug? hard

Ranks out of sync (one crashed/skipped a collective), mismatched shapes, or network issues; debug with NCCL_DEBUG=INFO, collective timeouts, ensuring all ranks call collectives identically.

Full fine-tune vs PEFT — when each? hard

Full FT for big domain shifts with ample compute/data; PEFT (LoRA/QLoRA) when compute-limited, many task variants, or to avoid catastrophic forgetting — near-FT quality at a fraction of cost.

How do you handle catastrophic forgetting? hard

Lower LR, fewer epochs, PEFT (less invasive), replay/mix general data, and eval base capabilities — SFT can erode pretrained skills.

What is sequence/context parallelism? hard

Splitting the sequence dimension across GPUs so each holds part of the tokens' activations/attention — enables very long context training beyond one GPU's activation memory.

How do you choose batch size and LR together? hard

Larger batch needs higher LR (linear/sqrt scaling) + warmup; too-large batch can hurt generalization. Tune via LR sweep at the chosen batch and watch grad-norm stability.

How do you make a long multi-node run robust? hard

Frequent full checkpoints to durable storage, automatic resume (incl. dataloader position + RNG), elastic/fault-tolerant training, health checks, and straggler detection.

Scenario-based

Loss goes to NaN around step 5000. Walk through the fix. hard

NaN = numerical blowup. If FP16, it's likely overflow/underflow — switch to BF16 (wider range, the standard fix) or check the loss scaler. Add/verify gradient clipping (norm ~1.0). Check for a too-high LR or a bad warmup. Inspect the data batch around that step (a corrupt/garbage sample, divide-by-zero, bad label). Watch gradient norms — a spike right before NaN points to instability or a hot sample. Reproduce by resuming from a pre-NaN checkpoint with clipping + BF16.

Loss plateaus very early and won't drop. What do you check? medium

Order of suspicion: learning rate (too low = no progress, too high = bouncing — try an LR sweep and a proper warmup+decay schedule), data (quality, shuffling, tokenization bug, labels), gradient flow (vanishing grads, check norms per layer; init/normalization), and capacity vs task. Also confirm the loss/metric is even computed correctly. Plot LR, grad-norm, and a few sample predictions — the plateau cause usually shows up there.

A training job crashes at 80% completion. How do you resume correctly? medium

Resume needs the full state, not just weights: model params, optimizer state (Adam moments), LR scheduler step, RNG seeds, current step/epoch, and ideally dataloader position (so you don't re-see or skip data). Save these together every N steps. On restart, load all of it and continue — not from scratch, not from step 0. Test the resume path before you need it; the classic bug is restoring weights but resetting the optimizer/scheduler, which spikes the loss.

MFU is low — training is slow. How do you speed it up? hard

Low MFU means the GPUs aren't doing useful math. Profile to find the limiter: input pipeline bound (GPU starved — more dataloader workers, prefetch, faster storage), batch too small (raise it / gradient accumulation), no kernel fusion (use FlashAttention, fused optimizers, torch.compile), or communication bound in multi-GPU (overlap all-reduce, use FSDP/ZeRO, bigger batch per step). Also check mixed precision is on and shapes hit Tensor Cores. Fix the dominant one, re-measure MFU.

You must fine-tune a 70B model on a couple of GPUs. How? hard

Full fine-tune is impossible there (16 bytes/param ≈ 1.1TB). Use QLoRA: load the base model in 4-bit (NF4), freeze it, and train small LoRA adapters in BF16 — cuts trainable params and optimizer state by orders of magnitude. Add gradient checkpointing and paged optimizers for memory spikes. You train only the adapters (a few % of params) and merge or serve them. Trade-off: slightly below full-FT quality, but fits and is cheap.

Your SFT model overfits — train loss drops, eval degrades. Fix it. medium

Classic overfit on a small instruction set. Reduce epochs (SFT often needs only 1–3), lower the LR, add early stopping on a held-out eval set, and get more/diverse data. Use a held-out eval and check actual generations, not just loss. LoRA with a modest rank also regularizes vs full fine-tune. Watch for catastrophic forgetting of base capabilities — mix in some general data if needed.

Throughput low, GPUs at 40% util. Diagnose. hard

Input-pipeline or comm bound: profile, add dataloader workers/prefetch, raise batch, overlap all-reduce — then re-measure MFU.

Pretrain a model that doesn't fit one node. Plan? hard

FSDP/ZeRO-3 sharding + tensor parallelism in-node + pipeline across nodes, mapped to NVLink/IB, with activation checkpointing.

Loss fine but downstream quality poor. What's wrong? medium

Train/eval mismatch, data contamination, overfitting, wrong eval format, or wrong objective — inspect generations and held-out metrics, not just loss.

10-day run crashes at day 8. Avoid losing it? medium

Frequent full checkpoints (weights+optimizer+scheduler+RNG+dataloader pos) and tested auto-resume.

Fine-tune is good but forgot base abilities. Fix? medium

Catastrophic forgetting — lower LR, fewer epochs, use LoRA, and mix in general data.

Consumer GPUs only, must fine-tune 13B. How? hard

QLoRA: 4-bit frozen base + LoRA + gradient checkpointing + paged 8-bit optimizer; train adapters and merge.

Reward hacking during RLHF. Response? hard

KL penalty to reference model, retrain reward model on exploited cases, add diverse evals, consider DPO.

Gradients explode only on some batches. Diagnose. medium

Outlier/garbage samples or long-sequence spikes; add clipping, inspect/filter those batches, check tokenization/label bugs.

Distributed run hangs with no error. Likely cause? hard

NCCL collective deadlock — a rank crashed/desynced or shape mismatch; check NCCL logs, ensure identical collective calls, set timeouts.

Cut training cost 2x without hurting quality much. Levers? hard

Raise MFU (pipeline/comm/batch), BF16/FP8, cleaner-but-less data, spot instances with checkpointing, right-size model/tokens via scaling laws.

what industry actually asks

For research-engineer / training-infra roles (frontier labs, large model teams), the hard block dominates: ZeRO/FSDP internals, 3D parallelism mapped to topology, distributed debugging, alignment (SFT/DPO/RLHF), MFU, and checkpoint/resume correctness. For applied-ML / fine-tuning roles, expect easy/medium: memory breakdown, LoRA/QLoRA, warmup, BF16 vs FP16, and "loss went to NaN — debug it." A frequent live exercise: "this 13B model won't fit / trains too slowly on 8×A100 — what do you change?" — answer with the memory ladder (BF16 → checkpointing → accumulation → FSDP stage) and the throughput ladder (data pipeline, kill syncs, compile, overlap comms), and mention you'd confirm with MFU.

LLM & Model Training — Interview Questions.

Easy — fundamentals

Medium — applied training

Hard — senior & distributed

Scenario-based