Causal Attention and Autoregressive Generation

Why can't an LLM see into the future? The causal mask is why — and it's elegantly simple. This is article 5 of 7 in Phase 1 of the DeepSeek Engineering Blog Series.

If You Read Nothing Else: During training, the model sees the full sequence at once — but it shouldn't "cheat" by looking at future tokens when predicting the next one. The causal mask sets all future positions to −∞ before softmax, forcing zero attention weight on them. During inference, this mask becomes unnecessary because future tokens simply don't exist yet. This design forces the use of a KV cache — the memory bottleneck that DeepSeek's MLA solves.

The problem: training vs inference

There's a fundamental asymmetry in how LLMs work:

During training: We have the complete sequence. "The cat sat on the mat" — all 6 tokens available at once. We can process them in parallel.
During inference: We only have the past. We've generated "The cat" and need to predict "sat" — but "on the mat" doesn't exist yet.

The problem: if during training token position 3 can attend to positions 4, 5, and 6, it would learn to "cheat" — using information from the future to predict the present. But at inference time, that future information won't be available. The model would fail catastrophically.

The causal (triangular) mask

Fig 1 — Causal mask: upper-triangle set to −∞ so each token can only attend to itself and past tokens.

The solution is the causal attention mask — a triangular matrix applied to the attention scores before softmax. Every position above the diagonal (future positions) gets set to −∞.

# Attention scores (4×4) BEFORE masking:
scores = [[0.5, 1.2, 0.8, 0.3],    # token 0 → all tokens
          [1.1, 0.4, 0.9, 0.6],    # token 1 → all tokens
          [0.7, 1.3, 0.2, 1.0],    # token 2 → all tokens
          [0.8, 0.5, 1.1, 0.4]]    # token 3 → all tokens

# The causal mask:
mask = [[ 0,  -∞,  -∞,  -∞],       # token 0: only see itself
        [ 0,   0,  -∞,  -∞],       # token 1: see tokens 0-1
        [ 0,   0,   0,  -∞],       # token 2: see tokens 0-2
        [ 0,   0,   0,   0]]       # token 3: see all tokens 0-3

# After masking (add mask to scores):
masked = [[0.5,  -∞,  -∞,  -∞],
          [1.1, 0.4,  -∞,  -∞],
          [0.7, 1.3, 0.2,  -∞],
          [0.8, 0.5, 1.1, 0.4]]

Effect on softmax

When softmax encounters −∞, exp(−∞) = 0. Those positions get exactly zero attention weight:

# After softmax:
weights = [[1.00, 0.00, 0.00, 0.00],   # token 0: 100% self
           [0.67, 0.33, 0.00, 0.00],   # token 1: attends to 0 and 1
           [0.30, 0.55, 0.15, 0.00],   # token 2: attends to 0, 1, 2
           [0.24, 0.18, 0.33, 0.16]]   # token 3: attends to all

Each row sums to 1.0, but only over the tokens up to and including the current position. Future tokens are invisible.

Key Insight

The causal mask is the only difference between encoder-style attention (BERT, which sees everything) and decoder-style attention (GPT, which can only look backward). It's a single triangular matrix — and it changes the entire model from bidirectional to autoregressive.

Implementation in code

import torch
import torch.nn.functional as F

def causal_self_attention(X, W_Q, W_K, W_V):
    """Self-attention with causal masking."""
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V

    n = Q.shape[0]
    d_k = K.shape[-1]

    scores = Q @ K.T / (d_k ** 0.5)      # (n, n)

    # Create causal mask: upper triangle = True
    mask = torch.triu(torch.ones(n, n, dtype=torch.bool), diagonal=1)
    scores = scores.masked_fill(mask, float('-inf'))

    weights = F.softmax(scores, dim=-1)    # (n, n)
    output = weights @ V                   # (n, d_v)
    return output, weights

The key line: torch.triu(..., diagonal=1) creates an upper-triangular boolean mask. masked_fill sets those positions to −∞. That's the entire causal mechanism.

This is exactly how it's implemented in Karpathy's nanoGPT — see the CausalSelfAttention class, line 40: self.register_buffer("bias", torch.tril(torch.ones(block_size, block_size))).

The autoregressive generation loop

At inference time, generation happens one token at a time:

# Autoregressive generation — step by step

# Step 1: Input = ["The"]
tokens = ["The"]
logits = model(["The"])           # predict next
next = sample(logits)              # → "cat"
tokens = ["The", "cat"]

# Step 2: Input = ["The", "cat"]
logits = model(["The", "cat"])    # predict next
next = sample(logits)              # → "sat"
tokens = ["The", "cat", "sat"]

# Step 3: Input = ["The", "cat", "sat"]
logits = model(["The", "cat", "sat"])
next = sample(logits)              # → "down"
tokens = ["The", "cat", "sat", "down"]

# ... and so on

Each step feeds the entire sequence so far through the model. The model outputs logits for every position, but we only use the logit at the last position to predict the next token.

Teacher forcing vs free-running

During training, we use teacher forcing: the model receives the true sequence as input and predicts the next token at every position simultaneously. The causal mask ensures it can't cheat by looking ahead. The loss is computed at every position in parallel.

# Training: teacher forcing (parallel)
input  = ["The", "cat", "sat", "on",  "the"]
target = ["cat", "sat", "on",  "the", "mat"]

# Model processes all 5 positions at once
# Causal mask ensures position 0 can't see positions 1-4
# Loss computed at all positions simultaneously

During inference, we use free-running: the model generates one token at a time, feeding its own output back as input. Any mistake propagates — if the model generates a wrong token, all subsequent tokens are conditioned on that error.

Fig 2 — Training processes all positions in parallel; inference generates one token at a time, feeding output back as input.

Why this design forces the KV cache

Here's the problem that leads directly to Phase 2. At inference step t, the model needs to compute attention over all t tokens. The naive approach: recompute Q, K, V for all previous tokens every step.

For a 1000-token sequence at step 1000, you'd compute 1000 Q vectors, 1000 K vectors, and 1000 V vectors — even though the first 999 haven't changed since the last step.

The KV cache solves this: cache the K and V vectors from all previous steps. At step t, only compute Q, K, V for the new token. Look up cached K and V for all previous tokens. This reduces per-step compute from O(t²) to O(t).

The Catch

The KV cache saves compute but costs memory. For DeepSeek-V3 with 61 layers, 128 attention heads, head dimension 128, and 128K context: the KV cache alone is 2 × 61 × 128 × 128 × 128,000 × 2 bytes ≈ 50+ GB. This is the memory bottleneck that MLA (Phase 3) compresses by 93%.

Causal attention in DeepSeek

DeepSeek-V3 uses the same causal masking principle but with two key differences:

MLA replaces standard KV caching. Instead of caching full K and V per head, DeepSeek caches a compressed latent vector. The causal structure is preserved — the mask still prevents attending to future tokens — but the memory cost is dramatically lower.
MTP (Multi-Token Prediction) changes what happens at the output. Instead of predicting just the next token, DeepSeek-V3 predicts 2 tokens ahead. But the causal constraint still holds — each prediction can only use information from tokens before the position being predicted.

5 things to remember

Causal mask: Upper-triangle set to −∞. Prevents future information leakage during training.
softmax(−∞) = 0: Future tokens get exactly zero attention weight. Clean and exact.
Training parallel, inference sequential: Teacher forcing (parallel) during training, free-running (sequential) during generation.
KV cache: Cache previous steps' K and V to avoid recomputation. Saves compute, costs memory.
Memory bottleneck: The KV cache grows linearly with sequence length. This is why MLA and GQA exist — and why Phase 2 matters.

Go deeper

Code: nanoGPT causal mask — model.py line 40
Book: Raschka — "Build a Large Language Model From Scratch" (Chapter 3) — Manning
Blog: The Illustrated GPT-2 (autoregressive section) — Jay Alammar
Paper: Language Models are Unsupervised Multitask Learners (GPT-2) — OpenAI