Multi-Head Attention Implementation in Python

Theory is nice. Running code is better. Let's implement Multi-Head Attention in PyTorch — no nn.MultiheadAttention, just raw matrix operations. This is the final article in Phase 1 of the DeepSeek Engineering Blog Series.

If You Read Nothing Else: This article implements the complete MHA module from scratch in ~60 lines of PyTorch. You'll understand every tensor reshape, every matrix multiplication, and every design choice. The code matches what's in Karpathy's nanoGPT and Raschka's LLMs-from-scratch — adapted for clarity.

The full implementation

Fig 1 — CausalSelfAttention forward pass: every tensor operation from input to output.

Here's the complete CausalSelfAttention class. We'll break down every line below.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class CausalSelfAttention(nn.Module):
    """Multi-head causal self-attention from scratch.

    Based on:
    - Karpathy nanoGPT: github.com/karpathy/nanoGPT/blob/master/model.py
    - Vaswani et al. "Attention Is All You Need" (arXiv:1706.03762)
    """

    def __init__(self, d_model, n_heads, max_seq_len=2048, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"

        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.d_model = d_model

        # Single projection for Q, K, V (more efficient than 3 separate)
        self.qkv_proj = nn.Linear(d_model, 3 * d_model, bias=False)

        # Output projection
        self.out_proj = nn.Linear(d_model, d_model, bias=False)

        # Dropout on attention weights
        self.attn_dropout = nn.Dropout(dropout)
        self.resid_dropout = nn.Dropout(dropout)

        # Causal mask — registered as buffer (not a parameter)
        mask = torch.tril(torch.ones(max_seq_len, max_seq_len))
        self.register_buffer("mask", mask.view(1, 1, max_seq_len, max_seq_len))

    def forward(self, x):
        B, T, C = x.shape   # batch, sequence length, d_model

        # 1. Project to Q, K, V in one shot
        qkv = self.qkv_proj(x)                    # (B, T, 3*C)
        Q, K, V = qkv.chunk(3, dim=-1)            # each (B, T, C)

        # 2. Reshape for multi-head: (B, T, C) → (B, n_heads, T, d_head)
        Q = Q.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        K = K.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        V = V.view(B, T, self.n_heads, self.d_head).transpose(1, 2)

        # 3. Scaled dot-product attention
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_head)

        # 4. Apply causal mask
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))

        # 5. Softmax + dropout
        weights = F.softmax(scores, dim=-1)        # (B, nh, T, T)
        weights = self.attn_dropout(weights)

        # 6. Weighted sum of values
        out = weights @ V                           # (B, nh, T, d_head)

        # 7. Reshape back: (B, nh, T, d_head) → (B, T, C)
        out = out.transpose(1, 2).contiguous().view(B, T, C)

        # 8. Output projection + dropout
        out = self.resid_dropout(self.out_proj(out))
        return out

Line-by-line breakdown

The combined QKV projection

self.qkv_proj = nn.Linear(d_model, 3 * d_model, bias=False)

Instead of three separate linear layers for Q, K, and V, we use one that produces all three at once. This is a standard optimization: one large matrix multiplication is faster than three small ones on GPUs (better memory locality, fewer kernel launches).

The weight matrix W_QKV has shape (d_model, 3 × d_model). We split the output into three equal parts using .chunk(3).

The reshape: (B, T, C) → (B, nh, T, d_head)

Q = Q.view(B, T, self.n_heads, self.d_head).transpose(1, 2)

This is the critical tensor manipulation. Let's trace through it:

Fig 2 — The view+transpose trick puts attention heads into the batch dimension for parallel GPU computation.

Q.view(B, T, n_heads, d_head) — reshapes (B, T, C) into (B, T, n_heads, d_head). The last dimension C = n_heads × d_head, so we're just reinterpreting the memory layout.
.transpose(1, 2) — swaps the T and n_heads dimensions: (B, n_heads, T, d_head). Now each head is its own "batch" — PyTorch's batched matmul handles all heads in parallel.

Why Transpose?

PyTorch's @ operator does batched matrix multiplication on the last two dimensions. By putting n_heads in the batch dimension, Q @ K.T computes attention for all heads simultaneously — one CUDA kernel call, not a for-loop over heads.

The causal mask

mask = torch.tril(torch.ones(max_seq_len, max_seq_len))
self.register_buffer("mask", mask.view(1, 1, max_seq_len, max_seq_len))

torch.tril creates a lower-triangular matrix of 1s. We register it as a buffer (not a parameter — it's not trained). The shape (1, 1, T, T) broadcasts across batch and head dimensions.

masked_fill(mask == 0, -inf) sets the upper triangle to −∞, implementing the causal constraint from Article 1.5.

The reshape back

out = out.transpose(1, 2).contiguous().view(B, T, C)

Reverse the reshape: (B, nh, T, d_head) → (B, T, nh, d_head) → (B, T, C). The .contiguous() call ensures the tensor is stored in contiguous memory before .view() — required because transpose() doesn't move data, it just changes the stride metadata.

Testing the implementation

# Verify output shapes
def test_shapes():
    d_model = 512
    n_heads = 8
    batch_size = 2
    seq_len = 64

    mha = CausalSelfAttention(d_model, n_heads)
    x = torch.randn(batch_size, seq_len, d_model)
    out = mha(x)

    assert out.shape == (batch_size, seq_len, d_model), \
        f"Expected {(batch_size, seq_len, d_model)}, got {out.shape}"
    print(f"✓ Output shape: {out.shape}")

test_shapes()
# ✓ Output shape: torch.Size([2, 64, 512])

Verify causality

def test_causality():
    """Ensure token i cannot attend to token j > i."""
    d_model, n_heads = 64, 4
    mha = CausalSelfAttention(d_model, n_heads)

    # Create two inputs that differ only at position 3
    x1 = torch.randn(1, 5, d_model)
    x2 = x1.clone()
    x2[0, 3, :] = torch.randn(d_model)  # change token 3

    out1 = mha(x1)
    out2 = mha(x2)

    # Positions 0, 1, 2 should be identical (can't see position 3)
    for pos in range(3):
        diff = (out1[0, pos] - out2[0, pos]).abs().max().item()
        assert diff < 1e-6, f"Position {pos} changed! diff={diff}"
        print(f"✓ Position {pos}: diff = {diff:.2e} (causal)")

    # Position 3+ should differ
    diff_3 = (out1[0, 3] - out2[0, 3]).abs().max().item()
    assert diff_3 > 0.01, f"Position 3 should differ! diff={diff_3}"
    print(f"✓ Position 3: diff = {diff_3:.4f} (affected, as expected)")

test_causality()
# ✓ Position 0: diff = 0.00e+00 (causal)
# ✓ Position 1: diff = 0.00e+00 (causal)
# ✓ Position 2: diff = 0.00e+00 (causal)
# ✓ Position 3: diff = 0.2847 (affected, as expected)

Performance profiling

Let's count FLOPs and memory for realistic model sizes:

def profile_mha(d_model, n_heads, seq_len, batch_size=1):
    """Estimate FLOPs and memory for one MHA layer."""
    # QKV projection: 3 × (B × T × d × d) multiplies
    qkv_flops = 3 * batch_size * seq_len * d_model * d_model

    # Attention scores: B × nh × T × T × d_h
    attn_flops = batch_size * n_heads * seq_len * seq_len * (d_model // n_heads)

    # Attention × V: same as scores
    av_flops = attn_flops

    # Output projection: B × T × d × d
    out_flops = batch_size * seq_len * d_model * d_model

    total_flops = qkv_flops + attn_flops + av_flops + out_flops

    # KV cache per layer (inference, FP16)
    kv_cache_bytes = 2 * seq_len * n_heads * (d_model // n_heads) * 2

    return {
        'total_gflops': total_flops / 1e9,
        'kv_cache_mb': kv_cache_bytes / 1e6,
        'attn_fraction': (attn_flops + av_flops) / total_flops
    }

# GPT-3 scale
stats = profile_mha(12288, 96, 4096)
print(f"GPT-3 MHA layer: {stats['total_gflops']:.1f} GFLOPs")
print(f"  KV cache: {stats['kv_cache_mb']:.1f} MB")
print(f"  Attention fraction: {stats['attn_fraction']:.1%}")

# DeepSeek-V2 scale (if it used standard MHA)
stats = profile_mha(5120, 128, 4096)
print(f"\nDeepSeek-V2 MHA layer: {stats['total_gflops']:.1f} GFLOPs")
print(f"  KV cache: {stats['kv_cache_mb']:.1f} MB")

Fig 3 — Linear projections dominate MHA compute (~80%), not the attention matrix itself.

Comparison with PyTorch built-in

def compare_with_pytorch():
    """Verify our implementation matches PyTorch's nn.MultiheadAttention."""
    d_model, n_heads, seq_len = 256, 4, 32
    torch.manual_seed(42)

    # Our implementation
    ours = CausalSelfAttention(d_model, n_heads, dropout=0.0)

    # PyTorch built-in
    theirs = nn.MultiheadAttention(d_model, n_heads, dropout=0.0, bias=False,
                                    batch_first=True)

    # Copy weights to match
    # (Weight layout differs — PyTorch uses a different internal format)
    # This is for conceptual verification, not exact matching

    x = torch.randn(1, seq_len, d_model)
    out_ours = ours(x)
    print(f"Our output shape: {out_ours.shape}")
    print(f"Our output range: [{out_ours.min():.3f}, {out_ours.max():.3f}]")

compare_with_pytorch()

What nanoGPT does differently

Karpathy's nanoGPT uses essentially the same structure with a few practical additions:

Flash Attention: When available, nanoGPT uses PyTorch's F.scaled_dot_product_attention which dispatches to FlashAttention-2 (arXiv:2307.08691) — a memory-efficient kernel that avoids materializing the full n×n attention matrix.
Bias option: nanoGPT supports bias in the QKV projection. Most modern models (LLaMA, DeepSeek) use bias=False.
Compile-friendly: The code is structured to work with torch.compile() for kernel fusion.

From here to DeepSeek's MLA

The MHA implementation above is what every standard Transformer uses. DeepSeek's MLA modifies steps 1 and 6:

Step 1 (projection): Instead of projecting to full K and V, MLA first projects to a compressed latent vector c_KV of dimension d_c (512 in DeepSeek-V2, vs 5120 for full MHA). During inference, only c_KV is cached.
Step 6 (cache): When computing attention with cached tokens, K and V are reconstructed from c_KV via learned up-projection matrices. The attention computation itself is identical — the savings are entirely in memory.

We'll implement MLA from scratch in Phase 3, Article 3.2. The code structure is remarkably similar to what we built here — just with an extra compression/decompression step around K and V.

Phase 1 complete

You've now covered the complete foundation of how LLMs work:

1.1: The big picture — Transformers, DeepSeek's three innovations
1.2: Tokenization → embedding → forward pass → sampling
1.3: The attention mechanism — Q, K, V, dot-product, softmax
1.4: Self-attention math — every matrix multiplication by hand
1.5: Causal masking — why LLMs can't see the future, and the KV cache
1.6: Multi-head attention — why multiple heads, what they learn
1.7: This article — the complete PyTorch implementation

Next up — Phase 2: KV Cache & Efficient Attention. Now that you understand standard MHA and its memory cost, we'll explore the bottleneck (KV cache), and trace the evolution through MQA → GQA → MLA. This is where DeepSeek's story gets interesting.

5 things to remember

Combined QKV: One projection matrix for all three — more efficient on GPUs.
Reshape trick: (B, T, C) → (B, nh, T, d_h) via view + transpose. Enables batched attention.
Causal mask: Lower-triangular buffer registered once, broadcast across batch and heads.
Contiguous: Call .contiguous() after transpose before view — required for memory layout.
FlashAttention: In production, use F.scaled_dot_product_attention — same math, 2-4× faster, less memory.

Go deeper

Code: nanoGPT model.py — GitHub (karpathy)
Code: LLMs-from-scratch — GitHub (rasbt)
Docs: PyTorch scaled_dot_product_attention — pytorch.org
Paper: FlashAttention-2 — arXiv:2307.08691
Paper: DeepSeek-V2 (MLA implementation) — arXiv:2405.04434