Stop using attention as a black box. Here's every matrix multiplication, explained. This is article 4 of 7 in Phase 1 of the DeepSeek Engineering Blog Series.
If You Read Nothing Else: Self-attention takes an input matrix X (one row per token), projects it into Q, K, V matrices via learned weights, computes pairwise similarity scores, normalizes them with softmax, and produces a weighted sum of values. This article walks through every step with actual numbers.
The input
Fig 1 — Self-attention computation flow with tensor shapes at each step.
We start with a matrix X of shape n × d, where n = number of tokens and d = embedding dimension. Each row is one token's embedding vector.
For this walkthrough, let's use a tiny example: 4 tokens, embedding dimension of 3.
# 4 tokens, each with a 3-dimensional embedding
X = [[1.0, 0.0, 1.0], # token 0: "The"
[0.0, 1.0, 0.0], # token 1: "cat"
[1.0, 1.0, 0.0], # token 2: "sat"
[0.0, 0.0, 1.0]] # token 3: "down"
# Shape: (4, 3) → n=4, d=3
Step 1: Projection matrices
We need three learned weight matrices: W_Q, W_K, W_V. Each has shape d × d_k. In our example, let's use d_k = 2 (in real models, d_k = d/h where h is the number of attention heads — typically 64 or 128).
# Learned weight matrices (in practice, initialized randomly and trained)
W_Q = [[1, 0],
[0, 1],
[1, 0]] # shape: (3, 2)
W_K = [[0, 1],
[1, 0],
[0, 1]] # shape: (3, 2)
W_V = [[1, 1],
[0, 1],
[1, 0]] # shape: (3, 2)
Step 2: Computing Q, K, V
Multiply the input X by each weight matrix:
Q = X · W_Q
= [[1·1+0·0+1·1, 1·0+0·1+1·0], = [[2, 0],
[0·1+1·0+0·1, 0·0+1·1+0·0], [0, 1],
[1·1+1·0+0·1, 1·0+1·1+0·0], [1, 1],
[0·1+0·0+1·1, 0·0+0·1+1·0]] [1, 0]]
K = X · W_K
= [[0+0+0, 1+0+1], = [[0, 2],
[0+1+0, 0+0+0], [1, 0],
[0+1+0, 1+0+0], [1, 1],
[0+0+0, 0+0+1]] [0, 1]]
V = X · W_V
= [[1+0+1, 1+0+0], = [[2, 1],
[0+0+0, 0+1+0], [0, 1],
[1+0+0, 1+1+0], [1, 2],
[0+0+1, 0+0+0]] [1, 0]]
Now each token has its own Query, Key, and Value vectors of dimension d_k = 2.
Step 3: Attention scores
Compute the dot product of every Query with every Key. This gives us an n × n matrix of raw scores:
scores = Q · K^T
K^T = [[0, 1, 1, 0],
[2, 0, 1, 1]]
scores = [[2·0+0·2, 2·1+0·0, 2·1+0·1, 2·0+0·1], = [[0, 2, 2, 0],
[0·0+1·2, 0·1+1·0, 0·1+1·1, 0·0+1·1], [2, 0, 1, 1],
[1·0+1·2, 1·1+1·0, 1·1+1·1, 1·0+1·1], [2, 1, 2, 1],
[1·0+0·2, 1·1+0·0, 1·1+0·1, 1·0+0·1]] [0, 1, 1, 0]]
Reading row 0: token "The" has the highest raw scores for "cat" (2) and "sat" (2). It doesn't attend much to itself (0) or "down" (0).
Fig 2 — Attention score heatmap. Darker = higher similarity. Row 0 shows "The" attending most to "cat" and "sat".
Step 4: Scale
Divide by √d_k = √2 ≈ 1.414:
scaled = scores / √2
scaled = [[0.00, 1.41, 1.41, 0.00],
[1.41, 0.00, 0.71, 0.71],
[1.41, 0.71, 1.41, 0.71],
[0.00, 0.71, 0.71, 0.00]]
Step 5: Softmax
Apply softmax to each row independently. Each row becomes a probability distribution summing to 1:
weights = softmax(scaled, dim=-1)
weights ≈ [[0.14, 0.36, 0.36, 0.14], # "The" attends equally to "cat" and "sat"
[0.36, 0.14, 0.18, 0.18], # "cat" attends most to... "The"?
[0.30, 0.15, 0.30, 0.15], # "sat" splits between "The" and itself
[0.18, 0.30, 0.30, 0.18]] # "down" attends to "cat" and "sat"
These weights are for our toy example with random-ish weight matrices. In a trained model, the weights would reflect meaningful linguistic relationships — pronouns attending to their referents, verbs attending to their subjects, etc.
Step 6: Weighted sum of values
Multiply the attention weights by V to get the output:
output = weights · V
# For token 0 ("The"):
output[0] = 0.14·[2,1] + 0.36·[0,1] + 0.36·[1,2] + 0.14·[1,0]
= [0.28,0.14] + [0,0.36] + [0.36,0.72] + [0.14,0]
= [0.78, 1.22]
# Full output matrix:
output ≈ [[0.78, 1.22],
[1.10, 0.78],
[0.94, 1.05],
[0.54, 1.14]]
Each row is now a context-aware vector. Token "The" (originally [2,1] in V-space) now contains blended information from "cat" and "sat" that attended to it.
The matrix form
Everything above condenses into one formula from the original Transformer paper:
Where:
- Q = X · W_Q — shape: (n, d_k)
- K = X · W_K — shape: (n, d_k)
- V = X · W_V — shape: (n, d_v)
- Output shape: (n, d_v)
NumPy implementation
import numpy as np
def self_attention(X, W_Q, W_K, W_V):
"""Single-head self-attention from scratch."""
Q = X @ W_Q # (n, d_k)
K = X @ W_K # (n, d_k)
V = X @ W_V # (n, d_v)
d_k = K.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # (n, n)
# softmax along last axis
exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
output = weights @ V # (n, d_v)
return output, weights
# Test with our example
X = np.array([[1,0,1],[0,1,0],[1,1,0],[0,0,1]], dtype=float)
W_Q = np.array([[1,0],[0,1],[1,0]], dtype=float)
W_K = np.array([[0,1],[1,0],[0,1]], dtype=float)
W_V = np.array([[1,1],[0,1],[1,0]], dtype=float)
out, attn = self_attention(X, W_Q, W_K, W_V)
print("Output:\n", out.round(2))
print("Attention weights:\n", attn.round(2))
PyTorch implementation
import torch
import torch.nn.functional as F
def self_attention_torch(X, W_Q, W_K, W_V):
"""Self-attention in PyTorch — identical logic."""
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
d_k = K.shape[-1]
scores = Q @ K.T / (d_k ** 0.5)
weights = F.softmax(scores, dim=-1)
output = weights @ V
return output, weights
# Usage
X = torch.tensor([[1,0,1],[0,1,0],[1,1,0],[0,0,1]], dtype=torch.float32)
W_Q = torch.tensor([[1,0],[0,1],[1,0]], dtype=torch.float32)
W_K = torch.tensor([[0,1],[1,0],[0,1]], dtype=torch.float32)
W_V = torch.tensor([[1,1],[0,1],[1,0]], dtype=torch.float32)
out, attn = self_attention_torch(X, W_Q, W_K, W_V)
print(out) # Same results as NumPy
What the attention weights mean
The attention weight matrix is an n × n matrix where weights[i][j] tells us how much token i attends to token j. In trained models, these matrices reveal learned linguistic structures:
- Positional heads: Attend to the previous or next token (learned bigram patterns).
- Coreference heads: Pronouns attend to their referent nouns.
- Syntactic heads: Verbs attend to their subjects and objects.
- Copy heads: Attend to tokens that should be repeated in the output.
Tools like BertViz let you visualize these attention patterns in real models.
5 things to remember
- Projection: X → Q, K, V via three learned weight matrices. These are the only parameters.
- Scores: Q · K^T gives pairwise similarity. Shape: (n, n).
- Scale + Softmax: Divide by √d_k, then softmax → attention probability distribution.
- Output: weights · V blends value vectors by relevance. Each token becomes context-aware.
- Complexity: O(n² · d) — quadratic in sequence length. This is the bottleneck.
Go deeper
- Video: Andrej Karpathy — "Let's Build GPT from Scratch" — YouTube
- Blog: Sebastian Raschka — Self-Attention from Scratch in PyTorch — sebastianraschka.com
- Code: LLMs-from-Scratch — GitHub (rasbt)
- Book: "Build a Large Language Model From Scratch" — Sebastian Raschka (Manning)