Self-Attention From Scratch

Stop using attention as a black box. Here's every matrix multiplication, explained. This is article 4 of 7 in Phase 1 of the DeepSeek Engineering Blog Series.

If You Read Nothing Else: Self-attention takes an input matrix X (one row per token), projects it into Q, K, V matrices via learned weights, computes pairwise similarity scores, normalizes them with softmax, and produces a weighted sum of values. This article walks through every step with actual numbers.

The input

Fig 1 — Self-attention computation flow with tensor shapes at each step.

We start with a matrix X of shape n × d, where n = number of tokens and d = embedding dimension. Each row is one token's embedding vector.

For this walkthrough, let's use a tiny example: 4 tokens, embedding dimension of 3.

# 4 tokens, each with a 3-dimensional embedding
X = [[1.0, 0.0, 1.0],    # token 0: "The"
     [0.0, 1.0, 0.0],    # token 1: "cat"
     [1.0, 1.0, 0.0],    # token 2: "sat"
     [0.0, 0.0, 1.0]]    # token 3: "down"

# Shape: (4, 3)  →  n=4, d=3

Step 1: Projection matrices

We need three learned weight matrices: W_Q, W_K, W_V. Each has shape d × d_k. In our example, let's use d_k = 2 (in real models, d_k = d/h where h is the number of attention heads — typically 64 or 128).

# Learned weight matrices (in practice, initialized randomly and trained)
W_Q = [[1, 0],
       [0, 1],
       [1, 0]]    # shape: (3, 2)

W_K = [[0, 1],
       [1, 0],
       [0, 1]]    # shape: (3, 2)

W_V = [[1, 1],
       [0, 1],
       [1, 0]]    # shape: (3, 2)

Step 2: Computing Q, K, V

Multiply the input X by each weight matrix:

Q = X · W_Q
  = [[1·1+0·0+1·1, 1·0+0·1+1·0],    = [[2, 0],
     [0·1+1·0+0·1, 0·0+1·1+0·0],       [0, 1],
     [1·1+1·0+0·1, 1·0+1·1+0·0],       [1, 1],
     [0·1+0·0+1·1, 0·0+0·1+1·0]]       [1, 0]]

K = X · W_K
  = [[0+0+0, 1+0+1],                 = [[0, 2],
     [0+1+0, 0+0+0],                    [1, 0],
     [0+1+0, 1+0+0],                    [1, 1],
     [0+0+0, 0+0+1]]                    [0, 1]]

V = X · W_V
  = [[1+0+1, 1+0+0],                 = [[2, 1],
     [0+0+0, 0+1+0],                    [0, 1],
     [1+0+0, 1+1+0],                    [1, 2],
     [0+0+1, 0+0+0]]                    [1, 0]]

Now each token has its own Query, Key, and Value vectors of dimension d_k = 2.

Step 3: Attention scores

Compute the dot product of every Query with every Key. This gives us an n × n matrix of raw scores:

scores = Q · K^T

K^T = [[0, 1, 1, 0],
       [2, 0, 1, 1]]

scores = [[2·0+0·2, 2·1+0·0, 2·1+0·1, 2·0+0·1],   = [[0, 2, 2, 0],
          [0·0+1·2, 0·1+1·0, 0·1+1·1, 0·0+1·1],      [2, 0, 1, 1],
          [1·0+1·2, 1·1+1·0, 1·1+1·1, 1·0+1·1],      [2, 1, 2, 1],
          [1·0+0·2, 1·1+0·0, 1·1+0·1, 1·0+0·1]]      [0, 1, 1, 0]]

Reading row 0: token "The" has the highest raw scores for "cat" (2) and "sat" (2). It doesn't attend much to itself (0) or "down" (0).

Fig 2 — Attention score heatmap. Darker = higher similarity. Row 0 shows "The" attending most to "cat" and "sat".

Step 4: Scale

Divide by √d_k = √2 ≈ 1.414:

scaled = scores / √2

scaled = [[0.00, 1.41, 1.41, 0.00],
          [1.41, 0.00, 0.71, 0.71],
          [1.41, 0.71, 1.41, 0.71],
          [0.00, 0.71, 0.71, 0.00]]

Step 5: Softmax

Apply softmax to each row independently. Each row becomes a probability distribution summing to 1:

weights = softmax(scaled, dim=-1)

weights ≈ [[0.14, 0.36, 0.36, 0.14],    # "The" attends equally to "cat" and "sat"
           [0.36, 0.14, 0.18, 0.18],     # "cat" attends most to... "The"?
           [0.30, 0.15, 0.30, 0.15],     # "sat" splits between "The" and itself
           [0.18, 0.30, 0.30, 0.18]]     # "down" attends to "cat" and "sat"

Notice

These weights are for our toy example with random-ish weight matrices. In a trained model, the weights would reflect meaningful linguistic relationships — pronouns attending to their referents, verbs attending to their subjects, etc.

Step 6: Weighted sum of values

Multiply the attention weights by V to get the output:

output = weights · V

# For token 0 ("The"):
output[0] = 0.14·[2,1] + 0.36·[0,1] + 0.36·[1,2] + 0.14·[1,0]
          = [0.28,0.14] + [0,0.36] + [0.36,0.72] + [0.14,0]
          = [0.78, 1.22]

# Full output matrix:
output ≈ [[0.78, 1.22],
          [1.10, 0.78],
          [0.94, 1.05],
          [0.54, 1.14]]

Each row is now a context-aware vector. Token "The" (originally [2,1] in V-space) now contains blended information from "cat" and "sat" that attended to it.

The matrix form

Everything above condenses into one formula from the original Transformer paper:

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

Where:

Q = X · W_Q — shape: (n, d_k)
K = X · W_K — shape: (n, d_k)
V = X · W_V — shape: (n, d_v)
Output shape: (n, d_v)

NumPy implementation

import numpy as np

def self_attention(X, W_Q, W_K, W_V):
    """Single-head self-attention from scratch."""
    Q = X @ W_Q                           # (n, d_k)
    K = X @ W_K                           # (n, d_k)
    V = X @ W_V                           # (n, d_v)

    d_k = K.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)       # (n, n)

    # softmax along last axis
    exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True))
    weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

    output = weights @ V                  # (n, d_v)
    return output, weights

# Test with our example
X = np.array([[1,0,1],[0,1,0],[1,1,0],[0,0,1]], dtype=float)
W_Q = np.array([[1,0],[0,1],[1,0]], dtype=float)
W_K = np.array([[0,1],[1,0],[0,1]], dtype=float)
W_V = np.array([[1,1],[0,1],[1,0]], dtype=float)

out, attn = self_attention(X, W_Q, W_K, W_V)
print("Output:\n", out.round(2))
print("Attention weights:\n", attn.round(2))

PyTorch implementation

import torch
import torch.nn.functional as F

def self_attention_torch(X, W_Q, W_K, W_V):
    """Self-attention in PyTorch — identical logic."""
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V

    d_k = K.shape[-1]
    scores = Q @ K.T / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)
    output = weights @ V
    return output, weights

# Usage
X = torch.tensor([[1,0,1],[0,1,0],[1,1,0],[0,0,1]], dtype=torch.float32)
W_Q = torch.tensor([[1,0],[0,1],[1,0]], dtype=torch.float32)
W_K = torch.tensor([[0,1],[1,0],[0,1]], dtype=torch.float32)
W_V = torch.tensor([[1,1],[0,1],[1,0]], dtype=torch.float32)

out, attn = self_attention_torch(X, W_Q, W_K, W_V)
print(out)   # Same results as NumPy

What the attention weights mean

The attention weight matrix is an n × n matrix where weights[i][j] tells us how much token i attends to token j. In trained models, these matrices reveal learned linguistic structures:

Positional heads: Attend to the previous or next token (learned bigram patterns).
Coreference heads: Pronouns attend to their referent nouns.
Syntactic heads: Verbs attend to their subjects and objects.
Copy heads: Attend to tokens that should be repeated in the output.

Tools like BertViz let you visualize these attention patterns in real models.

5 things to remember

Projection: X → Q, K, V via three learned weight matrices. These are the only parameters.
Scores: Q · K^T gives pairwise similarity. Shape: (n, n).
Scale + Softmax: Divide by √d_k, then softmax → attention probability distribution.
Output: weights · V blends value vectors by relevance. Each token becomes context-aware.
Complexity: O(n² · d) — quadratic in sequence length. This is the bottleneck.

Go deeper

Video: Andrej Karpathy — "Let's Build GPT from Scratch" — YouTube
Blog: Sebastian Raschka — Self-Attention from Scratch in PyTorch — sebastianraschka.com
Code: LLMs-from-Scratch — GitHub (rasbt)
Book: "Build a Large Language Model From Scratch" — Sebastian Raschka (Manning)

Self-Attention From Scratch.

The input

Step 1: Projection matrices

Step 2: Computing Q, K, V

Step 3: Attention scores

Step 4: Scale

Step 5: Softmax

Step 6: Weighted sum of values

The matrix form

NumPy implementation

PyTorch implementation

What the attention weights mean

5 things to remember

Go deeper