DeepSeek Engineering Blog Series · Phase 1

LLM Foundations

Article 3 of 7 · Phase 1 of 10

May 9, 2026 · ml · 14 min read · 2800 words beginner

Attention Mechanism Explained.

ml deepseek transformers phase-1 beginner

The attention mechanism is the one idea that changed AI forever. Here's what it actually does — no linear algebra assumed. This is article 3 of 7 in Phase 1 of the DeepSeek Engineering Blog Series.

If You Read Nothing Else: Attention is a way for each word in a sentence to look at every other word and decide which ones are relevant. It transforms a sequence of independent token vectors into context-aware representations. The entire Transformer — and by extension every modern LLM — is built on this single mechanism.

The core problem

attention mechanism — query, key, value token vector x Q = x · W_Q "what am I looking for?" K = x · W_K "what do I contain?" V = x · W_V "what info do I carry?" Q · Kᵀ / √d_k softmax × V context-aware output

Fig 1 — The attention mechanism: Q and K determine relevance, V carries the information.

Consider the sentence: "The animal didn't cross the street because it was too tired."

What does "it" refer to? A human instantly knows: "it" = "the animal." But to a computer processing tokens sequentially, "it" is just another vector. Without some mechanism to connect "it" back to "animal," the model has no way to resolve this reference.

Before Transformers, models used recurrence (RNNs, LSTMs) — processing tokens one at a time, passing a hidden state forward. The problem: information from early tokens gets diluted over long sequences. By the time the model reaches "it" in a 500-word paragraph, the information about "animal" from the beginning has largely faded.

Attention solves this by letting every token directly access every other token, regardless of distance. Token 50 can directly "attend to" token 3. No information bottleneck. No fading memory.

The Query-Key-Value framework

The attention mechanism uses three concepts, borrowed from information retrieval:

  • Query (Q): "What am I looking for?" — what the current token needs to know.
  • Key (K): "What do I contain?" — what each token advertises about itself.
  • Value (V): "What information do I carry?" — the actual content to retrieve.
Think of it like a library search. Your Query is the search term you type. The Keys are the titles and descriptions of every book. The Values are the actual book contents. The search engine matches your Query against all Keys, ranks them by relevance, and returns a weighted blend of the Values.

In practice, each token's embedding vector gets multiplied by three separate learned weight matrices to produce Q, K, and V vectors:

Q = x · W_Q    # What am I looking for?
K = x · W_K    # What do I contain?
V = x · W_V    # What information do I carry?

Here, x is the input embedding (shape: d_model), and W_Q, W_K, W_V are learned matrices (shape: d_model × d_k). These projections are the only learnable parameters in the attention mechanism.

Dot-product similarity

Once we have Q and K vectors, we need to measure how "relevant" each key is to our query. The simplest measure of similarity between two vectors: the dot product.

score(q, k) = q · k = Σ(q_i × k_i)

High dot product → vectors point in similar directions → high relevance. Low or negative dot product → different directions → low relevance.

For a sequence of n tokens, we compute the dot product of each query with every key, producing an n × n matrix of raw attention scores. Token 5's query gets compared against tokens 1, 2, 3, 4, and 5's keys. The result tells us how much each token "matters" to token 5.

The scaling factor

There's a subtle problem. If the dimension of the key vectors (d_k) is large, the dot products become large numbers. Large inputs to softmax push the function into regions where its gradient is nearly zero — making training unstable.

The fix from the original Transformer paper (Vaswani et al., 2017, Section 3.2.1): divide by √d_k.

score(Q, K) = (Q · KT) / √d_k

Why the square root? If each element of Q and K is drawn from a distribution with variance 1, then their dot product has variance d_k. Dividing by √d_k normalizes the variance back to 1, keeping softmax in a well-behaved range.

Without scaling, the authors found that attention weights become almost one-hot (all mass on one token) in higher dimensions, preventing the model from learning nuanced attention patterns.

Softmax → attention weights

After scaling, we apply softmax to each row of the score matrix. This converts raw scores into probabilities that sum to 1:

attention_weights = softmax(scores)
# Each row sums to 1.0
# weights[i][j] = how much token i attends to token j

These weights form the attention weight matrix — an n × n heatmap showing which tokens each position "pays attention to." In practice, these heatmaps reveal fascinating patterns: attention heads that track syntax, coreference, positional relationships, and more.

Weighted sum of values

The final step: multiply each value vector by its attention weight and sum them up. For token i:

output[i] = Σ_j (weight[i][j] × V[j])

This produces a new vector for each token — a context-aware representation that blends information from all relevant tokens. Token "it" now carries information about "animal" because it attended strongly to it.

The complete formula

Putting it all together, the scaled dot-product attention formula from the Attention Is All You Need paper (arXiv:1706.03762, Section 3.2.1):

Attention(Q, K, V) = softmax(Q · KT / √d_k) · V

This single equation is the foundation of every Transformer model. GPT-4, LLaMA, Mistral, DeepSeek — they all use this exact formula (with variations in how Q, K, V are computed and how the cache is managed).

The analogy that sticks

Attention is like writing a research paper. Your current sentence (Query) searches through your notes (Keys) to find the most relevant sources (Values) to pull forward. Heavily cited sources get weighted more. The result is a paragraph informed by all the best sources, not just the one that came immediately before.

What attention sees: a visualization

When processing "The animal didn't cross the street because it was too tired," a trained attention head might produce weights like:

Token "it" attending to:
  "The"      → 0.02
  "animal"   → 0.45  ← strong!
  "didn't"   → 0.01
  "cross"    → 0.03
  "the"      → 0.02
  "street"   → 0.08
  "because"  → 0.04
  "it"       → 0.15
  "was"      → 0.10
  "too"      → 0.05
  "tired"    → 0.05
attention weights for token "it" — coreference head The animal 0.45 didn't cross the street it 0.15 (self) was too tired pronoun resolved!

Fig 2 — Attention weights show "it" strongly attending to "animal" (0.45) — coreference resolution.

The model assigns 45% of its attention to "animal" — it has learned to resolve the pronoun reference. Different attention heads learn different patterns. One head might focus on syntactic structure. Another on semantic similarity. Another on adjacent tokens. This is why Multi-Head Attention (Article 1.6) is so powerful — each head specializes.

Why attention changed everything

Before attention, the best sequence models were LSTMs and GRUs. They had three fundamental limitations:

  1. Sequential processing: Token n depends on token n-1. No parallelism during training.
  2. Information bottleneck: Everything must flow through a fixed-size hidden state.
  3. Long-range dependency: Information from 500 tokens ago has been compressed through 500 steps.

Attention solves all three. Every token directly accesses every other token (no bottleneck). The full computation is parallelizable during training. And long-range dependencies are handled with a single matrix multiplication, not 500 sequential steps.

The cost? Compute is O(n²·d) — quadratic in sequence length. For a 128K token context, that's 16 billion attention score computations per layer. This is the primary computational bottleneck in modern LLMs, and it's why innovations like FlashAttention (arXiv:2307.08691) and sparse attention patterns exist.

5 things to remember

  1. Q, K, V: Three projections of each token. Query asks, Key advertises, Value delivers.
  2. Dot product: Measures relevance between Q and K. Higher = more similar.
  3. Scaling: Divide by √d_k to prevent softmax saturation. Critical for training stability.
  4. Weights: Softmax produces a probability distribution — how much each token matters to the current position.
  5. Output: Weighted sum of V vectors. Each token's output is a blend of all relevant tokens' information.

Go deeper

  • Paper: "Attention Is All You Need" (Section 3.2) — arXiv:1706.03762
  • Paper: "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau attention, the precursor) — arXiv:1409.0473
  • Video: 3Blue1Brown — Visualizing Attention — YouTube
  • Blog: The Illustrated Transformer — Jay Alammar
  • Blog: Lena Voita — Attention Heads Analysis — lena-voita.github.io
← How Tokens Flow Through an LLM Self-Attention From Scratch →
© cvam — written in plaintext, served warm