DeepSeek Engineering Blog Series · Phase 1

LLM Foundations

Article 2 of 7 · Phase 1 of 10

May 9, 2026 · ml · 15 min read · 3200 words beginner

How Tokens Flow Through an LLM.

ml deepseek transformers phase-1 beginner

LLMs don't read words. They read numbers. Here's the journey from your sentence to those numbers and back. This is article 2 of 7 in Phase 1 of the DeepSeek Engineering Blog Series.

If You Read Nothing Else: Your text gets chopped into tokens (subword pieces), each mapped to an integer ID. That ID looks up a learned vector (the embedding). These vectors flow through attention and FFN layers, accumulating meaning. The final layer produces a probability over the vocabulary, the model samples a token, and the loop repeats. That's the entire generation process.

Step 1: Tokenization

how tokens flow through an LLM "Hello world" BPE tokenize [15496, 995] Embed lookup [0.12, ...] d=5120 Attention + FFN × 61 layers residual + RMSNorm each layer Logits → Softmax vocab probabilities next token autoregressive loop — feed output back as input

Fig 1 — Complete token flow from text input to generated output, with autoregressive loop.

Before any neural network computation happens, your text must be converted to a sequence of integers. This is tokenization — and it's more subtle than splitting on spaces.

Why not just use words?

If your vocabulary is every English word, you need hundreds of thousands of entries. Rare words like "defenestration" get their own slot. Misspellings break everything. And other languages? Forget it.

Modern LLMs use subword tokenization. The dominant algorithm is Byte-Pair Encoding (BPE), introduced by Sennrich et al. in 2015 and used (with variations) by GPT-2, GPT-3, GPT-4, LLaMA, and DeepSeek.

How BPE works

BPE starts with individual characters (or bytes) as the initial vocabulary. It then iteratively finds the most frequent pair of adjacent tokens in the training corpus and merges them into a new token. This continues for a fixed number of merges (typically 32K–128K).

# BPE example
Starting:   ['l', 'o', 'w', 'e', 'r']
After merge: ['lo', 'w', 'e', 'r']    # 'l'+'o' was most frequent
After merge: ['low', 'e', 'r']         # 'lo'+'w' was most frequent
After merge: ['low', 'er']             # 'e'+'r' was most frequent
After merge: ['lower']                 # 'low'+'er' was most frequent
BPE merge tree — "lower" l o w e r lo merge 1 low merge 2 er merge 3 lower

Fig 2 — BPE builds tokens bottom-up by merging the most frequent adjacent pairs.

The result: common words like "the" become single tokens. Rare words get split into recognizable pieces: "tokenization" → ["token", "ization"]. Even completely new words get handled by falling back to character-level pieces.

Vocabulary sizes

Model Tokenizer Vocab Size
GPT-2 BPE 50,257
GPT-4 BPE (cl100k) 100,256
LLaMA-2 SentencePiece BPE 32,000
LLaMA-3 tiktoken BPE 128,256
DeepSeek-V3 BPE 129,280

DeepSeek-V3 uses a vocabulary of 129,280 tokens. Larger vocabularies mean common phrases compress into fewer tokens (saving compute), but the embedding table gets bigger.

Live example

The sentence "DeepSeek-V3 has 671B parameters" might tokenize as:

["Deep", "Seek", "-V", "3", " has", " 6", "71", "B", " parameters"]
→ [14388, 42545, 12, 18, 706, 220, 6028, 33, 7384]

Each string piece maps to an integer ID. These IDs are the input to the neural network. You can experiment with this yourself using OpenAI's tokenizer visualizer or Hugging Face's tokenizers library.

Step 2: Token IDs → embedding vectors

Each token ID is used to look up a row in the embedding matrix. This matrix has shape [vocab_size × d_model]. For DeepSeek-V2, that's [102400 × 5120] — each of the 102,400 tokens has a learned vector of 5,120 dimensions.

Why 5,120 dimensions? It's a design choice balancing expressiveness and cost. GPT-3 uses 12,288 dimensions. More dimensions = more expressive power but quadratically more compute in attention. DeepSeek compensates with more total parameters via MoE, so it can afford a smaller hidden dimension.

Key Insight

The embedding lookup is not a neural network operation — it's a table lookup. Token ID 42545 → row 42545 of the matrix. No multiplication, no activation function. Just index and retrieve. This is why it's fast.

After the embedding lookup, the model adds positional information to each token's vector. Without this, the model can't distinguish "dog bites man" from "man bites dog" — the same tokens in different order would produce identical embeddings. DeepSeek uses RoPE (Rotary Positional Encoding), which we'll cover in Phase 4.

Step 3: The forward pass

Now we have a matrix of shape [seq_len × d_model] — one vector per token. This matrix flows through a stack of identical transformer layers. DeepSeek-V3 has 61 layers.

Each layer does two things in sequence:

3a. Self-attention (with layer norm)

The input is first normalized (DeepSeek uses RMSNorm, a simpler variant of LayerNorm). Then attention lets each token look at every other token and create a weighted combination. The output is added back to the input via a residual connection — this "skip connection" is critical for training deep networks.

# Pseudocode for one transformer layer
x_norm = RMSNorm(x)
attn_out = SelfAttention(x_norm)
x = x + attn_out              # residual connection

x_norm = RMSNorm(x)
ffn_out = FeedForward(x_norm)  # or MoE in DeepSeek
x = x + ffn_out                # residual connection

3b. Feedforward (or MoE)

After attention, each token independently passes through a feedforward network. In DeepSeek-V3, this is a Mixture of Experts layer: a router picks 8 of 256 specialist FFNs, plus 1 shared FFN. The outputs are weighted-summed and added back via another residual connection.

This attention → FFN pattern repeats 61 times. The residual stream — the running sum of all outputs — is the model's main information highway. Think of it as a river that each layer adds tributaries to.

one transformer layer (×61) input x RMSNorm Self-Attention + residual RMSNorm FFN / MoE +

Fig 3 — One transformer layer: RMSNorm → Attention → residual → RMSNorm → FFN → residual.

Step 4: Logits → softmax → sampling

After 61 layers, we have a final vector for each position. To predict the next token, we take the vector at the last position (the most recent token) and multiply it by the transposed embedding matrix (or a separate output projection). This produces a logits vector of size vocab_size (129,280 for DeepSeek-V3).

logits = x[-1] @ W_output.T    # shape: [129280]
probs  = softmax(logits / τ)   # temperature-scaled probabilities
next_token = sample(probs)      # pick one token

Each logit is an unnormalized score for one vocabulary token. Softmax converts these to probabilities. The temperature parameter (τ) controls how "sharp" the distribution is — we covered this in the softmax temperature article.

Sampling strategies

  • Greedy (argmax): Always pick the highest-probability token. Deterministic but often repetitive.
  • Top-k: Only consider the top k tokens, re-normalize, then sample. GPT-2 default: k=40.
  • Top-p (nucleus): Keep the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9). More adaptive than top-k.
  • Temperature: Divide logits by τ before softmax. Low τ → greedy. High τ → uniform. Composes with top-k/top-p.

Step 5: The autoregressive loop

Here's the thing most diagrams don't make clear: generation is sequential. The model generates one token at a time, then feeds that token back in as input to generate the next one.

# The autoregressive generation loop
tokens = tokenize("Once upon a")       # [8238, 3714, 257]

for i in range(max_length):
    logits = model.forward(tokens)      # full forward pass
    next_id = sample(softmax(logits))   # pick next token
    tokens.append(next_id)              # add to sequence
    if next_id == EOS_TOKEN:
        break

print(detokenize(tokens))
# "Once upon a time, in a kingdom far away..."

Each iteration requires a full forward pass through all 61 layers. For a 1000-token response, that's 1000 forward passes. This is why generation is slow — and why the KV cache exists. Instead of recomputing attention for all previous tokens, the model caches the Key and Value vectors from previous steps. This is the memory bottleneck that DeepSeek's MLA solves (Phase 3).

Why This Matters

The autoregressive loop is why LLMs feel "slow" — they can't parallelize generation across tokens. Each new token depends on all previous tokens. Training is parallel (you have the full sequence), but inference is sequential. This fundamental asymmetry drives most of the optimization work in Phase 2–3.

Putting it all together

The complete pipeline:

"Hello world" 
  → tokenize       → [15496, 995]
  → embed           → [[0.12, -0.34, ...], [0.56, 0.78, ...]]  (2×5120)
  → + position      → RoPE-encoded embeddings
  → 61× { attention → residual → FFN/MoE → residual }
  → final norm      → [0.23, -0.11, ...]  (1×5120, last position)
  → output proj     → [2.1, -0.5, 0.8, ...]  (1×129280 logits)
  → softmax(τ)      → [0.0001, 0.0003, ..., 0.15, ...]  (probabilities)
  → sample          → 318  (token ID)
  → detokenize      → "!"

Common misconceptions

  • "LLMs understand language." They manipulate numerical vectors. Understanding is a philosophical question — but mechanistically, it's matrix multiplications all the way down.
  • "The model sees the whole text at once." During training, yes — the full sequence is processed in parallel (with causal masking). During inference, it generates one token at a time.
  • "Bigger vocabulary = smarter model." Not directly. Larger vocab means fewer tokens per sentence (more efficient), but the embedding table gets larger. It's a tradeoff. DeepSeek-V3's 129K vocab is on the larger side.

5 things to remember

  1. Tokenize: Text → subword pieces → integer IDs using BPE. DeepSeek-V3 vocab: 129,280 tokens.
  2. Embed: Each token ID → a learned vector of d_model dimensions. It's a table lookup, not a computation.
  3. Transform: 61 layers of attention + FFN/MoE, connected by residual streams and layer norms.
  4. Predict: Final vector → logits → softmax → sample. One token at a time.
  5. Loop: Generation is autoregressive — each token feeds back in. This is why KV caching matters (Phase 2).

Go deeper

← Introduction to DeepSeek Attention Mechanism Explained →
© cvam — written in plaintext, served warm