LLMs don't read words. They read numbers. Here's the journey from your sentence to those numbers and back. This is article 2 of 7 in Phase 1 of the DeepSeek Engineering Blog Series.
If You Read Nothing Else: Your text gets chopped into tokens (subword pieces), each mapped to an integer ID. That ID looks up a learned vector (the embedding). These vectors flow through attention and FFN layers, accumulating meaning. The final layer produces a probability over the vocabulary, the model samples a token, and the loop repeats. That's the entire generation process.
Step 1: Tokenization
Fig 1 — Complete token flow from text input to generated output, with autoregressive loop.
Before any neural network computation happens, your text must be converted to a sequence of integers. This is tokenization — and it's more subtle than splitting on spaces.
Why not just use words?
If your vocabulary is every English word, you need hundreds of thousands of entries. Rare words like "defenestration" get their own slot. Misspellings break everything. And other languages? Forget it.
Modern LLMs use subword tokenization. The dominant algorithm is Byte-Pair Encoding (BPE), introduced by Sennrich et al. in 2015 and used (with variations) by GPT-2, GPT-3, GPT-4, LLaMA, and DeepSeek.
How BPE works
BPE starts with individual characters (or bytes) as the initial vocabulary. It then iteratively finds the most frequent pair of adjacent tokens in the training corpus and merges them into a new token. This continues for a fixed number of merges (typically 32K–128K).
# BPE example Starting: ['l', 'o', 'w', 'e', 'r'] After merge: ['lo', 'w', 'e', 'r'] # 'l'+'o' was most frequent After merge: ['low', 'e', 'r'] # 'lo'+'w' was most frequent After merge: ['low', 'er'] # 'e'+'r' was most frequent After merge: ['lower'] # 'low'+'er' was most frequent
Fig 2 — BPE builds tokens bottom-up by merging the most frequent adjacent pairs.
The result: common words like "the" become single tokens. Rare words get split into recognizable pieces: "tokenization" → ["token", "ization"]. Even completely new words get handled by falling back to character-level pieces.
Vocabulary sizes
| Model | Tokenizer | Vocab Size |
|---|---|---|
| GPT-2 | BPE | 50,257 |
| GPT-4 | BPE (cl100k) | 100,256 |
| LLaMA-2 | SentencePiece BPE | 32,000 |
| LLaMA-3 | tiktoken BPE | 128,256 |
| DeepSeek-V3 | BPE | 129,280 |
DeepSeek-V3 uses a vocabulary of 129,280 tokens. Larger vocabularies mean common phrases compress into fewer tokens (saving compute), but the embedding table gets bigger.
Live example
The sentence "DeepSeek-V3 has 671B parameters" might tokenize as:
["Deep", "Seek", "-V", "3", " has", " 6", "71", "B", " parameters"] → [14388, 42545, 12, 18, 706, 220, 6028, 33, 7384]
Each string piece maps to an integer ID. These IDs are the input to the neural network. You can experiment with this yourself using OpenAI's tokenizer visualizer or Hugging Face's tokenizers library.
Step 2: Token IDs → embedding vectors
Each token ID is used to look up a row in the embedding matrix. This matrix has shape [vocab_size × d_model]. For DeepSeek-V2, that's [102400 × 5120] — each of the 102,400 tokens has a learned vector of 5,120 dimensions.
Why 5,120 dimensions? It's a design choice balancing expressiveness and cost. GPT-3 uses 12,288 dimensions. More dimensions = more expressive power but quadratically more compute in attention. DeepSeek compensates with more total parameters via MoE, so it can afford a smaller hidden dimension.
The embedding lookup is not a neural network operation — it's a table lookup. Token ID 42545 → row 42545 of the matrix. No multiplication, no activation function. Just index and retrieve. This is why it's fast.
After the embedding lookup, the model adds positional information to each token's vector. Without this, the model can't distinguish "dog bites man" from "man bites dog" — the same tokens in different order would produce identical embeddings. DeepSeek uses RoPE (Rotary Positional Encoding), which we'll cover in Phase 4.
Step 3: The forward pass
Now we have a matrix of shape [seq_len × d_model] — one vector per token. This matrix flows through a stack of identical transformer layers. DeepSeek-V3 has 61 layers.
Each layer does two things in sequence:
3a. Self-attention (with layer norm)
The input is first normalized (DeepSeek uses RMSNorm, a simpler variant of LayerNorm). Then attention lets each token look at every other token and create a weighted combination. The output is added back to the input via a residual connection — this "skip connection" is critical for training deep networks.
# Pseudocode for one transformer layer x_norm = RMSNorm(x) attn_out = SelfAttention(x_norm) x = x + attn_out # residual connection x_norm = RMSNorm(x) ffn_out = FeedForward(x_norm) # or MoE in DeepSeek x = x + ffn_out # residual connection
3b. Feedforward (or MoE)
After attention, each token independently passes through a feedforward network. In DeepSeek-V3, this is a Mixture of Experts layer: a router picks 8 of 256 specialist FFNs, plus 1 shared FFN. The outputs are weighted-summed and added back via another residual connection.
This attention → FFN pattern repeats 61 times. The residual stream — the running sum of all outputs — is the model's main information highway. Think of it as a river that each layer adds tributaries to.
Fig 3 — One transformer layer: RMSNorm → Attention → residual → RMSNorm → FFN → residual.
Step 4: Logits → softmax → sampling
After 61 layers, we have a final vector for each position. To predict the next token, we take the vector at the last position (the most recent token) and multiply it by the transposed embedding matrix (or a separate output projection). This produces a logits vector of size vocab_size (129,280 for DeepSeek-V3).
logits = x[-1] @ W_output.T # shape: [129280] probs = softmax(logits / τ) # temperature-scaled probabilities next_token = sample(probs) # pick one token
Each logit is an unnormalized score for one vocabulary token. Softmax converts these to probabilities. The temperature parameter (τ) controls how "sharp" the distribution is — we covered this in the softmax temperature article.
Sampling strategies
- Greedy (argmax): Always pick the highest-probability token. Deterministic but often repetitive.
- Top-k: Only consider the top k tokens, re-normalize, then sample. GPT-2 default: k=40.
- Top-p (nucleus): Keep the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9). More adaptive than top-k.
- Temperature: Divide logits by τ before softmax. Low τ → greedy. High τ → uniform. Composes with top-k/top-p.
Step 5: The autoregressive loop
Here's the thing most diagrams don't make clear: generation is sequential. The model generates one token at a time, then feeds that token back in as input to generate the next one.
# The autoregressive generation loop
tokens = tokenize("Once upon a") # [8238, 3714, 257]
for i in range(max_length):
logits = model.forward(tokens) # full forward pass
next_id = sample(softmax(logits)) # pick next token
tokens.append(next_id) # add to sequence
if next_id == EOS_TOKEN:
break
print(detokenize(tokens))
# "Once upon a time, in a kingdom far away..."
Each iteration requires a full forward pass through all 61 layers. For a 1000-token response, that's 1000 forward passes. This is why generation is slow — and why the KV cache exists. Instead of recomputing attention for all previous tokens, the model caches the Key and Value vectors from previous steps. This is the memory bottleneck that DeepSeek's MLA solves (Phase 3).
The autoregressive loop is why LLMs feel "slow" — they can't parallelize generation across tokens. Each new token depends on all previous tokens. Training is parallel (you have the full sequence), but inference is sequential. This fundamental asymmetry drives most of the optimization work in Phase 2–3.
Putting it all together
The complete pipeline:
"Hello world"
→ tokenize → [15496, 995]
→ embed → [[0.12, -0.34, ...], [0.56, 0.78, ...]] (2×5120)
→ + position → RoPE-encoded embeddings
→ 61× { attention → residual → FFN/MoE → residual }
→ final norm → [0.23, -0.11, ...] (1×5120, last position)
→ output proj → [2.1, -0.5, 0.8, ...] (1×129280 logits)
→ softmax(τ) → [0.0001, 0.0003, ..., 0.15, ...] (probabilities)
→ sample → 318 (token ID)
→ detokenize → "!"
Common misconceptions
- "LLMs understand language." They manipulate numerical vectors. Understanding is a philosophical question — but mechanistically, it's matrix multiplications all the way down.
- "The model sees the whole text at once." During training, yes — the full sequence is processed in parallel (with causal masking). During inference, it generates one token at a time.
- "Bigger vocabulary = smarter model." Not directly. Larger vocab means fewer tokens per sentence (more efficient), but the embedding table gets larger. It's a tradeoff. DeepSeek-V3's 129K vocab is on the larger side.
5 things to remember
- Tokenize: Text → subword pieces → integer IDs using BPE. DeepSeek-V3 vocab: 129,280 tokens.
- Embed: Each token ID → a learned vector of d_model dimensions. It's a table lookup, not a computation.
- Transform: 61 layers of attention + FFN/MoE, connected by residual streams and layer norms.
- Predict: Final vector → logits → softmax → sample. One token at a time.
- Loop: Generation is autoregressive — each token feeds back in. This is why KV caching matters (Phase 2).
Go deeper
- Video: Andrej Karpathy — "Let's Build the GPT Tokenizer" — YouTube
- Blog: The Illustrated GPT-2 — Jay Alammar
- Tool: OpenAI Tokenizer Visualizer — platform.openai.com/tokenizer
- Code: Hugging Face tokenizers library — GitHub
- Paper: Neural Machine Translation of Rare Words with Subword Units (BPE) — arXiv:1508.07909