DeepSeek Engineering Blog Series · Phase 1

LLM Foundations

Article 1 of 7 · Phase 1 of 10

May 9, 2026 · ml · 18 min read · 3800 words beginner

Introduction to DeepSeek Architecture.

ml deepseek transformers phase-1 beginner

Before you understand why DeepSeek is different, you need to know what every Large Language Model shares. This is article 1 of 7 in Phase 1 of the DeepSeek Engineering Blog Series — a 50+ article deep dive from transformer internals to DeepSeek-V4.

If You Read Nothing Else: Every LLM — GPT-4, LLaMA, Mistral, DeepSeek — is built from the same basic architecture: the Transformer. DeepSeek's innovation is not inventing something new from scratch. It is redesigning specific components (attention, feedforward layers, training) to be dramatically more efficient. DeepSeek-V3 has 671 billion parameters but only activates 37 billion per token — costing $5.58M to train vs hundreds of millions for comparable models.

The bird's-eye view

Text goes in. Text comes out. What happens in between?

Every modern LLM follows the same core pipeline. Your input text gets broken into tokens (small pieces of words). Each token gets converted into a numerical vector (an embedding). These vectors flow through dozens of identical processing layers. Each layer has two main parts: an attention mechanism that lets tokens communicate with each other, and a feedforward network that processes each token independently. After all the layers, the model outputs a probability distribution over the entire vocabulary to predict the next token.

That's it. GPT-4, Claude, LLaMA-3, Mistral, and DeepSeek all follow this exact pattern. The differences are in the details — and those details matter enormously.

transformer pipeline — every LLM follows this Input "hello world" Tokenize [15496, 995] Embed → vectors Attention + FFN × N layers Output next token + RoPE (Ph.4) MLA (Ph.3) MoE (Ph.5) MTP (Ph.6)

Fig 1 — Transformer pipeline. Orange annotations show where DeepSeek innovates.

Layers of a Transformer

The Transformer architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need" (arXiv:1706.03762), has four major stages. Every LLM you've heard of is a variation of this design.

1. Embedding layer

The embedding layer converts each token ID (an integer) into a dense vector. In GPT-3, this vector has 12,288 dimensions. In DeepSeek-V2, it has 5,120 dimensions. These aren't arbitrary — larger dimensions give the model more expressive power but cost more compute.

Alongside token embeddings, most models add positional information so the model knows word order. The original Transformer used sinusoidal functions. Modern models like DeepSeek use Rotary Positional Encoding (RoPE), which encodes position by rotating vectors in complex space — we'll cover this in Phase 4.

2. Attention mechanism

This is where the magic happens. Attention lets each token look at every other token in the sequence and decide which ones are relevant. When the model processes the word "it" in "The animal didn't cross the street because it was too tired," attention allows "it" to focus on "animal" — understanding the reference.

The attention mechanism creates three vectors for each token: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). The model computes dot products between Queries and Keys to determine attention weights, then uses those weights to create a weighted sum of Values. The formula:

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

We'll dissect this formula completely in Articles 1.3 and 1.4.

3. Feedforward network (FFN)

After attention, each token passes independently through a feedforward neural network — typically two linear layers with an activation function in between. This is where the model does its "thinking" on individual token representations. In a standard Transformer, the FFN is the same network applied to every token position independently.

The FFN is also where DeepSeek makes one of its biggest innovations: Mixture of Experts (MoE). Instead of one large FFN, DeepSeek-V3 uses 256 smaller expert FFNs and a router that selects only 8 of them per token. This means the model has 671 billion total parameters but only uses 37 billion per forward pass — getting the intelligence of a massive model at the compute cost of a much smaller one. We'll cover MoE in depth in Phase 5.

4. Output layer

The final layer projects the model's internal representation back to vocabulary size, producing a logit for every possible next token. Softmax converts these logits to probabilities. The model samples from this distribution (or takes the argmax) to generate the next token.

DeepSeek's specific modifications

DeepSeek didn't just build another Transformer. The team at DeepSeek (a Chinese AI lab, founded in 2023 as a subsidiary of High-Flyer Capital Management) made three architectural innovations that set their models apart from GPT, LLaMA, and Mistral.

Innovation 1: Multi-Head Latent Attention (MLA)

Standard Multi-Head Attention (MHA) caches a separate Key and Value vector for every token at every layer and every head. For a 70B model at 128K context, this KV cache alone can consume over 40GB of GPU memory.

DeepSeek's MLA compresses all Key-Value information into a single low-rank latent vector per token. Instead of caching n_heads × head_dim values per token (5,120 values in DeepSeek-V2), MLA caches a compressed vector of only 512 dimensions — a 10× reduction. At inference time, the full Keys and Values are reconstructed from this latent vector on demand.

Result: DeepSeek-V2 reduced KV cache memory by 93.3% compared to standard MHA, with negligible quality loss. This is covered in Phase 3.

Innovation 2: DeepSeekMoE

Most MoE models (like Mixtral) use a handful of large experts. DeepSeek uses many smaller, fine-grained experts. DeepSeek-V3 has 256 routed experts plus 1 shared expert that processes every token. The router selects the top 8 experts per token.

The shared expert handles common knowledge (grammar, basic reasoning) while routed experts specialize. DeepSeek-V3 also introduced auxiliary-loss-free load balancing — a technique that keeps experts evenly utilized without the quality-degrading auxiliary loss used by other MoE models. Details in Phase 5.

Innovation 3: Multi-Token Prediction (MTP)

Standard LLMs predict only the next token. DeepSeek-V3 trains with additional prediction heads that predict 2 tokens ahead simultaneously. This forces the model to "plan ahead" rather than being myopic. At inference, these MTP heads serve as draft tokens for speculative decoding, achieving a 1.8× speedup. Covered in Phase 6.

What makes DeepSeek architecturally unique

Let's put this in context with a comparison table of major architectures:

Model Total Params Active Params Attention FFN
GPT-4 (est.) ~1.8T ~220B MHA MoE (est.)
LLaMA-3 70B 70B 70B GQA Dense
Mistral/Mixtral 46.7B 12.9B GQA MoE (8 experts)
DeepSeek-V2 236B 21B MLA MoE (160 experts)
DeepSeek-V3 671B 37B MLA MoE (256+1 experts)
total params vs active params (billions) GPT-4 1800B 220B active LLaMA-3 70B 70B (dense) DS-V3 671B 37B active total params active per token

Fig 2 — DeepSeek-V3 has 671B total params but only activates 37B per token (5.5%).

The key insight: DeepSeek is the only major model family using MLA (Multi-Head Latent Attention). While Meta, Google, and Mistral moved from MHA to GQA (Grouped Query Attention), DeepSeek took a fundamentally different approach — compressing the entire KV cache into a learned latent space rather than simply reducing the number of KV heads.

The training story

DeepSeek-V3 was trained on 14.8 trillion tokens using 2,048 NVIDIA H800 GPUs. The total training cost was approximately $5.58 million — a fraction of what comparable models cost. For context, estimates place GPT-4's training cost at $63–100 million, and LLaMA-3 405B at approximately $30 million.

estimated training cost ($M) GPT-4 (est.) $63-100M LLaMA-3 405B ~$30M DeepSeek-V3 $5.58M 11-18x cheaper

Fig 3 — Training cost comparison. FP8 + MoE sparsity + DualPipe enabled DeepSeek's efficiency.

How did they make it so cheap? Three factors:

  • FP8 training: DeepSeek-V3 was the first model at this scale to use 8-bit floating-point precision for training, halving memory and doubling throughput compared to BF16 (Phase 7).
  • Sparse compute: Only 37B of 671B parameters activate per token. You pay compute costs proportional to active parameters, not total parameters.
  • DualPipe: A custom pipeline parallelism strategy that overlaps compute and communication, achieving near-zero pipeline bubbles (Phase 8).

DeepSeek-R1: the reasoning breakthrough

In January 2025, DeepSeek released R1, which is not a new architecture but DeepSeek-V3 trained with reinforcement learning for reasoning. The key discovery: when trained with pure RL (no supervised fine-tuning), the model spontaneously developed chain-of-thought reasoning, self-verification, and backtracking — behaviors no one explicitly taught it.

R1-Zero (the pure RL variant) went from 15.6% to 71.0% accuracy on AIME 2024 (a competition math benchmark) during training. The full R1 model, with a 4-stage training pipeline, matches or exceeds OpenAI's o1 on many reasoning benchmarks. We cover this in Phase 9.

The 10-phase roadmap

This article is the first stop on a journey through the entire DeepSeek architecture. Here's the map:

  1. Phase 1 — LLM Foundations (you are here): Transformer basics, attention, self-attention, multi-head attention
  2. Phase 2 — KV Cache & Efficient Attention: Why memory is the bottleneck; MQA, GQA explained
  3. Phase 3 — Multi-Head Latent Attention (MLA): DeepSeek's core innovation, from scratch
  4. Phase 4 — Positional Encoding: From integers to RoPE — how models understand word order
  5. Phase 5 — Mixture of Experts (MoE): Sparse activation, routing, DeepSeekMoE
  6. Phase 6 — Multi-Token Prediction: Predicting ahead, speculative decoding
  7. Phase 7 — Quantization: FP8 training, fine-grained quantization
  8. Phase 8 — V2/V3 System Design: Distributed training at 2048-GPU scale
  9. Phase 9 — R1 & Reasoning: GRPO, emergent chain-of-thought, test-time compute
  10. Phase 10 — Future Systems: V4, million-token context, agentic architectures

The analogy that sticks

An LLM is like an enormous translation office. The input text enters as one language (tokens); each floor of the building (layer) refines the understanding; the final floor outputs the next word. DeepSeek's innovation is in redesigning specific floors to be faster and cheaper — some floors use specialist workers instead of generalists (MoE), some use compressed filing cabinets instead of full-size ones (MLA), and the whole building runs on more efficient power (FP8).

5 things to remember

  1. Same foundation: Every LLM is a Transformer — Embedding → Attention → FFN → Output.
  2. MLA: DeepSeek compresses the KV cache by 93% using low-rank latent vectors instead of reducing attention heads.
  3. MoE: 671B parameters, 37B active. 256 fine-grained experts + 1 shared expert per token.
  4. Cost: $5.58M training cost for a model competitive with GPT-4 — enabled by FP8, sparsity, and efficient parallelism.
  5. R1: Same architecture as V3, different training. Pure RL produced emergent reasoning without being taught how to reason.

Go deeper

How Tokens Flow Through an LLM →
© cvam — written in plaintext, served warm