// SERIES

Long-form, multi-part deep dives.

Each series is a connected sequence of articles that builds from foundations to mastery. Pick one and start reading.

ALL SERIES // CLICK TO JUMP

SERIES // 10 PHASES · 50+ ARTICLES · IN PROGRESS

DeepSeek Engineering
Blog Series.

From Transformer internals to DeepSeek-V4. A complete technical deep-dive into every architectural innovation — MLA, MoE, MTP, FP8 training, GRPO, and more. Grounded in original papers and code.

7 articles

24k words

10 phases

1 / 10 completed

ROADMAP // CLICK A PHASE TO JUMP

Phase 1 LLM Foundations

Phase 2 KV Cache

Phase 3 MLA

Phase 4 Positional Enc.

Phase 5 MoE

Phase 6 MTP

Phase 7 Quantization

Phase 8 V2/V3 Design

Phase 9 R1 & Reasoning

Phase 10 Future

Phase 1

LLM Foundations

Build the mental model of how a Large Language Model works — from raw text to generated tokens. No assumed ML background.

1.1

Introduction to DeepSeek Architecture

Before you understand why DeepSeek is different, you need to know what every LLM shares. MLA, MoE, MTP — the three innovations that matter.

beginner 18 min 1.2

How Tokens Flow Through an LLM

LLMs don't read words. They read numbers. The full journey from your sentence to token IDs to embeddings to predictions and back.

beginner 15 min 1.3

Attention Mechanism Explained

The attention mechanism is the one idea that changed AI forever. Query, Key, Value — explained without assuming you know linear algebra.

beginner 14 min 1.4

Self-Attention From Scratch

Stop using attention as a black box. Every matrix multiplication explained — Q, K, V projections, scaled dot-product, and a worked numeric example.

intermediate 18 min 1.5

Causal Attention and Autoregressive Generation

Why can't an LLM see into the future? The causal mask is why — and it's elegantly simple. Plus: the full autoregressive loop explained.

intermediate 14 min 1.6

Multi-Head Attention Internals

One attention head learns one pattern. GPT-4 has 96 of them. Here's why multiple heads matter and what each one actually learns.

intermediate 16 min 1.7

Multi-Head Attention Implementation in Python

Theory is nice. Running code is better. Full PyTorch MHA from scratch — no nn.MultiheadAttention, just raw matrix ops.

intermediate 20 min

Phase 2

KV Cache & Efficient Attention

Explain the #1 memory bottleneck in LLM inference (the KV cache) and trace the evolution of attention variants: MQA → GQA → MLA.

4 articles · coming soon

2.1 KV Cache Internals · 2.2 MQA Explained · 2.3 GQA Deep Dive · 2.4 Why Attention Scaling Breaks

Phase 3

DeepSeek MLA (Major Innovation)

The centrepiece technical innovation of DeepSeek-V2/V3 — Multi-Head Latent Attention. 93% KV cache reduction without quality loss.

5 articles · coming soon

3.1 MLA Explained · 3.2 MLA From Scratch · 3.3 MLA vs MQA vs GQA · 3.4 KV Cache Memory · 3.5 MLA + RoPE

Phase 4

Positional Encoding Evolution

How Transformers understand word order — from integer position to RoPE. The modern standard used by DeepSeek, LLaMA, and Mistral.

5 articles · coming soon

4.1 Integer PE · 4.2 Binary PE · 4.3 Sinusoidal PE · 4.4 RoPE Visual Guide · 4.5 Why RoPE Won

Phase 5

Mixture of Experts (MoE)

DeepSeek's second pillar — fine-grained expert segmentation, shared experts, and auxiliary-loss-free load balancing.

7 articles · coming soon

5.1 Intro to MoE · 5.2 Routing · 5.3 Visualizing Experts · 5.4 Aux Loss · 5.5 Capacity Factor · 5.6 DeepSeekMoE · 5.7 Code

Phase 6

Multi-Token Prediction (MTP)

Predicting multiple future tokens simultaneously — improving training efficiency and enabling speculative decoding.

6 articles · coming soon

6.1 Why NTP Is Limited · 6.2 MTP Intro · 6.3 DeepSeek MTP · 6.4 Training Architecture · 6.5 Code · 6.6 Speculative Decoding

Phase 7

Quantization & Inference Optimization

DeepSeek-V3 was the first 671B model to train with FP8. Fine-grained tile-based quantization explained.

6 articles · coming soon

7.1 LLM Quantization · 7.2 Mixed Precision · 7.3 Fine-Grained Quant · 7.4 Accumulation Precision · 7.5 Online Quant · 7.6 FP8 Training

Phase 8

DeepSeek V2/V3 System Design

Training 671B parameters on 2048 GPUs — parallelism strategies, communication topology, and the $5.58M budget.

8 articles · coming soon

8.1 V2 Breakdown · 8.2 V3 Architecture · 8.3 Loss-Free Balancing · 8.4 Sparse Economics · 8.5 Distributed Training · 8.6 Expert Parallelism · 8.7 Communication · 8.8 GPU Memory

Phase 9

DeepSeek R1 & Reasoning

The most impactful release — RL for reasoning, GRPO algorithm, emergent chain-of-thought, and test-time compute scaling.

6 articles · coming soon

9.1 R1 Architecture · 9.2 RL for Reasoning · 9.3 GRPO vs PPO · 9.4 Emergent CoT · 9.5 Self-Verification · 9.6 Test-Time Scaling

Phase 10

Future DeepSeek Systems

V4, million-token context, agentic architectures, and the future of open-source frontier models.

6 articles · coming soon

10.1 Agentic Architecture · 10.2 V4 Architecture · 10.3 Million-Token Context · 10.4 Beyond CUDA · 10.5 DeepSeek vs GPT-5 · 10.6 Open-Source Future

Long-form, multi-part deep dives.

DeepSeek Engineering Blog Series

FlashAttention — The Evolution Series

DeepSeek EngineeringBlog Series.

LLM Foundations

Introduction to DeepSeek Architecture

How Tokens Flow Through an LLM

Attention Mechanism Explained

Self-Attention From Scratch

Causal Attention and Autoregressive Generation

Multi-Head Attention Internals

Multi-Head Attention Implementation in Python

KV Cache & Efficient Attention

DeepSeek MLA (Major Innovation)

Positional Encoding Evolution

Mixture of Experts (MoE)

Multi-Token Prediction (MTP)

Quantization & Inference Optimization

DeepSeek V2/V3 System Design

DeepSeek R1 & Reasoning

Future DeepSeek Systems

DeepSeek Engineering
Blog Series.