// SERIES

Long-form, multi-part deep dives.

Each series is a connected sequence of articles that builds from foundations to mastery. Pick one and start reading.


ALL SERIES // CLICK TO JUMP

ml

DeepSeek Engineering Blog Series

From Transformer internals to DeepSeek-V4. 10 planned phases covering every architectural innovation.

#deepseek #ml #transformers
In Progress · 1/10 phases · 7 articles
paperjuice

FlashAttention — The Evolution Series

Four papers. Four years. From IO-aware tiling to Blackwell asymmetric scaling.

#flashattention #gpu #ml
Paper Juice · 4 papers · 54 min

SERIES // 10 PHASES · 50+ ARTICLES · IN PROGRESS

DeepSeek Engineering
Blog Series.

From Transformer internals to DeepSeek-V4. A complete technical deep-dive into every architectural innovation — MLA, MoE, MTP, FP8 training, GRPO, and more. Grounded in original papers and code.

7 articles
24k words
10 phases
1 / 10 completed


Phase 1

LLM Foundations

Build the mental model of how a Large Language Model works — from raw text to generated tokens. No assumed ML background.

Phase 2

KV Cache & Efficient Attention

Explain the #1 memory bottleneck in LLM inference (the KV cache) and trace the evolution of attention variants: MQA → GQA → MLA.

4 articles · coming soon

2.1 KV Cache Internals · 2.2 MQA Explained · 2.3 GQA Deep Dive · 2.4 Why Attention Scaling Breaks
Phase 3

DeepSeek MLA (Major Innovation)

The centrepiece technical innovation of DeepSeek-V2/V3 — Multi-Head Latent Attention. 93% KV cache reduction without quality loss.

5 articles · coming soon

3.1 MLA Explained · 3.2 MLA From Scratch · 3.3 MLA vs MQA vs GQA · 3.4 KV Cache Memory · 3.5 MLA + RoPE
Phase 4

Positional Encoding Evolution

How Transformers understand word order — from integer position to RoPE. The modern standard used by DeepSeek, LLaMA, and Mistral.

5 articles · coming soon

4.1 Integer PE · 4.2 Binary PE · 4.3 Sinusoidal PE · 4.4 RoPE Visual Guide · 4.5 Why RoPE Won
Phase 5

Mixture of Experts (MoE)

DeepSeek's second pillar — fine-grained expert segmentation, shared experts, and auxiliary-loss-free load balancing.

7 articles · coming soon

5.1 Intro to MoE · 5.2 Routing · 5.3 Visualizing Experts · 5.4 Aux Loss · 5.5 Capacity Factor · 5.6 DeepSeekMoE · 5.7 Code
Phase 6

Multi-Token Prediction (MTP)

Predicting multiple future tokens simultaneously — improving training efficiency and enabling speculative decoding.

6 articles · coming soon

6.1 Why NTP Is Limited · 6.2 MTP Intro · 6.3 DeepSeek MTP · 6.4 Training Architecture · 6.5 Code · 6.6 Speculative Decoding
Phase 7

Quantization & Inference Optimization

DeepSeek-V3 was the first 671B model to train with FP8. Fine-grained tile-based quantization explained.

6 articles · coming soon

7.1 LLM Quantization · 7.2 Mixed Precision · 7.3 Fine-Grained Quant · 7.4 Accumulation Precision · 7.5 Online Quant · 7.6 FP8 Training
Phase 8

DeepSeek V2/V3 System Design

Training 671B parameters on 2048 GPUs — parallelism strategies, communication topology, and the $5.58M budget.

8 articles · coming soon

8.1 V2 Breakdown · 8.2 V3 Architecture · 8.3 Loss-Free Balancing · 8.4 Sparse Economics · 8.5 Distributed Training · 8.6 Expert Parallelism · 8.7 Communication · 8.8 GPU Memory
Phase 9

DeepSeek R1 & Reasoning

The most impactful release — RL for reasoning, GRPO algorithm, emergent chain-of-thought, and test-time compute scaling.

6 articles · coming soon

9.1 R1 Architecture · 9.2 RL for Reasoning · 9.3 GRPO vs PPO · 9.4 Emergent CoT · 9.5 Self-Verification · 9.6 Test-Time Scaling
Phase 10

Future DeepSeek Systems

V4, million-token context, agentic architectures, and the future of open-source frontier models.

6 articles · coming soon

10.1 Agentic Architecture · 10.2 V4 Architecture · 10.3 Million-Token Context · 10.4 Beyond CUDA · 10.5 DeepSeek vs GPT-5 · 10.6 Open-Source Future