Papers — cvam.sight

TRANSFORMERS & DEEPSEEK

2017

Attention Is All You Need

Vaswani et al. — the Transformer architecture that started it all.

2024

DeepSeek-V2

Multi-Head Latent Attention — 93% KV cache reduction. The core innovation.

2024

DeepSeek-V3 Technical Report

671B params, $5.58M training cost. MLA + MoE + MTP at scale.

2025

DeepSeek-R1

Emergent reasoning from pure RL. GRPO, chain-of-thought, self-verification.

FLASHATTENTION SERIES

2022

FlashAttention

IO-aware exact attention via tiling and online softmax. 2–4× speedup, linear memory.

2023

FlashAttention-2

2× faster via better work partitioning. 230 TFLOPs/s on A100. Single author: Tri Dao.

2024

FlashAttention-3

H100-optimized with async execution, warp specialization, and FP8. 740 TFLOPs/s.

VISION & LANGUAGE

2022

Flamingo

Visual language model for few-shot learning. Perceiver Resampler + gated cross-attention. Beats fine-tuned models with 32 examples. NeurIPS 2022.

2023

I-JEPA

Self-supervised vision by predicting abstract representations, not pixels. No augmentations, 10× cheaper than MAE. CVPR 2023.

2026

LocateAnything

Parallel Box Decoding for vision-language grounding. Emits whole boxes at once — up to 10× faster than Qwen3-VL and more accurate. NVIDIA.

YC PAPER CLUB

2026

Speculative Speculative Decoding

Parallelizes the draft↔verify loop of speculative decoding. ~30% over speculative baselines, up to 5× over plain decoding. Kumar, Dao, May.

2024

Diffusion Model Predictive Control

Multi-step action proposal + dynamics model, both diffusion, for online MPC. Matches SOTA offline RL on D4RL. TMLR.

2026

LeWorldModel

Stable end-to-end JEPA from pixels. Two loss terms, one hyperparameter, 15M params, 48× faster planning. LeCun et al.

2025

Deep Learning is Not So Mysterious or Different

Benign overfitting, double descent, overparametrization — none unique to neural nets. Soft inductive biases + PAC-Bayes. Andrew Gordon Wilson.

2025

Pre-training Under Infinite Compute

Data fixed, compute free. Weight decay 30× standard, asymptote-fitting, ensemble scaling, distillation. Kim, Kotha, Liang, Hashimoto.

POST-TRAINING & DISTILLATION

2026

Rethinking On-Policy Distillation

When OPD works, when it silently fails. Two conditions, 97–99% shared top-k mass, and why long-horizon reasoning may break the recipe.

PROMPT OPTIMIZATION

2026

GEPA

Genetic-Pareto prompt optimization. Beats GRPO by 6% using 35× fewer rollouts. ICLR 2026 Oral.

Key papers referenced across the blog.

Attention Is All You Need

DeepSeek-V2

DeepSeek-V3 Technical Report

DeepSeek-R1

FlashAttention

FlashAttention-2

FlashAttention-3

Flamingo

I-JEPA

LocateAnything

Speculative Speculative Decoding

Diffusion Model Predictive Control

LeWorldModel

Deep Learning is Not So Mysterious or Different

Pre-training Under Infinite Compute

Rethinking On-Policy Distillation

GEPA