Original research papers that form the foundation of the articles on this site. Sorted by topic.
Vaswani et al. — the Transformer architecture that started it all.
2024Multi-Head Latent Attention — 93% KV cache reduction. The core innovation.
2024671B params, $5.58M training cost. MLA + MoE + MTP at scale.
2025Emergent reasoning from pure RL. GRPO, chain-of-thought, self-verification.
IO-aware exact attention via tiling and online softmax. 2–4× speedup, linear memory.
20232× faster via better work partitioning. 230 TFLOPs/s on A100. Single author: Tri Dao.
2024H100-optimized with async execution, warp specialization, and FP8. 740 TFLOPs/s.
Visual language model for few-shot learning. Perceiver Resampler + gated cross-attention. Beats fine-tuned models with 32 examples. NeurIPS 2022.
2023Self-supervised vision by predicting abstract representations, not pixels. No augmentations, 10× cheaper than MAE. CVPR 2023.
2026Parallel Box Decoding for vision-language grounding. Emits whole boxes at once — up to 10× faster than Qwen3-VL and more accurate. NVIDIA.
Parallelizes the draft↔verify loop of speculative decoding. ~30% over speculative baselines, up to 5× over plain decoding. Kumar, Dao, May.
2024Multi-step action proposal + dynamics model, both diffusion, for online MPC. Matches SOTA offline RL on D4RL. TMLR.
2026Stable end-to-end JEPA from pixels. Two loss terms, one hyperparameter, 15M params, 48× faster planning. LeCun et al.
2025Benign overfitting, double descent, overparametrization — none unique to neural nets. Soft inductive biases + PAC-Bayes. Andrew Gordon Wilson.
2025Data fixed, compute free. Weight decay 30× standard, asymptote-fitting, ensemble scaling, distillation. Kim, Kotha, Liang, Hashimoto.