A curated path through reinforcement learning — courses and books to learn the theory, libraries
and environments to run experiments, and the landmark papers per family: value-based,
policy-gradient, actor-critic, model-based, offline, multi-agent, and RLHF. Opinionated and tight.
Links open in a new tab.
Courses & learning
| Resource | What | Link |
| David Silver — RL Course (DeepMind/UCL) | The canonical lecture series. Watch this first. | site |
| Spinning Up in Deep RL (OpenAI) | Best practical intro — clean explanations + reference code. | site |
| CS285 — Deep RL (Sergey Levine, Berkeley) | The deep graduate course, full videos. | site |
| Hugging Face Deep RL Course | Hands-on, free, train agents end-to-end. | course |
| OpenAI Spinning Up — Key Papers | A curated, ordered reading list of the field's classics. | list |
Books
| Resource | What | Link |
| Sutton & Barto — Reinforcement Learning: An Introduction | The foundational text, free PDF. Read it cover to cover. | site |
| Szepesvári — Algorithms for RL | Concise, theoretical treatment of the core algorithms. | pdf |
| Lapan — Deep RL Hands-On | Practical, code-heavy walkthrough of modern deep RL. | book |
| Graesser & Keng — Foundations of Deep RL | Clear bridge from theory to implementation. | book |
Libraries & frameworks
| Resource | What | Link |
| Stable-Baselines3 | Reliable PyTorch implementations of the standard algorithms. Best starting point. | repo |
| CleanRL | Single-file, readable implementations. Best for learning + research. | repo |
| Ray RLlib | Scalable, distributed RL for production workloads. | docs |
| Tianshou | Fast, modular PyTorch RL platform. | repo |
| TorchRL | Official PyTorch RL library — composable primitives. | repo |
| TRL (HF) | RLHF / PPO / DPO for fine-tuning language models. | repo |
Environments & benchmarks
| Resource | What | Link |
| Gymnasium | The standard RL environment API (maintained Gym fork). | site |
| PettingZoo | The Gym of multi-agent RL. | site |
| MuJoCo | The continuous-control physics benchmark suite. | site |
| Arcade Learning Env (Atari) | The classic deep-RL benchmark. | repo |
| Isaac Gym / Isaac Lab | Massively-parallel GPU sim for robotics RL. | repo |
| D4RL | The standard datasets for offline RL. | repo |
Value-based methods
| Paper | Why it matters | Link |
| DQN (2013/2015) | Deep Q-Networks — human-level Atari, started deep RL. | arXiv |
| Double DQN (2015) | Fixes Q-value overestimation. | arXiv |
| Prioritized Experience Replay (2015) | Sample important transitions more often. | arXiv |
| Rainbow (2017) | Combines six DQN improvements into one. | arXiv |
Policy gradient & actor-critic
| Paper | Why it matters | Link |
| TRPO (2015) | Trust-region updates — stable policy optimization. | arXiv |
| A3C (2016) | Asynchronous advantage actor-critic. | arXiv |
| PPO (2017) | The default workhorse — simple, robust, everywhere. | arXiv |
| DDPG (2015) | Continuous-control actor-critic with replay. | arXiv |
| SAC (2018) | Maximum-entropy off-policy — sample-efficient continuous control. | arXiv |
| TD3 (2018) | Twin-critic fixes to DDPG. | arXiv |
Model-based & planning
| Paper | Why it matters | Link |
| AlphaGo / AlphaZero (2016/2017) | MCTS + deep nets — superhuman Go/chess/shogi. | arXiv |
| MuZero (2019) | Planning with a learned model — no rules given. | arXiv |
| Dreamer / DreamerV3 (2020-23) | World-model RL — learns in imagination, one config across domains. | arXiv |
Offline & multi-agent
| Paper | Why it matters | Link |
| CQL (2020) | Conservative Q-learning — the offline-RL workhorse. | arXiv |
| Decision Transformer (2021) | RL as sequence modeling — condition on return. | arXiv |
| QMIX (2018) | Value factorization for cooperative multi-agent RL. | arXiv |
| MADDPG (2017) | Centralized-critic multi-agent actor-critic. | arXiv |
RLHF & RL for LLMs
| Paper | Why it matters | Link |
| InstructGPT (2022) | RLHF that made LLMs follow instructions — the ChatGPT recipe. | arXiv |
| DPO (2023) | Direct preference optimization — RLHF without a reward model. | arXiv |
| GRPO (DeepSeekMath) (2024) | Group-relative PPO — critic-free RL behind reasoning models. | arXiv |
More curated lists
| Resource | What | Link |
| Awesome RL (aikorea) | The long-standing community link collection. | repo |
| Papers With Code — RL | Leaderboards + code for every task and benchmark. | site |
| Lil'Log (Lilian Weng) | The best deep-dive blog posts on policy gradients, exploration, RLHF. | blog |
where to start
New to RL? Watch David Silver, read Sutton & Barto in parallel, then implement from CleanRL or
Spinning Up. Algorithm order: DQN → PPO → SAC → MuZero. For LLM work, go InstructGPT → DPO → GRPO.
RL is finicky — always compare against a tuned PPO/SAC baseline before believing a new result.