Awesome Reinforcement Learning

A curated path through reinforcement learning — courses and books to learn the theory, libraries and environments to run experiments, and the landmark papers per family: value-based, policy-gradient, actor-critic, model-based, offline, multi-agent, and RLHF. Opinionated and tight. Links open in a new tab.

Courses & learning

Resource	What	Link
David Silver — RL Course (DeepMind/UCL)	The canonical lecture series. Watch this first.	site
Spinning Up in Deep RL (OpenAI)	Best practical intro — clean explanations + reference code.	site
CS285 — Deep RL (Sergey Levine, Berkeley)	The deep graduate course, full videos.	site
Hugging Face Deep RL Course	Hands-on, free, train agents end-to-end.	course
OpenAI Spinning Up — Key Papers	A curated, ordered reading list of the field's classics.	list

Books

Resource	What	Link
Sutton & Barto — Reinforcement Learning: An Introduction	The foundational text, free PDF. Read it cover to cover.	site
Szepesvári — Algorithms for RL	Concise, theoretical treatment of the core algorithms.	pdf
Lapan — Deep RL Hands-On	Practical, code-heavy walkthrough of modern deep RL.	book
Graesser & Keng — Foundations of Deep RL	Clear bridge from theory to implementation.	book

Libraries & frameworks

Resource	What	Link
Stable-Baselines3	Reliable PyTorch implementations of the standard algorithms. Best starting point.	repo
CleanRL	Single-file, readable implementations. Best for learning + research.	repo
Ray RLlib	Scalable, distributed RL for production workloads.	docs
Tianshou	Fast, modular PyTorch RL platform.	repo
TorchRL	Official PyTorch RL library — composable primitives.	repo
TRL (HF)	RLHF / PPO / DPO for fine-tuning language models.	repo

Environments & benchmarks

Resource	What	Link
Gymnasium	The standard RL environment API (maintained Gym fork).	site
PettingZoo	The Gym of multi-agent RL.	site
MuJoCo	The continuous-control physics benchmark suite.	site
Arcade Learning Env (Atari)	The classic deep-RL benchmark.	repo
Isaac Gym / Isaac Lab	Massively-parallel GPU sim for robotics RL.	repo
D4RL	The standard datasets for offline RL.	repo

Value-based methods

Paper	Why it matters	Link
DQN (2013/2015)	Deep Q-Networks — human-level Atari, started deep RL.	arXiv
Double DQN (2015)	Fixes Q-value overestimation.	arXiv
Prioritized Experience Replay (2015)	Sample important transitions more often.	arXiv
Rainbow (2017)	Combines six DQN improvements into one.	arXiv

Policy gradient & actor-critic

Paper	Why it matters	Link
TRPO (2015)	Trust-region updates — stable policy optimization.	arXiv
A3C (2016)	Asynchronous advantage actor-critic.	arXiv
PPO (2017)	The default workhorse — simple, robust, everywhere.	arXiv
DDPG (2015)	Continuous-control actor-critic with replay.	arXiv
SAC (2018)	Maximum-entropy off-policy — sample-efficient continuous control.	arXiv
TD3 (2018)	Twin-critic fixes to DDPG.	arXiv

Model-based & planning

Paper	Why it matters	Link
AlphaGo / AlphaZero (2016/2017)	MCTS + deep nets — superhuman Go/chess/shogi.	arXiv
MuZero (2019)	Planning with a learned model — no rules given.	arXiv
Dreamer / DreamerV3 (2020-23)	World-model RL — learns in imagination, one config across domains.	arXiv

Offline & multi-agent

Paper	Why it matters	Link
CQL (2020)	Conservative Q-learning — the offline-RL workhorse.	arXiv
Decision Transformer (2021)	RL as sequence modeling — condition on return.	arXiv
QMIX (2018)	Value factorization for cooperative multi-agent RL.	arXiv
MADDPG (2017)	Centralized-critic multi-agent actor-critic.	arXiv

RLHF & RL for LLMs

Paper	Why it matters	Link
InstructGPT (2022)	RLHF that made LLMs follow instructions — the ChatGPT recipe.	arXiv
DPO (2023)	Direct preference optimization — RLHF without a reward model.	arXiv
GRPO (DeepSeekMath) (2024)	Group-relative PPO — critic-free RL behind reasoning models.	arXiv

More curated lists

Resource	What	Link
Awesome RL (aikorea)	The long-standing community link collection.	repo
Papers With Code — RL	Leaderboards + code for every task and benchmark.	site
Lil'Log (Lilian Weng)	The best deep-dive blog posts on policy gradients, exploration, RLHF.	blog

where to start New to RL? Watch David Silver, read Sutton & Barto in parallel, then implement from CleanRL or Spinning Up. Algorithm order: DQN → PPO → SAC → MuZero. For LLM work, go InstructGPT → DPO → GRPO. RL is finicky — always compare against a tuned PPO/SAC baseline before believing a new result.

Awesome Reinforcement Learning.