← Awesome Lists

AWESOME · REINFORCEMENT LEARNING · CURATED

Awesome Reinforcement Learning.

reinforcement-learning deep-rl awesome resources
A curated path through reinforcement learning — courses and books to learn the theory, libraries and environments to run experiments, and the landmark papers per family: value-based, policy-gradient, actor-critic, model-based, offline, multi-agent, and RLHF. Opinionated and tight. Links open in a new tab.

Courses & learning

ResourceWhatLink
David Silver — RL Course (DeepMind/UCL)The canonical lecture series. Watch this first.site
Spinning Up in Deep RL (OpenAI)Best practical intro — clean explanations + reference code.site
CS285 — Deep RL (Sergey Levine, Berkeley)The deep graduate course, full videos.site
Hugging Face Deep RL CourseHands-on, free, train agents end-to-end.course
OpenAI Spinning Up — Key PapersA curated, ordered reading list of the field's classics.list

Books

ResourceWhatLink
Sutton & Barto — Reinforcement Learning: An IntroductionThe foundational text, free PDF. Read it cover to cover.site
Szepesvári — Algorithms for RLConcise, theoretical treatment of the core algorithms.pdf
Lapan — Deep RL Hands-OnPractical, code-heavy walkthrough of modern deep RL.book
Graesser & Keng — Foundations of Deep RLClear bridge from theory to implementation.book

Libraries & frameworks

ResourceWhatLink
Stable-Baselines3Reliable PyTorch implementations of the standard algorithms. Best starting point.repo
CleanRLSingle-file, readable implementations. Best for learning + research.repo
Ray RLlibScalable, distributed RL for production workloads.docs
TianshouFast, modular PyTorch RL platform.repo
TorchRLOfficial PyTorch RL library — composable primitives.repo
TRL (HF)RLHF / PPO / DPO for fine-tuning language models.repo

Environments & benchmarks

ResourceWhatLink
GymnasiumThe standard RL environment API (maintained Gym fork).site
PettingZooThe Gym of multi-agent RL.site
MuJoCoThe continuous-control physics benchmark suite.site
Arcade Learning Env (Atari)The classic deep-RL benchmark.repo
Isaac Gym / Isaac LabMassively-parallel GPU sim for robotics RL.repo
D4RLThe standard datasets for offline RL.repo

Value-based methods

PaperWhy it mattersLink
DQN (2013/2015)Deep Q-Networks — human-level Atari, started deep RL.arXiv
Double DQN (2015)Fixes Q-value overestimation.arXiv
Prioritized Experience Replay (2015)Sample important transitions more often.arXiv
Rainbow (2017)Combines six DQN improvements into one.arXiv

Policy gradient & actor-critic

PaperWhy it mattersLink
TRPO (2015)Trust-region updates — stable policy optimization.arXiv
A3C (2016)Asynchronous advantage actor-critic.arXiv
PPO (2017)The default workhorse — simple, robust, everywhere.arXiv
DDPG (2015)Continuous-control actor-critic with replay.arXiv
SAC (2018)Maximum-entropy off-policy — sample-efficient continuous control.arXiv
TD3 (2018)Twin-critic fixes to DDPG.arXiv

Model-based & planning

PaperWhy it mattersLink
AlphaGo / AlphaZero (2016/2017)MCTS + deep nets — superhuman Go/chess/shogi.arXiv
MuZero (2019)Planning with a learned model — no rules given.arXiv
Dreamer / DreamerV3 (2020-23)World-model RL — learns in imagination, one config across domains.arXiv

Offline & multi-agent

PaperWhy it mattersLink
CQL (2020)Conservative Q-learning — the offline-RL workhorse.arXiv
Decision Transformer (2021)RL as sequence modeling — condition on return.arXiv
QMIX (2018)Value factorization for cooperative multi-agent RL.arXiv
MADDPG (2017)Centralized-critic multi-agent actor-critic.arXiv

RLHF & RL for LLMs

PaperWhy it mattersLink
InstructGPT (2022)RLHF that made LLMs follow instructions — the ChatGPT recipe.arXiv
DPO (2023)Direct preference optimization — RLHF without a reward model.arXiv
GRPO (DeepSeekMath) (2024)Group-relative PPO — critic-free RL behind reasoning models.arXiv

More curated lists

ResourceWhatLink
Awesome RL (aikorea)The long-standing community link collection.repo
Papers With Code — RLLeaderboards + code for every task and benchmark.site
Lil'Log (Lilian Weng)The best deep-dive blog posts on policy gradients, exploration, RLHF.blog
where to start New to RL? Watch David Silver, read Sutton & Barto in parallel, then implement from CleanRL or Spinning Up. Algorithm order: DQN → PPO → SAC → MuZero. For LLM work, go InstructGPT → DPO → GRPO. RL is finicky — always compare against a tuned PPO/SAC baseline before believing a new result.
← prev: Computer Vision next: Generative AI →
© cvam — written in plaintext, served warm