GitHub - walkinglabs/hands-on-modern-rl: 🚀 An open-source, hands-on curriculum bridging the gap from basic RL concepts to LLM alignment, RLVR, and advanced Agentic systems.

A practice-first guide to modern RL, from classic control to LLM post-training, RLVR, and multimodal agents.

Course Preview · Overview · News · Contents · Course Outline · Experiment Code · Quick Start · Contributing

Course Preview

A clear learning map _{From the preface and foundations to frontier topics, the chapter tree and page outline help you navigate quickly.}	Line-by-line code focus _{Key PPO, DPO, and GRPO implementations include code maps that connect formulas to readable code.}
Training metric visualization _{Real curves, metric explanations, and failure signals sit together so you can debug while running experiments.}	LLM post-training pipelines _{RLHF, DPO, GRPO, RLVR, and related topics are tied together through flows, artifacts, and cases.}
Agentic RL experiment _{DeepCoder-style GRPO training curves connect tool-use agents, response length, and reward dynamics in a reproducible lab.}	Atari game experiment _{Atari Pong gameplay screenshots and DQN training notes show how pixel-based agents turn screens into decisions.}

Note

We hope this open course gives more learners the courage to climb toward the frontier of intelligence and solve more of the hard problems on the path to AGI.

The course is evolving quickly. We recommend focusing on chapters that are not marked as under construction; chapters still in progress may contain mistakes, and corrections or suggestions are welcome.

Help Wanted

Because compute resources are limited, we are seeking GPU support. If you can help with GPU access, please contact physicoada@gmail.com.

Overview

Hands-On Modern RL is an open course for learning modern reinforcement learning through practice. Instead of the usual "formula first, black-box API later" route, this course takes a practice-first path: learners begin with runnable code and observable training behavior, then use those concrete traces to understand states, value functions, policy gradients, reward modeling, credit assignment, and the rest of the mathematical structure behind RL.

The course spans classic control and connects directly to current AI frontiers, including large language model (LLM) post-training, preference alignment with DPO and GRPO, reinforcement learning with verifiable rewards (RLVR), multi-turn tool-use agents, Agentic RL, and vision-language model (VLM) reinforcement learning.

The goal is to provide a solid ladder: from solving CartPole for the first time to building modern post-training and agent systems.

Design Principles

The course is organized around these engineering and teaching principles:

Practice before formalism. Each major topic starts from experiments, metrics, failure cases, or implementation details, then introduces the mathematical abstraction.
Theory explains behavior. MDPs, Bellman equations, policy gradients, GAE, PPO clipping, DPO objectives, and GRPO-style group advantages are introduced as tools for explaining what the code does.
Modern RL goes beyond classic RL. The course covers classic control and deep RL, then moves into RLHF, preference optimization, RLVR, VLM reinforcement learning, and multi-turn agent training.
Debugging is first-class. Training collapse, reward hacking, KL drift, entropy decay, OOM failures, and evaluation blind spots are treated as core material.
Readable systems beat black boxes. Examples favor explicit implementations, inspectable metrics, and clear experiment boundaries so learners can modify and extend them.

Audience

This course is for learners who want to understand reinforcement learning by building and inspecting working systems.

It is especially useful for:

Machine learning engineers moving from supervised learning into RL.
Researchers and students preparing to read modern RL and alignment papers.
LLM practitioners interested in RLHF, DPO, GRPO, RLVR, and post-training systems.
Builders of tool-use agents, web agents, code agents, and evaluation pipelines.
Self-learners who prefer code, experiments, and visual intuition before dense derivations.

Recommended background:

Python programming experience.
Basic PyTorch familiarity.
Introductory linear algebra, probability, and calculus for machine learning.
Ability to read papers and trace open-source training scripts.

The course includes math review appendices, so full mathematical fluency is not required on day one.

Learning Goals

After completing the course, learners should be able to:

Implement and explain the core RL loop: environment interaction, trajectory collection, reward feedback, policy updates, and evaluation.
Connect MDPs, value functions, Bellman equations, TD learning, policy gradients, and advantage estimation to concrete training behavior.
Read and modify implementations of DQN, REINFORCE, Actor-Critic, PPO, DPO, GRPO, and related methods.
Reason about LLM post-training pipelines, including SFT, reward modeling, PPO-style RLHF, DPO-family methods, and RLVR training.
Understand multi-turn interaction and credit assignment, and build tool-use, trajectory-synthesis, and Agentic RL systems.
Extend reinforcement learning ideas to VLMs, embodied intelligence, multi-agent self-play, and other frontier areas.
Diagnose common RL failure modes and design reasonable algorithms, engineering evaluations, and debugging workflows for new RL problems.

Current Status

This repository is an active courseware project. Content is being expanded chapter by chapter, with emphasis on correctness, runnable examples, and a stable learning path.

Course site: walkinglabs.github.io/hands-on-modern-rl
Source content: docs/
Runnable examples: code/
Local verification: npm run verify
License: CC BY-NC-SA 4.0

Issues and pull requests are welcome for typo fixes, conceptual corrections, reproducibility improvements, references, and focused course extensions.

News

Note: This course was created with AI assistance and has not yet been fully reviewed. It may contain factual mistakes or code that does not run as expected. Issues and pull requests are very welcome.

[2026-05-15] 📖 Full English Translation & PDF Release: Complete English translation of all chapters is now available. PDF builds for both Chinese and English editions are released automatically via CI.
[2026-05-13] 🚀 Major Upgrade: LLM and Traditional RL Hands-on Labs: Added reproducible training examples for Agentic RL (Deep Research / rLLM) and Traditional RL (Actor-Critic continuous control). Includes complete code and fine-tuning analysis for building an Agentic training system from scratch, along with new VLM RL (GeoQA geometry reasoning) hands-on experiments!
[2026-05-02] Initial browsable open-source release for testing and feedback.

Roadmap

The course is under active development. Planned milestones:

2026-05-02: Initial open-source browsable release for community testing and feedback.
2026-05-10: Publish a first stable minor version, fix early typos, and stabilize Part 1 and Part 2 content and code.
Late May 2026: Improve reproducible LLM RL experiments and add a full RLVR hands-on module with evaluation.
Early June 2026: Deliver Agentic RL projects step by step, from single-tool use to complex Deep Research trajectory synthesis.
Late June 2026: Add Unity-based embodied RL environments and trainable project examples.
July 2026 and later: Expand multimodal frontier content with full VLM RL or Diffusion RL hands-on cases.

Course Outline

The course is divided into seven parts plus appendices. The README keeps only the main modules; the online site contains the full chapter tree, diagrams, code references, and detailed navigation.

Preface

Module	Description
Course Guide	Course positioning, learning path, and how to use the materials.
A Brief History of Reinforcement Learning	From trial-and-error learning to AlphaGo, RLHF, and LLM alignment.
Environment Setup	Installation and dependency setup for the course.

Part I: Fundamentals & Classical RL

Chapter	Main Topic	What It Covers
01	CartPole	States, actions, rewards, policies, value, entropy, and training curves through a first runnable control task.
02	Bandits	ε-greedy, UCB, Thompson sampling, and contextual bandits.
03	MDPs and Value Functions	MDPs, Bellman equations, DP/MC/TD, Q-learning, policy objectives, algorithm taxonomy, and reward design.

Part II: Deep Reinforcement Learning

Chapter	Main Topic	What It Covers
07	Deep Q-Networks	From tabular Q-learning to DQN, replay buffers, target networks, CNN encoders, LunarLander, and Atari.
08	Policy Gradient and REINFORCE	Direct policy optimization, sampling-based gradients, baselines, and variance reduction.
09	Actor-Critic	Actor-critic architecture, advantage functions, TD-error critic training, and Pendulum experiments.
10	PPO	Clipped objectives, trust-region intuition, GAE, reward models, and long-horizon planning.
11	Continuous Control	DDPG, TD3, SAC, model-based RL, and world-model search.

Part III: Advanced RL Methods

Chapter	Main Topic	What It Covers
12	Offline RL	Off-policy data, sequence modeling, CQL/IQL, and offline experiments.
13	Imitation & Meta-RL	Behavioral cloning, DAgger, IRL, GAIL, and meta-RL.
14	Exploration, MARL & Hierarchical RL	Curiosity-driven exploration, multi-agent RL, and hierarchical methods.

Part IV: LLM Alignment & Post-Training

Chapter	Main Topic	What It Covers
15	RLHF	SFT, reward modeling, PPO-style RLHF, evaluation, scaling, and reward hacking.
16	LLM RL Industrial Practice	Distributed sync, modern post-training stacks, and industrial pipelines.
17	DPO Family	DPO derivation, training metrics, and IPO/KTO/ORPO variants.
18	GRPO & RLVR	Group-relative advantages, DeepSeek-R1, DAPO, verifiable rewards, and sandboxed training.
19	Reasoning Models	O1/R1-style reasoning emergence, test-time scaling, hybrid and adaptive thinking.
20	PRM & Inference-Time Search	Outcome vs. process reward models, generative PRMs, and parallel reasoning.
21	CAI & RLAIF	Constitutional AI, HHH alignment, and RLAIF engineering.

Part V: Agentic Reinforcement Learning

Chapter	Main Topic	What It Covers
22	Agentic RL	Multi-turn credit assignment, tool-use trajectories, multi-agent swarms, and industrial practice.
23	RL-based SWE	SWE-bench, DeepSWE, Code World Model, and Self-play SSR.
24	Deep Research Agents	Browser RL harness and deep-research evaluation.
25	Computer Use	GUI agents, training recipes, and safety swarms.

Part VI: Multimodal Reinforcement Learning

Chapter	Main Topic	What It Covers
26	VLM RL	VLM GRPO, visual rewards, Qwen3-VL reflection, and EasyR1 GeoQA practice.
27	Audio RL	Audio reward design and future directions.
28	VLA & Embodied Intelligence	Vision-language-action models and embodied RL.
29	Visual Generation RL	Image and video generation RL.

Part VII: Safety, Evaluation & Research Frontiers

Chapter	Main Topic	What It Covers
30	Alignment Failures	Classical and modern failure modes, scaling laws, sleeper agents, and defenses.
31	AlphaEvolve	Code evolution and discovery via RL.
32	Self-Play & Scaling Outlook	Self-play, RL scaling laws, and LLM multi-agent RL outlook.

Appendices

Appendix	Main Topic	What It Covers
A	Training Debugging Guide	Common RL training failures, symptoms, root causes, and fixes.
B	RL Engineering Practice	Training infrastructure, agent sandboxes, parallelism, monitoring, evaluation benchmarks, metrics, and industrial exercises.
C	Handwritten Code Cheatsheet	Compact code notes for SFT, PPO, DPO, GRPO, sampling, attention, and DAPO.
D	Learning Resources and Reproduction Projects	Curated resources and reproduction projects for expanding course examples.
E	Math Foundations for Reinforcement Learning	Linear algebra, probability, calculus, optimization, and information theory for RL.

Experiment Code

The code/ directory contains runnable examples aligned with course chapters. Each chapter's code is intentionally compact so it can be inspected, run, and modified independently.

Area	Code Path	Representative Experiments
Classic control	`code/chapter01_cartpole/`	Train CartPole, inspect rewards and episode length, and compare PPO implementations.
Preference fine-tuning	`code/chapter02_dpo/`	Generate preference data, train with DPO, and compare model behavior before and after fine-tuning.
MDP and value learning	`code/chapter03_mdp/`	Run bandit strategies, solve GridWorld, and verify Bellman updates numerically.
Deep Q-learning	`code/chapter04_dqn/`	Implement replay buffers, target networks, and Double DQN variants.
Policy gradient	`code/chapter05_policy_gradient/`	Compare REINFORCE, baseline variants, and Actor-Critic updates.
PPO	`code/chapter07_ppo/`	Train LunarLander, inspect clipping, visualize GAE, and compare training stability.
RLHF	`code/chapter08_rlhf/`	Walk through SFT, reward model training, PPO-style alignment, and veRL/GSM8K adapter scripts.
Alignment and RLVR	`code/chapter09_alignment/`, `code/chapter09_grpo_rlvr/`	Explore DPO rewards, GRPO group advantages, and rule-based verifiable rewards.
VLM and agents	`code/chapter10_agentic_rl/`, `code/chapter11_vlm_rl/`	Build tool-use agent trajectory synthesis and implement multimodal model RL examples.
Advanced topics	`code/chapter12_future_trends/`	Study frontier directions including multi-agent RL and model-based RL.

See code/README.md for a code index and chapter-specific dependency notes.

Recommended Learning Path

A practical path through the repository:

Read the course guide and run the CartPole example.
Skim the DPO chapter early, even before finishing all theory, to anchor the motivation for LLM post-training.
Study Chapters 01-11 in order; this is the conceptual core (foundations, value-based and policy-based deep RL, PPO, continuous control).
After understanding policy gradients and PPO, return to Part IV (RLHF, DPO, GRPO, RLVR, reasoning, PRM, CAI).
Use the debugging and engineering appendices whenever a training run behaves strangely.
Treat Parts V-VII as extensions: Agentic RL, multimodal RL, and safety/frontier research.

Quick Start

Read Online

Published course site:

https://walkinglabs.github.io/hands-on-modern-rl/

Run the Documentation Site Locally

Requirements:

Node.js >= 18.0.0
npm

git clone https://github.com/walkinglabs/hands-on-modern-rl.git
cd hands-on-modern-rl
npm install
npm run dev

Then open the local VitePress URL shown in the terminal, usually:

http://localhost:5173

Verify the Site

Before submitting a pull request that changes documentation structure, theme code, navigation, build scripts, or generated assets, run:

npm run verify

This checks formatting, lints the VitePress theme, builds the site, and verifies expected build artifacts.

Run Course Code

Most code examples use Python and are organized by chapter.

cd code
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

For smaller installs, use chapter-specific requirements files:

pip install -r chapter01_cartpole/requirements.txt
python chapter01_cartpole/1-ppo_cartpole.py

Some chapters may require additional system libraries, GPU support, model downloads, or environment-specific setup. Start with Chapter 01 before running examples that involve LLMs, VLMs, or heavy simulators.

Repository Structure

hands-on-modern-rl/
|-- docs/                      # VitePress course content
|   |-- .vitepress/            # Site config, navigation, and theme overrides
|   |-- public/                # Static assets copied into the built site
|   |-- preface/               # Course introduction and history
|   |-- chapter*/              # Main course chapters
|   |-- appendix*/             # Supplementary material and references
|   `-- summaries/             # Part-level review and summary notes
|-- code/                      # Runnable examples aligned with chapters
|-- scripts/                   # Maintenance and verification scripts
|-- package.json               # Site scripts and dependencies
|-- AGENTS.md                  # Repository maintenance guide
`-- README.md                  # Main project overview

Development Commands

npm run dev           # Start the local documentation server
npm run build         # Build the static site
npm run preview       # Preview the built site locally
npm run format        # Format repository files with Prettier
npm run format:check  # Check formatting
npm run lint          # Lint VitePress theme code
npm run verify        # Run format check, lint, build, and artifact verification

Contributing

Contributions should make the course clearer, more accurate, easier to reproduce, or easier to navigate.

Good contributions include:

Fixing conceptual errors, formulas, diagrams, broken links, or typos.
Improving explanations without changing the intended learning path.
Adding small, reproducible experiments that clarify existing chapters.
Improving scripts, build reliability, navigation, or accessibility.
Adding high-quality references to papers, official documentation, or widely used open-source implementations.

Please keep pull requests focused. A good PR usually changes one chapter, one experiment, one group of diagrams, or one infrastructure issue at a time.

When adding content:

Put course material under docs/.
Use kebab-case for new directories and files.
Prefer directory-based routes with index.md.
Update docs/.vitepress/config.mjs when adding navigable pages.
Run npm run verify before requesting review if your change touches config, theme, scripts, or generated site output.
Use Conventional Commits, such as docs: clarify ppo clipping or fix: repair chapter link.

For repository-specific maintenance rules, see AGENTS.md.

Other Courses

Our team has also created other courses. Take a look:

Learn Harness Engineering — A course on Harness Engineering for AI coding agents. Through 12 lectures and 6 projects, it teaches you to build instructions, state management, verification, and control mechanisms that make model output reliable.
Modern LLM Notebook — Build modern LLMs from scratch through 23 runnable Jupyter Notebooks in PyTorch, covering Tokenizer, Transformer, training, inference, alignment, and frontier topics.

Discussion Group (WeChat)

For suggestions or feedback, scan the QR code to join the discussion group (WeChat):

Citation

If you use this course in teaching materials, study notes, or derivative non-commercial educational work, please cite the repository:

@misc{hands_on_modern_rl,
  title        = {Hands-On Modern RL: Practice-first reinforcement learning from CartPole to LLM post-training and agentic systems},
  author       = {WalkingLabs},
  year         = {2026},
  howpublished = {\url{https://github.com/walkinglabs/hands-on-modern-rl}},
  note         = {Open courseware repository}
}

License

This course is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

You may share and adapt the material for non-commercial purposes, provided that you give appropriate credit and distribute derivative works under the same license.

Star History

_{Maintained by WalkingLabs and contributors.}

Name		Name	Last commit message	Last commit date
Latest commit History 457 Commits
.github/workflows		.github/workflows
code		code
docs		docs
planning		planning
scripts		scripts
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
eslint.config.js		eslint.config.js
package-lock.json		package-lock.json
package.json		package.json
restructure.py		restructure.py
vercel.json		vercel.json

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Course Preview

Contents

Overview

Design Principles

Audience

Learning Goals

Current Status

News

Roadmap

Course Outline

Preface

Part I: Fundamentals & Classical RL

Part II: Deep Reinforcement Learning

Part III: Advanced RL Methods

Part IV: LLM Alignment & Post-Training

Part V: Agentic Reinforcement Learning

Part VI: Multimodal Reinforcement Learning

Part VII: Safety, Evaluation & Research Frontiers

Appendices

Experiment Code

Recommended Learning Path

Quick Start

Read Online

Run the Documentation Site Locally

Verify the Site

Run Course Code

Repository Structure

Development Commands

Contributing

Other Courses

Discussion Group (WeChat)

Citation

License

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages