Two layers live in this directory, mixed.
Trusted, hand-written:
cs294-2017/: personal student notes from CS 294 Deep RL (Berkeley, Spring 2017: Levine, Schulman, Finn). 246 lines of working notes from the field being built. Idiosyncratic, kept as written.status: hand-written.sutton-barto-digest/: short distillation of the four elements of an RL system (policy, reward, value function, model) from Sutton & Barto.status: hand-written.
These are old (2017) and informal, but they're a real person's understanding, not AI text. Trusted as starting points.
AI-drafted, useful as scaffold (unreviewed: treat with skepticism):
lectures/: a 34-lecture series taking RL from MDPs through RLHF / DPO / GRPO / RLVR / agentic / offline and on into reasoning, systems, and applications. Lectures 01–19 have had an editorial pass: broken links fixed, code bugs caught (import gym→gymnasium, missing imports, old-APIenv.stepcalls), citations checked or removed when they didn't resolve, fake-first-person framing stripped. Lectures 20–34 are newer drafts without that pass. In no case has a person read a lecture end to end and signed off. Cross-check the math against the cited papers; treat the code as a starting point that needs verification. Index and per-lecture review status below.cheat-sheets/:RL-Math-Formulas.mdandRL-Quick-Reference.md. Audited (caught a wrong KL direction; fixed). Same caveat.diagrams/:RL-Algorithm-Diagrams.md. Audited (caught and fixed a wrong DPO loss diagram and a wrong GRPO advantage diagram). Same caveat.
../CURRICULUM.md is the suggested order through everything. ../AGENTS.md explains the <!-- status: ... --> convention every doc carries.
| # | Lecture | Status |
|---|---|---|
| 01 | MDPs and Bellman equations, exercise: 01-mdps |
unreviewed (de-slopped; a fabricated value-function output was removed) |
| 02 | Policy gradients from scratch, exercise: 02-policy-gradients |
unreviewed (de-slopped; a broken link and a code bug were fixed) |
| 03 | Value functions & Q-learning, exercise: 03-q-learning |
unreviewed (de-slopped; a dead Modern-RL-Research/ path and a missing import fixed) |
| 04 | Actor-critic methods, exercise: 04-actor-critic |
unreviewed (de-slopped; a code bug fixed) |
| 05 | Trust regions and TRPO | unreviewed (de-slopped; fabricated training times removed) |
| 06 | PPO | unreviewed (de-slopped; import gym → gymnasium fixed) |
| 07 | Off-policy learning: SAC and TD3 | unreviewed (de-slopped; an old-API env.step call fixed) |
| 08 | Model-based RL | unreviewed (de-slopped; old-API calls + a wrong citation fixed) |
| 09 | Reward modeling for RLHF | unreviewed (de-slopped; citations checked, IDs added) |
| 10 | PPO for language models | unreviewed (de-slopped; a broken next-lecture link + unverified compute claims fixed) |
| 11 | Direct preference optimization | unreviewed (de-slopped; a fabricated paper removed) |
| 12 | Beyond DPO: GRPO, RRHF, IPO | unreviewed (de-slopped; a fabricated benchmark table + a fabricated paper removed) |
| 13 | RLHF for code generation, exercise: 15-grpo-rlvr (related) |
unreviewed (de-slopped; CodeRL mis-attributed to Meta → fixed to Salesforce; fabricated benchmark numbers removed) |
| 14 | Constitutional AI, RLAIF, self-improvement | unreviewed (new draft) |
| 15 | RL with verifiable rewards & reasoning models, exercise: 15-grpo-rlvr |
unreviewed (new draft) |
| 16 | Agentic RL: tool use, multi-turn | unreviewed (new draft) |
| 17 | Online & iterative preference optimization | unreviewed (new draft) |
| 18 | Distillation of reasoning models | unreviewed (new draft) |
| 19 | Offline RL | unreviewed (new draft) |
| 20 | Exploration: from ε-greedy to intrinsic motivation, exercise: 20-exploration |
unreviewed (new draft) |
| 21 | Multi-agent RL and self-play | unreviewed (new draft) |
| 22 | World models | unreviewed (new draft) |
| 23 | Process reward models vs outcome reward models | unreviewed (new draft) |
| 24 | Computer use and browser agents | unreviewed (new draft) |
| 25 | Long-horizon credit assignment | unreviewed (new draft) |
| 26 | RL for mathematical reasoning | unreviewed (new draft) |
| 27 | RLAIF and synthetic preferences at scale | unreviewed (new draft) |
| 28 | Reward hacking and verifier design | unreviewed (new draft) |
| 29 | Distributed RL systems | unreviewed (new draft) |
| 30 | RL inference infrastructure for LLMs | unreviewed (new draft) |
| 31 | Hardware for RL | unreviewed (new draft) |
| 32 | Meta-RL and in-context RL | unreviewed (new draft) |
| 33 | Robotics RL | unreviewed (new draft) |
| 34 | Self-distillation and self-improvement loops | unreviewed (new draft) |
What "unreviewed" means here: nobody has read the lecture end-to-end and signed off on it. The editorial pass (de-slop, fix broken links, catch code bugs, verify citations) has happened for lectures 01–19; that's the parenthetical note next to those rows. Lectures 20–34 are newer drafts that haven't had even that pass yet, so treat them with more caution. The next step for any of them is a person reads it and either flips it to reviewed (with today's date in last-reviewed:) or notes what's still wrong.
Planned: a curated paper layer in ../reference/papers/, built from ../tools/lit-builder/ once the LLM scoring step has been run (it needs a credential: see issue #2). Two hand-curated topic READMEs have landed: GRPO-RLVR/ and Agentic-RL/, but their auto-generated PAPERS.md files still need a collector run.
Cheat sheets and diagrams are in cheat-sheets/ and diagrams/, also unreviewed. Beyond RL-Math-Formulas.md and RL-Quick-Reference.md: RLHF-vs-DPO-vs-GRPO.md (side-by-side comparison of the alignment methods), RL-LLM-loops-2026.md (ASCII data-flow diagrams of every training loop), KL-control.md (KL penalties across TRPO/PPO/RLHF/DPO/GRPO), RL-loss-functions.md (one block per algorithm with loss, gradient, code, and tradeoff).
Starting from scratch: read the talks/books/courses linked in ../readme.md; they're the trusted external material. The hand-written CS294 notes at cs294-2017/ give you one student's path through the same material.
Already know RL, here for the LLM part: lectures 09 → 11 → 12 → 14 → 15 → 17 covers the RLHF → DPO → GRPO → constitutional AI → RLVR → iterative preference optimization arc.
Here for code generation specifically: lecture 02 (policy-gradient intuition), 10 (PPO for LLMs), 13 (RLHF for code), 15 (RLVR: the basis of modern reasoning-RL on code).
- Calculus (derivatives, chain rule, gradients), probability (expectations, distributions, KL divergence), basic linear algebra.
- Python at an intermediate level; PyTorch basics; NumPy.
- A few hours per lecture including coding and debugging.