Skip to content

Latest commit

 

History

History
79 lines (60 loc) · 8.54 KB

File metadata and controls

79 lines (60 loc) · 8.54 KB

notes: study material

Two layers live in this directory, mixed.

Trusted, hand-written:

  • cs294-2017/: personal student notes from CS 294 Deep RL (Berkeley, Spring 2017: Levine, Schulman, Finn). 246 lines of working notes from the field being built. Idiosyncratic, kept as written. status: hand-written.
  • sutton-barto-digest/: short distillation of the four elements of an RL system (policy, reward, value function, model) from Sutton & Barto. status: hand-written.

These are old (2017) and informal, but they're a real person's understanding, not AI text. Trusted as starting points.

AI-drafted, useful as scaffold (unreviewed: treat with skepticism):

  • lectures/: a 34-lecture series taking RL from MDPs through RLHF / DPO / GRPO / RLVR / agentic / offline and on into reasoning, systems, and applications. Lectures 01–19 have had an editorial pass: broken links fixed, code bugs caught (import gymgymnasium, missing imports, old-API env.step calls), citations checked or removed when they didn't resolve, fake-first-person framing stripped. Lectures 20–34 are newer drafts without that pass. In no case has a person read a lecture end to end and signed off. Cross-check the math against the cited papers; treat the code as a starting point that needs verification. Index and per-lecture review status below.
  • cheat-sheets/: RL-Math-Formulas.md and RL-Quick-Reference.md. Audited (caught a wrong KL direction; fixed). Same caveat.
  • diagrams/: RL-Algorithm-Diagrams.md. Audited (caught and fixed a wrong DPO loss diagram and a wrong GRPO advantage diagram). Same caveat.

../CURRICULUM.md is the suggested order through everything. ../AGENTS.md explains the <!-- status: ... --> convention every doc carries.

Lecture series: drafts, in order

# Lecture Status
01 MDPs and Bellman equations, exercise: 01-mdps unreviewed (de-slopped; a fabricated value-function output was removed)
02 Policy gradients from scratch, exercise: 02-policy-gradients unreviewed (de-slopped; a broken link and a code bug were fixed)
03 Value functions & Q-learning, exercise: 03-q-learning unreviewed (de-slopped; a dead Modern-RL-Research/ path and a missing import fixed)
04 Actor-critic methods, exercise: 04-actor-critic unreviewed (de-slopped; a code bug fixed)
05 Trust regions and TRPO unreviewed (de-slopped; fabricated training times removed)
06 PPO unreviewed (de-slopped; import gymgymnasium fixed)
07 Off-policy learning: SAC and TD3 unreviewed (de-slopped; an old-API env.step call fixed)
08 Model-based RL unreviewed (de-slopped; old-API calls + a wrong citation fixed)
09 Reward modeling for RLHF unreviewed (de-slopped; citations checked, IDs added)
10 PPO for language models unreviewed (de-slopped; a broken next-lecture link + unverified compute claims fixed)
11 Direct preference optimization unreviewed (de-slopped; a fabricated paper removed)
12 Beyond DPO: GRPO, RRHF, IPO unreviewed (de-slopped; a fabricated benchmark table + a fabricated paper removed)
13 RLHF for code generation, exercise: 15-grpo-rlvr (related) unreviewed (de-slopped; CodeRL mis-attributed to Meta → fixed to Salesforce; fabricated benchmark numbers removed)
14 Constitutional AI, RLAIF, self-improvement unreviewed (new draft)
15 RL with verifiable rewards & reasoning models, exercise: 15-grpo-rlvr unreviewed (new draft)
16 Agentic RL: tool use, multi-turn unreviewed (new draft)
17 Online & iterative preference optimization unreviewed (new draft)
18 Distillation of reasoning models unreviewed (new draft)
19 Offline RL unreviewed (new draft)
20 Exploration: from ε-greedy to intrinsic motivation, exercise: 20-exploration unreviewed (new draft)
21 Multi-agent RL and self-play unreviewed (new draft)
22 World models unreviewed (new draft)
23 Process reward models vs outcome reward models unreviewed (new draft)
24 Computer use and browser agents unreviewed (new draft)
25 Long-horizon credit assignment unreviewed (new draft)
26 RL for mathematical reasoning unreviewed (new draft)
27 RLAIF and synthetic preferences at scale unreviewed (new draft)
28 Reward hacking and verifier design unreviewed (new draft)
29 Distributed RL systems unreviewed (new draft)
30 RL inference infrastructure for LLMs unreviewed (new draft)
31 Hardware for RL unreviewed (new draft)
32 Meta-RL and in-context RL unreviewed (new draft)
33 Robotics RL unreviewed (new draft)
34 Self-distillation and self-improvement loops unreviewed (new draft)

What "unreviewed" means here: nobody has read the lecture end-to-end and signed off on it. The editorial pass (de-slop, fix broken links, catch code bugs, verify citations) has happened for lectures 01–19; that's the parenthetical note next to those rows. Lectures 20–34 are newer drafts that haven't had even that pass yet, so treat them with more caution. The next step for any of them is a person reads it and either flips it to reviewed (with today's date in last-reviewed:) or notes what's still wrong.

Planned: a curated paper layer in ../reference/papers/, built from ../tools/lit-builder/ once the LLM scoring step has been run (it needs a credential: see issue #2). Two hand-curated topic READMEs have landed: GRPO-RLVR/ and Agentic-RL/, but their auto-generated PAPERS.md files still need a collector run.

Cheat sheets and diagrams are in cheat-sheets/ and diagrams/, also unreviewed. Beyond RL-Math-Formulas.md and RL-Quick-Reference.md: RLHF-vs-DPO-vs-GRPO.md (side-by-side comparison of the alignment methods), RL-LLM-loops-2026.md (ASCII data-flow diagrams of every training loop), KL-control.md (KL penalties across TRPO/PPO/RLHF/DPO/GRPO), RL-loss-functions.md (one block per algorithm with loss, gradient, code, and tradeoff).

How to use this

Starting from scratch: read the talks/books/courses linked in ../readme.md; they're the trusted external material. The hand-written CS294 notes at cs294-2017/ give you one student's path through the same material.

Already know RL, here for the LLM part: lectures 09 → 11 → 12 → 14 → 15 → 17 covers the RLHF → DPO → GRPO → constitutional AI → RLVR → iterative preference optimization arc.

Here for code generation specifically: lecture 02 (policy-gradient intuition), 10 (PPO for LLMs), 13 (RLHF for code), 15 (RLVR: the basis of modern reasoning-RL on code).

Prerequisites

  • Calculus (derivatives, chain rule, gradients), probability (expectations, distributions, KL divergence), basic linear algebra.
  • Python at an intermediate level; PyTorch basics; NumPy.
  • A few hours per lecture including coding and debugging.