Skip to content

0bserver07/Study-Reinforcement-Learning

Repository files navigation

Study Reinforcement Learning

Notes, lectures, and exercises for learning reinforcement learning and how it's used to train language models, from MDPs and policy gradients through RLHF, DPO, and GRPO.

This is a personal study repo, not a library. It mixes notes a person wrote (some going back to a 2017 Berkeley course) with a newer lecture series that hasn't been reviewed yet. Every doc under notes/ and reference/ says at the top whether it's hand-written, reviewed, or unreviewed. See AGENTS.md for how the repo is organized and how to work in it, with a coding agent or solo.

What's here

Trusted, hand-written:

  • notes/cs294-2017/: personal student notes from CS 294 Deep RL (Berkeley, Spring 2017; Levine, Schulman, Finn). 246 lines of real-time notes from the field being built. Idiosyncratic, opinionated, with the cannon-trajectory aside. Kept as written.
  • notes/sutton-barto-digest/: short distillation of the four elements of an RL system, from Sutton & Barto.
  • Talks, books, courses: the curated external links below. The Pineau intro, Abbeel's deep RL talk, David Silver's UCL course, Sutton & Barto's book, CS285, Spinning Up. Here since 2015. Still the best place to start if you're new.
  • exercises/: five small coding exercises with pytest tests and reference solutions, verified to pass. Implement REINFORCE on CartPole, Q-learning on FrozenLake, value iteration on a gridworld, actor-critic, a tiny GRPO loop on a verifiable arithmetic task.

AI-drafted, useful as scaffold (unreviewed, treat with skepticism):

  • notes/lectures/: a 34-lecture series, MDPs through RLHF / DPO / GRPO / RLVR / agentic / offline. Editorial pass done (broken links, code bugs, made-up citations all caught and fixed), but no person has read each one end-to-end. Cross-check the math against the cited papers before relying on it. Index and per-lecture status in notes/README.md; ordered study path in CURRICULUM.md.
  • notes/cheat-sheets/, notes/diagrams/: quick reference. Same caveat. (The diagrams file caught and fixed two wrong loss diagrams during the audit, FWIW.)
  • reference/papers/: auto-collected paper lists from arXiv (~430 abstracts). Use as a search index, not a curated reading list.
  • tools/: arxiv-collector/ (fetches arXiv papers), lit-builder/ (ICLR/NeurIPS/ICML triage with keyword filter + LLM scoring), content-pipeline/ (drafts blog posts from papers; auxiliary).

AGENTS.md explains the <!-- status: hand-written | reviewed | unreviewed --> convention every doc carries.

Start here

  • New to RL? Start with the talks/books/courses below: Pineau's intro, then Sutton & Barto for foundations, then David Silver's UCL course or CS285 (Berkeley's current version of CS294). The 2017 CS294 notes (notes/cs294-2017/) give you one student's working notes through the same material if you like that genre.
  • Want hands-on? Do the exercises/. They're tested and they actually run. Five of them, a couple of hours each.
  • Curious about modern LLM RL? The 34-lecture series in notes/lectures/ covers RLHF, DPO, GRPO, RLVR, agentic, offline. Drafts; cross-check the claims against the cited papers.
  • Working in this repo with Claude Code or Codex? Read AGENTS.md first.

The landscape

Everything in the lecture series is the same underlying object: an MDP, where an agent picks actions and some signal tells it whether things are going well. What changes between sub-fields is mostly what that signal is and who provides it. Classical RL gets a reward from the environment. RLHF infers a reward from human preference labels. RLAIF replaces the human with an LLM judge or a written constitution. RLVR skips the learned reward model entirely and uses a verifier: a checker for math, a test suite for code. Agentic RL puts the model in a multi-turn loop with an environment that tells it whether the task ultimately succeeded. Offline RL works from logged data only, no fresh interaction.

The map below shows where each family fits. The lectures fill in the details; CURRICULUM.md is the suggested order.

The landscape: six reward sources, one MDP. Where each RL family fits: classical RL, RLHF, RLAIF, RLVR, agentic RL, and offline RL.


Talks to start with

  • Introduction to Reinforcement Learning by Joelle Pineau, McGill University:

    • Applications of RL.

    • When to use RL?

    • RL vs supervised learning

    • What is MDP? Markov Decision Process

    • Components of an RL agent:

      • states
      • actions (Probabilistic effects)
      • Reward function
      • Initial state distribution
                                      +-----------------+
               +--------------------- |                 |
               |                      |      Agent      |
               |                      |                 | +---------------------+
               |         +----------> |                 |                       |
               |         |            +-----------------+                       |
               |         |                                                      |
         state |         | reward                                               | action
         S(t)  |         | r(t)                                                 | a(t)
               |         |                                                      |
               |         | +                                                    |
               |         | |  r(t+1) +----------------------------+             |
               |         +-----------+                            |             |
               |           |         |                            | <-----------+
               |           |         |      Environment           |
               |           |  S(t+1) |                            |
               +---------------------+                            |
                           |         +----------------------------+
                           +
        
         * Sutton and Barto (1998)
        
        
    • Explanation of the Markov Property:

    • Why Maximizing utility in:

      • Episodic tasks
      • Continuing tasks
        • The discount factor, gamma γ
    • What is the policy & what to do with it?

      • A policy defines the action-selection strategy at every state:
    • Value functions:

      • The value of a policy equations are (two forms of) Bellman’s equation.
      • (This is a dynamic programming algorithm).
      • Iterative Policy Evaluation:
        • Main idea: turn Bellman equations into update rules.
    • Optimal policies and optimal value functions.

      • Finding a good policy: Policy Iteration (Check the talk Below By Peter Abeel)
      • Finding a good policy: Value iteration
        • Asynchronous value iteration:
        • Instead of updating all states on every iteration, focus on important states.
    • Key challenges in RL:

      • Designing the problem domain
        • State representation – Action choice – Cost/reward signal
      • Acquiring data for training – Exploration / exploitation – High cost actions – Time-delayed cost/reward signal
      • Function approximation
      • Validation / confidence measures
    • The RL lingo.

    • In large state spaces: Need approximation:

      • Fitted Q-iteration:
        • Use supervised learning to estimate the Q-function from a batch of training data:
        • Input, Output and Loss.
          • i.e: The Arcade Learning Environment
    • Deep Q-network (DQN) and tips.

  • Deep Reinforcement Learning by Pieter Abbeel, EE & CS, UC Berkeley

Books

Courses

Some landmark papers in RL for LLMs

(With identifiers so you can check them. The lecture series goes into these.)

  • AlphaCode: Competition-level code generation with AlphaCode, Li et al., Science 2022. arXiv:2203.07814
  • CodeRL: CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning, Le et al., NeurIPS 2022. arXiv:2207.01780
  • InstructGPT: Training language models to follow instructions with human feedback, Ouyang et al., 2022. arXiv:2203.02155
  • DPO: Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Rafailov et al., 2023. arXiv:2305.18290
  • DeepSeek-R1: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek-AI, 2025. arXiv:2501.12948

More, organized by topic, in reference/papers/.

Communities


Contributing

Suggestions and corrections welcome via issues or pull requests. If you fix an error in an unreviewed lecture, note what was wrong. That's the most useful kind of contribution here.

License

cc

Licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported.

About

RL study guide — foundations through RLHF, DPO, GRPO, RLVR, agentic RL, and offline RL. Hand-written CS294 notes, 19 lecture drafts, 5 tested exercises, citations that resolve.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages