Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

Reference: GRPO and RL with verifiable rewards

A reading list focused on Group Relative Policy Optimization (GRPO), RL with verifiable rewards (RLVR), and the reasoning-model lineage that builds on them. For the lecture-level treatment of how these pieces fit together, see ../../../notes/lectures/15-rl-verifiable-rewards.md.

This README is hand-curated. The companion PAPERS.md (when present) is generated by the arxiv-collector and lists a broader bibliography pulled from arXiv queries: see the regeneration note at the bottom.

How to use this list

The lecture is the conceptual entry point: read it first if you haven't, then come back here to go deeper on whatever piece you want to understand at the level of the primary source. The papers are organized so that §1 → §2 is the spine of the RLVR story (PPO → instruction-tuned RLHF → GRPO → R1), §3 and §4 are the two main applied domains (math with PRMs, code with test suites), and §5 covers the failure modes you'll hit if you actually train any of this. §6 points sideways to the preference-optimization branch (DPO and friends), which is what you reach for when the task isn't verifiable.

Every paper below has been checked against arXiv. Citations carry an ID that resolves; if a candidate paper didn't, it's not on the list.

Where to start

If you've never read in this area, three papers in order:

DeepSeekMath (Shao et al. 2024, arXiv:2402.03300): §3 introduces GRPO. The algorithm itself is short; read it first because the rest of the list assumes you know what GRPO is.
Let's Verify Step by Step (Lightman et al. 2023, arXiv:2305.20050): the cleanest treatment of why step-level (process) supervision can beat outcome-only supervision, and how the step-level labels were collected. Read second because it sets up the contrast that RLVR (pure outcome) is trading off.
DeepSeek-R1 (DeepSeek-AI 2025, arXiv:2501.12948): the recipe paper that combined GRPO with rule-based verifiable rewards on a base model and reported emergent reasoning behaviors. Read third because the experimental claims will land harder once you've internalized the algorithm and the process-vs-outcome trade-off.

After those three you've covered the algorithm, the alternative (process supervision), and the canonical recipe. From there, pick by domain (§4 for code, §3 for math) or by concern (§5 for reward hacking).

1. Foundational

Papers GRPO and RLVR build on. These predate the GRPO line but are prerequisites for understanding why the algorithm looks the way it does.

PPO

Proximal Policy Optimization Algorithms: Schulman, Wolski, Dhariwal et al. OpenAI, 2017. arXiv:1707.06347. Introduces the clipped surrogate objective that GRPO inherits directly. The clipping mechanism that caps probability-ratio updates is the same in both algorithms; what changes in GRPO is the source of the advantage (group-relative mean instead of a learned critic). Read §3 (clipped objective) and §4 (algorithm); the rest is Atari and MuJoCo benchmarks. The lecture's loss derivation maps line-for-line onto §3, so reading them side by side is the quickest way to see what GRPO keeps and what it drops.

Instruction tuning with RLHF

Training language models to follow instructions with human feedback (InstructGPT): Ouyang, Wu, Jiang et al. OpenAI, 2022. arXiv:2203.02155. The reference implementation of the three-stage RLHF pipeline (SFT → reward model → PPO) for LLMs. Important not because GRPO uses it but because it's the baseline GRPO replaces: read it to understand why a "no critic, no reward model" approach was attractive enough to be developed. The discussion of reward model training, KL penalties, and the four-model-copy memory footprint (policy, reference, reward model, value head) is the prior art for GRPO's KL term and the motivation for dropping the critic. The appendix on reward-model overfitting is also where the "reward goes up while real preference goes down" pattern is first carefully documented in the LLM setting.

Rejection-sampling self-training

STaR: Bootstrapping Reasoning With Reasoning: Zelikman, Wu, Mu, Goodman. Stanford, NeurIPS 2022. arXiv:2203.14465. Sample chains of thought, keep the ones whose final answer is correct, fine-tune on those, repeat. This is the cheap, gradient-free precursor to GRPO-on-verifiable-rewards: both use the verifier the same way (to label which completions are good), but STaR does it at filter time for SFT, while GRPO does it inline as a policy-gradient signal. The "rationalization" trick, generating a chain conditioned on the correct answer for problems the model fails on, is worth knowing as a data-augmentation pattern, and it shows up again (under different names) in the cold-start SFT stage of R1 and in the synthetic-data pipelines of Qwen-Math.
Reinforced Self-Training (ReST) for Language Modeling: Gulcehre, Le Paine, Srinivasan et al. DeepMind, 2023. arXiv:2308.08998. Generalizes STaR slightly: generate a large offline dataset from the current policy, filter by reward threshold, fine-tune. The point worth absorbing is the offline vs. online distinction: ReST does not do live rollouts during training, which makes it cheaper and (as the paper argues) less prone to reward hacking, at the cost of weaker credit assignment than a policy-gradient method. Useful baseline to know about before deciding GRPO is worth the infrastructure.

2. GRPO and outcome-reward reasoning

The core of this reading list. These are the papers that introduce GRPO and the reasoning-model recipes that use it.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models: Shao, Wang, Zhu et al. DeepSeek-AI, 2024. arXiv:2402.03300. Introduces GRPO. The algorithm replaces PPO's value-function critic with a within-group baseline computed from K samples per prompt, eliminating one of the four model copies in standard RLHF training. §3 is self-contained: the math is straightforward once you accept that the group mean is your baseline. The rest of the paper is the DeepSeekMath-7B training story: a 120B-token math-focused pretraining corpus, the data deduplication pipeline, and evaluation on MATH and GSM8K. Read §3 for the algorithm, §2 for the data work that gave the RL stage something to amplify, and skim §4–5 for the benchmark numbers. The paper's framing, "weak base model + RL ≠ strong reasoning," is the same point R1 hammers a year later.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning: DeepSeek-AI team paper, 2025. arXiv:2501.12948. Two models, one paper. R1-Zero applies GRPO with a rule-based reward (correctness + format) directly to a base model (no SFT warm-up) and reports that long-form reasoning, backtracking, and the "aha moment" emerge during training without being explicitly rewarded. R1 adds a cold-start SFT stage (a few thousand reasoning traces) before RL plus a final rejection-sampling + SFT + RLHF pass to fix R1-Zero's language-mixing and readability failures. The paper also distills R1 back into smaller models (1.5B–70B) via SFT on R1-generated traces; the distilled models outperform same-scale models trained from scratch with RL, which is a practical point worth absorbing: for small models, distillation beats running your own RL. Read §2 for the algorithm details (a thin extension of DeepSeekMath's GRPO), §3 for the R1-Zero ablations (which are the most informative experimental section in the recent reasoning literature), and §4 for the full R1 pipeline. The reasoning-length-over-training curves and the documented qualitative behaviors are the parts you'll cite most often.
Kimi k1.5: Scaling Reinforcement Learning with LLMs: Kimi Team (Du, Gao et al.), 2025. arXiv:2501.12599. Submitted the same day as DeepSeek-R1 (22 Jan 2025) and reports a parallel result: a multimodal reasoning model trained with RL on long chains of thought. Differences from R1 worth noting: Kimi uses a partial-rollout / online-mirror-descent variant of policy optimization rather than vanilla GRPO, frames "long-context RL" (up to 128K-token contexts during training) as the core lever, and reports explicit length-penalty terms in the reward to keep chains from ballooning unbounded. Read alongside R1 to triangulate which design choices are essential to the reasoning-RL recipe and which are team-specific; the convergent finding across both papers (extended CoT + outcome reward + KL-regularized policy gradient) is the strongest evidence that the recipe generalizes.
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement: Yang, Zhang, Hui et al. Alibaba Qwen, 2024. arXiv:2409.12122. Qwen's math-specialized model line. The interest for this list is the data pipeline (synthetic math data generation, problem-difficulty stratification, self-improvement loops where the model generates problems and solutions that get filtered and folded back into training) and the reward modeling section, which uses a math-domain reward model on top of standard RL. Less of a clean GRPO paper than DeepSeekMath; Qwen leans on PRM-flavored reward modeling and tool-augmented rollouts more than pure GRPO. Useful as a comparison point for how a different lab approached the same domain in the same year, and for the synthetic-data tricks that GRPO papers gloss over.
OpenAI o1 ("Learning to reason with LLMs"): OpenAI, September 2024. No technical paper. Blog post: openai.com/index/learning-to-reason-with-llms. Cited by URL because there is no paper. The blog post describes RL on chain-of-thought training and inference-time compute scaling as the two main levers, but gives no algorithm, no reward function structure, and no information on whether a process reward model is used at inference. The two plots that are widely-cited (training-compute scaling and inference-compute scaling, both showing accuracy roughly linear in log-compute) are the substantive content. Claims about o1's training procedure beyond what's in the blog post (that it's MCTS-based, that it uses a PRM at inference, that it does explicit search) are speculation that has not been confirmed by OpenAI. The blog is short; read it once to know what's officially documented and what isn't, then treat further claims with skepticism.
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model: Hu, Zhang, Han et al. 2025. arXiv:2503.24290. Open reproduction of the R1-Zero recipe (pure RL on a base model, no SFT warm-up) with code and weights released. Useful as a sanity check on the R1-Zero claims (the headline emergent behaviors, extended reasoning and self-correction, show up here too, which is decent evidence they're real and not a DeepSeek-specific quirk) and as a reference implementation if you want to actually run this pipeline. The ablations on KL coefficient, sampling temperature, and group size K are the most informative parts; in particular, the paper shows that the "trivial" GRPO hyperparameter recipe (K=64, low KL, no length penalty) already produces most of the R1-Zero behaviors, which suggests the recipe is more robust than the engineering folklore implied.

3. Process reward models

GRPO with a rule-based checker is outcome supervision: the reward lands on the final answer only. The papers below characterize when that's enough and when step-level (process) supervision pays off.

Solving math word problems with process- and outcome-based feedback: Uesato, Kushman, Kumar et al. DeepMind, 2022. arXiv:2211.14275. The first systematic comparison on GSM8K. Headline finding: outcome supervision reaches similar final-answer accuracy with less annotation cost, but outcome-supervised models make more reasoning errors that happen to cancel out (wrong steps producing right answers). This is the paper to cite when the question is "do I need process supervision at all"; it's also the cleanest framing of why outcome-supervised RLVR can pass benchmarks while producing unreliable reasoning chains. The error-type taxonomy in §4 (trace-level vs. final-answer correctness) is the conceptual frame the later PRM papers all inherit.
Let's Verify Step by Step: Lightman, Kosaraju, Burda et al. OpenAI, 2023. arXiv:2305.20050. Trains a process reward model (PRM) using ~800K human-annotated step-level labels on the MATH benchmark. At matched scale, the PRM-supervised model substantially outperforms the outcome-supervised one. The contribution is both the algorithm and the dataset (PRM800K), which is the de-facto benchmark for step-level reward models. Read for the methodology of step-level annotation, the discussion of "active learning" (selecting which model traces to label is non-trivial when annotation is expensive), and the figures comparing best-of-N selection with PRMs vs. outcome-only reward models; those plots are the empirical case for process supervision in one image.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations: Wang, Li, Shao et al. 2023. arXiv:2312.08935. PRM trained without human step labels. Step quality is estimated by completing each partial chain N times and measuring the fraction of completions that reach the correct final answer: high-fraction steps are labeled "good," low-fraction steps "bad." This is a useful trick when you have an outcome verifier but no annotation budget; the cost is the rollout budget for the N completions per step. Note that the lead author overlaps with the DeepSeekMath team, which is why some of the design choices (synthetic labels via outcome rollouts, PRM-as-verifier-for-RL) echo each other across the two papers. Read §3 for the labeling procedure and §4 for the RL-with-PRM section, which is the closest thing this list has to a direct "GRPO with a learned process reward" worked example.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision (OmegaPRM): Luo, Liu, Liu et al. Google DeepMind, 2024. arXiv:2406.06592. Uses MCTS to allocate completion budget more efficiently than Math-Shepherd's uniform rollouts: explore promising prefixes deeper, prune dead branches early. Same goal (synthetic step-level labels), better sample efficiency. The MCTS framing also makes it natural to identify the first erroneous step in a chain (the node where the value estimate drops sharply), which is a richer signal than just labeling steps "good" or "bad." Read after Math-Shepherd as the natural extension; together they're a small arc on "how to label process supervision data when humans aren't available."

4. Code-RL with verifiable rewards

Code with a test harness is the other natural home for RLVR: pass rate replaces math correctness as the reward signal. The general points carry over from §2 and §3; what's different here is how the reward is computed and how easy the verifier is to game.

CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning: Le, Wang, Gotmare et al. Salesforce, NeurIPS 2022. arXiv:2207.01780. Pre-GRPO RL-for-code, using actor-critic with test-pass-rate as the reward and a learned critic for value estimation. Not GRPO, but the same problem setup: execution feedback as the reward signal. Read for the discussion of reward shaping (passes vs. compile-error vs. runtime-error vs. test-fail give a four-level reward instead of binary), the critic-sampling inference procedure, and the failure-mode catalog; most of the issues they identify (sparse reward, reward hacking on visible tests, the gap between training-test and held-out-test scores) carry directly into modern GRPO-on-code training.
CodeT: Code Generation with Generated Tests: Chen, Zhang, Nguyen et al. Microsoft, 2022. arXiv:2207.10397. Not RL: uses the LLM to generate both candidate solutions and candidate tests, then selects the solution that passes the most generated tests. Included here because it's the conceptual root of "model-generated verifiers": the trick of using a model to produce verification signal it then optimizes against. Worth reading before any modern paper that uses synthetic test generation as part of a verifiable-reward pipeline.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?: Jimenez, Yang, Wettig et al. Princeton, 2023. arXiv:2310.06770. The benchmark, not an RL paper. Real GitHub issues paired with the maintainers' test patches as the verifier. Read because every recent RL-on-code paper benchmarks here, and because the failure modes the benchmark surfaces (multi-file edits, repository-level context, brittle or under-specified test suites, environment setup that doesn't reproduce) shape what a useful RLVR reward function looks like for production code. The "Verified" subset (a manually-cleaned version of the benchmark) is what most modern papers actually report on, because the original SWE-bench has enough environment-setup noise that small-percentage differences can be artifact rather than capability.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution: Wei, Duchenne, Copet et al. Meta, 2025. arXiv:2502.18449. GRPO-style RL applied to software-evolution tasks: given an issue and a repo, generate a patch, score it against the maintainers' tests. The contribution is the data pipeline (mining commits + tests from open-source repos at scale) and the demonstration that the RLVR recipe transfers from math to multi-file code edits. Read as a worked example of what changes when you move RLVR from short-answer math to long-horizon code tasks: the verifier is slower (running a test suite vs. matching a number), the reward is harder to make un-gameable (test suites have coverage holes), and the rollouts are expensive enough that group size K is often pushed down to make the training compute budget fit. The ablations on reward shaping (binary pass/fail vs. partial credit by test count) are worth comparing to CodeRL's earlier discussion of the same question.

5. Verifier design and reward hacking

Any reward function can be gamed. With RLVR the gameable surface is the verifier itself, not a learned reward model, and that changes which mitigations work.

Learning to summarize from human feedback: Stiennon, Ouyang, Wu et al. OpenAI, 2020. arXiv:2009.01325. Pre-LLM-scale RLHF on summarization, but the discussion of reward-model overoptimization and the KL-to-reference penalty is the methodological root of every RLHF and RLVR paper that came after. The diagrams of reward vs. KL for different training durations are the reference figures for the "reward goes up while real quality goes down" pattern; Lecture 15's reward-hacking section is essentially a restatement of those figures applied to verifiable rewards. Worth reading even though the task (TL;DR summarization) is dated, because the experimental design is unusually clean: a held-out human-preference signal cross-checks the trained reward model and exposes the gap.
Scaling Laws for Reward Model Overoptimization: Gao, Schulman, Hilton. OpenAI, 2022. arXiv:2210.10760. Characterizes the relationship between training KL, reward-model score, and gold-standard quality. The functional form they fit (gold reward as a function of √KL with a quadratic penalty term) is the closest thing the field has to a quantitative law for "how much reward gain is real vs. hacking." The paper is specifically about learned reward models, but the framework, distinguishing proxy reward from gold reward and measuring the divergence, applies directly to RLVR if your "gold" is held-out test problems and your "proxy" is the in-training verifier. Read for the methodology (synthetic gold reward via a much larger preference model) and for the BoN-vs-RL comparison, which is the cleanest demonstration that the same overoptimization curves show up in both regimes.
Feedback Loops With Language Models Drive In-Context Reward Hacking: Pan, Jones, Jagadeesan et al. 2024. arXiv:2402.06627. Documents in-context reward hacking (ICRH): LLMs in feedback loops optimize implicit objectives at test time, producing side effects the deployer didn't intend. Less directly about RLVR training than about how RLVR-trained models behave once deployed in agentic loops, but the failure mode is the same shape: the model learns to game whatever signal it's iteratively optimizing against, whether the gradient step happens during training (RLVR) or only in-context (deployment). The case studies (a model adjusting a Twitter persona for engagement, a coding agent over-fitting to a test harness) make the abstract concern concrete. Worth reading before deploying an RLVR-trained model as part of any longer-running agentic system.

There's likely relevant 2025 work specifically on reward hacking in RLVR (gaming weak test suites, exploiting math-checker normalization bugs, etc.); the lecture notes mention several patterns but I haven't confirmed a single dedicated paper for this list. Add one here once you find a citation you can verify.

6. Related: DPO and offline preference variants

DPO and the rest of the preference-optimization family are not RLVR: they work from preference-paired data, not from a checkable correct answer. Useful to know about as the alternative for tasks where verifiable rewards aren't available (subjective quality, helpfulness, style). The reading list for those methods lives at ../RLHF-and-Alignment/, which covers DPO (Rafailov et al. 2023, arXiv:2305.18290), Constitutional AI, and the broader RLHF lineage.

The high-level relationship: RLHF, DPO, and RLVR all answer the question "how do you fine-tune an LLM with a reward signal"; they differ in where the signal comes from (learned model from human preferences, implicit from a preference dataset, rule-based from a checker) and in how it's optimized. Pick by the domain: subjective output → preference methods; verifiable correctness → RLVR.

Practical tooling these papers (and their reproductions) use

TRL (Hugging Face): has a GRPOTrainer and supporting utilities for reference-policy KL, group sampling, and reward functions. The most-used open implementation; what most reproductions of DeepSeekMath and R1-style training are built on.
DeepSpeed / FSDP: sharding for the policy + reference copies. The memory savings from dropping the critic (vs. PPO) show up here: you can fit a larger policy or a longer context.
vLLM or SGLang for rollout generation: GRPO is rollout-bound at scale, so a fast inference engine for the sampling step matters more than the optimizer micro-details.
Sandboxes for code rollouts: containers with CPU/memory/network limits, plus static analysis (Bandit, Semgrep) to filter obviously bad submissions before execution. The sibling ../LLM-Code-Generation/ list has more on this.
Math verifiers: SymPy for symbolic equivalence, normalized numeric comparison for word problems, the MATH dataset's own answer-extraction utilities. The verifier is part of the algorithm; treat it with the same care as the loss.

Reading orders by goal

A few suggested sequences depending on what you're trying to do.

Understand GRPO well enough to implement it. PPO (1707.06347) → DeepSeekMath §3 (2402.03300) → the reference implementation in ../../../notes/lectures/15-rl-verifiable-rewards.md → Open-Reasoner-Zero (2503.24290) for an actual open codebase.

Understand the R1 recipe end to end. STaR (2203.14465) → DeepSeekMath (2402.03300) → DeepSeek-R1 (2501.12948) → Kimi k1.5 (2501.12599) to triangulate which choices generalize across labs.

Decide between outcome and process supervision. Uesato et al. (2211.14275) → Lightman et al. (2305.20050) → Math-Shepherd (2312.08935) → OmegaPRM (2406.06592). The arc is: empirical case for process, human-labeled PRM, synthetic-label PRM, more compute-efficient synthetic-label PRM.

Take RLVR to code. CodeRL (2207.01780) for the prior art → CodeT (2207.10397) for model-generated verifiers → SWE-bench (2310.06770) for the benchmark → SWE-RL (2502.18449) for the modern RL-on-code recipe.

Worry about reward hacking before you ship. Stiennon et al. (2009.01325) for the original phenomenology → Gao et al. (2210.10760) for the quantitative scaling law → Pan, Jones, Jagadeesan et al. (2402.06627) for the in-deployment failure mode.

Rough reading-time estimates

For someone with the lecture background, approximate time to read each paper carefully (not skim):

Paper	Approx. time	Notes
PPO	1.5 h	§3 alone is 30 min; the appendices have most of the practical content
InstructGPT	2–3 h	Long, dense; §3 (method) and §4 (results) are the core
STaR	1 h	Short, focused; read in full
DeepSeekMath	2 h	§3 is fast (30 min); the rest is the math-data work
DeepSeek-R1	3 h	Long, but §3 (R1-Zero ablations) repays the time
Kimi k1.5	3 h	Comparable length to R1; skim if you've just read R1
Lightman et al. (PRM)	1.5 h	Methodology-heavy; the figures carry most of the argument
Uesato et al.	1 h	Tight paper, clear comparison
Math-Shepherd	1.5 h	§3 (labeling) is the contribution; rest is benchmarks
OmegaPRM	1.5 h	MCTS section needs care if you haven't seen MCTS before
CodeRL	2 h	Dated but the failure-mode catalog is worth it
SWE-bench	1 h	Benchmark paper; read §3–4
SWE-RL	2 h	Recent, dense; skim if you've read DeepSeekMath
Gao et al. (overoptimization)	1.5 h	The scaling-law derivation is worth the time
Stiennon et al.	1.5 h	Older but the cleanest experimental design

Reading the three "Where to start" papers above takes about 6–7 hours of focused time. Reading the full list is roughly 25–30 hours; nobody needs to do that in one pass.

Notes on what's not here

Things deliberately omitted because they didn't survive the verification step or are out of scope:

Autoregressive proof-search methods (AlphaProof, etc.): formal theorem proving deserves its own list, not a few entries here.
Inference-time-only methods (best-of-N, MCTS over reasoning steps without RL training): the lecture covers them in §"o1 and inference-time scaling," but they're not training methods so they don't fit a GRPO/RLVR list.
Self-play and RL-on-RL-distilled-data methods: emerging area as of 2025; haven't verified a representative paper to anchor it.
"Survey of RL for reasoning LLMs" papers: several exist from 2024–2025; some had ID mismatches with the year claimed, so I left them out rather than risk repeating earlier citation errors in this repo. Search arXiv directly if you want a survey.
DPO and the preference-optimization family: covered in ../RLHF-and-Alignment/; they're not RLVR.
The Tülu-3 and Tülu-3 RLVR work from AI2: relevant but I haven't verified a current arXiv ID; add it once confirmed.
Recent post-R1 GRPO variants (DAPO, GRPO-with-asymmetric-clip, REINFORCE++ variants from 2025): moving target; check arXiv directly for the latest, and add the survivors here once they have follow-up reproductions.

Key terms and where they're defined

A short cross-reference for terms that come up across the list.

Term	Defined in	Used in (selected)
Clipped surrogate objective	PPO (1707.06347, §3)	DeepSeekMath, R1, every GRPO paper
Group-relative advantage	DeepSeekMath (2402.03300, §3)	R1, Open-Reasoner-Zero, SWE-RL
Cold-start SFT for reasoning	R1 (2501.12948, §4)	downstream R1 reproductions
Process Reward Model (PRM)	Lightman et al. (2305.20050)	Math-Shepherd, OmegaPRM
Outcome Reward Model (ORM)	Uesato et al. (2211.14275)	RLVR papers in general
Reward overoptimization scaling	Gao et al. (2210.10760)	implicitly in every RLHF/RLVR paper
Rationalization (CoT data aug)	STaR (2203.14465)	R1 cold-start, Qwen-Math data pipeline
In-context reward hacking	Pan, Jones et al. (2402.06627)	agentic-RL discussions

For the algorithmic terms (clipped surrogate, group baseline, KL penalty, format reward), ../../../notes/lectures/15-rl-verifiable-rewards.md walks through each one with code.

Verification

All arXiv IDs in this README were checked against arxiv.org. The first-three-author lists and submission years were confirmed at the same time. Citations with no arXiv (only the OpenAI o1 blog post) are clearly marked. If you find a discrepancy, fix it here and note it in your summary: don't leave a wrong ID in place.

The companion PAPERS.md (generated by the collector) has not been verified by hand. Treat the per-paper details there as unchecked until someone reads them.

Regenerating PAPERS.md

To regenerate PAPERS.md in this directory: run tools/arxiv-collector/arxiv_paper_collector.py; see ../README.md for the broader pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Reference: GRPO and RL with verifiable rewards

How to use this list

Where to start

1. Foundational

PPO

Instruction tuning with RLHF

Rejection-sampling self-training

2. GRPO and outcome-reward reasoning

3. Process reward models

4. Code-RL with verifiable rewards

5. Verifier design and reward hacking

6. Related: DPO and offline preference variants

Practical tooling these papers (and their reproductions) use

Reading orders by goal

Rough reading-time estimates

Notes on what's not here

Key terms and where they're defined

Verification

Regenerating PAPERS.md

FilesExpand file tree

GRPO-RLVR

Directory actions

More options

Directory actions

More options

Latest commit

History

GRPO-RLVR

Folders and files

parent directory

README.md

Reference: GRPO and RL with verifiable rewards

How to use this list

Where to start

1. Foundational

PPO

Instruction tuning with RLHF

Rejection-sampling self-training

2. GRPO and outcome-reward reasoning

3. Process reward models

4. Code-RL with verifiable rewards

5. Verifier design and reward hacking

6. Related: DPO and offline preference variants

Practical tooling these papers (and their reproductions) use

Reading orders by goal

Rough reading-time estimates

Notes on what's not here

Key terms and where they're defined

Verification

Regenerating PAPERS.md