Training and evaluation code for Process Reward Models (PRMs) that score intermediate reasoning steps in mathematical problem-solving. We train discriminative PRMs on Qwen2.5-Math (1.5B, 7B) using trajectory-level supervision (min, softmin, product aggregation) and active curriculum learning, then evaluate via Best-of-N selection on AIME 2024/2025, MATH-500, and AIMO benchmarks.
- Requirements
- Installation
- Project Structure
- Quick Start
- Training
- Evaluation
- Models and Datasets
- Experiment Tracking
- Known Issues
- Citation
- License
Hardware:
- NVIDIA GPU(s) with CUDA 12.8 support
- Multi-GPU recommended for 7B models (DeepSpeed ZeRO)
Software:
- Python >= 3.11
- CUDA Toolkit 12.8
- Ninja build system
- Git LFS (for downloading models/datasets)
Warning Training and evaluation require incompatible package versions (different torch, transformers, vllm). You must use separate virtual environments. See Installation.
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or: pip install uvgit clone <repo_url>
cd active_prmOption A: System-wide — Follow the NVIDIA CUDA Toolkit installation guide for your OS.
Option B: HPC with lmod
module load cuda/12.8.0
module load ninjaWarning Training and evaluation use different versions of torch, transformers, and vllm. You must use separate environments.
| Package | Training env | Evaluation env |
|---|---|---|
| torch | 2.7.0 | 2.8.0 |
| transformers | 4.53.2 | 4.55.2 |
| openai | 1.99.0 | 1.99.1 |
| deepspeed | 0.16.7 | — |
| accelerate | 1.6.0 | — |
| trl | 0.17.0 | — |
| flash-attn | >=2.8.3 | — |
| liger-kernel | 0.5.8 | — |
| vllm | — | 0.11.0 |
| math-verify | — | 0.7.0 |
Training environment:
uv sync --extra train
source .venv/bin/activateEvaluation environment (use a separate directory or venv):
uv sync --extra eval
source .venv/bin/activatePRMBENCH The PRMBench directory under the eval/ folder is a fork of the PRMBench repository, modified in order to fit the discriminative PRM scoring. Look at the PRMBench/README.md file for more details and instructions for steup - it also uses uv for environment management.
Warning vLLM 0.11.0 defaults to 2-label PRM scoring. You must patch it to use 3 labels.
After installing the eval environment, edit your local vLLM installation:
# Find the file:
python -c "import vllm; print(vllm.__file__)"
# Edit: <vllm_path>/model_executor/models/qwen2_rm.py
# Change num_labels in Qwen2ForProcessRewardModel from 2 to 3export TOKENIZERS_PARALLELISM=false
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export WANDB_API_KEY=<your_wandb_api_key>
export HF_HOME=<your_hf_cache_dir> # Optionalactive_prm/
├── config/ # DeepSpeed and training configs
│ ├── 7b_config.json # Config for 7B model training
│ ├── warmup_config.json # Config for warmup phase
│ ├── ds_config.json # Standard DeepSpeed config
│ └── base.yaml # Hydra base config
├── prm_modelling/ # Model definitions
│ ├── modeling_qwen2_rm.py # Qwen2 PRM model
│ └── configuration_qwen2_rm.py # Model configuration
├── scripts/
│ ├── trainer/ # Training shell scripts
│ │ ├── tiny_train.sh # Small-scale training
│ │ ├── vanilla_train.sh # Full vanilla training
│ │ ├── actprm_run.sh # Active PRM training
│ │ └── multi_node.sh # Multi-node distributed
│ ├── eval/ # Evaluation scripts
│ │ ├── bestofn.sh # Best-of-N full pipeline
│ │ └── search.sh # Search-based evaluation
│ └── rollouts/ # Rollout generation scripts
├── offline_train.py # Main training entry point
├── dataset.py # Dataset and collator classes
├── trainer.py # Custom trainer logic
├── bestofn.py # Best-of-N scoring (inference)
├── evaluator.py # Score aggregation and metrics
├── outputs.py # Parse evaluation outputs to JSON
├── graph.py # Generate comparison plots
├── rollout.py # Rollout generation
└── utils.py # Shared utilities
# 1. Download a model
git lfs install
git lfs clone https://huggingface.co/kingsleykim/Qwen2.5-Math-1.5B-Prm-Three <your_model_dir>/Qwen2.5-Math-1.5B-Prm-Three
# 2. Train (in training env)
deepspeed active_prm/offline_train.py \
--config_path active_prm/config/ds_config.json \
--model_path <your_model_dir>/Qwen2.5-Math-1.5B-Prm-Three \
--data_path <your_data_dir>/prm_training_data \
--output_path <your_output_dir>/output_1.5b \
--vanilla --num_points 1000 --epochs 1
# 3. Evaluate (in eval env) — see EVALUATE.md for full pipeline
python active_prm/bestofn.py \
--parquet-file <your_data_dir>/rollouts.parquet \
--model-path <your_model_dir>/Qwen2.5-Math-1.5B-Prm-Three \
--intermediate-save-path results/scores.pkl
python active_prm/evaluator.py --file results/scores.pkl --score-type minFor detailed instructions see TRAINING.md and EVALUATE.md.
Three training modes are supported:
Vanilla PRM — Standard supervised fine-tuning with optional trajectory-level loss:
deepspeed active_prm/offline_train.py \
--config_path active_prm/config/ds_config.json \
--model_path <model> --data_path <data> --output_path <output> \
--vanilla --traj_loss --agg min --epochs 1Active PRM — Curriculum learning with k-weighted step sampling:
deepspeed active_prm/offline_train.py \
--config_path active_prm/config/ds_config.json \
--model_path <model> --data_path <data> --output_path <output> \
--k1 0.3 --k2 0.3 --k3 0.4 --epochs 1Combined — Joint step-level and trajectory-level training:
deepspeed active_prm/offline_train.py \
--config_path active_prm/config/ds_config.json \
--model_path <model> --data_path <data> --output_path <output> \
--combined --traj_loss --agg min --epochs 1See TRAINING.md for the full CLI reference, loss configurations, and example scripts.
The evaluation pipeline has four stages:
bestofn.py— Run PRM inference to score trajectories and save raw probabilitiesevaluator.py— Aggregate step-level probabilities using different strategiesoutputs.py— Parse evaluation outputs into structured JSONgraph.py— Generate comparison plots across models and benchmarks
Supported aggregation strategies: min, sum, softmin, orm, sum_norm, norm.
See EVALUATE.md for arguments, output formats, and example scripts.
| Model | HuggingFace Path |
|---|---|
| Qwen2.5-Math-7B-Prm-Three | https://huggingface.co/kingsleykim/Qwen2.5-Math-7B-Prm-Three/ |
| Qwen2.5-Math-1.5B-Prm-Three | https://huggingface.co/kingsleykim/Qwen2.5-Math-PRM-1.5B-Three |
| Qwen2.5-Math-7B-Instruct (base) | Qwen/Qwen2.5-Math-7B-Instruct |
| Qwen2.5-Math-PRM-1.5B-Single (one-head PRM, 1.5B) | https://huggingface.co/kingsleykim/Qwen2.5-Math-PRM-1.5B-Single |
| Qwen2.5-Math-PRM-7B-Single (one-head PRM, 7B) | https://huggingface.co/kingsleykim/Qwen2.5-Math-7B-PRM-Single/ |
We also tuned and ablated with two, four and five for the number of reward model heads - if you would like those model weights, please email me at kingsleykimm@gmail.com.
git lfs install
# Via git lfs clone
git lfs clone https://huggingface.co/kingsleykim/Qwen2.5-Math-7B-Prm-Three
# Or via huggingface-cli
huggingface-cli download kingsleykim/Qwen2.5-Math-7B-Prm-Three --local-dir ./models/Qwen2.5-Math-7B-Prm-ThreeTraining datasets are HuggingFace datasets with three columns:
| Column | Type | Description |
|---|---|---|
inputs |
string | Problem + solution with <extra_0> step markers |
hard_labels |
list[int] | Per-step label: 0=correct, 1=neutral, 2=incorrect |
correct |
bool | Whether the final answer is correct |
Note: hard_labels is only used for the Vanilla PRM, which uses the per-step labels in the cross entropy loss. For training the PRMs we describe in the paper, the loss will only depend on the correct column.
Training logs to Weights & Biases. Set WANDB_API_KEY before training.
- vLLM 3-label patch: vLLM 0.11.0 must be manually patched for 3-label PRM scoring (see Installation)
- Incompatible environments: Training and evaluation require different torch/transformers versions; use separate virtual environments
- flash-attn build: Requires CUDA headers at build time. Ensure
CUDA_HOMEis set and the CUDA toolkit is accessible. uv and the flash-attn also have extensive documentation on best practices for installation. What I found helpful is to always download the wheel that matches your Pytorch and CUDA versions, or as close as you can get.
@inproceedings{kim2026learning,
title={Learning Discriminative Process Reward Models without Step Labels},
author={Kim, Kingsley and Liu, Haolin and Wei, Chen-Yu},
booktitle={ICLR 2026 Workshop on Scaling Post-training for LLMs (SPOT @ ICLR)},
year={2026},
url={https://openreview.net/forum?id=df3p10k2kq}
}TBD