Skip to content

kingsleykimm/discriminative_prms

Repository files navigation

[SPOT @ ICLR 2026] Learning Discriminative Process Reward Models without Step Labels

Python 3.11+ CUDA 12.8 uv

Training and evaluation code for Process Reward Models (PRMs) that score intermediate reasoning steps in mathematical problem-solving. We train discriminative PRMs on Qwen2.5-Math (1.5B, 7B) using trajectory-level supervision (min, softmin, product aggregation) and active curriculum learning, then evaluate via Best-of-N selection on AIME 2024/2025, MATH-500, and AIMO benchmarks.

Table of Contents

Requirements

Hardware:

  • NVIDIA GPU(s) with CUDA 12.8 support
  • Multi-GPU recommended for 7B models (DeepSpeed ZeRO)

Software:

  • Python >= 3.11
  • CUDA Toolkit 12.8
  • Ninja build system
  • Git LFS (for downloading models/datasets)

Warning Training and evaluation require incompatible package versions (different torch, transformers, vllm). You must use separate virtual environments. See Installation.

Installation

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh
# Or: pip install uv

2. Clone the repository

git clone <repo_url>
cd active_prm

3. CUDA Toolkit

Option A: System-wide — Follow the NVIDIA CUDA Toolkit installation guide for your OS.

Option B: HPC with lmod

module load cuda/12.8.0
module load ninja

4. Create virtual environments

Warning Training and evaluation use different versions of torch, transformers, and vllm. You must use separate environments.

Package Training env Evaluation env
torch 2.7.0 2.8.0
transformers 4.53.2 4.55.2
openai 1.99.0 1.99.1
deepspeed 0.16.7
accelerate 1.6.0
trl 0.17.0
flash-attn >=2.8.3
liger-kernel 0.5.8
vllm 0.11.0
math-verify 0.7.0

Training environment:

uv sync --extra train
source .venv/bin/activate

Evaluation environment (use a separate directory or venv):

uv sync --extra eval
source .venv/bin/activate

PRMBENCH The PRMBench directory under the eval/ folder is a fork of the PRMBench repository, modified in order to fit the discriminative PRM scoring. Look at the PRMBench/README.md file for more details and instructions for steup - it also uses uv for environment management.

5. vLLM patch for 3-label PRM

Warning vLLM 0.11.0 defaults to 2-label PRM scoring. You must patch it to use 3 labels.

After installing the eval environment, edit your local vLLM installation:

# Find the file:
python -c "import vllm; print(vllm.__file__)"
# Edit: <vllm_path>/model_executor/models/qwen2_rm.py
# Change num_labels in Qwen2ForProcessRewardModel from 2 to 3

6. Environment variables

export TOKENIZERS_PARALLELISM=false
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export WANDB_API_KEY=<your_wandb_api_key>
export HF_HOME=<your_hf_cache_dir>       # Optional

Project Structure

active_prm/
├── config/                          # DeepSpeed and training configs
│   ├── 7b_config.json              # Config for 7B model training
│   ├── warmup_config.json          # Config for warmup phase
│   ├── ds_config.json              # Standard DeepSpeed config
│   └── base.yaml                   # Hydra base config
├── prm_modelling/                   # Model definitions
│   ├── modeling_qwen2_rm.py        # Qwen2 PRM model
│   └── configuration_qwen2_rm.py   # Model configuration
├── scripts/
│   ├── trainer/                    # Training shell scripts
│   │   ├── tiny_train.sh          # Small-scale training
│   │   ├── vanilla_train.sh       # Full vanilla training
│   │   ├── actprm_run.sh          # Active PRM training
│   │   └── multi_node.sh          # Multi-node distributed
│   ├── eval/                       # Evaluation scripts
│   │   ├── bestofn.sh             # Best-of-N full pipeline
│   │   └── search.sh             # Search-based evaluation
│   └── rollouts/                   # Rollout generation scripts
├── offline_train.py                # Main training entry point
├── dataset.py                      # Dataset and collator classes
├── trainer.py                      # Custom trainer logic
├── bestofn.py                      # Best-of-N scoring (inference)
├── evaluator.py                    # Score aggregation and metrics
├── outputs.py                      # Parse evaluation outputs to JSON
├── graph.py                        # Generate comparison plots
├── rollout.py                      # Rollout generation
└── utils.py                        # Shared utilities

Quick Start

# 1. Download a model
git lfs install
git lfs clone https://huggingface.co/kingsleykim/Qwen2.5-Math-1.5B-Prm-Three <your_model_dir>/Qwen2.5-Math-1.5B-Prm-Three

# 2. Train (in training env)
deepspeed active_prm/offline_train.py \
    --config_path active_prm/config/ds_config.json \
    --model_path <your_model_dir>/Qwen2.5-Math-1.5B-Prm-Three \
    --data_path <your_data_dir>/prm_training_data \
    --output_path <your_output_dir>/output_1.5b \
    --vanilla --num_points 1000 --epochs 1

# 3. Evaluate (in eval env) — see EVALUATE.md for full pipeline
python active_prm/bestofn.py \
    --parquet-file <your_data_dir>/rollouts.parquet \
    --model-path <your_model_dir>/Qwen2.5-Math-1.5B-Prm-Three \
    --intermediate-save-path results/scores.pkl

python active_prm/evaluator.py --file results/scores.pkl --score-type min

For detailed instructions see TRAINING.md and EVALUATE.md.

Training

Three training modes are supported:

Vanilla PRM — Standard supervised fine-tuning with optional trajectory-level loss:

deepspeed active_prm/offline_train.py \
    --config_path active_prm/config/ds_config.json \
    --model_path <model> --data_path <data> --output_path <output> \
    --vanilla --traj_loss --agg min --epochs 1

Active PRM — Curriculum learning with k-weighted step sampling:

deepspeed active_prm/offline_train.py \
    --config_path active_prm/config/ds_config.json \
    --model_path <model> --data_path <data> --output_path <output> \
    --k1 0.3 --k2 0.3 --k3 0.4 --epochs 1

Combined — Joint step-level and trajectory-level training:

deepspeed active_prm/offline_train.py \
    --config_path active_prm/config/ds_config.json \
    --model_path <model> --data_path <data> --output_path <output> \
    --combined --traj_loss --agg min --epochs 1

See TRAINING.md for the full CLI reference, loss configurations, and example scripts.

Evaluation

The evaluation pipeline has four stages:

  1. bestofn.py — Run PRM inference to score trajectories and save raw probabilities
  2. evaluator.py — Aggregate step-level probabilities using different strategies
  3. outputs.py — Parse evaluation outputs into structured JSON
  4. graph.py — Generate comparison plots across models and benchmarks

Supported aggregation strategies: min, sum, softmin, orm, sum_norm, norm.

See EVALUATE.md for arguments, output formats, and example scripts.

Models and Datasets

Pre-trained Models

Model HuggingFace Path
Qwen2.5-Math-7B-Prm-Three https://huggingface.co/kingsleykim/Qwen2.5-Math-7B-Prm-Three/
Qwen2.5-Math-1.5B-Prm-Three https://huggingface.co/kingsleykim/Qwen2.5-Math-PRM-1.5B-Three
Qwen2.5-Math-7B-Instruct (base) Qwen/Qwen2.5-Math-7B-Instruct
Qwen2.5-Math-PRM-1.5B-Single (one-head PRM, 1.5B) https://huggingface.co/kingsleykim/Qwen2.5-Math-PRM-1.5B-Single
Qwen2.5-Math-PRM-7B-Single (one-head PRM, 7B) https://huggingface.co/kingsleykim/Qwen2.5-Math-7B-PRM-Single/

We also tuned and ablated with two, four and five for the number of reward model heads - if you would like those model weights, please email me at kingsleykimm@gmail.com.

Downloading

git lfs install
# Via git lfs clone
git lfs clone https://huggingface.co/kingsleykim/Qwen2.5-Math-7B-Prm-Three

# Or via huggingface-cli
huggingface-cli download kingsleykim/Qwen2.5-Math-7B-Prm-Three --local-dir ./models/Qwen2.5-Math-7B-Prm-Three

Dataset Format

Training datasets are HuggingFace datasets with three columns:

Column Type Description
inputs string Problem + solution with <extra_0> step markers
hard_labels list[int] Per-step label: 0=correct, 1=neutral, 2=incorrect
correct bool Whether the final answer is correct

Note: hard_labels is only used for the Vanilla PRM, which uses the per-step labels in the cross entropy loss. For training the PRMs we describe in the paper, the loss will only depend on the correct column.

Experiment Tracking

Training logs to Weights & Biases. Set WANDB_API_KEY before training.

Known Issues

  • vLLM 3-label patch: vLLM 0.11.0 must be manually patched for 3-label PRM scoring (see Installation)
  • Incompatible environments: Training and evaluation require different torch/transformers versions; use separate virtual environments
  • flash-attn build: Requires CUDA headers at build time. Ensure CUDA_HOME is set and the CUDA toolkit is accessible. uv and the flash-attn also have extensive documentation on best practices for installation. What I found helpful is to always download the wheel that matches your Pytorch and CUDA versions, or as close as you can get.

Citation

@inproceedings{kim2026learning,
  title={Learning Discriminative Process Reward Models without Step Labels},
  author={Kim, Kingsley and Liu, Haolin and Wei, Chen-Yu},
  booktitle={ICLR 2026 Workshop on Scaling Post-training for LLMs (SPOT @ ICLR)},
  year={2026},
  url={https://openreview.net/forum?id=df3p10k2kq}
}

License

TBD

About

[SPOT @ ICLR 2026] Implementation for the Paper "Learning Discriminative Process Reward Models without Step Labels"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors