[SPOT @ ICLR 2026] Learning Discriminative Process Reward Models without Step Labels

Training and evaluation code for Process Reward Models (PRMs) that score intermediate reasoning steps in mathematical problem-solving. We train discriminative PRMs on Qwen2.5-Math (1.5B, 7B) using trajectory-level supervision (min, softmin, product aggregation) and active curriculum learning, then evaluate via Best-of-N selection on AIME 2024/2025, MATH-500, and AIMO benchmarks.

Requirements

Hardware:

NVIDIA GPU(s) with CUDA 12.8 support
Multi-GPU recommended for 7B models (DeepSpeed ZeRO)

Software:

Python >= 3.11
CUDA Toolkit 12.8
Ninja build system
Git LFS (for downloading models/datasets)

Warning Training and evaluation require incompatible package versions (different torch, transformers, vllm). You must use separate virtual environments. See Installation.

Installation

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh
# Or: pip install uv

2. Clone the repository

git clone <repo_url>
cd active_prm

3. CUDA Toolkit

Option A: System-wide — Follow the NVIDIA CUDA Toolkit installation guide for your OS.

Option B: HPC with lmod

module load cuda/12.8.0
module load ninja

4. Create virtual environments

Warning Training and evaluation use different versions of torch, transformers, and vllm. You must use separate environments.

Package	Training env	Evaluation env
torch	2.7.0	2.8.0
transformers	4.53.2	4.55.2
openai	1.99.0	1.99.1
deepspeed	0.16.7	—
accelerate	1.6.0	—
trl	0.17.0	—
flash-attn	>=2.8.3	—
liger-kernel	0.5.8	—
vllm	—	0.11.0
math-verify	—	0.7.0

Training environment:

uv sync --extra train
source .venv/bin/activate

Evaluation environment (use a separate directory or venv):

uv sync --extra eval
source .venv/bin/activate

PRMBENCH The PRMBench directory under the eval/ folder is a fork of the PRMBench repository, modified in order to fit the discriminative PRM scoring. Look at the PRMBench/README.md file for more details and instructions for steup - it also uses uv for environment management.

5. vLLM patch for 3-label PRM

Warning vLLM 0.11.0 defaults to 2-label PRM scoring. You must patch it to use 3 labels.

After installing the eval environment, edit your local vLLM installation:

# Find the file:
python -c "import vllm; print(vllm.__file__)"
# Edit: <vllm_path>/model_executor/models/qwen2_rm.py
# Change num_labels in Qwen2ForProcessRewardModel from 2 to 3

6. Environment variables

export TOKENIZERS_PARALLELISM=false
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export WANDB_API_KEY=<your_wandb_api_key>
export HF_HOME=<your_hf_cache_dir>       # Optional

Project Structure

active_prm/
├── config/                          # DeepSpeed and training configs
│   ├── 7b_config.json              # Config for 7B model training
│   ├── warmup_config.json          # Config for warmup phase
│   ├── ds_config.json              # Standard DeepSpeed config
│   └── base.yaml                   # Hydra base config
├── prm_modelling/                   # Model definitions
│   ├── modeling_qwen2_rm.py        # Qwen2 PRM model
│   └── configuration_qwen2_rm.py   # Model configuration
├── scripts/
│   ├── trainer/                    # Training shell scripts
│   │   ├── tiny_train.sh          # Small-scale training
│   │   ├── vanilla_train.sh       # Full vanilla training
│   │   ├── actprm_run.sh          # Active PRM training
│   │   └── multi_node.sh          # Multi-node distributed
│   ├── eval/                       # Evaluation scripts
│   │   ├── bestofn.sh             # Best-of-N full pipeline
│   │   └── search.sh             # Search-based evaluation
│   └── rollouts/                   # Rollout generation scripts
├── offline_train.py                # Main training entry point
├── dataset.py                      # Dataset and collator classes
├── trainer.py                      # Custom trainer logic
├── bestofn.py                      # Best-of-N scoring (inference)
├── evaluator.py                    # Score aggregation and metrics
├── outputs.py                      # Parse evaluation outputs to JSON
├── graph.py                        # Generate comparison plots
├── rollout.py                      # Rollout generation
└── utils.py                        # Shared utilities

Quick Start

# 1. Download a model
git lfs install
git lfs clone https://huggingface.co/kingsleykim/Qwen2.5-Math-1.5B-Prm-Three <your_model_dir>/Qwen2.5-Math-1.5B-Prm-Three

# 2. Train (in training env)
deepspeed active_prm/offline_train.py \
    --config_path active_prm/config/ds_config.json \
    --model_path <your_model_dir>/Qwen2.5-Math-1.5B-Prm-Three \
    --data_path <your_data_dir>/prm_training_data \
    --output_path <your_output_dir>/output_1.5b \
    --vanilla --num_points 1000 --epochs 1

# 3. Evaluate (in eval env) — see EVALUATE.md for full pipeline
python active_prm/bestofn.py \
    --parquet-file <your_data_dir>/rollouts.parquet \
    --model-path <your_model_dir>/Qwen2.5-Math-1.5B-Prm-Three \
    --intermediate-save-path results/scores.pkl

python active_prm/evaluator.py --file results/scores.pkl --score-type min

For detailed instructions see TRAINING.md and EVALUATE.md.

Training

Three training modes are supported:

Vanilla PRM — Standard supervised fine-tuning with optional trajectory-level loss:

deepspeed active_prm/offline_train.py \
    --config_path active_prm/config/ds_config.json \
    --model_path <model> --data_path <data> --output_path <output> \
    --vanilla --traj_loss --agg min --epochs 1

Active PRM — Curriculum learning with k-weighted step sampling:

deepspeed active_prm/offline_train.py \
    --config_path active_prm/config/ds_config.json \
    --model_path <model> --data_path <data> --output_path <output> \
    --k1 0.3 --k2 0.3 --k3 0.4 --epochs 1

Combined — Joint step-level and trajectory-level training:

deepspeed active_prm/offline_train.py \
    --config_path active_prm/config/ds_config.json \
    --model_path <model> --data_path <data> --output_path <output> \
    --combined --traj_loss --agg min --epochs 1

See TRAINING.md for the full CLI reference, loss configurations, and example scripts.

Evaluation

The evaluation pipeline has four stages:

bestofn.py — Run PRM inference to score trajectories and save raw probabilities
evaluator.py — Aggregate step-level probabilities using different strategies
outputs.py — Parse evaluation outputs into structured JSON
graph.py — Generate comparison plots across models and benchmarks

Supported aggregation strategies: min, sum, softmin, orm, sum_norm, norm.

See EVALUATE.md for arguments, output formats, and example scripts.

Models and Datasets

Pre-trained Models

Model	HuggingFace Path
Qwen2.5-Math-7B-Prm-Three	`https://huggingface.co/kingsleykim/Qwen2.5-Math-7B-Prm-Three/`
Qwen2.5-Math-1.5B-Prm-Three	`https://huggingface.co/kingsleykim/Qwen2.5-Math-PRM-1.5B-Three`
Qwen2.5-Math-7B-Instruct (base)	`Qwen/Qwen2.5-Math-7B-Instruct`
Qwen2.5-Math-PRM-1.5B-Single (one-head PRM, 1.5B)	`https://huggingface.co/kingsleykim/Qwen2.5-Math-PRM-1.5B-Single`
Qwen2.5-Math-PRM-7B-Single (one-head PRM, 7B)	`https://huggingface.co/kingsleykim/Qwen2.5-Math-7B-PRM-Single/`

We also tuned and ablated with two, four and five for the number of reward model heads - if you would like those model weights, please email me at kingsleykimm@gmail.com.

Downloading

git lfs install
# Via git lfs clone
git lfs clone https://huggingface.co/kingsleykim/Qwen2.5-Math-7B-Prm-Three

# Or via huggingface-cli
huggingface-cli download kingsleykim/Qwen2.5-Math-7B-Prm-Three --local-dir ./models/Qwen2.5-Math-7B-Prm-Three

Dataset Format

Training datasets are HuggingFace datasets with three columns:

Column	Type	Description
`inputs`	string	Problem + solution with `<extra_0>` step markers
`hard_labels`	list[int]	Per-step label: 0=correct, 1=neutral, 2=incorrect
`correct`	bool	Whether the final answer is correct

Note: hard_labels is only used for the Vanilla PRM, which uses the per-step labels in the cross entropy loss. For training the PRMs we describe in the paper, the loss will only depend on the correct column.

Experiment Tracking

Training logs to Weights & Biases. Set WANDB_API_KEY before training.

Known Issues

vLLM 3-label patch: vLLM 0.11.0 must be manually patched for 3-label PRM scoring (see Installation)
Incompatible environments: Training and evaluation require different torch/transformers versions; use separate virtual environments
flash-attn build: Requires CUDA headers at build time. Ensure CUDA_HOME is set and the CUDA toolkit is accessible. uv and the flash-attn also have extensive documentation on best practices for installation. What I found helpful is to always download the wheel that matches your Pytorch and CUDA versions, or as close as you can get.

Citation

@inproceedings{kim2026learning,
  title={Learning Discriminative Process Reward Models without Step Labels},
  author={Kim, Kingsley and Liu, Haolin and Wei, Chen-Yu},
  booktitle={ICLR 2026 Workshop on Scaling Post-training for LLMs (SPOT @ ICLR)},
  year={2026},
  url={https://openreview.net/forum?id=df3p10k2kq}
}

License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
discriminative_prm		discriminative_prm
.gitignore		.gitignore
.python-version		.python-version
EVALUATE.md		EVALUATE.md
README.md		README.md
TRAINING.md		TRAINING.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[SPOT @ ICLR 2026] Learning Discriminative Process Reward Models without Step Labels

Table of Contents

Requirements

Installation

1. Install uv

2. Clone the repository

3. CUDA Toolkit

4. Create virtual environments

5. vLLM patch for 3-label PRM

6. Environment variables

Project Structure

Quick Start

Training

Evaluation

Models and Datasets

Pre-trained Models

Downloading

Dataset Format

Experiment Tracking

Known Issues

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[SPOT @ ICLR 2026] Learning Discriminative Process Reward Models without Step Labels

Table of Contents

Requirements

Installation

1. Install uv

2. Clone the repository

3. CUDA Toolkit

4. Create virtual environments

5. vLLM patch for 3-label PRM

6. Environment variables

Project Structure

Quick Start

Training

Evaluation

Models and Datasets

Pre-trained Models

Downloading

Dataset Format

Experiment Tracking

Known Issues

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages