A Unified Reinforcement Learning Framework for Mathematical Reasoning
Train small language models to reason like giants. Nano-Reasoner implements state-of-the-art RL algorithms for mathematical reasoning, achieving 84%+ accuracy on GSM8K with just a 1.5B parameter model.
- 6 RL Algorithms — PPO, GRPO, Dr.GRPO, GSPO, DAPO, and GRPO-LEAD in one unified framework
- Memory Efficient — 4-bit quantization + LoRA + gradient checkpointing (runs on 16GB GPUs)
- Two-Phase Training — SFT cold start followed by RL fine-tuning for optimal results
- Production Ready — Checkpoint resumption, W&B logging, and modular design
- Colab Compatible — Full training pipeline in a single notebook
┌─────────────────────────────────────────────────────────────────┐
│ NANO-REASONER PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: SFT Cold Start │
│ ━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Teaches model the <think>...</think> format │
│ • Train for 1 epoch on format examples │
│ • Saves to: checkpoints/sft/ │
│ │ │
│ ▼ │
│ Phase 2: RL Training (Choose Algorithm) │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Loads SFT checkpoint as base │
│ • PPO / GRPO / Dr.GRPO / GSPO / DAPO / GRPO-LEAD │
│ • Optimizes for correctness via reward signal │
│ • Saves to: checkpoints/<algorithm>/ │
│ │
└─────────────────────────────────────────────────────────────────┘
Why SFT First? RL algorithms require the model to spontaneously emit <think> tags. An untrained model won't do this. SFT "cold start" teaches the format before RL optimizes for correctness.
| Algorithm | Description | Memory | Best For |
|---|---|---|---|
| GRPO | Group Relative Policy Optimization | Low | Baseline, stable training |
| Dr.GRPO | Length-corrected GRPO | Low | Avoiding length bias |
| PPO | Proximal Policy Optimization | High | When you have VRAM to spare |
| GSPO | Group Sequence Policy Optimization | Low | Sequence-level optimization |
| DAPO | Decoupled Clip and Dynamic sAmpling Policy Optimization | Low | Dynamic sampling scenarios |
| GRPO-LEAD | Length & Difficulty Aware GRPO | Low | Curriculum-style training |
- Python 3.9+
- CUDA 11.8+ (for GPU training)
- 16GB+ GPU memory recommended
# Clone the repository
git clone https://github.com/your-org/nano-reasoner.git
cd nano-reasoner
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
make install
# Or manually: pip install -r requirements.txtpip install flash-attn --no-build-isolationmake train-sftThis teaches the model the reasoning format (<think>...</think> and \boxed{}).
Choose your algorithm:
# Recommended: Dr.GRPO (length-corrected, stable)
make train-dr-grpo
# Or try other algorithms
make train-grpo # Baseline GRPO
make train-ppo # PPO (higher memory)
make train-gspo # Sequence-level optimization
make train-dapo # Decoupled clipping
make train-grpo-lead # Length & difficulty awaremake evaluate CKPT=checkpoints/dr_grpo/bestEdit configs/base.yaml to customize training:
seed: 42
model:
name: "Qwen/Qwen2.5-Math-1.5B-Instruct"
data:
dataset_name: "openai/gsm8k" # or "HuggingFaceH4/openr1-math-220k"
training:
algo: "DR.GRPO" # PPO, GRPO, DR.GRPO, GSPO, DAPO, GRPO-LEAD
group_size: 8 # Number of samples per prompt for GRPO variants
batch_size: 2 # Batch size per GPU
epochs: 1
learning_rate: 5.0e-6
max_samples: 5000 # null for full dataset
max_new_tokens: 384 # Max reasoning length
ppo_epochs: 2 # Inner optimization iterations
save_steps: 100
paths:
output_dir: "checkpoints"
logging:
use_wandb: false
project: "NanoReasoner"| GPU | VRAM | Recommended Settings |
|---|---|---|
| T4 | 16GB | batch_size: 1, group_size: 2 |
| L4 | 24GB | batch_size: 1, group_size: 4 |
| A100 40GB | 40GB | batch_size: 2, group_size: 8 |
| A100 80GB | 80GB | batch_size: 8, group_size: 8 |
Training on GSM8K with Qwen2.5-Math-1.5B-Instruct:
| Algorithm | Accuracy | Training Time | Notes |
|---|---|---|---|
| Base Model | ~65% | - | Zero-shot |
| + SFT | ~72% | ~30 min | Format tuning only |
| + Dr.GRPO | 84% | ~10 hrs | 5000 samples, A100 |
| + GRPO | 82% | ~10 hrs | Baseline RL |
Results on A100-80GB with default hyperparameters.
nano-reasoner/
├── src/
│ ├── __init__.py # Package exports
│ ├── config_parser.py # YAML config loading
│ ├── dataset.py # MathReasoningDataset
│ ├── model.py # UnifiedPolicyModel (4-bit + LoRA)
│ ├── trainer.py # UnifiedReasoningTrainer (all algorithms)
│ └── utils.py # Utilities (seeding, logging, checkpoints)
├── scripts/
│ ├── train.py # Training entry point
│ └── inference.py # Evaluation script
├── configs/
│ ├── base.yaml # RL training config
│ └── sft.yaml # SFT cold start config
├── notebooks/
│ └── unified_rl_training_v2.ipynb # Colab notebook
├── Makefile # Training commands
├── requirements.txt # Dependencies
└── pyproject.toml # Package metadata
For cloud training, use the included Jupyter notebook:
The notebook includes:
- Google Drive integration for checkpoint persistence
- Interactive configuration widgets
- Real-time training metrics
- Built-in evaluation and inference
Training automatically resumes from the latest checkpoint:
from src import find_latest_checkpoint
checkpoint_dir = "checkpoints/dr_grpo"
path, step = find_latest_checkpoint(checkpoint_dir)
print(f"Resuming from step {step}: {path}")Extend MathReasoningDataset for new datasets:
from src import MathReasoningDataset
dataset = MathReasoningDataset(
tokenizer=tokenizer,
split="train",
max_samples=10000,
mode="rl", # or "sft"
dataset_name="your-org/your-math-dataset"
)from src import (
UnifiedPolicyModel,
UnifiedReasoningTrainer,
MathReasoningDataset,
seed_everything
)
seed_everything(42)
# Initialize model
model = UnifiedPolicyModel("Qwen/Qwen2.5-Math-1.5B-Instruct", algo="DR.GRPO")
# Configure trainer
config = {
"algo": "DR.GRPO",
"group_size": 8,
"learning_rate": 5e-6,
"max_new_tokens": 384,
"ppo_epochs": 1
}
trainer = UnifiedReasoningTrainer(model, config, device="cuda")
# Train
for batch in dataloader:
metrics = trainer.train_step(batch)
print(f"Loss: {metrics['loss']:.4f}, Accuracy: {metrics['accuracy']:.1%}")Standard baseline that normalizes advantages within groups of samples for the same prompt.
Addresses length bias by scaling gradients inversely with response length:
scale_i = L_i / mean(L_group)
Combines length penalty with difficulty weighting based on pass rate:
difficulty_weight = 2.0 - pass_rate
length_penalty = exp(-0.1 * |z_score|)
- Reduce
batch_sizeandgroup_size - Enable gradient checkpointing (default: enabled)
- Use 4-bit quantization (default: enabled)
- Verify SFT phase completed successfully
- Check reward function:
trainer.extract_answer()must find\boxed{} - Ensure model generates
<think>tags
- Install Flash Attention 2
- Use larger batch sizes if memory allows
- Check GPU utilization with
nvidia-smi
If you use Nano-Reasoner in your research, please cite:
@software{nano_reasoner,
title = {Nano-Reasoner: Unified RL for Mathematical Reasoning},
year = {2024},
url = {https://github.com/your-org/nano-reasoner}
}MIT License - see LICENSE for details.
- Qwen2.5-Math for the base model
- Unsloth for optimized training
- GRPO and Dr.GRPO papers for algorithm inspiration