feat: online reward normalization (Welford’s algorithm) by RUFFY-369 · Pull Request #427 · NousResearch/atropos

RUFFY-369 · 2026-03-30T21:04:10Z

PR Type

Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

Added an online reward normalizer to BaseEnv to keep training stable as rewards shift. I used Welford’s Online Algorithm for the running Z-score calculation to keep it O(1) in memory and avoid storing large reward histories.

I included a configurable warmup_steps phase so the distribution doesn't start shifting until the mean/std estimates have statistically stabilized. This should fix the gradient explosion issues often seen in early RL training stages.

Related Issues

Part of [Enhancement] RL Training Infrastructure Stabilization & Observability #431 (RL Infrastructure Enhancements)
Depends on feat: reward ensembles and inter-rater reliability metrics #426

Type of Change

New feature (non-breaking change which adds functionality)

✅ Developer & Reviewer Checklist

Code follows project style (black, isort, flake8 pass with pre-commit)
I have performed a self-review of my own code
I have made corresponding changes to the documentation
My changes generate no new warnings
New and existing unit tests pass locally with my changes (21/21 verified)
Docstrings added for all new public classes / functions
If .env vars required, did you add it to the .env.example in repo root? (N/A)

…lity Add RewardNormalizer to atroposlib/envs/ with: - Welford's online algorithm for running mean/variance (no data storage) - Z-score and min-max normalization modes - Configurable reward clipping and warmup period - Checkpoint save/load support - Opt-in integration in BaseEnv via 3 new config fields - WandB metrics for normalization statistics 21/21 tests passing.

for more information, see https://pre-commit.ci

RUFFY-369 and others added 4 commits March 28, 2026 03:31

style: fix lints and pin dependencies for reward normalization

8a3a582

Merge branch 'NousResearch:main' into feat/reward-normalization

39fb1d6

[pre-commit.ci] auto fixes from pre-commit.com hooks

845feeb

for more information, see https://pre-commit.ci

This was referenced Mar 30, 2026

feat: difficulty based curriculum sampling strategy #428

Open

feat: API performance tracking and final infra integration #430

Open

[Enhancement] RL Training Infrastructure Stabilization & Observability #431

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: online reward normalization (Welford’s algorithm)#427

feat: online reward normalization (Welford’s algorithm)#427
RUFFY-369 wants to merge 4 commits intoNousResearch:mainfrom
RUFFY-369:feat/reward-normalization

RUFFY-369 commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RUFFY-369 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

📝 General Information

Description

Related Issues

Type of Change

✅ Developer & Reviewer Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RUFFY-369 commented Mar 30, 2026 •

edited

Loading