Skip to content

feat: reward ensembles and inter-rater reliability metrics#426

Open
RUFFY-369 wants to merge 3 commits intoNousResearch:mainfrom
RUFFY-369:feat/reward-ensemble
Open

feat: reward ensembles and inter-rater reliability metrics#426
RUFFY-369 wants to merge 3 commits intoNousResearch:mainfrom
RUFFY-369:feat/reward-ensemble

Conversation

@RUFFY-369
Copy link
Copy Markdown

@RUFFY-369 RUFFY-369 commented Mar 30, 2026

PR Type

  • RL Environment PR - Complete Environment Snapshot & Zero-Training sections
  • Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

I’ve added a new EnsembleReward class to atroposlib/envs/reward_fns/ to mitigate reward hacking and high variance in RL training. Instead of relying on a single scoring model, this allows for aggregating multiple scorers using mean, median, min, or majority_vote.

I also integrated Krippendorff's alpha to track inter-rater reliability (IRR) across the ensemble. This is a critical observability tool to catch when scorers are fundamentally disagreeing—a common signal that the agent is finding an unaligned edge case.

Related Issues

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Code refactor (no functional changes)
  • Build/CI/CD related changes
  • Other (please describe):

✅ Developer & Reviewer Checklist

  • Code follows project style (black, isort, flake8 pass with pre-commit)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes (17/17 verified)
  • Docstrings added for all new public classes / functions
  • If .env vars required, did you add it to the .env.example in repo root? (N/A)

RUFFY-369 and others added 3 commits March 28, 2026 03:22
…ability

Add EnsembleReward to atroposlib/envs/reward_fns/ with:
- Multiple aggregation strategies: mean, median, min, majority_vote
- Krippendorff's alpha inter-rater reliability metric
- Per-item disagreement tracking for reward hacking detection
- Full integration with RewardRegistry

17/17 tests passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant