Is your feature request related to a problem? Please describe.
As we scale RL training in atropos, several infrastructural gaps have emerged that impact training reliability:
- Reward Hacking: The current reward_fns/ lack a consensus mechanism, making single-rater models prone to exploitation.
- Non-Stationary Rewards: While combined_reward.py has basic scaling, there is no online Z-score normalization (Welford's) to stabilize long-term training.
- Difficulty Scaling: The core environment currently lacks a unified CurriculumScheduler for automated "easy-first" sampling.
- Observability Gap: There is no high-resolution profiling for trainer-inference throughput (items/sec) or node-level p95/p99 latency tracking.
These gaps make it difficult to maintain stable, predictable training runs without manual tuning and opaque performance bottlenecks.
Describe the solution you'd like
I've developed a modular suite of infrastructure enhancements to harden the RL loop and provide much-needed observability:
- Ensemble Reward Aggregator: Consolidated scoring with Krippendorff’s Alpha for tracking rater reliability.
- Online Normalization: Stationary reward scaling via Welford’s Algorithm (O(1) memory).
- Curriculum Scheduler: Extensible sampling strategies for competence-weighted difficulty scaling.
- Numerical Verification Suite: Automated health checks for reward distribution bias and advantage stability.
- Throughput Profiling: A high-resolution APIPerformanceTracker for real-time node latency monitoring.
Describe alternatives you've considered
I considered a monolithic PR, but chose to split this into 5 atomic units (#426 – #430 ) to ensure a clean, reviewable history for each component. I also explored fixed-threshold clipping but opted for Welford's online approach to better handle non-stationary distributions without manual intervention
Additional context
Verified through 92 unit tests and a 20 step E2E rollout on an RTX 3090. Fully compatible with hermes-agent. I have opened these as a series of 5 interconnected PRs:
#426 (Ensemble Reward)
#427 (Reward Normalization)
#428 (Curriculum Learning)
#429 (Numerical Health Checks)
#430 (Throughput Profiling & Final Integration)
cc @dmahan93 @teknium1
Is your feature request related to a problem? Please describe.
As we scale RL training in atropos, several infrastructural gaps have emerged that impact training reliability:
These gaps make it difficult to maintain stable, predictable training runs without manual tuning and opaque performance bottlenecks.
Describe the solution you'd like
I've developed a modular suite of infrastructure enhancements to harden the RL loop and provide much-needed observability:
Describe alternatives you've considered
I considered a monolithic PR, but chose to split this into 5 atomic units (#426 – #430 ) to ensure a clean, reviewable history for each component. I also explored fixed-threshold clipping but opted for Welford's online approach to better handle non-stationary distributions without manual intervention
Additional context
Verified through 92 unit tests and a 20 step E2E rollout on an RTX 3090. Fully compatible with hermes-agent. I have opened these as a series of 5 interconnected PRs:
#426 (Ensemble Reward)
#427 (Reward Normalization)
#428 (Curriculum Learning)
#429 (Numerical Health Checks)
#430 (Throughput Profiling & Final Integration)
cc @dmahan93 @teknium1