[Enhancement] RL Training Infrastructure Stabilization & Observability

**Is your feature request related to a problem? Please describe.**

As we scale RL training in atropos, several infrastructural gaps have emerged that impact training reliability:

- Reward Hacking: The current reward_fns/ lack a consensus mechanism, making single-rater models prone to exploitation.
- Non-Stationary Rewards: While combined_reward.py has basic scaling, there is no online Z-score normalization (Welford's) to stabilize long-term training.
- Difficulty Scaling: The core environment currently lacks a unified CurriculumScheduler for automated "easy-first" sampling.
- Observability Gap: There is no high-resolution profiling for trainer-inference throughput (items/sec) or node-level p95/p99 latency tracking.

These gaps make it difficult to maintain stable, predictable training runs without manual tuning and opaque performance bottlenecks.

**Describe the solution you'd like**

I've developed a modular suite of infrastructure enhancements to harden the RL loop and provide much-needed observability:

- Ensemble Reward Aggregator: Consolidated scoring with Krippendorff’s Alpha for tracking rater reliability.
- Online Normalization: Stationary reward scaling via Welford’s Algorithm (O(1) memory).
- Curriculum Scheduler: Extensible sampling strategies for competence-weighted difficulty scaling.
- Numerical Verification Suite: Automated health checks for reward distribution bias and advantage stability.
- Throughput Profiling: A high-resolution APIPerformanceTracker for real-time node latency monitoring.

**Describe alternatives you've considered**

I considered a monolithic PR, but chose to split this into 5 atomic units (#426 – #430 ) to ensure a clean, reviewable history for each component. I also explored fixed-threshold clipping but opted for Welford's online approach to better handle non-stationary distributions without manual intervention

**Additional context**

Verified through 92 unit tests and a 20 step E2E rollout on an RTX 3090. Fully compatible with hermes-agent. I have opened these as a series of 5 interconnected PRs:

#426  (Ensemble Reward)
#427  (Reward Normalization)
#428  (Curriculum Learning)
#429  (Numerical Health Checks)
#430  (Throughput Profiling & Final Integration)

cc @dmahan93 @teknium1 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] RL Training Infrastructure Stabilization & Observability #431

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] RL Training Infrastructure Stabilization & Observability #431

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions