Skip to content

[Enhancement] RL Training Infrastructure Stabilization & Observability #431

@RUFFY-369

Description

@RUFFY-369

Is your feature request related to a problem? Please describe.

As we scale RL training in atropos, several infrastructural gaps have emerged that impact training reliability:

  • Reward Hacking: The current reward_fns/ lack a consensus mechanism, making single-rater models prone to exploitation.
  • Non-Stationary Rewards: While combined_reward.py has basic scaling, there is no online Z-score normalization (Welford's) to stabilize long-term training.
  • Difficulty Scaling: The core environment currently lacks a unified CurriculumScheduler for automated "easy-first" sampling.
  • Observability Gap: There is no high-resolution profiling for trainer-inference throughput (items/sec) or node-level p95/p99 latency tracking.

These gaps make it difficult to maintain stable, predictable training runs without manual tuning and opaque performance bottlenecks.

Describe the solution you'd like

I've developed a modular suite of infrastructure enhancements to harden the RL loop and provide much-needed observability:

  • Ensemble Reward Aggregator: Consolidated scoring with Krippendorff’s Alpha for tracking rater reliability.
  • Online Normalization: Stationary reward scaling via Welford’s Algorithm (O(1) memory).
  • Curriculum Scheduler: Extensible sampling strategies for competence-weighted difficulty scaling.
  • Numerical Verification Suite: Automated health checks for reward distribution bias and advantage stability.
  • Throughput Profiling: A high-resolution APIPerformanceTracker for real-time node latency monitoring.

Describe alternatives you've considered

I considered a monolithic PR, but chose to split this into 5 atomic units (#426#430 ) to ensure a clean, reviewable history for each component. I also explored fixed-threshold clipping but opted for Welford's online approach to better handle non-stationary distributions without manual intervention

Additional context

Verified through 92 unit tests and a 20 step E2E rollout on an RTX 3090. Fully compatible with hermes-agent. I have opened these as a series of 5 interconnected PRs:

#426 (Ensemble Reward)
#427 (Reward Normalization)
#428 (Curriculum Learning)
#429 (Numerical Health Checks)
#430 (Throughput Profiling & Final Integration)

cc @dmahan93 @teknium1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions