Skip to content

feat: Atropos Elastic Orchestration (DEO) with PID Control #437

@RUFFY-369

Description

@RUFFY-369

Is your feature request related to a problem? Please describe.

Current static environment allocation in atroposlib is susceptible to VRAM fragmentation and rollout starvation during high-variance RL workloads. Specifically, the atroposlib/api/server.py implementation on main (line 234) appends scored data to the queue without capacity limits or backpressure signals, which can lead to infinite queue growth or trainer idle-time. Furthermore, the library currently has no mechanism for GPU thermal-throttle cordoning or systemic worker process isolation/cleanup, which can result in zombie processes after a crash or degraded throughput on unstable hardware.

Describe the solution you'd like

Implement a Dynamic Environment Orchestrator (DEO) to act as a resilient control plane for scaling.

  • PID-Style Control Loop: A dampened scaling calculation using Rollout Pressure (Queue/BatchSize) with hysteresis to determine the optimal number of actors.
  • Hardware-Aware Scaling: Integrated pre-flight VRAM checks and nvidia-smi telemetry to skip scaling onto thermally throttled or saturated GPUs.
  • Process Isolation & Adoption: Implementation of os.setpgrp for group-level isolation and logic to adopt existing worker processes on startup to prevent resource leakage.
  • Graceful Draining: Support for SIGUSR1 signal handling, allowing workers to finish their current rollout and sync data before exiting during a scale-down event.

Describe alternatives you've considered

  • Static Allocation: Leads to consistent idle-time on expensive hardware when workload pressure is low.
  • K8s/Slurm Native Operators: High-latency overhead for the sub-minute environment resets common in RL research.
  • Raw Thresholding: Lacks dampening, which can cause scaling flapping during transient throughput spikes.

Additional context

Verified that the existing API lacks a global metrics endpoint for orchestration; this proposal includes adding a /global-status endpoint to expose rollout pressure metrics to the orchestrator.

cc @dmahan93 @teknium1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions