Is your feature request related to a problem? Please describe.
Current static environment allocation in atroposlib is susceptible to VRAM fragmentation and rollout starvation during high-variance RL workloads. Specifically, the atroposlib/api/server.py implementation on main (line 234) appends scored data to the queue without capacity limits or backpressure signals, which can lead to infinite queue growth or trainer idle-time. Furthermore, the library currently has no mechanism for GPU thermal-throttle cordoning or systemic worker process isolation/cleanup, which can result in zombie processes after a crash or degraded throughput on unstable hardware.
Describe the solution you'd like
Implement a Dynamic Environment Orchestrator (DEO) to act as a resilient control plane for scaling.
- PID-Style Control Loop: A dampened scaling calculation using Rollout Pressure (Queue/BatchSize) with hysteresis to determine the optimal number of actors.
- Hardware-Aware Scaling: Integrated pre-flight VRAM checks and nvidia-smi telemetry to skip scaling onto thermally throttled or saturated GPUs.
- Process Isolation & Adoption: Implementation of os.setpgrp for group-level isolation and logic to adopt existing worker processes on startup to prevent resource leakage.
- Graceful Draining: Support for SIGUSR1 signal handling, allowing workers to finish their current rollout and sync data before exiting during a scale-down event.
Describe alternatives you've considered
- Static Allocation: Leads to consistent idle-time on expensive hardware when workload pressure is low.
- K8s/Slurm Native Operators: High-latency overhead for the sub-minute environment resets common in RL research.
- Raw Thresholding: Lacks dampening, which can cause scaling flapping during transient throughput spikes.
Additional context
Verified that the existing API lacks a global metrics endpoint for orchestration; this proposal includes adding a /global-status endpoint to expose rollout pressure metrics to the orchestrator.
cc @dmahan93 @teknium1
Is your feature request related to a problem? Please describe.
Current static environment allocation in atroposlib is susceptible to VRAM fragmentation and rollout starvation during high-variance RL workloads. Specifically, the atroposlib/api/server.py implementation on main (line 234) appends scored data to the queue without capacity limits or backpressure signals, which can lead to infinite queue growth or trainer idle-time. Furthermore, the library currently has no mechanism for GPU thermal-throttle cordoning or systemic worker process isolation/cleanup, which can result in zombie processes after a crash or degraded throughput on unstable hardware.
Describe the solution you'd like
Implement a Dynamic Environment Orchestrator (DEO) to act as a resilient control plane for scaling.
Describe alternatives you've considered
Additional context
Verified that the existing API lacks a global metrics endpoint for orchestration; this proposal includes adding a /global-status endpoint to expose rollout pressure metrics to the orchestrator.
cc @dmahan93 @teknium1