feat: Atropos Elastic Orchestration (DEO) with PID Control

**Is your feature request related to a problem? Please describe.**

Current static environment allocation in atroposlib is susceptible to VRAM fragmentation and rollout starvation during high-variance RL workloads. Specifically, the atroposlib/api/server.py implementation on main (line 234) appends scored data to the queue without capacity limits or backpressure signals, which can lead to infinite queue growth or trainer idle-time. Furthermore, the library currently has no mechanism for GPU thermal-throttle cordoning or systemic worker process isolation/cleanup, which can result in **zombie** processes after a crash or degraded throughput on unstable hardware.

**Describe the solution you'd like**

Implement a Dynamic Environment Orchestrator (DEO) to act as a resilient control plane for scaling.

- PID-Style Control Loop: A dampened scaling calculation using **Rollout Pressure** (Queue/BatchSize) with hysteresis to determine the optimal number of actors.
- Hardware-Aware Scaling: Integrated pre-flight VRAM checks and nvidia-smi telemetry to skip scaling onto thermally throttled or saturated GPUs.
- Process Isolation & Adoption: Implementation of os.setpgrp for group-level isolation and logic to **adopt** existing worker processes on startup to prevent resource leakage.
- Graceful Draining: Support for SIGUSR1 signal handling, allowing workers to finish their current rollout and sync data before exiting during a scale-down event.


**Describe alternatives you've considered**

- Static Allocation: Leads to consistent idle-time on expensive hardware when workload pressure is low.
- K8s/Slurm Native Operators: High-latency overhead for the sub-minute environment resets common in RL research.
- Raw Thresholding: Lacks dampening, which can cause scaling **flapping** during transient throughput spikes.

**Additional context**

Verified that the existing API lacks a global metrics endpoint for orchestration; this proposal includes adding a /global-status endpoint to expose rollout pressure metrics to the orchestrator.

cc @dmahan93 @teknium1 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Atropos Elastic Orchestration (DEO) with PID Control #437

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Atropos Elastic Orchestration (DEO) with PID Control #437

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions