feat: replace max_async_level with on_policy debug flag#2328
feat: replace max_async_level with on_policy debug flag#2328
Conversation
Removes max_async_level and strict_async_level. Async training now always uses the latest available policy with no upper bound on the step gap (max_off_policy_steps still bounds rollout staleness). Adds no_async boolean for debug-only synchronous on-policy runs, where the orchestrator blocks until the trainer checkpoint for the current step is ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
left a comment
There was a problem hiding this comment.
nice, love it. would check tests and allow nccl weight broadcast with on_policy=True as well
Removed the NCCL + on_policy compatibility check on both the orchestrator and trainer configs — NCCL broadcast no longer cares about the on_policy flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
||
| max_async_level: Annotated[ | ||
| int | None, | ||
| on_policy: Annotated[ |
There was a problem hiding this comment.
i think we can remove this from the shared config? the trainer doesn't need to know about the async level anymore. thus far, it needed to to know at which point it can start cleaning broadcast checkpoints. since now async level is implicitly <=1, we are fine with cleaning dirs >=2 steps away
| if self.max_async_level is not None: | ||
| self.trainer.max_async_level = self.max_async_level | ||
| self.orchestrator.max_async_level = self.max_async_level | ||
| def auto_setup_on_policy(self): |
|
|
||
| max_async_level: Annotated[ | ||
| int, | ||
| on_policy: Annotated[ |
| if self.strict_async_level: | ||
| return async_away_ckpt_step | ||
| return max(async_away_ckpt_step, latest_ckpt_step) | ||
| if self.on_policy: |
There was a problem hiding this comment.
we still need the async_away_ckpt_step but hardcoded with max_async_level=1. otherwise the orchestrator will race away from the trainer
|
|
||
|
|
||
| def validate_shared_max_async_level( | ||
| def validate_shared_on_policy( |
CI integration tests showed the orchestrator was racing 16 steps ahead of the trainer (Async Level: 16, Max. Off-Policy Level: 0) because removing max_async_level also removed the back-pressure that paused the orchestrator until the trainer had broadcast recent weights. The trainer ended up consuming severely off-policy batches and reward regressed (~0.2 vs 0.65 threshold at step 19). Reuse max_off_policy_steps (default 8) as the single back-pressure knob: the orchestrator now waits for a checkpoint at step - max_off_policy_steps if the latest ckpt lags below that. on_policy=true still forces strict step-equality waiting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review: - Remove on_policy from TrainerConfig and the shared RLConfig. The trainer no longer needs to know about async level: broadcast cleanup can always keep the two most recent dirs, and the last-step broadcast skip uses a hardcoded 1-step tail. - Drop auto_setup_on_policy + validate_shared_on_policy. - Restore async_away_ckpt_step in the scheduler, hardcoded at 1 step, so the orchestrator can't race ahead of the trainer in async mode. - Add CHANGELOG entry for the removed max_async_level / strict_async_level fields and the new orchestrator.on_policy flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7c7817b. Configure here.
| def auto_setup_bench(self): | ||
| if self.bench: | ||
| self.max_steps = 4 # Run for 1 warmup step + 3 evaluation steps | ||
| self.max_async_level = int(1e9) # Never wait for RL weight checkpoints |
There was a problem hiding this comment.
Bench mode lost "never wait" async override
Medium Severity
The auto_setup_bench method previously set self.max_async_level = int(1e9) with the explicit comment "Never wait for RL weight checkpoints." This line was removed without any replacement. Bench mode now uses the hardcoded async level of 1, meaning the orchestrator will block waiting for trainer checkpoints at step 2+. This defeats the purpose of benchmark mode, which is to measure orchestrator throughput without trainer bottlenecks.
Reviewed by Cursor Bugbot for commit 7c7817b. Configure here.
There was a problem hiding this comment.
oh this is valid i think. if we hardcode async level 1 then the benchmark mode will stop at the async barrier of 1. maybe we can circumvent this by setting enable_policy_updates=False


Summary
max_async_levelandstrict_async_levelfrom the orchestrator/trainer/rl configs and from the scheduler.max_off_policy_stepscontinues to bound rollout staleness.on_policy: boolflag (defaultfalse) on the orchestrator, trainer, and rl configs. Whentrue, the orchestrator blocks until the trainer has produced a checkpoint for the current step, i.e. fully synchronous on-policy RL. This is debug-only: it will be significantly slower than async training in this setup.on_policy=falseinstead ofmax_async_level=1.max_async_level).docs/async.mdand removesmax_async_levelfrom debug and CI configs.Note
Medium Risk
Changes rollout scheduling and checkpoint/broadcast cleanup semantics in the RL training loop, which can affect throughput and correctness of policy updates despite being conceptually straightforward.
Overview
Removes the
max_async_level/strict_async_levelconfiguration knobs across RL (trainer,orchestrator, and sharedrlconfig) and deletes the associated cross-config validation.Updates the orchestrator scheduler to a fixed async policy: it now serves rollouts from the latest available checkpoint, caps itself to at most one step ahead of the trainer, and relies on
max_off_policy_stepsto drop overly stale rollouts.Adds
orchestrator.on_policy(debug-only) to force fully synchronous on-policy RL by blocking rollout generation until the trainer checkpoint for the current step is available, and adjusts log messaging/docs and debug/CI configs accordingly.Simplifies weight broadcast tail-step and cleanup behavior: trainer broadcast skipping is now based on the final step (not async level), and filesystem broadcast cleanup keeps the two most recent broadcasts unconditionally.
Reviewed by Cursor Bugbot for commit 7c7817b. Bugbot is set up for automated code reviews on this repo. Configure here.