Skip to content

feat: replace max_async_level with on_policy debug flag#2328

Open
samsja wants to merge 5 commits intomainfrom
feat/replace-max-async-level-with-no-async
Open

feat: replace max_async_level with on_policy debug flag#2328
samsja wants to merge 5 commits intomainfrom
feat/replace-max-async-level-with-no-async

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented Apr 19, 2026

Summary

  • Removes max_async_level and strict_async_level from the orchestrator/trainer/rl configs and from the scheduler.
  • Async training now always serves rollouts from the latest available policy (no upper bound on the step gap between trainer and inference). max_off_policy_steps continues to bound rollout staleness.
  • Adds an on_policy: bool flag (default false) on the orchestrator, trainer, and rl configs. When true, the orchestrator blocks until the trainer has produced a checkpoint for the current step, i.e. fully synchronous on-policy RL. This is debug-only: it will be significantly slower than async training in this setup.
  • NCCL weight broadcast validation now checks on_policy=false instead of max_async_level=1.
  • Filesystem weight-broadcast cleanup now keeps the two most recent broadcasts unconditionally (previously computed from max_async_level).
  • Updates docs/async.md and removes max_async_level from debug and CI configs.

Note

Medium Risk
Changes rollout scheduling and checkpoint/broadcast cleanup semantics in the RL training loop, which can affect throughput and correctness of policy updates despite being conceptually straightforward.

Overview
Removes the max_async_level/strict_async_level configuration knobs across RL (trainer, orchestrator, and shared rl config) and deletes the associated cross-config validation.

Updates the orchestrator scheduler to a fixed async policy: it now serves rollouts from the latest available checkpoint, caps itself to at most one step ahead of the trainer, and relies on max_off_policy_steps to drop overly stale rollouts.

Adds orchestrator.on_policy (debug-only) to force fully synchronous on-policy RL by blocking rollout generation until the trainer checkpoint for the current step is available, and adjusts log messaging/docs and debug/CI configs accordingly.

Simplifies weight broadcast tail-step and cleanup behavior: trainer broadcast skipping is now based on the final step (not async level), and filesystem broadcast cleanup keeps the two most recent broadcasts unconditionally.

Reviewed by Cursor Bugbot for commit 7c7817b. Bugbot is set up for automated code reviews on this repo. Configure here.

Removes max_async_level and strict_async_level. Async training now
always uses the latest available policy with no upper bound on the
step gap (max_off_policy_steps still bounds rollout staleness).
Adds no_async boolean for debug-only synchronous on-policy runs,
where the orchestrator blocks until the trainer checkpoint for the
current step is ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samsja samsja marked this pull request as ready for review April 19, 2026 20:36
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samsja samsja changed the title feat: replace max_async_level with no_async debug flag feat: replace max_async_level with on_policy debug flag Apr 19, 2026
Comment thread src/prime_rl/configs/orchestrator.py
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, love it. would check tests and allow nccl weight broadcast with on_policy=True as well

Removed the NCCL + on_policy compatibility check on both the
orchestrator and trainer configs — NCCL broadcast no longer cares
about the on_policy flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/prime_rl/configs/orchestrator.py
Comment thread src/prime_rl/configs/rl.py Outdated

max_async_level: Annotated[
int | None,
on_policy: Annotated[
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can remove this from the shared config? the trainer doesn't need to know about the async level anymore. thus far, it needed to to know at which point it can start cleaning broadcast checkpoints. since now async level is implicitly <=1, we are fine with cleaning dirs >=2 steps away

Comment thread src/prime_rl/configs/rl.py Outdated
if self.max_async_level is not None:
self.trainer.max_async_level = self.max_async_level
self.orchestrator.max_async_level = self.max_async_level
def auto_setup_on_policy(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove this as well

Comment thread src/prime_rl/configs/trainer.py Outdated

max_async_level: Annotated[
int,
on_policy: Annotated[
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove

if self.strict_async_level:
return async_away_ckpt_step
return max(async_away_ckpt_step, latest_ckpt_step)
if self.on_policy:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need the async_away_ckpt_step but hardcoded with max_async_level=1. otherwise the orchestrator will race away from the trainer

Comment thread src/prime_rl/utils/validation.py Outdated


def validate_shared_max_async_level(
def validate_shared_on_policy(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove

CI integration tests showed the orchestrator was racing 16 steps ahead
of the trainer (Async Level: 16, Max. Off-Policy Level: 0) because
removing max_async_level also removed the back-pressure that paused
the orchestrator until the trainer had broadcast recent weights. The
trainer ended up consuming severely off-policy batches and reward
regressed (~0.2 vs 0.65 threshold at step 19).

Reuse max_off_policy_steps (default 8) as the single back-pressure
knob: the orchestrator now waits for a checkpoint at step -
max_off_policy_steps if the latest ckpt lags below that. on_policy=true
still forces strict step-equality waiting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/prime_rl/orchestrator/scheduler.py Outdated
Per review:
- Remove on_policy from TrainerConfig and the shared RLConfig. The
  trainer no longer needs to know about async level: broadcast cleanup
  can always keep the two most recent dirs, and the last-step broadcast
  skip uses a hardcoded 1-step tail.
- Drop auto_setup_on_policy + validate_shared_on_policy.
- Restore async_away_ckpt_step in the scheduler, hardcoded at 1 step,
  so the orchestrator can't race ahead of the trainer in async mode.
- Add CHANGELOG entry for the removed max_async_level / strict_async_level
  fields and the new orchestrator.on_policy flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7c7817b. Configure here.

def auto_setup_bench(self):
if self.bench:
self.max_steps = 4 # Run for 1 warmup step + 3 evaluation steps
self.max_async_level = int(1e9) # Never wait for RL weight checkpoints
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bench mode lost "never wait" async override

Medium Severity

The auto_setup_bench method previously set self.max_async_level = int(1e9) with the explicit comment "Never wait for RL weight checkpoints." This line was removed without any replacement. Bench mode now uses the hardcoded async level of 1, meaning the orchestrator will block waiting for trainer checkpoints at step 2+. This defeats the purpose of benchmark mode, which is to measure orchestrator throughput without trainer bottlenecks.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7c7817b. Configure here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this is valid i think. if we hardcode async level 1 then the benchmark mode will stop at the async barrier of 1. maybe we can circumvent this by setting enable_policy_updates=False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants