feat: replace max_async_level with on_policy debug flag by samsja · Pull Request #2328 · PrimeIntellect-ai/prime-rl

samsja · 2026-04-19T20:35:23Z

Summary

Removes max_async_level and strict_async_level from the orchestrator/trainer/rl configs and from the scheduler.
Async training now always serves rollouts from the latest available policy (no upper bound on the step gap between trainer and inference). max_off_policy_steps continues to bound rollout staleness.
Adds an on_policy: bool flag (default false) on the orchestrator, trainer, and rl configs. When true, the orchestrator blocks until the trainer has produced a checkpoint for the current step, i.e. fully synchronous on-policy RL. This is debug-only: it will be significantly slower than async training in this setup.
NCCL weight broadcast validation now checks on_policy=false instead of max_async_level=1.
Filesystem weight-broadcast cleanup now keeps the two most recent broadcasts unconditionally (previously computed from max_async_level).
Updates docs/async.md and removes max_async_level from debug and CI configs.

Note

Medium Risk
Changes rollout scheduling and checkpoint/broadcast cleanup semantics in the RL training loop, which can affect throughput and correctness of policy updates despite being conceptually straightforward.

Overview
Removes the max_async_level/strict_async_level configuration knobs across RL (trainer, orchestrator, and shared rl config) and deletes the associated cross-config validation.

Updates the orchestrator scheduler to a fixed async policy: it now serves rollouts from the latest available checkpoint, caps itself to at most one step ahead of the trainer, and relies on max_off_policy_steps to drop overly stale rollouts.

Adds orchestrator.on_policy (debug-only) to force fully synchronous on-policy RL by blocking rollout generation until the trainer checkpoint for the current step is available, and adjusts log messaging/docs and debug/CI configs accordingly.

Simplifies weight broadcast tail-step and cleanup behavior: trainer broadcast skipping is now based on the final step (not async level), and filesystem broadcast cleanup keeps the two most recent broadcasts unconditionally.

^{Reviewed by Cursor Bugbot for commit 7c7817b. Bugbot is set up for automated code reviews on this repo. Configure here.}

Removes max_async_level and strict_async_level. Async training now always uses the latest available policy with no upper bound on the step gap (max_off_policy_steps still bounds rollout staleness). Adds no_async boolean for debug-only synchronous on-policy runs, where the orchestrator blocks until the trainer checkpoint for the current step is ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mikasenghaas

nice, love it. would check tests and allow nccl weight broadcast with on_policy=True as well

Removed the NCCL + on_policy compatibility check on both the orchestrator and trainer configs — NCCL broadcast no longer cares about the on_policy flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mikasenghaas · 2026-04-20T02:18:23Z


-    max_async_level: Annotated[
-        int | None,
+    on_policy: Annotated[


i think we can remove this from the shared config? the trainer doesn't need to know about the async level anymore. thus far, it needed to to know at which point it can start cleaning broadcast checkpoints. since now async level is implicitly <=1, we are fine with cleaning dirs >=2 steps away

mikasenghaas · 2026-04-20T02:18:31Z

-        if self.max_async_level is not None:
-            self.trainer.max_async_level = self.max_async_level
-            self.orchestrator.max_async_level = self.max_async_level
+    def auto_setup_on_policy(self):


can remove this as well

mikasenghaas · 2026-04-20T02:18:46Z


-    max_async_level: Annotated[
-        int,
+    on_policy: Annotated[


mikasenghaas · 2026-04-20T02:19:46Z

-        if self.strict_async_level:
-            return async_away_ckpt_step
-        return max(async_away_ckpt_step, latest_ckpt_step)
+        if self.on_policy:


we still need the async_away_ckpt_step but hardcoded with max_async_level=1. otherwise the orchestrator will race away from the trainer

mikasenghaas · 2026-04-20T02:20:34Z



-def validate_shared_max_async_level(
+def validate_shared_on_policy(


CI integration tests showed the orchestrator was racing 16 steps ahead of the trainer (Async Level: 16, Max. Off-Policy Level: 0) because removing max_async_level also removed the back-pressure that paused the orchestrator until the trainer had broadcast recent weights. The trainer ended up consuming severely off-policy batches and reward regressed (~0.2 vs 0.65 threshold at step 19). Reuse max_off_policy_steps (default 8) as the single back-pressure knob: the orchestrator now waits for a checkpoint at step - max_off_policy_steps if the latest ckpt lags below that. on_policy=true still forces strict step-equality waiting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per review: - Remove on_policy from TrainerConfig and the shared RLConfig. The trainer no longer needs to know about async level: broadcast cleanup can always keep the two most recent dirs, and the last-step broadcast skip uses a hardcoded 1-step tail. - Drop auto_setup_on_policy + validate_shared_on_policy. - Restore async_away_ckpt_step in the scheduler, hardcoded at 1 step, so the orchestrator can't race ahead of the trainer in async mode. - Add CHANGELOG entry for the removed max_async_level / strict_async_level fields and the new orchestrator.on_policy flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 7c7817b. Configure here.}

cursor · 2026-04-20T02:38:22Z

    def auto_setup_bench(self):
        if self.bench:
            self.max_steps = 4  # Run for 1 warmup step + 3 evaluation steps
-            self.max_async_level = int(1e9)  # Never wait for RL weight checkpoints


Bench mode lost "never wait" async override

Medium Severity

The auto_setup_bench method previously set self.max_async_level = int(1e9) with the explicit comment "Never wait for RL weight checkpoints." This line was removed without any replacement. Bench mode now uses the hardcoded async level of 1, meaning the orchestrator will block waiting for trainer checkpoints at step 2+. This defeats the purpose of benchmark mode, which is to measure orchestrator throughput without trainer bottlenecks.

^{Reviewed by Cursor Bugbot for commit 7c7817b. Configure here.}

oh this is valid i think. if we hardcode async level 1 then the benchmark mode will stop at the async barrier of 1. maybe we can circumvent this by setting enable_policy_updates=False

samsja marked this pull request as ready for review April 19, 2026 20:36

refactor: rename no_async flag to on_policy

5449932

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samsja changed the title ~~feat: replace max_async_level with no_async debug flag~~ feat: replace max_async_level with on_policy debug flag Apr 19, 2026

cursor bot reviewed Apr 19, 2026

View reviewed changes

Comment thread src/prime_rl/configs/orchestrator.py

mikasenghaas reviewed Apr 19, 2026

View reviewed changes

chore: drop NCCL-vs-on_policy validators

56d2a90

Removed the NCCL + on_policy compatibility check on both the orchestrator and trainer configs — NCCL broadcast no longer cares about the on_policy flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor bot reviewed Apr 19, 2026

View reviewed changes

Comment thread src/prime_rl/configs/orchestrator.py

mikasenghaas reviewed Apr 20, 2026

View reviewed changes

cursor bot reviewed Apr 20, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/scheduler.py Outdated

cursor bot reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: replace max_async_level with on_policy debug flag#2328

feat: replace max_async_level with on_policy debug flag#2328
samsja wants to merge 5 commits intomainfrom
feat/replace-max-async-level-with-no-async

samsja commented Apr 19, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

mikasenghaas left a comment

Uh oh!

Uh oh!

mikasenghaas Apr 20, 2026

Uh oh!

mikasenghaas Apr 20, 2026

Uh oh!

mikasenghaas Apr 20, 2026

Uh oh!

mikasenghaas Apr 20, 2026

Uh oh!

mikasenghaas Apr 20, 2026

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 20, 2026

Uh oh!

mikasenghaas Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def validate_shared_max_async_level(
		def validate_shared_on_policy(

Conversation

samsja commented Apr 19, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

mikasenghaas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mikasenghaas Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 20, 2026

Choose a reason for hiding this comment

Bench mode lost "never wait" async override

Uh oh!

mikasenghaas Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samsja commented Apr 19, 2026 •

edited by cursor bot

Loading