Skip to content

fix(reborn): reconcile scheduler concurrency default and guard worker_count=1#5224

Closed
henrypark133 wants to merge 2 commits into
mainfrom
fix/reborn-runner-concurrency
Closed

fix(reborn): reconcile scheduler concurrency default and guard worker_count=1#5224
henrypark133 wants to merge 2 commits into
mainfrom
fix/reborn-runner-concurrency

Conversation

@henrypark133

Copy link
Copy Markdown
Collaborator

Root cause: NOT a scheduler concurrency regression, and NOT a worker_count=1 config foot-gun

This started as an investigation of a production report — "triggered runs starving; 7 of 23 never reached turn run started; a single TurnRunnerId serving runs serially → effective concurrency ≈ 1."

The log evidence disproves the starvation hypothesis. From logs.1782348290172.log:

  • The scheduler runs runs concurrently — three runs overlap in-flight in multiple windows:
    • 23:30 — 81de47f4 (07.45→16.85), 6200c910 (10.59→23.70), 9461329b (13.01→24.67) all in flight ~23:30:13–16.
    • Same 3-way overlap at 00:00 and 00:30.
  • The single TurnRunnerId(ce43288a…) is one scheduler instance (one process), by design — it is not a per-run worker. Concurrency is genuinely > 1, so production is not running at worker_count=1.
  • The "… did not finish before Slack delivery timeout" / "never started" lines all originate in ironclaw_reborn_composition::slack_delivery, not the scheduler. Real runs reach BlockedApproval within ~7s (5e46d384: started 23:20:32, blocked 23:20:39, finished 23:20:40). The Slack triggered-run delivery delivers the first gate, then re-waits on the now-permanently-blocked run for the full DEFAULT_TRIGGERED_RUN_DELIVERY_MAX_WAIT (30 min) and finally reports Failed — the timeout fires exactly 30 min after the run blocked (5e46d384 blocked 23:20:39 → wait-failed 23:50:45). The "orphan" run_ids are prior-cycle / duplicate delivery waits.

That Slack-delivery availability behavior is a separate track and is intentionally not touched here.

What this PR ships

Since the scheduler is healthy, there is no behavioral fix. This PR ships only the genuine, independently-verified config-hygiene gaps uncovered while verifying:

  1. Reconcile the divergent default. TurnRunSchedulerConfig::default().max_concurrent_runs was the literal 4, while production overrides with worker_count (default 16) — so any Default-only caller silently under-provisioned. Introduce ironclaw_host_runtime::DEFAULT_MAX_CONCURRENT_RUNS = 16 as the single source of truth; Default uses it; ironclaw_reborn::DEFAULT_TURN_RUNNER_WORKER_COUNT now derives from it (legal one-way import), making the two equal by construction — no literal duplication, no drift.
  2. Guard the degenerate value. The CLI resolver (runner_settings) now emits a startup warn! when worker_count resolves to 1 (legal, but serializes all runs through one slot). The operator's explicit value is honoured, not silently overridden.
  3. Fix the stale doc. [runner].worker_count "defaults to 4" → corrected to 16.
  4. Record the invariant in crates/ironclaw_host_runtime/CLAUDE.md so a future agent does not re-introduce a divergent default or assume the scheduler is serial.

Tests (red→green verified on the unpatched base)

  • ironclaw_host_runtime: default_config_uses_canonical_max_concurrent_runs (asserts 16; on base Default was 4red), plus with_max_concurrent_runs floor boundaries (0→1, 1→1).
  • ironclaw_reborn: worker_count_default_matches_scheduler_default (equals the scheduler constant and is > 1).
  • ironclaw_reborn_cli: drives the real runner_settings resolver — asserts a WARN fires for worker_count = 1 (removing the warn → red) and does not fire for a normal value, per "Test Through the Caller".

Quality gate

  • cargo fmt --all — clean.
  • cargo clippy -p ironclaw_host_runtime -p ironclaw_reborn_cli -p ironclaw_reborn -p ironclaw_reborn_config --all-targets --all-features0 warnings.
  • All new/touched tests pass. (3 pre-existing sandbox_process tests fail locally only because no Docker daemon is present — unrelated to this diff.)

Review

Ran the multi-agent /code-review skill. The strongest finding (derive the worker-count const from the scheduler const instead of duplicating the literal 16 + pinning with a test) was applied — it is strictly better and removes the drift failure mode entirely. The stale-doc and two test-coverage findings were also addressed.

🤖 Generated with Claude Code

…_count=1

Investigation of a production "triggered runs starving / single runner serving
serially" report. The log evidence disproves the starvation hypothesis: the
TurnRunScheduler runs runs concurrently (three runs overlap in-flight at 23:30,
00:00 and 00:30 in the captured logs), and the single TurnRunnerId is one
scheduler instance by design, not a per-run worker. Production was not running
at worker_count=1. The "never started" / "did not finish before Slack delivery
timeout" lines originate in the Slack triggered-run delivery layer (runs reach
BlockedApproval within ~7s; the delivery re-waits on the permanently-blocked run
for the full 30-minute max_wait), which is a separate availability track.

No scheduler behavioral change is warranted. This commit ships only the
genuine, independently-verified config-hygiene gaps found while verifying:

- TurnRunSchedulerConfig::default().max_concurrent_runs was the literal 4 while
  production used 16, so any Default-only caller silently under-provisioned.
  Introduce ironclaw_host_runtime::DEFAULT_MAX_CONCURRENT_RUNS (= 16) as the
  single source of truth and have Default use it. ironclaw_reborn's
  DEFAULT_TURN_RUNNER_WORKER_COUNT now derives from that constant (the import is
  legal one-way), so the scheduler cap and worker-count default are equal by
  construction.
- The CLI resolver (runner_settings) now emits a startup warn when worker_count
  resolves to 1, which is legal but serializes all runs through one slot. The
  operator's explicit value is honoured, not silently overridden.
- Fix the stale "defaults to 4" doc on [runner].worker_count (real default 16).
- Record the scheduler concurrency invariant in the host_runtime CLAUDE.md spec.

Tests (red->green verified):
- ironclaw_host_runtime: Default equals DEFAULT_MAX_CONCURRENT_RUNS (was 4 ->
  red), plus with_max_concurrent_runs floor boundaries.
- ironclaw_reborn: worker-count default equals the scheduler constant and is >1.
- ironclaw_reborn_cli: the real resolver warns on worker_count=1 and does NOT
  warn on a normal value, driving runner_settings per "Test Through the Caller".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5224 June 25, 2026 07:05 Destroyed
@github-actions github-actions Bot added scope: docs Documentation size: M 50-199 changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 91b0d633-f000-4ac8-be98-9d695fa76ca4

📥 Commits

Reviewing files that changed from the base of the PR and between 78ec682 and 36b6a15.

📒 Files selected for processing (1)
  • docs/plans/2026-06-25-runner-concurrency.md

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes
    • Increased the default turn-processing concurrency to a higher, consistent value across the app.
    • Ensures runner concurrency settings default consistently and validates key defaults with new unit tests.
    • Added a warning when the runner is configured to use a single worker, since it serializes execution.
  • Documentation
    • Updated configuration and concurrency documentation to reflect the current defaults.
    • Added a guardrail/incident investigation plan to track and prevent future concurrency drift.

Walkthrough

Centralizes the scheduler concurrency default, derives the reborn worker-count default from it, warns when a resolved worker count is 1, updates the worker-count comment, and adds tests plus an incident plan document.

Changes

Runner concurrency guardrails

Layer / File(s) Summary
Canonical scheduler default
crates/ironclaw_host_runtime/src/turn_scheduler.rs, crates/ironclaw_host_runtime/src/lib.rs, crates/ironclaw_host_runtime/src/turn_scheduler/tests.rs, crates/ironclaw_host_runtime/CLAUDE.md
Defines the shared scheduler default constant, uses it in TurnRunSchedulerConfig::default(), re-exports the constant, adds default/concurrency tests, and documents the scheduler-instance and default-consistency invariants.
Derived runtime worker default
crates/ironclaw_reborn/src/runtime.rs, crates/ironclaw_reborn_config/src/config_file.rs
Derives DEFAULT_TURN_RUNNER_WORKER_COUNT from the shared scheduler default, adds a matching-value test, and updates the [runner].worker_count comment to reflect the new default and 1 behavior.
Single-worker warning
crates/ironclaw_reborn_cli/src/runtime/mod.rs
Logs a warning when resolved worker_count is 1 and extends resolver tests to capture the warning and preserve the single-worker setting.
Runner concurrency plan
docs/plans/2026-06-25-runner-concurrency.md
Adds a concurrency incident note with evidence, wiring facts, guardrail targets, implementation steps, non-goals, and expected test outcomes.

Estimated review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • nearai/ironclaw#5085: Introduced the scheduler concurrency control point that this PR centralizes and propagates into runtime defaults.

Suggested reviewers

  • hanakannzashi
  • zmanian

Poem

A semaphore kept the turnstile bright,
One default traveled clean and light.
A warning rang for single-slot rain,
While tests and docs stayed snug in lane.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is detailed, but it omits required template sections like Change Type, Linked Issue, Security Impact, and the trust-boundary checklist. Add the missing template sections and fill in the required fields, especially Change Type, Linked Issue, Security Impact, Reborn Trust-Boundary Checklist, and Rollback Plan.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the PR’s main change and follows Conventional Commits style.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/plans/2026-06-25-runner-concurrency.md`:
- Line 11: The incident note currently embeds a machine-local absolute path in
the “Hard log evidence” entry, which should be sanitized before committing.
Update the note in the runner concurrency plan to replace the `/Users/henry/...`
reference with a neutral placeholder or a repo-relative artifact reference,
keeping the same evidence context but removing host-specific path details.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0359a6d5-88f8-47e7-9bef-2a2ff18f79fd

📥 Commits

Reviewing files that changed from the base of the PR and between 3cbde9b and 78ec682.

📒 Files selected for processing (8)
  • crates/ironclaw_host_runtime/CLAUDE.md
  • crates/ironclaw_host_runtime/src/lib.rs
  • crates/ironclaw_host_runtime/src/turn_scheduler.rs
  • crates/ironclaw_host_runtime/src/turn_scheduler/tests.rs
  • crates/ironclaw_reborn/src/runtime.rs
  • crates/ironclaw_reborn_cli/src/runtime/mod.rs
  • crates/ironclaw_reborn_config/src/config_file.rs
  • docs/plans/2026-06-25-runner-concurrency.md

Comment thread docs/plans/2026-06-25-runner-concurrency.md Outdated
@railway-app

railway-app Bot commented Jun 25, 2026

Copy link
Copy Markdown

🚅 Deployed to the ironclaw-pr-5224 environment in ironclaw-ci-preview

Service Status Web Updated (UTC)
ironclaw ✅ Success (View Logs) Web Jun 25, 2026 at 8:07 am

@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5224 June 25, 2026 08:00 Destroyed

@henrypark133 henrypark133 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (multi-agent)

Intent: Unify the turn-run concurrency default across runtime crates, warn on worker_count=1, and update docs and tests.

Stats: 0 findings (from 1 raw, 0 after aggregation/intent filtering) across 0 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.

No actionable findings survived aggregation. The only raw item questioned whether the new worker_count = 1 startup message should be warn!; I dropped it because the PR body and adjacent comment explicitly make that loud boot-time warning the intended behavior for a legal-but-degenerate concurrency setting.

@henrypark133

Copy link
Copy Markdown
Collaborator Author

Closing — the premise was a misdiagnosis. Deeper log analysis showed production runs were genuinely concurrent (3 runs in-flight simultaneously at 23:30:07–13 in logs.1782348290172), so the single TurnRunnerId is one scheduler instance, not effective-concurrency-1. The "7 runs never started / starvation" symptom was actually the Slack triggered-run delivery re-waiting 30 min on permanently-blocked runs (fixed in #5222), not a scheduler/worker_count problem.

This PR only added config-hygiene guardrails (default 416 reconcile, a worker_count==1 warn, a doc fix) for a foot-gun that did not actually fire in prod — net new code for a non-problem. Closing to keep the change set aligned with the real root cause (the Postgres round-trip/pool cascade). If we later want the Default 4-vs-16 reconcile as standalone hygiene, it can be re-raised as a one-line change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules scope: docs Documentation size: M 50-199 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants