fix(reborn): reconcile scheduler concurrency default and guard worker_count=1 by henrypark133 · Pull Request #5224 · nearai/ironclaw

henrypark133 · 2026-06-25T07:05:31Z

Root cause: NOT a scheduler concurrency regression, and NOT a `worker_count=1` config foot-gun

This started as an investigation of a production report — "triggered runs starving; 7 of 23 never reached turn run started; a single TurnRunnerId serving runs serially → effective concurrency ≈ 1."

The log evidence disproves the starvation hypothesis. From logs.1782348290172.log:

The scheduler runs runs concurrently — three runs overlap in-flight in multiple windows:
- 23:30 — 81de47f4 (07.45→16.85), 6200c910 (10.59→23.70), 9461329b (13.01→24.67) all in flight ~23:30:13–16.
- Same 3-way overlap at 00:00 and 00:30.
The single TurnRunnerId(ce43288a…) is one scheduler instance (one process), by design — it is not a per-run worker. Concurrency is genuinely > 1, so production is not running at worker_count=1.
The "… did not finish before Slack delivery timeout" / "never started" lines all originate in ironclaw_reborn_composition::slack_delivery, not the scheduler. Real runs reach BlockedApproval within ~7s (5e46d384: started 23:20:32, blocked 23:20:39, finished 23:20:40). The Slack triggered-run delivery delivers the first gate, then re-waits on the now-permanently-blocked run for the full DEFAULT_TRIGGERED_RUN_DELIVERY_MAX_WAIT (30 min) and finally reports Failed — the timeout fires exactly 30 min after the run blocked (5e46d384 blocked 23:20:39 → wait-failed 23:50:45). The "orphan" run_ids are prior-cycle / duplicate delivery waits.

That Slack-delivery availability behavior is a separate track and is intentionally not touched here.

What this PR ships

Since the scheduler is healthy, there is no behavioral fix. This PR ships only the genuine, independently-verified config-hygiene gaps uncovered while verifying:

Reconcile the divergent default. TurnRunSchedulerConfig::default().max_concurrent_runs was the literal 4, while production overrides with worker_count (default 16) — so any Default-only caller silently under-provisioned. Introduce ironclaw_host_runtime::DEFAULT_MAX_CONCURRENT_RUNS = 16 as the single source of truth; Default uses it; ironclaw_reborn::DEFAULT_TURN_RUNNER_WORKER_COUNT now derives from it (legal one-way import), making the two equal by construction — no literal duplication, no drift.
Guard the degenerate value. The CLI resolver (runner_settings) now emits a startup warn! when worker_count resolves to 1 (legal, but serializes all runs through one slot). The operator's explicit value is honoured, not silently overridden.
Fix the stale doc. [runner].worker_count "defaults to 4" → corrected to 16.
Record the invariant in crates/ironclaw_host_runtime/CLAUDE.md so a future agent does not re-introduce a divergent default or assume the scheduler is serial.

Tests (red→green verified on the unpatched base)

ironclaw_host_runtime: default_config_uses_canonical_max_concurrent_runs (asserts 16; on base Default was 4 → red), plus with_max_concurrent_runs floor boundaries (0→1, 1→1).
ironclaw_reborn: worker_count_default_matches_scheduler_default (equals the scheduler constant and is > 1).
ironclaw_reborn_cli: drives the real runner_settings resolver — asserts a WARN fires for worker_count = 1 (removing the warn → red) and does not fire for a normal value, per "Test Through the Caller".

Quality gate

cargo fmt --all — clean.
cargo clippy -p ironclaw_host_runtime -p ironclaw_reborn_cli -p ironclaw_reborn -p ironclaw_reborn_config --all-targets --all-features — 0 warnings.
All new/touched tests pass. (3 pre-existing sandbox_process tests fail locally only because no Docker daemon is present — unrelated to this diff.)

Review

Ran the multi-agent /code-review skill. The strongest finding (derive the worker-count const from the scheduler const instead of duplicating the literal 16 + pinning with a test) was applied — it is strictly better and removes the drift failure mode entirely. The stale-doc and two test-coverage findings were also addressed.

🤖 Generated with Claude Code

…_count=1 Investigation of a production "triggered runs starving / single runner serving serially" report. The log evidence disproves the starvation hypothesis: the TurnRunScheduler runs runs concurrently (three runs overlap in-flight at 23:30, 00:00 and 00:30 in the captured logs), and the single TurnRunnerId is one scheduler instance by design, not a per-run worker. Production was not running at worker_count=1. The "never started" / "did not finish before Slack delivery timeout" lines originate in the Slack triggered-run delivery layer (runs reach BlockedApproval within ~7s; the delivery re-waits on the permanently-blocked run for the full 30-minute max_wait), which is a separate availability track. No scheduler behavioral change is warranted. This commit ships only the genuine, independently-verified config-hygiene gaps found while verifying: - TurnRunSchedulerConfig::default().max_concurrent_runs was the literal 4 while production used 16, so any Default-only caller silently under-provisioned. Introduce ironclaw_host_runtime::DEFAULT_MAX_CONCURRENT_RUNS (= 16) as the single source of truth and have Default use it. ironclaw_reborn's DEFAULT_TURN_RUNNER_WORKER_COUNT now derives from that constant (the import is legal one-way), so the scheduler cap and worker-count default are equal by construction. - The CLI resolver (runner_settings) now emits a startup warn when worker_count resolves to 1, which is legal but serializes all runs through one slot. The operator's explicit value is honoured, not silently overridden. - Fix the stale "defaults to 4" doc on [runner].worker_count (real default 16). - Record the scheduler concurrency invariant in the host_runtime CLAUDE.md spec. Tests (red->green verified): - ironclaw_host_runtime: Default equals DEFAULT_MAX_CONCURRENT_RUNS (was 4 -> red), plus with_max_concurrent_runs floor boundaries. - ironclaw_reborn: worker-count default equals the scheduler constant and is >1. - ironclaw_reborn_cli: the real resolver warns on worker_count=1 and does NOT warn on a normal value, driving runner_settings per "Test Through the Caller". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-06-25T07:05:36Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

coderabbitai · 2026-06-25T07:07:52Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 91b0d633-f000-4ac8-be98-9d695fa76ca4

📥 Commits

Reviewing files that changed from the base of the PR and between 78ec682 and 36b6a15.

📒 Files selected for processing (1)

docs/plans/2026-06-25-runner-concurrency.md

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Increased the default turn-processing concurrency to a higher, consistent value across the app.
- Ensures runner concurrency settings default consistently and validates key defaults with new unit tests.
- Added a warning when the runner is configured to use a single worker, since it serializes execution.
Documentation
- Updated configuration and concurrency documentation to reflect the current defaults.
- Added a guardrail/incident investigation plan to track and prevent future concurrency drift.

Walkthrough

Centralizes the scheduler concurrency default, derives the reborn worker-count default from it, warns when a resolved worker count is 1, updates the worker-count comment, and adds tests plus an incident plan document.

Changes

Runner concurrency guardrails

Layer / File(s)	Summary
Canonical scheduler default `crates/ironclaw_host_runtime/src/turn_scheduler.rs`, `crates/ironclaw_host_runtime/src/lib.rs`, `crates/ironclaw_host_runtime/src/turn_scheduler/tests.rs`, `crates/ironclaw_host_runtime/CLAUDE.md`	Defines the shared scheduler default constant, uses it in `TurnRunSchedulerConfig::default()`, re-exports the constant, adds default/concurrency tests, and documents the scheduler-instance and default-consistency invariants.
Derived runtime worker default `crates/ironclaw_reborn/src/runtime.rs`, `crates/ironclaw_reborn_config/src/config_file.rs`	Derives `DEFAULT_TURN_RUNNER_WORKER_COUNT` from the shared scheduler default, adds a matching-value test, and updates the `[runner].worker_count` comment to reflect the new default and `1` behavior.
Single-worker warning `crates/ironclaw_reborn_cli/src/runtime/mod.rs`	Logs a warning when resolved `worker_count` is `1` and extends resolver tests to capture the warning and preserve the single-worker setting.
Runner concurrency plan `docs/plans/2026-06-25-runner-concurrency.md`	Adds a concurrency incident note with evidence, wiring facts, guardrail targets, implementation steps, non-goals, and expected test outcomes.

Estimated review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

nearai/ironclaw#5085: Introduced the scheduler concurrency control point that this PR centralizes and propagates into runtime defaults.

Suggested reviewers

hanakannzashi
zmanian

Poem

A semaphore kept the turnstile bright,
One default traveled clean and light.
A warning rang for single-slot rain,
While tests and docs stayed snug in lane.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is detailed, but it omits required template sections like Change Type, Linked Issue, Security Impact, and the trust-boundary checklist.	Add the missing template sections and fill in the required fields, especially Change Type, Linked Issue, Security Impact, Reborn Trust-Boundary Checklist, and Rollback Plan.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title matches the PR’s main change and follows Conventional Commits style.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/plans/2026-06-25-runner-concurrency.md`:
- Line 11: The incident note currently embeds a machine-local absolute path in
the “Hard log evidence” entry, which should be sanitized before committing.
Update the note in the runner concurrency plan to replace the `/Users/henry/...`
reference with a neutral placeholder or a repo-relative artifact reference,
keeping the same evidence context but removing host-specific path details.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0359a6d5-88f8-47e7-9bef-2a2ff18f79fd

📥 Commits

Reviewing files that changed from the base of the PR and between 3cbde9b and 78ec682.

📒 Files selected for processing (8)

crates/ironclaw_host_runtime/CLAUDE.md
crates/ironclaw_host_runtime/src/lib.rs
crates/ironclaw_host_runtime/src/turn_scheduler.rs
crates/ironclaw_host_runtime/src/turn_scheduler/tests.rs
crates/ironclaw_reborn/src/runtime.rs
crates/ironclaw_reborn_cli/src/runtime/mod.rs
crates/ironclaw_reborn_config/src/config_file.rs
docs/plans/2026-06-25-runner-concurrency.md

railway-app · 2026-06-25T07:30:25Z

🚅 Deployed to the ironclaw-pr-5224 environment in ironclaw-ci-preview

Service	Status	Web	Updated (UTC)
ironclaw	✅ Success (View Logs)	Web	Jun 25, 2026 at 8:07 am

henrypark133

Code Review (multi-agent)

Intent: Unify the turn-run concurrency default across runtime crates, warn on worker_count=1, and update docs and tests.

Stats: 0 findings (from 1 raw, 0 after aggregation/intent filtering) across 0 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.

No actionable findings survived aggregation. The only raw item questioned whether the new worker_count = 1 startup message should be warn!; I dropped it because the PR body and adjacent comment explicitly make that loud boot-time warning the intended behavior for a legal-but-degenerate concurrency setting.

henrypark133 · 2026-06-25T16:56:15Z

Closing — the premise was a misdiagnosis. Deeper log analysis showed production runs were genuinely concurrent (3 runs in-flight simultaneously at 23:30:07–13 in logs.1782348290172), so the single TurnRunnerId is one scheduler instance, not effective-concurrency-1. The "7 runs never started / starvation" symptom was actually the Slack triggered-run delivery re-waiting 30 min on permanently-blocked runs (fixed in #5222), not a scheduler/worker_count problem.

This PR only added config-hygiene guardrails (default 4→16 reconcile, a worker_count==1 warn, a doc fix) for a foot-gun that did not actually fire in prod — net new code for a non-problem. Closing to keep the change set aligned with the real root cause (the Postgres round-trip/pool cascade). If we later want the Default 4-vs-16 reconcile as standalone hygiene, it can be re-raised as a one-line change.

railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5224 June 25, 2026 07:05 Destroyed

github-actions Bot added scope: docs Documentation size: M 50-199 changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Jun 25, 2026

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread docs/plans/2026-06-25-runner-concurrency.md Outdated

docs(reborn): sanitize runner concurrency note

36b6a15

railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5224 June 25, 2026 08:00 Destroyed

coderabbitai Bot mentioned this pull request Jun 25, 2026

fix(reborn): release Slack admission permit once inbound is durably accepted #5225

Merged

henrypark133 commented Jun 25, 2026

View reviewed changes

henrypark133 closed this Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(reborn): reconcile scheduler concurrency default and guard worker_count=1#5224

fix(reborn): reconcile scheduler concurrency default and guard worker_count=1#5224
henrypark133 wants to merge 2 commits into
mainfrom
fix/reborn-runner-concurrency

henrypark133 commented Jun 25, 2026

Uh oh!

gemini-code-assist Bot commented Jun 25, 2026

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Estimated review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

railway-app Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

henrypark133 left a comment

Uh oh!

henrypark133 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

henrypark133 commented Jun 25, 2026

Root cause: NOT a scheduler concurrency regression, and NOT a worker_count=1 config foot-gun

What this PR ships

Tests (red→green verified on the unpatched base)

Quality gate

Review

Uh oh!

gemini-code-assist Bot commented Jun 25, 2026

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

railway-app Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henrypark133 left a comment

Choose a reason for hiding this comment

Code Review (multi-agent)

Uh oh!

henrypark133 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Root cause: NOT a scheduler concurrency regression, and NOT a `worker_count=1` config foot-gun

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

railway-app Bot commented Jun 25, 2026 •

edited

Loading