fix(reborn): reconcile scheduler concurrency default and guard worker_count=1#5224
fix(reborn): reconcile scheduler concurrency default and guard worker_count=1#5224henrypark133 wants to merge 2 commits into
Conversation
…_count=1 Investigation of a production "triggered runs starving / single runner serving serially" report. The log evidence disproves the starvation hypothesis: the TurnRunScheduler runs runs concurrently (three runs overlap in-flight at 23:30, 00:00 and 00:30 in the captured logs), and the single TurnRunnerId is one scheduler instance by design, not a per-run worker. Production was not running at worker_count=1. The "never started" / "did not finish before Slack delivery timeout" lines originate in the Slack triggered-run delivery layer (runs reach BlockedApproval within ~7s; the delivery re-waits on the permanently-blocked run for the full 30-minute max_wait), which is a separate availability track. No scheduler behavioral change is warranted. This commit ships only the genuine, independently-verified config-hygiene gaps found while verifying: - TurnRunSchedulerConfig::default().max_concurrent_runs was the literal 4 while production used 16, so any Default-only caller silently under-provisioned. Introduce ironclaw_host_runtime::DEFAULT_MAX_CONCURRENT_RUNS (= 16) as the single source of truth and have Default use it. ironclaw_reborn's DEFAULT_TURN_RUNNER_WORKER_COUNT now derives from that constant (the import is legal one-way), so the scheduler cap and worker-count default are equal by construction. - The CLI resolver (runner_settings) now emits a startup warn when worker_count resolves to 1, which is legal but serializes all runs through one slot. The operator's explicit value is honoured, not silently overridden. - Fix the stale "defaults to 4" doc on [runner].worker_count (real default 16). - Record the scheduler concurrency invariant in the host_runtime CLAUDE.md spec. Tests (red->green verified): - ironclaw_host_runtime: Default equals DEFAULT_MAX_CONCURRENT_RUNS (was 4 -> red), plus with_max_concurrent_runs floor boundaries. - ironclaw_reborn: worker-count default equals the scheduler constant and is >1. - ironclaw_reborn_cli: the real resolver warns on worker_count=1 and does NOT warn on a normal value, driving runner_settings per "Test Through the Caller". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughCentralizes the scheduler concurrency default, derives the reborn worker-count default from it, warns when a resolved worker count is ChangesRunner concurrency guardrails
Estimated review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/plans/2026-06-25-runner-concurrency.md`:
- Line 11: The incident note currently embeds a machine-local absolute path in
the “Hard log evidence” entry, which should be sanitized before committing.
Update the note in the runner concurrency plan to replace the `/Users/henry/...`
reference with a neutral placeholder or a repo-relative artifact reference,
keeping the same evidence context but removing host-specific path details.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 0359a6d5-88f8-47e7-9bef-2a2ff18f79fd
📒 Files selected for processing (8)
crates/ironclaw_host_runtime/CLAUDE.mdcrates/ironclaw_host_runtime/src/lib.rscrates/ironclaw_host_runtime/src/turn_scheduler.rscrates/ironclaw_host_runtime/src/turn_scheduler/tests.rscrates/ironclaw_reborn/src/runtime.rscrates/ironclaw_reborn_cli/src/runtime/mod.rscrates/ironclaw_reborn_config/src/config_file.rsdocs/plans/2026-06-25-runner-concurrency.md
|
🚅 Deployed to the ironclaw-pr-5224 environment in ironclaw-ci-preview
|
henrypark133
left a comment
There was a problem hiding this comment.
Code Review (multi-agent)
Intent: Unify the turn-run concurrency default across runtime crates, warn on worker_count=1, and update docs and tests.
Stats: 0 findings (from 1 raw, 0 after aggregation/intent filtering) across 0 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.
No actionable findings survived aggregation. The only raw item questioned whether the new worker_count = 1 startup message should be warn!; I dropped it because the PR body and adjacent comment explicitly make that loud boot-time warning the intended behavior for a legal-but-degenerate concurrency setting.
|
Closing — the premise was a misdiagnosis. Deeper log analysis showed production runs were genuinely concurrent (3 runs in-flight simultaneously at 23:30:07–13 in This PR only added config-hygiene guardrails (default |
Root cause: NOT a scheduler concurrency regression, and NOT a
worker_count=1config foot-gunThis started as an investigation of a production report — "triggered runs starving; 7 of 23 never reached
turn run started; a singleTurnRunnerIdserving runs serially → effective concurrency ≈ 1."The log evidence disproves the starvation hypothesis. From
logs.1782348290172.log:81de47f4(07.45→16.85),6200c910(10.59→23.70),9461329b(13.01→24.67) all in flight ~23:30:13–16.TurnRunnerId(ce43288a…)is one scheduler instance (one process), by design — it is not a per-run worker. Concurrency is genuinely > 1, so production is not running atworker_count=1."… did not finish before Slack delivery timeout"/ "never started" lines all originate inironclaw_reborn_composition::slack_delivery, not the scheduler. Real runs reachBlockedApprovalwithin ~7s (5e46d384: started 23:20:32, blocked 23:20:39, finished 23:20:40). The Slack triggered-run delivery delivers the first gate, then re-waits on the now-permanently-blocked run for the fullDEFAULT_TRIGGERED_RUN_DELIVERY_MAX_WAIT(30 min) and finally reportsFailed— the timeout fires exactly 30 min after the run blocked (5e46d384blocked 23:20:39 → wait-failed 23:50:45). The "orphan" run_ids are prior-cycle / duplicate delivery waits.That Slack-delivery availability behavior is a separate track and is intentionally not touched here.
What this PR ships
Since the scheduler is healthy, there is no behavioral fix. This PR ships only the genuine, independently-verified config-hygiene gaps uncovered while verifying:
TurnRunSchedulerConfig::default().max_concurrent_runswas the literal4, while production overrides withworker_count(default16) — so anyDefault-only caller silently under-provisioned. Introduceironclaw_host_runtime::DEFAULT_MAX_CONCURRENT_RUNS = 16as the single source of truth;Defaultuses it;ironclaw_reborn::DEFAULT_TURN_RUNNER_WORKER_COUNTnow derives from it (legal one-way import), making the two equal by construction — no literal duplication, no drift.runner_settings) now emits a startupwarn!whenworker_countresolves to1(legal, but serializes all runs through one slot). The operator's explicit value is honoured, not silently overridden.[runner].worker_count"defaults to 4" → corrected to 16.crates/ironclaw_host_runtime/CLAUDE.mdso a future agent does not re-introduce a divergent default or assume the scheduler is serial.Tests (red→green verified on the unpatched base)
ironclaw_host_runtime:default_config_uses_canonical_max_concurrent_runs(asserts16; on baseDefaultwas4→ red), pluswith_max_concurrent_runsfloor boundaries (0→1,1→1).ironclaw_reborn:worker_count_default_matches_scheduler_default(equals the scheduler constant and is> 1).ironclaw_reborn_cli: drives the realrunner_settingsresolver — asserts aWARNfires forworker_count = 1(removing the warn → red) and does not fire for a normal value, per "Test Through the Caller".Quality gate
cargo fmt --all— clean.cargo clippy -p ironclaw_host_runtime -p ironclaw_reborn_cli -p ironclaw_reborn -p ironclaw_reborn_config --all-targets --all-features— 0 warnings.sandbox_processtests fail locally only because no Docker daemon is present — unrelated to this diff.)Review
Ran the multi-agent
/code-reviewskill. The strongest finding (derive the worker-count const from the scheduler const instead of duplicating the literal16+ pinning with a test) was applied — it is strictly better and removes the drift failure mode entirely. The stale-doc and two test-coverage findings were also addressed.🤖 Generated with Claude Code