fix(reborn): stop WASM execution from starving the tokio worker pool#5206
fix(reborn): stop WASM execution from starving the tokio worker pool#5206henrypark133 wants to merge 9 commits into
Conversation
During the 2026-06-24 meltdown, a burst of ~40 concurrent turns drove the runtime into a ~4-minute full freeze. WASM tool execution ran the synchronous wasmtime guest call inline on tokio worker threads, and the WASM host HTTP egress / credential restage used block_in_place — so a burst could park every worker, starving the scheduler, trigger poller, and lease heartbeats until the process died and restarted. Changes: - Offload the synchronous wasmtime call to spawn_blocking (mirroring the legacy src/tools/wasm path) so it never parks a tokio worker. The adapter now holds Arc<WitToolRuntime>. - Bound concurrent native WASM execution with a process-wide semaphore (64, well under tokio's 512 blocking-pool ceiling) so a storm queues instead of exhausting the pool. - Remove the block_in_place branches in host.rs and wasm_credentials.rs; always route host egress / credential restage through the existing single-thread side runtimes, which multiplex concurrent async I/O without touching turn workers. - Preserve panic->error mapping across the change and make the credential restage path consistent with the egress path. Surface a static error message on a blocking-task panic instead of the raw JoinError payload. Tested through the real run_wasm_execution_blocking caller: panic mapping + permit release, and shared-semaphore gating. This is PR A of the P1 track. PR B (removing the redundant per-record turn-state lock convoys via a shared CAS helper) is a separate, larger concurrency refactor and follows in its own PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe PR makes ChangesRuntime offload wedge
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses WASM worker-pool starvation by offloading synchronous WASM execution to Tokio's blocking thread pool (spawn_blocking) and bounding concurrency with a shared semaphore. It also removes the use of block_in_place for HTTP egress and credential restaging, routing them through dedicated side-runtimes instead. A critical concurrency issue was identified where the semaphore permit is leaked upon cancellation because the permit is not moved into the spawn_blocking closure, which could allow more concurrent tasks to run than the configured limit.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c67338d9f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs`:
- Around line 1443-1453: The test
`wasm_exec_semaphore_starts_open_at_configured_bound` reads
`WASM_EXEC_SEMAPHORE.available_permits()` without serialization, so it can race
with the async semaphore-draining tests and fail nondeterministically. Update
this test to acquire `SEMAPHORE_TEST_LOCK` before asserting the semaphore state,
matching the other semaphore-related tests in `runtime_adapters.rs`, so the
shared global `WASM_EXEC_SEMAPHORE` is checked under the same test isolation.
- Around line 743-756: The WASM concurrency guard in runtime_adapters::execute
still releases the semaphore permit when the outer async future is canceled,
even though the spawned blocking work keeps running. Move the acquired permit
into the tokio::task::spawn_blocking closure so it stays alive for the full
runtime.execute call, and keep the existing WasmError handling for gate closure
and panics. Add a cancellation regression around the execute path to verify that
dropping the future after the blocking task starts does not bypass
MAX_CONCURRENT_WASM_EXEC.
In `@crates/ironclaw_wasm/tests/wit_tool_runtime_contract.rs`:
- Around line 313-328: Update the contract test for WitToolRuntime so it
exercises the actual Clone implementation instead of only cloning
Arc<WitToolRuntime>. In
cloned_runtime_executes_inside_spawn_blocking_concurrently, create a real cloned
runtime from WitToolRuntime::new(...) and use that cloned runtime inside the
spawn_blocking tasks, while keeping the prepared tool shared as needed. This
will ensure the test fails if the WitToolRuntime: Clone contract regresses.
In `@docs/plans/2026-06-24-p1-runtime-wedge-impl.md`:
- Around line 69-72: The plan wording is too absolute in the PR B summary.
Update the statement in the runtime wedge plan to soften the claim from “Nothing
was found invalid” to language like “no contradicting evidence was found,” and
keep the rest of the framing aligned with the canonical `#5142` approach.
- Around line 141-150: The current test plan only proves the code returns under
multi_thread, which can still pass even if block_in_place is used. Update the
ironclaw_host_runtime / ironclaw_wasm test to assert an observable side-runtime
marker from the actual egress/credential restage path, using dispatch_json as
the caller, so the test fails unless work truly runs on the side runtime and not
the worker thread. Make the assertion specific to the side-runtime execution
path rather than just completion timing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 66b4ca4e-4ac1-464e-9c17-d2ec85bca9b9
📒 Files selected for processing (8)
crates/ironclaw_host_runtime/src/services.rscrates/ironclaw_host_runtime/src/services/runtime_adapters.rscrates/ironclaw_host_runtime/src/wasm_credentials.rscrates/ironclaw_wasm/src/error.rscrates/ironclaw_wasm/src/host.rscrates/ironclaw_wasm/src/runtime.rscrates/ironclaw_wasm/tests/wit_tool_runtime_contract.rsdocs/plans/2026-06-24-p1-runtime-wedge-impl.md
|
🚅 Deployed to the ironclaw-pr-5206 environment in ironclaw-ci-preview
|
henrypark133
left a comment
There was a problem hiding this comment.
Code Review (multi-agent)
Intent: Stop WASM execution from starving the tokio worker pool by offloading synchronous wasmtime calls and bounding concurrency.
Stats: 2 findings (from 2 raw, 2 after dedup) across 2 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.
Tests
- Medium Missing panic-path coverage for credential restage on the worker runtime (
crates/ironclaw_host_runtime/src/wasm_credentials.rs:285-322, confidence 73) — anchor:crates/ironclaw_host_runtime/src/wasm_credentials.rs:303
The new worker-runtime restage path maps a panicking restage future toCredentialStageError::Backend, but the tests only cover successful credential restage. A regression here would silently fall back to the closed-channel path or a misleading error.
Maintainability
- Medium Keep WitToolRuntime by value instead of wrapping it in Arc (
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:471-490, confidence 84) — anchor:crates/ironclaw_wasm/src/runtime.rs:15
WitToolRuntimealready documents cheap cloning and shares the underlying wasmtime engine, so the adapter-levelArc<WitToolRuntime>adds another ownership layer and extraArc::cloneplumbing without making the blocking execution path safer.
…-safe The runtime-wedge offload added an `.await` (the spawn_blocking join handle for WASM, the async handler dispatch for first-party) between `governor.reserve()` and `reconcile()/release()`. The turn scheduler's select! heartbeat arm (turn_scheduler.rs) drops the executor future mid-await on user cancel, lease expiry, or heartbeat-store timeout. The detached blocking task ran to completion but reconcile/release never did, and ironclaw_resources has no TTL/sweep for orphaned reservations, so each cancelled turn permanently inflated `reserved_by_account` until the scope hit false resource exhaustion. Also move the WASM execution semaphore permit into the spawn_blocking closure so the slot is held for the real blocking work, not the cancellable outer future (flagged by gemini/coderabbit/codex). Introduce a file-local RAII `ReservationGuard` that releases the reservation on drop unless settled (reconcile / account_failed / disarm). Held across the await, it releases on future-drop, closing the leak. Applied to both `execute_prepared_wasm` and the first-party `dispatch_json` path (same leak shape), deleting the three scattered `release_*`/`account_or_release_*` helpers. Behavior-preserving on every non-cancellation path. Tests: 5 guard unit tests plus a cancellation regression test that times out the dispatch future and asserts the reserved tally returns to baseline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
henrypark133
left a comment
There was a problem hiding this comment.
Code Review (multi-agent)
Intent: Prevent Reborn WASM execution from starving tokio workers by offloading blocking work, bounding concurrency, and removing worker-blocking host calls.
Stats: 3 findings (from 6 raw, 3 after validation) across 1 file. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 1
Performance
- High Cache misses still compile WASM synchronously on worker threads (
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:595-619, confidence 90) — anchor:crates/ironclaw_wasm/src/runtime.rs:50
The PR offloads prepared execution, but the cold-cache path still callsWitToolRuntime::preparebefore reaching that gate.prepareperforms wasmtime component compilation and metadata extraction synchronously, so a cold-start or eviction burst can still pin tokio workers before the new semaphore/spawn_blocking path applies.
Tests
- Medium Handler-error accounting needs a reconcile-failure test (
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:383-413, confidence 82) — anchor:crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:383
The newReservationGuard::account_failedbranch is not covered for a first-party handler error with accountable usage whengovernor.reconcileitself fails. That branch should preserve the original handler error and release the reservation.
Maintainability
- Medium Split the WASM execution machinery out of this adapter file (
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:487-865, confidence 86) — anchor:crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:487(body-only)
runtime_adapters.rsnow owns adapter wiring, the reservation RAII state machine, the semaphore gate, thespawn_blockingbridge, and the WASM execution path. Moving the reservation guard and blocking bridge into a focused sibling module would keep this central adapter easier to scan.
Close the remaining worker-starvation hole from the runtime wedge: on a prepared-tool cache miss, `WasmRuntimeAdapter::dispatch_json` called `WitToolRuntime::prepare` (synchronous wasmtime component compilation) directly on the async worker, so a cold-start or cache-eviction burst re-pinned tokio workers exactly as the execution path used to. Route prepare through the same `spawn_blocking` + `WASM_EXEC_SEMAPHORE` gate as execution via a new `run_wasm_prepare_blocking` helper (permit moved into the closure; compile panics now caught and surfaced as a WasmError instead of unwinding the worker). Drop the redundant `Arc<WitToolRuntime>` wrapper while touching this plumbing: `WitToolRuntime` is a cheap `Clone` (its `Engine` is internally reference-counted, its only other field is a small limits struct), so the adapter holds it by value and hands owned clones to both blocking helpers. Add the missing-coverage tests flagged in review: - run_wasm_prepare_blocking_is_gated_by_shared_semaphore - run_wasm_prepare_blocking_invalid_bytes_maps_to_wasm_error - block_on_credential_restage_maps_panicking_future_to_backend_on_worker_runtime - first_party_adapter_preserves_handler_error_when_account_failed_reconcile_fails Single-flight dedup of concurrent in-flight preparation is intentionally left as a follow-up: the semaphore already bounds concurrent compiles, so this closes the starvation bug; dedup is a separate compute-waste optimization needing its own barrier primitive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs (1)
588-611: 🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy liftDeduplicate in-flight prepares before entering the shared WASM gate.
Line 589 observes an empty cache, but lines 603-611 compile before any in-flight marker is inserted. Concurrent misses for the same module can all take
WASM_EXEC_SEMAPHOREpermits and compile identical bytes, with all but one discarded at lines 613-620. A single cold module can occupy the process-wide gate and delay unrelated hot executions. Add per-cache_keyin-flight prepare sharing, then cover it throughdispatch_json. As per path instructions, “Test through the caller.”Also applies to: 613-620
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs` around lines 588 - 611, The cold-path prepare flow in runtime_adapters::dispatch_json is allowing duplicate compiles because the cache is checked before any in-flight marker exists, so multiple requests for the same cache_key can all enter run_wasm_prepare_blocking and consume the shared WASM gate. Add per-cache_key in-flight prepare sharing/deduplication around the prepared_guard() miss path so only one prepare runs and the rest await it, then keep the cached result handling in dispatch_json and cover the behavior through the caller rather than testing internals directly.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs`:
- Around line 588-611: The cold-path prepare flow in
runtime_adapters::dispatch_json is allowing duplicate compiles because the cache
is checked before any in-flight marker exists, so multiple requests for the same
cache_key can all enter run_wasm_prepare_blocking and consume the shared WASM
gate. Add per-cache_key in-flight prepare sharing/deduplication around the
prepared_guard() miss path so only one prepare runs and the rest await it, then
keep the cached result handling in dispatch_json and cover the behavior through
the caller rather than testing internals directly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 4de12711-4227-404a-97b9-2970e58d4e7b
📒 Files selected for processing (3)
crates/ironclaw_host_runtime/src/services/runtime_adapters.rscrates/ironclaw_host_runtime/src/services/tests/first_party_runtime_adapter.rscrates/ironclaw_host_runtime/src/wasm_credentials.rs
henrypark133
left a comment
There was a problem hiding this comment.
Code Review (multi-agent)
Intent: Prevent Reborn WASM execution from starving the tokio worker pool by moving blocking work off workers and bounding concurrency.
Stats: 2 findings (from 8 raw, 2 after dedup/manual validation) across 1 file. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0
Bugs
- High Cancelled WASM dispatch can keep running after its reservation is released (
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:848-901, confidence 82) — anchor:crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:860
execute_prepared_wasmnow awaits aspawn_blockingjoin handle while holding theReservationGuardin the outer future. If the turn future is dropped on cancel/lease expiry/timeout, the guard releases the resource reservation, but Tokio does not stop an already-running blocking task; the guest can continue through host HTTP or credential restage after cancellation with no live reservation/accounting.
Performance
- Medium Cold-cache compilation shares the execution semaphore (
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:872-901, confidence 82) — anchor:crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:891
run_wasm_prepare_blockingandrun_wasm_execution_blockingboth acquireWASM_EXEC_SEMAPHORE, so a cache-miss or eviction storm can spend all 64 permits compiling components and make already-prepared executions queue behind cold compile work. That couples the steady-state hot path to cold-start contention.
…tarve execution run_wasm_prepare_blocking and run_wasm_execution_blocking both acquired the shared WASM_EXEC_SEMAPHORE, so a cold-cache compile storm could spend every permit on component compilation and make already-prepared hot executions queue behind cold-start work. Add a dedicated, smaller WASM_PREPARE_SEMAPHORE (a quarter of the execution bound) used only by preparation. The two gates are independent, so their combined blocking-pool usage is still far below tokio's default ceiling, and hot executions no longer contend with compile bursts. Tests: prepare now gated by its own semaphore (run_wasm_prepare_blocking_is_gated_by_prepare_semaphore), plus a decoupling test that a fully-drained execution semaphore does not block preparation (wasm_prepare_is_not_blocked_by_full_execution_semaphore). The separate accounting/cancellation concern (a cancelled turn's detached guest can keep running after its reservation is released) is tracked in #5216 — it needs an owned governor threaded through RuntimeAdapterRequest to co-locate reconcile with execution, out of scope here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs`:
- Around line 1828-1875: The current regression test only covers the prepare
path, but the inverse invariant is missing: a saturated WASM_PREPARE_SEMAPHORE
must not block already-prepared execution. Add a caller-path regression near
wasm_prepare_is_not_blocked_by_full_execution_semaphore that drains all prepare
permits, then invokes a real prepared WASM execution flow through the existing
caller path and verifies it completes promptly via WASM_EXEC_SEMAPHORE. Use the
relevant runtime helpers and symbols already present in runtime_adapters.rs (for
example WitToolRuntime, run_wasm_prepare_blocking, and the execution entrypoint
used by the caller path) so the test proves hot execution is independent of the
prepare gate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: dedb788f-f829-46d3-a408-799961f29434
📒 Files selected for processing (1)
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs
henrypark133
left a comment
There was a problem hiding this comment.
Code Review (multi-agent)
Intent: Prevent WASM execution from starving tokio workers by moving blocking work off the worker pool and removing inline blocking in host egress and credential restage.
Stats: 4 findings (from 7 raw, 4 after dedupe/filter) across 2 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0
Prior-review handling: I did not repost the cancelled-WASM/reservation-accounting concern because the same issue was already accepted and deferred to #5216. I also dropped the partial credential-restage failure because the loop is identical at the base SHA, so it is pre-existing/off-diff rather than introduced by this PR.
maintainability
- Medium Move the WASM execution machinery out of the adapter module (
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:487-865, confidence 86) — anchor: crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:487
This file now mixes adapter orchestration with the reservation RAII state machine, the blocking-pool semaphore gates, and the spawn_blocking bridge for WASM prepare/execute. That packs several unrelated ownership concerns into the central adapter module, so readers have to hold too many concepts at once just to follow dispatch_json.
performance
- Low Guard the semaphore-state assertion with the shared test lock (
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:1908-1918, confidence 93) — anchor: crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:1909
This test reads the process-wideWASM_EXEC_SEMAPHOREstate withoutSEMAPHORE_TEST_LOCK, so it can race with the new async tests that drain the same semaphore and intermittently observe 0 permits. That makes the suite flaky under parallel test execution even though the production code is fine.
local-patterns
- Nit Doc comment narrows the helper to execution only (
crates/ironclaw_wasm/src/error.rs:30-34, confidence 66) — anchor: crates/ironclaw_wasm/src/error.rs:30
The constructor doc says it is for callers that offload execution, but the new blocking-prepare path also uses it. That makes the comment narrower than the actual call sites and will mislead the next maintainer trying to find the right helper for blocking-task failures. - Nit Preparation gate failure reuses the execution gate message (
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:920, confidence 58) — anchor: crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:920
The prepare path reportswasm execution gate closed, which makes prepare and exec contention indistinguishable in logs now that they have separate semaphores. That breaks the local naming pattern established by the neighboring preparation-specific panic message.
Address review nits on the separate WASM prepare/exec gates: - Add wasm_execution_is_not_blocked_by_full_prepare_semaphore — the inverse of the prepare decoupling test: drains WASM_PREPARE_SEMAPHORE and asserts a real prepared execution still completes via the exec gate. - Guard wasm_exec_semaphore_starts_open_at_configured_bound with SEMAPHORE_TEST_LOCK (now async) so it can't race the drain tests. - run_wasm_prepare_blocking gate-closed message now says "wasm preparation gate closed" (distinct from execution in logs). - WasmError::execution_failed doc now describes blocking-offload join/ semaphore failures generically (execution and preparation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n module Pure move, no behavior change. runtime_adapters.rs had grown to mix adapter dispatch selection with the reservation RAII state machine, the blocking-pool semaphore gates, and the spawn_blocking bridge for WASM prepare/execute — too many ownership concerns for one file. Move the semaphore gates, ReservationGuard, execute_prepared_wasm, run_wasm_execution_blocking, run_wasm_prepare_blocking, the WASM-only error/guest helpers, and their tests into a new sibling module services/wasm_execution.rs. ReservationGuard and execute_prepared_wasm are pub(super) since the first-party and WASM adapters (which stay in runtime_adapters.rs) still use them; wasm_error_kind stays put and is imported back. runtime_adapters.rs: 2089 -> 708 lines. No logic, signatures, or messages changed; clippy clean, full lib suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs (1)
609-613: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winMake pre-guard reservation release fail loud.
This helper runs before
ReservationGuardowns the reservation.let _ = governor.release(...)silently hides a storage failure and can leave the prepared reservation stuck. Have callers return aResourcedispatch error on release failure, or at minimum emit a sanitized debug diagnostic.As per path instructions, “Fail loud: flag silent-failure patterns.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs` around lines 609 - 613, The release_first_party_reservation helper in runtime_adapters silently drops errors from governor.release, which can hide a failed pre-guard reservation cleanup. Update this path so callers of release_first_party_reservation propagate a Resource dispatch error when release fails, or at minimum add a sanitized debug diagnostic in the same failure branch. Keep the fix anchored around release_first_party_reservation and its call sites that run before ReservationGuard takes ownership.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ironclaw_host_runtime/src/services/wasm_execution.rs`:
- Around line 121-150: The failed reservation settlement paths in the wasm
execution cleanup are swallowing `ResourceGovernor::reconcile` and `release`
errors, which hides durable storage failures and can leave budget reserved
silently. Update `account_failed` and the reconcile/release branch around
`self.governor.reconcile`/`self.governor.release` to preserve and return the
underlying dispatch error whenever the method already returns `Result`, and in
the `Drop`-based cleanup path emit a sanitized diagnostic that includes the
`reservation_id` when settlement fails. Use the existing `ResourceGovernor`,
`account_failed`, and `reservation_id` symbols to locate and fix the
silent-failure branches.
---
Outside diff comments:
In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs`:
- Around line 609-613: The release_first_party_reservation helper in
runtime_adapters silently drops errors from governor.release, which can hide a
failed pre-guard reservation cleanup. Update this path so callers of
release_first_party_reservation propagate a Resource dispatch error when release
fails, or at minimum add a sanitized debug diagnostic in the same failure
branch. Keep the fix anchored around release_first_party_reservation and its
call sites that run before ReservationGuard takes ownership.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: b3e07c1d-1572-4999-8ff8-72da590c4bd8
📒 Files selected for processing (4)
crates/ironclaw_host_runtime/src/services.rscrates/ironclaw_host_runtime/src/services/runtime_adapters.rscrates/ironclaw_host_runtime/src/services/wasm_execution.rscrates/ironclaw_wasm/src/error.rs
…act test Two CI failures, both false alarms from the round 1-4 refactor — no behavior regression, verified before changing either. 1. no-panics gate flagged the two `const _: () = assert!(...)` bound invariants in wasm_execution.rs. These are const-context assertions evaluated at compile time (they fail the build, never panic at runtime), so they are false positives for a runtime-panic check. Switch them to `static_assertions::const_assert!` — the purpose-built compile-time assertion, already used by sibling crates (ironclaw_host_api/trust/turns) — which is both clearer and not matched by the gate's `assert!` pattern. 2. host_runtime_services_wasm_input_encode_releases_prepared_reservation was a source-grep contract test asserting the literal `release_wasm_reservation(...)` explicit-cleanup branch in runtime_adapters.rs. The round-1 ReservationGuard (RAII) replaced that explicit branch with Drop-based release, and the round-4 extraction moved the code to wasm_execution.rs, so the grep no longer matched. The guarded behavior — an input-encode failure must release the prepared reservation — is preserved and in fact strengthened: the guard releases on *every* early return, not just this branch. Rewrite the test to read wasm_execution.rs and assert the RAII invariant (reservation bound -> ReservationGuard constructed -> input encoded), which is the modern, less-brittle equivalent. The guard's actual Drop/release behavior is covered by unit tests in services::wasm_execution::tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
henrypark133
left a comment
There was a problem hiding this comment.
Code Review (multi-agent)
Intent: Prevent WASM and host runtime execution from starving Tokio workers by offloading blocking work and adding caller-path tests.
Stats: 1 finding (from 3 parsed raw findings, 1 after validation) across 1 file. Reviewers run: security, local-patterns, maintainability, approach. Reviewers failed: bugs, performance, tests, conventions (gpt-5.4-mini capacity after retries). Body-only: 1.
Maintainability
- Low Split the new wasm_execution module before it becomes a monolith (
crates/ironclaw_host_runtime/src/services/wasm_execution.rs:1-1396, confidence 78) - anchor:crates/ironclaw_host_runtime/src/services/wasm_execution.rs:1(body-only)
The new module concentrates reservation RAII, blocking-pool gates, guest-error decoding, and a large embedded WASM test harness into one 1,396-line file. The production path is now isolated fromruntime_adapters.rs, but the test fixture volume still makes this new module harder to scan than it needs to be. Move the large WAT fixtures plus execution-gate tests into the existing services test area or a focusedwasm_execution_testsmodule, keepingwasm_execution.rsfocused on production execution/accounting.
Notes
- I did not repost the cancelled-WASM/accounting-lifetime concern: the prior thread accepted it as valid and deferred it to #5216 with a concrete cross-crate-contract rationale.
- I also did not duplicate the current unresolved CodeRabbit thread about failed reservation settlement logging; that thread already covers the issue directly.
- No fresh blockers from the completed lenses; the remaining finding is cleanup-level.
…g them CodeRabbit (Major): ReservationGuard's reconcile/account_failed/Drop and release_first_party_reservation swallowed ResourceGovernor::reconcile and release errors (Err(_) / let _ = ...). Those come from durable storage, so a failed cleanup could leave scope budget permanently reserved with no signal — a silent-failure pattern banned by .claude/rules/error-handling.md. Surface the cause: the explicit settlement paths and the Drop safety net now emit a sanitized tracing::warn! keyed by reservation_id when reconcile or release fails, matching the existing release_adapter_reservation precedent. Behavior is otherwise unchanged — the reservation is still released on reconcile failure (covered by first_party_adapter_releases_reservation_when_reconcile_fails_after_success and first_party_adapter_preserves_handler_error_when_account_failed_reconcile_fails); the dispatch boundary still returns the sanitized DispatchError. Drop cannot return a Result, so logging is the only available signal there. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs (1)
428-432: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winDo not log raw governor release errors.
Both changed release-failure paths emit
error = %.... That can leak backend/storage details from host-runtime logs. Keep the warning andreservation_id, but omit raw error display or map it to a sanitized kind.As per coding guidelines,
crates/ironclaw_host_runtime/**/*.rsmust not include backend error details in logs. Based on learnings, reservation settlement failures should be logged in sanitized form keyed byreservation_id.Also applies to: 613-618
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs` around lines 428 - 432, The release-failure logging in the runtime adapter should stop emitting raw governor errors, since the current `request.governor.release(...)` paths log backend details via `error = %...`. Update the warning sites in `runtime_adapters.rs` (including the reconcile-failure release path and the other release-failure block mentioned in the review) to keep `reservation_id` and the warning message, but remove direct error display or replace it with a sanitized error kind/status derived from the `release_error` in the relevant function.Sources: Coding guidelines, Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ironclaw_host_runtime/src/services/wasm_execution.rs`:
- Around line 127-133: Sanitize the settlement warnings in wasm_execution.rs by
removing raw backend error details from the tracing::warn! calls in the
reservation reconciliation/release guard. Update the Err(error) branches in the
relevant settlement flow so they log only the reservation_id and a generic
failure message, without %error or other internal ResourceGovernor details.
Apply the same sanitized pattern consistently to the other release/reconcile
warning sites referenced in this guard.
---
Outside diff comments:
In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs`:
- Around line 428-432: The release-failure logging in the runtime adapter should
stop emitting raw governor errors, since the current
`request.governor.release(...)` paths log backend details via `error = %...`.
Update the warning sites in `runtime_adapters.rs` (including the
reconcile-failure release path and the other release-failure block mentioned in
the review) to keep `reservation_id` and the warning message, but remove direct
error display or replace it with a sanitized error kind/status derived from the
`release_error` in the relevant function.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: db9a37c6-715f-45f2-b4d9-ebca4e0b5302
📒 Files selected for processing (2)
crates/ironclaw_host_runtime/src/services/runtime_adapters.rscrates/ironclaw_host_runtime/src/services/wasm_execution.rs
Problem
During the 2026-06-24 Reborn meltdown, a burst of ~40 concurrent turns drove the runtime into a ~4-minute total freeze (every log emitter silent 19:51:05→19:54:59), after which the process restarted with mass
lease_expired. Root cause amplifier (Reborn-only):execute_prepared_wasm→WitToolRuntime::execute, nospawn_blocking).host.rs) and credential restage (wasm_credentials.rs) usedblock_in_place(|| handle.block_on(..)).When concurrent WASM calls reach the worker count, every worker parks → the scheduler, trigger poller, and lease heartbeats get no CPU → leases can't renew → wedge.
Fix (PR A — worker-pool de-wedge)
spawn_blockingthe synchronous wasmtime call (mirrors the correct legacysrc/tools/wasm/wrapper.rspattern). Adapter now holdsArc<WitToolRuntime>. AJoinError(panic) maps to aWasmErrorexecution failure with a static message (no panic-payload leak).Semaphore(64)(compile-asserted> 0 && < 512, tokio's default blocking-pool ceiling) gates native WASM execution so a storm queues instead of exhausting the pool. Permit held for the blocking task's lifetime, released on all paths incl. panic.block_in_place: host egress / credential restage always route through the existing single-thread side runtimes (WASM_HTTP_EGRESS_RUNTIME/WASM_CREDENTIAL_RESTAGE_RUNTIME), which multiplex concurrent async I/O without parking turn workers. Panic→error mapping preserved and made consistent across both paths.Tests / quality gate
cargo clippy -p ironclaw_host_runtime -p ironclaw_wasm --all-targets— 0 warnings.cargo test -p ironclaw_host_runtime -p ironclaw_wasm— all pass except 3 pre-existingsandbox_process::tests::*that require a Docker socket (SocketNotFoundError("/var/run/docker.sock")), unrelated to this change (sandbox_process.rsis untouched).run_wasm_execution_blocking: panic→ExecutionFailed+ permit release; and shared-semaphore gating (drains all 64 permits, confirms the real path queues until a permit frees).A note on the side runtimes
Code review flagged the single-thread side runtimes as a potential serialization bottleneck. That finding misreads async semantics: a 1-thread tokio runtime multiplexes many concurrent async I/O futures (it does not serialize them), the goal of protecting the 16 turn workers is achieved, and blocking-pool occupancy is bounded by the semaphore (64 ≪ 512). Those runtimes also pre-existed; this PR only routes to them universally. Left as-is by design; flagged here for reviewer visibility.
Scope
PR A of the P1 track. PR B — removing the redundant per-record turn-state lock convoys (
run_state/threads/resources/secrets) via one shared CAS helper, plus theironclaw_secretsArc→Weakmap leak — is a larger, concurrency-sensitive refactor and follows as a separate PR. Seedocs/plans/2026-06-24-p1-runtime-wedge-impl.md.🤖 Generated with Claude Code