Skip to content

fix(reborn): stop WASM execution from starving the tokio worker pool#5206

Open
henrypark133 wants to merge 9 commits into
mainfrom
fix/reborn-p1-runtime-wedge
Open

fix(reborn): stop WASM execution from starving the tokio worker pool#5206
henrypark133 wants to merge 9 commits into
mainfrom
fix/reborn-p1-runtime-wedge

Conversation

@henrypark133

Copy link
Copy Markdown
Collaborator

Problem

During the 2026-06-24 Reborn meltdown, a burst of ~40 concurrent turns drove the runtime into a ~4-minute total freeze (every log emitter silent 19:51:05→19:54:59), after which the process restarted with mass lease_expired. Root cause amplifier (Reborn-only):

  • WASM tool execution ran the synchronous wasmtime guest call inline on a tokio worker thread (execute_prepared_wasmWitToolRuntime::execute, no spawn_blocking).
  • WASM host HTTP egress (host.rs) and credential restage (wasm_credentials.rs) used block_in_place(|| handle.block_on(..)).

When concurrent WASM calls reach the worker count, every worker parks → the scheduler, trigger poller, and lease heartbeats get no CPU → leases can't renew → wedge.

Fix (PR A — worker-pool de-wedge)

  • spawn_blocking the synchronous wasmtime call (mirrors the correct legacy src/tools/wasm/wrapper.rs pattern). Adapter now holds Arc<WitToolRuntime>. A JoinError (panic) maps to a WasmError execution failure with a static message (no panic-payload leak).
  • Bounded concurrency: a process-wide Semaphore(64) (compile-asserted > 0 && < 512, tokio's default blocking-pool ceiling) gates native WASM execution so a storm queues instead of exhausting the pool. Permit held for the blocking task's lifetime, released on all paths incl. panic.
  • Remove block_in_place: host egress / credential restage always route through the existing single-thread side runtimes (WASM_HTTP_EGRESS_RUNTIME / WASM_CREDENTIAL_RESTAGE_RUNTIME), which multiplex concurrent async I/O without parking turn workers. Panic→error mapping preserved and made consistent across both paths.

Tests / quality gate

  • cargo clippy -p ironclaw_host_runtime -p ironclaw_wasm --all-targets0 warnings.
  • cargo test -p ironclaw_host_runtime -p ironclaw_wasm — all pass except 3 pre-existing sandbox_process::tests::* that require a Docker socket (SocketNotFoundError("/var/run/docker.sock")), unrelated to this change (sandbox_process.rs is untouched).
  • New caller-path tests (per the repo "Test Through the Caller" rule) driving the real run_wasm_execution_blocking: panic→ExecutionFailed + permit release; and shared-semaphore gating (drains all 64 permits, confirms the real path queues until a permit frees).

A note on the side runtimes

Code review flagged the single-thread side runtimes as a potential serialization bottleneck. That finding misreads async semantics: a 1-thread tokio runtime multiplexes many concurrent async I/O futures (it does not serialize them), the goal of protecting the 16 turn workers is achieved, and blocking-pool occupancy is bounded by the semaphore (64 ≪ 512). Those runtimes also pre-existed; this PR only routes to them universally. Left as-is by design; flagged here for reviewer visibility.

Scope

PR A of the P1 track. PR B — removing the redundant per-record turn-state lock convoys (run_state / threads / resources / secrets) via one shared CAS helper, plus the ironclaw_secrets ArcWeak map leak — is a larger, concurrency-sensitive refactor and follows as a separate PR. See docs/plans/2026-06-24-p1-runtime-wedge-impl.md.

🤖 Generated with Claude Code

During the 2026-06-24 meltdown, a burst of ~40 concurrent turns drove the
runtime into a ~4-minute full freeze. WASM tool execution ran the
synchronous wasmtime guest call inline on tokio worker threads, and the
WASM host HTTP egress / credential restage used block_in_place — so a
burst could park every worker, starving the scheduler, trigger poller,
and lease heartbeats until the process died and restarted.

Changes:
- Offload the synchronous wasmtime call to spawn_blocking (mirroring the
  legacy src/tools/wasm path) so it never parks a tokio worker. The
  adapter now holds Arc<WitToolRuntime>.
- Bound concurrent native WASM execution with a process-wide semaphore
  (64, well under tokio's 512 blocking-pool ceiling) so a storm queues
  instead of exhausting the pool.
- Remove the block_in_place branches in host.rs and wasm_credentials.rs;
  always route host egress / credential restage through the existing
  single-thread side runtimes, which multiplex concurrent async I/O
  without touching turn workers.
- Preserve panic->error mapping across the change and make the credential
  restage path consistent with the egress path. Surface a static error
  message on a blocking-task panic instead of the raw JoinError payload.

Tested through the real run_wasm_execution_blocking caller: panic mapping
+ permit release, and shared-semaphore gating.

This is PR A of the P1 track. PR B (removing the redundant per-record
turn-state lock convoys via a shared CAS helper) is a separate, larger
concurrency refactor and follows in its own PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5206 June 25, 2026 01:39 Destroyed
@github-actions github-actions Bot added scope: docs Documentation size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: c7cddcb9-6d41-45fb-9c31-8ea3fbb46860

📥 Commits

Reviewing files that changed from the base of the PR and between 339b682 and 4822003.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock, !**/Cargo.lock
📒 Files selected for processing (1)
  • crates/ironclaw_host_runtime/Cargo.toml

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added bounded, semaphore-gated concurrency for WASM component preparation vs execution.
    • Made WitToolRuntime cloneable to support safe concurrent usage.
  • Bug Fixes
    • Improved panic/join-failure handling and error mapping across WASM execution, egress, and credential restaging.
    • Hardened reserved-resource lifecycle so reservations are properly released or reconciled on cancellation and failure paths.
  • Tests
    • Added regression tests for concurrent cloned-runtime execution and for reservation cleanup on cancellation/reconcile failures.
  • Documentation
    • Added an implementation plan for “P1 Reborn Runtime Wedge.”

Walkthrough

The PR makes WitToolRuntime cloneable, exposes WasmError::execution_failed, routes egress and credential restage through watchdog workers, adds semaphore-gated WASM blocking helpers with reservation-guard settlement, expands regression tests, and adds a runtime wedge implementation plan.

Changes

Runtime offload wedge

Layer / File(s) Summary
Public runtime contracts and clone test
crates/ironclaw_host_runtime/src/services.rs, crates/ironclaw_wasm/src/error.rs, crates/ironclaw_wasm/src/runtime.rs, crates/ironclaw_wasm/tests/wit_tool_runtime_contract.rs
WitToolRuntime becomes Clone, WasmError::execution_failed becomes public, the host service import list includes WitToolExecution, and a concurrent spawn-blocking runtime test is added.
Egress and restage watchdogs
crates/ironclaw_wasm/src/host.rs, crates/ironclaw_host_runtime/src/wasm_credentials.rs
Current-runtime egress and credential restage paths stop using block_in_place; both now route through worker tasks that await join handles and map panics or join failures into backend errors.
Shared WASM execution engine
crates/ironclaw_host_runtime/src/services.rs, crates/ironclaw_host_runtime/src/services/wasm_execution.rs, crates/ironclaw_host_runtime/Cargo.toml, crates/ironclaw_host_runtime/tests/host_runtime_services_contract.rs
Adds the shared WASM execution module with semaphore gating, reservation-guard settlement, blocking offload helpers, guest error classification, and regression tests for execution and preparation behavior.
Reservation guard and dispatch wiring
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs, crates/ironclaw_host_runtime/src/services/tests/first_party_runtime_adapter.rs
First-party dispatch now uses ReservationGuard for cancellation-safe settlement, and prepared WASM dispatch switches to shared async blocking helpers with updated reconciliation and error mapping paths.
Runtime wedge implementation plan
docs/plans/2026-06-24-p1-runtime-wedge-impl.md
Adds the P1 runtime wedge plan, including scope, verification, PR ordering, PR A steps, PR B follow-up notes, quality gates, and out-of-scope items.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Suggested reviewers

  • think-in-universe
  • aiworkbot

Poem

A semaphore stood watch at night,
While watchdogs kept the joins in sight.
Clone the runtime, let guards hold fast,
And settle reservations clean at last.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is detailed, but it misses most required template sections, including change type, linked issue, security, trust-boundary, rollback, and review track. Reformat to the repository template and add the missing sections: Summary bullets, Change Type, Linked Issue, Validation, Security Impact, Trust-Boundary Checklist, Database Impact, Blast Radius, Rollback Plan, Review Follow-Through, and Review track.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the PR's main fix and uses Conventional Commits style with the reborn scope.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses WASM worker-pool starvation by offloading synchronous WASM execution to Tokio's blocking thread pool (spawn_blocking) and bounding concurrency with a shared semaphore. It also removes the use of block_in_place for HTTP egress and credential restaging, routing them through dedicated side-runtimes instead. A critical concurrency issue was identified where the semaphore permit is leaked upon cancellation because the permit is not moved into the spawn_blocking closure, which could allow more concurrent tasks to run than the configured limit.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c67338d9f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs`:
- Around line 1443-1453: The test
`wasm_exec_semaphore_starts_open_at_configured_bound` reads
`WASM_EXEC_SEMAPHORE.available_permits()` without serialization, so it can race
with the async semaphore-draining tests and fail nondeterministically. Update
this test to acquire `SEMAPHORE_TEST_LOCK` before asserting the semaphore state,
matching the other semaphore-related tests in `runtime_adapters.rs`, so the
shared global `WASM_EXEC_SEMAPHORE` is checked under the same test isolation.
- Around line 743-756: The WASM concurrency guard in runtime_adapters::execute
still releases the semaphore permit when the outer async future is canceled,
even though the spawned blocking work keeps running. Move the acquired permit
into the tokio::task::spawn_blocking closure so it stays alive for the full
runtime.execute call, and keep the existing WasmError handling for gate closure
and panics. Add a cancellation regression around the execute path to verify that
dropping the future after the blocking task starts does not bypass
MAX_CONCURRENT_WASM_EXEC.

In `@crates/ironclaw_wasm/tests/wit_tool_runtime_contract.rs`:
- Around line 313-328: Update the contract test for WitToolRuntime so it
exercises the actual Clone implementation instead of only cloning
Arc<WitToolRuntime>. In
cloned_runtime_executes_inside_spawn_blocking_concurrently, create a real cloned
runtime from WitToolRuntime::new(...) and use that cloned runtime inside the
spawn_blocking tasks, while keeping the prepared tool shared as needed. This
will ensure the test fails if the WitToolRuntime: Clone contract regresses.

In `@docs/plans/2026-06-24-p1-runtime-wedge-impl.md`:
- Around line 69-72: The plan wording is too absolute in the PR B summary.
Update the statement in the runtime wedge plan to soften the claim from “Nothing
was found invalid” to language like “no contradicting evidence was found,” and
keep the rest of the framing aligned with the canonical `#5142` approach.
- Around line 141-150: The current test plan only proves the code returns under
multi_thread, which can still pass even if block_in_place is used. Update the
ironclaw_host_runtime / ironclaw_wasm test to assert an observable side-runtime
marker from the actual egress/credential restage path, using dispatch_json as
the caller, so the test fails unless work truly runs on the side runtime and not
the worker thread. Make the assertion specific to the side-runtime execution
path rather than just completion timing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 66b4ca4e-4ac1-464e-9c17-d2ec85bca9b9

📥 Commits

Reviewing files that changed from the base of the PR and between 4f28feb and 3c67338.

📒 Files selected for processing (8)
  • crates/ironclaw_host_runtime/src/services.rs
  • crates/ironclaw_host_runtime/src/services/runtime_adapters.rs
  • crates/ironclaw_host_runtime/src/wasm_credentials.rs
  • crates/ironclaw_wasm/src/error.rs
  • crates/ironclaw_wasm/src/host.rs
  • crates/ironclaw_wasm/src/runtime.rs
  • crates/ironclaw_wasm/tests/wit_tool_runtime_contract.rs
  • docs/plans/2026-06-24-p1-runtime-wedge-impl.md

Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated
Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated
Comment thread crates/ironclaw_wasm/tests/wit_tool_runtime_contract.rs
Comment thread docs/plans/2026-06-24-p1-runtime-wedge-impl.md
Comment thread docs/plans/2026-06-24-p1-runtime-wedge-impl.md
@railway-app

railway-app Bot commented Jun 25, 2026

Copy link
Copy Markdown

🚅 Deployed to the ironclaw-pr-5206 environment in ironclaw-ci-preview

Service Status Web Updated (UTC)
ironclaw ✅ Success (View Logs) Web Jun 25, 2026 at 6:26 pm

@henrypark133 henrypark133 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (multi-agent)

Intent: Stop WASM execution from starving the tokio worker pool by offloading synchronous wasmtime calls and bounding concurrency.

Stats: 2 findings (from 2 raw, 2 after dedup) across 2 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.

Tests

  1. Medium Missing panic-path coverage for credential restage on the worker runtime (crates/ironclaw_host_runtime/src/wasm_credentials.rs:285-322, confidence 73) — anchor: crates/ironclaw_host_runtime/src/wasm_credentials.rs:303
    The new worker-runtime restage path maps a panicking restage future to CredentialStageError::Backend, but the tests only cover successful credential restage. A regression here would silently fall back to the closed-channel path or a misleading error.

Maintainability

  1. Medium Keep WitToolRuntime by value instead of wrapping it in Arc (crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:471-490, confidence 84) — anchor: crates/ironclaw_wasm/src/runtime.rs:15
    WitToolRuntime already documents cheap cloning and shares the underlying wasmtime engine, so the adapter-level Arc<WitToolRuntime> adds another ownership layer and extra Arc::clone plumbing without making the blocking execution path safer.

Comment thread crates/ironclaw_host_runtime/src/wasm_credentials.rs
Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated
…-safe

The runtime-wedge offload added an `.await` (the spawn_blocking join
handle for WASM, the async handler dispatch for first-party) between
`governor.reserve()` and `reconcile()/release()`. The turn scheduler's
select! heartbeat arm (turn_scheduler.rs) drops the executor future
mid-await on user cancel, lease expiry, or heartbeat-store timeout. The
detached blocking task ran to completion but reconcile/release never
did, and ironclaw_resources has no TTL/sweep for orphaned reservations,
so each cancelled turn permanently inflated `reserved_by_account` until
the scope hit false resource exhaustion.

Also move the WASM execution semaphore permit into the spawn_blocking
closure so the slot is held for the real blocking work, not the
cancellable outer future (flagged by gemini/coderabbit/codex).

Introduce a file-local RAII `ReservationGuard` that releases the
reservation on drop unless settled (reconcile / account_failed /
disarm). Held across the await, it releases on future-drop, closing the
leak. Applied to both `execute_prepared_wasm` and the first-party
`dispatch_json` path (same leak shape), deleting the three scattered
`release_*`/`account_or_release_*` helpers. Behavior-preserving on every
non-cancellation path.

Tests: 5 guard unit tests plus a cancellation regression test that
times out the dispatch future and asserts the reserved tally returns to
baseline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5206 June 25, 2026 04:31 Destroyed

@henrypark133 henrypark133 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (multi-agent)

Intent: Prevent Reborn WASM execution from starving tokio workers by offloading blocking work, bounding concurrency, and removing worker-blocking host calls.

Stats: 3 findings (from 6 raw, 3 after validation) across 1 file. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 1

Performance

  1. High Cache misses still compile WASM synchronously on worker threads (crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:595-619, confidence 90) — anchor: crates/ironclaw_wasm/src/runtime.rs:50
    The PR offloads prepared execution, but the cold-cache path still calls WitToolRuntime::prepare before reaching that gate. prepare performs wasmtime component compilation and metadata extraction synchronously, so a cold-start or eviction burst can still pin tokio workers before the new semaphore/spawn_blocking path applies.

Tests

  1. Medium Handler-error accounting needs a reconcile-failure test (crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:383-413, confidence 82) — anchor: crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:383
    The new ReservationGuard::account_failed branch is not covered for a first-party handler error with accountable usage when governor.reconcile itself fails. That branch should preserve the original handler error and release the reservation.

Maintainability

  1. Medium Split the WASM execution machinery out of this adapter file (crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:487-865, confidence 86) — anchor: crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:487 (body-only)
    runtime_adapters.rs now owns adapter wiring, the reservation RAII state machine, the semaphore gate, the spawn_blocking bridge, and the WASM execution path. Moving the reservation guard and blocking bridge into a focused sibling module would keep this central adapter easier to scan.

Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated
Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs
Close the remaining worker-starvation hole from the runtime wedge: on a
prepared-tool cache miss, `WasmRuntimeAdapter::dispatch_json` called
`WitToolRuntime::prepare` (synchronous wasmtime component compilation)
directly on the async worker, so a cold-start or cache-eviction burst
re-pinned tokio workers exactly as the execution path used to. Route
prepare through the same `spawn_blocking` + `WASM_EXEC_SEMAPHORE` gate as
execution via a new `run_wasm_prepare_blocking` helper (permit moved into
the closure; compile panics now caught and surfaced as a WasmError
instead of unwinding the worker).

Drop the redundant `Arc<WitToolRuntime>` wrapper while touching this
plumbing: `WitToolRuntime` is a cheap `Clone` (its `Engine` is internally
reference-counted, its only other field is a small limits struct), so the
adapter holds it by value and hands owned clones to both blocking
helpers.

Add the missing-coverage tests flagged in review:
- run_wasm_prepare_blocking_is_gated_by_shared_semaphore
- run_wasm_prepare_blocking_invalid_bytes_maps_to_wasm_error
- block_on_credential_restage_maps_panicking_future_to_backend_on_worker_runtime
- first_party_adapter_preserves_handler_error_when_account_failed_reconcile_fails

Single-flight dedup of concurrent in-flight preparation is intentionally
left as a follow-up: the semaphore already bounds concurrent compiles, so
this closes the starvation bug; dedup is a separate compute-waste
optimization needing its own barrier primitive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs (1)

588-611: 🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy lift

Deduplicate in-flight prepares before entering the shared WASM gate.

Line 589 observes an empty cache, but lines 603-611 compile before any in-flight marker is inserted. Concurrent misses for the same module can all take WASM_EXEC_SEMAPHORE permits and compile identical bytes, with all but one discarded at lines 613-620. A single cold module can occupy the process-wide gate and delay unrelated hot executions. Add per-cache_key in-flight prepare sharing, then cover it through dispatch_json. As per path instructions, “Test through the caller.”

Also applies to: 613-620

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs` around lines
588 - 611, The cold-path prepare flow in runtime_adapters::dispatch_json is
allowing duplicate compiles because the cache is checked before any in-flight
marker exists, so multiple requests for the same cache_key can all enter
run_wasm_prepare_blocking and consume the shared WASM gate. Add per-cache_key
in-flight prepare sharing/deduplication around the prepared_guard() miss path so
only one prepare runs and the rest await it, then keep the cached result
handling in dispatch_json and cover the behavior through the caller rather than
testing internals directly.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs`:
- Around line 588-611: The cold-path prepare flow in
runtime_adapters::dispatch_json is allowing duplicate compiles because the cache
is checked before any in-flight marker exists, so multiple requests for the same
cache_key can all enter run_wasm_prepare_blocking and consume the shared WASM
gate. Add per-cache_key in-flight prepare sharing/deduplication around the
prepared_guard() miss path so only one prepare runs and the rest await it, then
keep the cached result handling in dispatch_json and cover the behavior through
the caller rather than testing internals directly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 4de12711-4227-404a-97b9-2970e58d4e7b

📥 Commits

Reviewing files that changed from the base of the PR and between 4c52e37 and f7b298c.

📒 Files selected for processing (3)
  • crates/ironclaw_host_runtime/src/services/runtime_adapters.rs
  • crates/ironclaw_host_runtime/src/services/tests/first_party_runtime_adapter.rs
  • crates/ironclaw_host_runtime/src/wasm_credentials.rs

@henrypark133 henrypark133 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (multi-agent)

Intent: Prevent Reborn WASM execution from starving the tokio worker pool by moving blocking work off workers and bounding concurrency.
Stats: 2 findings (from 8 raw, 2 after dedup/manual validation) across 1 file. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0

Bugs

  1. High Cancelled WASM dispatch can keep running after its reservation is released (crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:848-901, confidence 82) — anchor: crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:860
    execute_prepared_wasm now awaits a spawn_blocking join handle while holding the ReservationGuard in the outer future. If the turn future is dropped on cancel/lease expiry/timeout, the guard releases the resource reservation, but Tokio does not stop an already-running blocking task; the guest can continue through host HTTP or credential restage after cancellation with no live reservation/accounting.

Performance

  1. Medium Cold-cache compilation shares the execution semaphore (crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:872-901, confidence 82) — anchor: crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:891
    run_wasm_prepare_blocking and run_wasm_execution_blocking both acquire WASM_EXEC_SEMAPHORE, so a cache-miss or eviction storm can spend all 64 permits compiling components and make already-prepared executions queue behind cold compile work. That couples the steady-state hot path to cold-start contention.

Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated
Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated
…tarve execution

run_wasm_prepare_blocking and run_wasm_execution_blocking both acquired
the shared WASM_EXEC_SEMAPHORE, so a cold-cache compile storm could spend
every permit on component compilation and make already-prepared hot
executions queue behind cold-start work.

Add a dedicated, smaller WASM_PREPARE_SEMAPHORE (a quarter of the
execution bound) used only by preparation. The two gates are independent,
so their combined blocking-pool usage is still far below tokio's default
ceiling, and hot executions no longer contend with compile bursts.

Tests: prepare now gated by its own semaphore
(run_wasm_prepare_blocking_is_gated_by_prepare_semaphore), plus a
decoupling test that a fully-drained execution semaphore does not block
preparation (wasm_prepare_is_not_blocked_by_full_execution_semaphore).

The separate accounting/cancellation concern (a cancelled turn's detached
guest can keep running after its reservation is released) is tracked
in #5216 — it needs an owned governor threaded through RuntimeAdapterRequest
to co-locate reconcile with execution, out of scope here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5206 June 25, 2026 05:39 Destroyed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs`:
- Around line 1828-1875: The current regression test only covers the prepare
path, but the inverse invariant is missing: a saturated WASM_PREPARE_SEMAPHORE
must not block already-prepared execution. Add a caller-path regression near
wasm_prepare_is_not_blocked_by_full_execution_semaphore that drains all prepare
permits, then invokes a real prepared WASM execution flow through the existing
caller path and verifies it completes promptly via WASM_EXEC_SEMAPHORE. Use the
relevant runtime helpers and symbols already present in runtime_adapters.rs (for
example WitToolRuntime, run_wasm_prepare_blocking, and the execution entrypoint
used by the caller path) so the test proves hot execution is independent of the
prepare gate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: dedb788f-f829-46d3-a408-799961f29434

📥 Commits

Reviewing files that changed from the base of the PR and between f7b298c and 311e9b9.

📒 Files selected for processing (1)
  • crates/ironclaw_host_runtime/src/services/runtime_adapters.rs

Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated

@henrypark133 henrypark133 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (multi-agent)

Intent: Prevent WASM execution from starving tokio workers by moving blocking work off the worker pool and removing inline blocking in host egress and credential restage.
Stats: 4 findings (from 7 raw, 4 after dedupe/filter) across 2 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0

Prior-review handling: I did not repost the cancelled-WASM/reservation-accounting concern because the same issue was already accepted and deferred to #5216. I also dropped the partial credential-restage failure because the loop is identical at the base SHA, so it is pre-existing/off-diff rather than introduced by this PR.

maintainability

  1. Medium Move the WASM execution machinery out of the adapter module (crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:487-865, confidence 86) — anchor: crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:487
    This file now mixes adapter orchestration with the reservation RAII state machine, the blocking-pool semaphore gates, and the spawn_blocking bridge for WASM prepare/execute. That packs several unrelated ownership concerns into the central adapter module, so readers have to hold too many concepts at once just to follow dispatch_json.

performance

  1. Low Guard the semaphore-state assertion with the shared test lock (crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:1908-1918, confidence 93) — anchor: crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:1909
    This test reads the process-wide WASM_EXEC_SEMAPHORE state without SEMAPHORE_TEST_LOCK, so it can race with the new async tests that drain the same semaphore and intermittently observe 0 permits. That makes the suite flaky under parallel test execution even though the production code is fine.

local-patterns

  1. Nit Doc comment narrows the helper to execution only (crates/ironclaw_wasm/src/error.rs:30-34, confidence 66) — anchor: crates/ironclaw_wasm/src/error.rs:30
    The constructor doc says it is for callers that offload execution, but the new blocking-prepare path also uses it. That makes the comment narrower than the actual call sites and will mislead the next maintainer trying to find the right helper for blocking-task failures.
  2. Nit Preparation gate failure reuses the execution gate message (crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:920, confidence 58) — anchor: crates/ironclaw_host_runtime/src/services/runtime_adapters.rs:920
    The prepare path reports wasm execution gate closed, which makes prepare and exec contention indistinguishable in logs now that they have separate semaphores. That breaks the local naming pattern established by the neighboring preparation-specific panic message.

Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated
Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated
Comment thread crates/ironclaw_wasm/src/error.rs
Comment thread crates/ironclaw_host_runtime/src/services/runtime_adapters.rs Outdated
henrypark133 and others added 2 commits June 24, 2026 23:09
Address review nits on the separate WASM prepare/exec gates:
- Add wasm_execution_is_not_blocked_by_full_prepare_semaphore — the
  inverse of the prepare decoupling test: drains WASM_PREPARE_SEMAPHORE
  and asserts a real prepared execution still completes via the exec gate.
- Guard wasm_exec_semaphore_starts_open_at_configured_bound with
  SEMAPHORE_TEST_LOCK (now async) so it can't race the drain tests.
- run_wasm_prepare_blocking gate-closed message now says "wasm
  preparation gate closed" (distinct from execution in logs).
- WasmError::execution_failed doc now describes blocking-offload join/
  semaphore failures generically (execution and preparation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n module

Pure move, no behavior change. runtime_adapters.rs had grown to mix
adapter dispatch selection with the reservation RAII state machine, the
blocking-pool semaphore gates, and the spawn_blocking bridge for WASM
prepare/execute — too many ownership concerns for one file.

Move the semaphore gates, ReservationGuard, execute_prepared_wasm,
run_wasm_execution_blocking, run_wasm_prepare_blocking, the WASM-only
error/guest helpers, and their tests into a new sibling module
services/wasm_execution.rs. ReservationGuard and execute_prepared_wasm
are pub(super) since the first-party and WASM adapters (which stay in
runtime_adapters.rs) still use them; wasm_error_kind stays put and is
imported back.

runtime_adapters.rs: 2089 -> 708 lines. No logic, signatures, or
messages changed; clippy clean, full lib suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5206 June 25, 2026 06:23 Destroyed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs (1)

609-613: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Make pre-guard reservation release fail loud.

This helper runs before ReservationGuard owns the reservation. let _ = governor.release(...) silently hides a storage failure and can leave the prepared reservation stuck. Have callers return a Resource dispatch error on release failure, or at minimum emit a sanitized debug diagnostic.

As per path instructions, “Fail loud: flag silent-failure patterns.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs` around lines
609 - 613, The release_first_party_reservation helper in runtime_adapters
silently drops errors from governor.release, which can hide a failed pre-guard
reservation cleanup. Update this path so callers of
release_first_party_reservation propagate a Resource dispatch error when release
fails, or at minimum add a sanitized debug diagnostic in the same failure
branch. Keep the fix anchored around release_first_party_reservation and its
call sites that run before ReservationGuard takes ownership.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_host_runtime/src/services/wasm_execution.rs`:
- Around line 121-150: The failed reservation settlement paths in the wasm
execution cleanup are swallowing `ResourceGovernor::reconcile` and `release`
errors, which hides durable storage failures and can leave budget reserved
silently. Update `account_failed` and the reconcile/release branch around
`self.governor.reconcile`/`self.governor.release` to preserve and return the
underlying dispatch error whenever the method already returns `Result`, and in
the `Drop`-based cleanup path emit a sanitized diagnostic that includes the
`reservation_id` when settlement fails. Use the existing `ResourceGovernor`,
`account_failed`, and `reservation_id` symbols to locate and fix the
silent-failure branches.

---

Outside diff comments:
In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs`:
- Around line 609-613: The release_first_party_reservation helper in
runtime_adapters silently drops errors from governor.release, which can hide a
failed pre-guard reservation cleanup. Update this path so callers of
release_first_party_reservation propagate a Resource dispatch error when release
fails, or at minimum add a sanitized debug diagnostic in the same failure
branch. Keep the fix anchored around release_first_party_reservation and its
call sites that run before ReservationGuard takes ownership.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: b3e07c1d-1572-4999-8ff8-72da590c4bd8

📥 Commits

Reviewing files that changed from the base of the PR and between 311e9b9 and f1b420f.

📒 Files selected for processing (4)
  • crates/ironclaw_host_runtime/src/services.rs
  • crates/ironclaw_host_runtime/src/services/runtime_adapters.rs
  • crates/ironclaw_host_runtime/src/services/wasm_execution.rs
  • crates/ironclaw_wasm/src/error.rs

Comment thread crates/ironclaw_host_runtime/src/services/wasm_execution.rs Outdated
…act test

Two CI failures, both false alarms from the round 1-4 refactor — no
behavior regression, verified before changing either.

1. no-panics gate flagged the two `const _: () = assert!(...)` bound
   invariants in wasm_execution.rs. These are const-context assertions
   evaluated at compile time (they fail the build, never panic at
   runtime), so they are false positives for a runtime-panic check.
   Switch them to `static_assertions::const_assert!` — the purpose-built
   compile-time assertion, already used by sibling crates
   (ironclaw_host_api/trust/turns) — which is both clearer and not
   matched by the gate's `assert!` pattern.

2. host_runtime_services_wasm_input_encode_releases_prepared_reservation
   was a source-grep contract test asserting the literal
   `release_wasm_reservation(...)` explicit-cleanup branch in
   runtime_adapters.rs. The round-1 ReservationGuard (RAII) replaced that
   explicit branch with Drop-based release, and the round-4 extraction
   moved the code to wasm_execution.rs, so the grep no longer matched.
   The guarded behavior — an input-encode failure must release the
   prepared reservation — is preserved and in fact strengthened: the
   guard releases on *every* early return, not just this branch. Rewrite
   the test to read wasm_execution.rs and assert the RAII invariant
   (reservation bound -> ReservationGuard constructed -> input encoded),
   which is the modern, less-brittle equivalent. The guard's actual
   Drop/release behavior is covered by unit tests in
   services::wasm_execution::tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5206 June 25, 2026 07:00 Destroyed
@github-actions github-actions Bot added the scope: dependencies Dependency updates label Jun 25, 2026

@henrypark133 henrypark133 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (multi-agent)

Intent: Prevent WASM and host runtime execution from starving Tokio workers by offloading blocking work and adding caller-path tests.

Stats: 1 finding (from 3 parsed raw findings, 1 after validation) across 1 file. Reviewers run: security, local-patterns, maintainability, approach. Reviewers failed: bugs, performance, tests, conventions (gpt-5.4-mini capacity after retries). Body-only: 1.

Maintainability

  1. Low Split the new wasm_execution module before it becomes a monolith (crates/ironclaw_host_runtime/src/services/wasm_execution.rs:1-1396, confidence 78) - anchor: crates/ironclaw_host_runtime/src/services/wasm_execution.rs:1 (body-only)
    The new module concentrates reservation RAII, blocking-pool gates, guest-error decoding, and a large embedded WASM test harness into one 1,396-line file. The production path is now isolated from runtime_adapters.rs, but the test fixture volume still makes this new module harder to scan than it needs to be. Move the large WAT fixtures plus execution-gate tests into the existing services test area or a focused wasm_execution_tests module, keeping wasm_execution.rs focused on production execution/accounting.

Notes

  • I did not repost the cancelled-WASM/accounting-lifetime concern: the prior thread accepted it as valid and deferred it to #5216 with a concrete cross-crate-contract rationale.
  • I also did not duplicate the current unresolved CodeRabbit thread about failed reservation settlement logging; that thread already covers the issue directly.
  • No fresh blockers from the completed lenses; the remaining finding is cleanup-level.

…g them

CodeRabbit (Major): ReservationGuard's reconcile/account_failed/Drop and
release_first_party_reservation swallowed ResourceGovernor::reconcile and
release errors (Err(_) / let _ = ...). Those come from durable storage, so
a failed cleanup could leave scope budget permanently reserved with no
signal — a silent-failure pattern banned by .claude/rules/error-handling.md.

Surface the cause: the explicit settlement paths and the Drop safety net
now emit a sanitized tracing::warn! keyed by reservation_id when reconcile
or release fails, matching the existing release_adapter_reservation
precedent. Behavior is otherwise unchanged — the reservation is still
released on reconcile failure (covered by
first_party_adapter_releases_reservation_when_reconcile_fails_after_success
and first_party_adapter_preserves_handler_error_when_account_failed_reconcile_fails);
the dispatch boundary still returns the sanitized DispatchError. Drop
cannot return a Result, so logging is the only available signal there.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5206 June 25, 2026 16:40 Destroyed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/ironclaw_host_runtime/src/services/runtime_adapters.rs (1)

428-432: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Do not log raw governor release errors.

Both changed release-failure paths emit error = %.... That can leak backend/storage details from host-runtime logs. Keep the warning and reservation_id, but omit raw error display or map it to a sanitized kind.

As per coding guidelines, crates/ironclaw_host_runtime/**/*.rs must not include backend error details in logs. Based on learnings, reservation settlement failures should be logged in sanitized form keyed by reservation_id.

Also applies to: 613-618

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs` around lines
428 - 432, The release-failure logging in the runtime adapter should stop
emitting raw governor errors, since the current `request.governor.release(...)`
paths log backend details via `error = %...`. Update the warning sites in
`runtime_adapters.rs` (including the reconcile-failure release path and the
other release-failure block mentioned in the review) to keep `reservation_id`
and the warning message, but remove direct error display or replace it with a
sanitized error kind/status derived from the `release_error` in the relevant
function.

Sources: Coding guidelines, Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_host_runtime/src/services/wasm_execution.rs`:
- Around line 127-133: Sanitize the settlement warnings in wasm_execution.rs by
removing raw backend error details from the tracing::warn! calls in the
reservation reconciliation/release guard. Update the Err(error) branches in the
relevant settlement flow so they log only the reservation_id and a generic
failure message, without %error or other internal ResourceGovernor details.
Apply the same sanitized pattern consistently to the other release/reconcile
warning sites referenced in this guard.

---

Outside diff comments:
In `@crates/ironclaw_host_runtime/src/services/runtime_adapters.rs`:
- Around line 428-432: The release-failure logging in the runtime adapter should
stop emitting raw governor errors, since the current
`request.governor.release(...)` paths log backend details via `error = %...`.
Update the warning sites in `runtime_adapters.rs` (including the
reconcile-failure release path and the other release-failure block mentioned in
the review) to keep `reservation_id` and the warning message, but remove direct
error display or replace it with a sanitized error kind/status derived from the
`release_error` in the relevant function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: db9a37c6-715f-45f2-b4d9-ebca4e0b5302

📥 Commits

Reviewing files that changed from the base of the PR and between 0ba2fd3 and 339b682.

📒 Files selected for processing (2)
  • crates/ironclaw_host_runtime/src/services/runtime_adapters.rs
  • crates/ironclaw_host_runtime/src/services/wasm_execution.rs

Comment thread crates/ironclaw_host_runtime/src/services/wasm_execution.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules scope: dependencies Dependency updates scope: docs Documentation size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant