fix(reborn): release Slack admission permit once inbound is durably accepted#5225
Conversation
…ccepted The immediate-ACK webhook path acquired an admission permit (the max_in_flight intake semaphore, capacity 64 for Slack) and held it for the entire spawned post-ACK task — including the observer delivery poll. For Slack that observer (SlackFinalReplyDeliveryObserver) polls the submitted run for its final reply for up to max_wait (default 120s) inside observe_workflow_ack, well outside the ~2s workflow_timeout that bounds submit_inbound. Net effect: each turn pinned an admission slot for up to ~120s even though submit_inbound durably accepted the run in ~2s. With 64 slow turns, every intake slot is exhausted and new inbound webhooks are rejected TooManyInFlight (HTTP 429) for the whole delivery window; under sustained load Slack exhausts its bounded retries and user messages are silently lost. Fix: decouple admission from delivery. Once submit_inbound returns a durable ack (ProductInboundAck::is_durable_outcome), drop the permit before invoking the observer, so admission gates only the fast intake window. Non-durable acks and error/timeout paths retain the permit until scope end so admission still backpressures un-accepted intake. The unbounded delivery wait is bounded by its own mechanism on the delivery side (semaphore / single-flight guard / max_wait), unchanged. Red->green coverage: admission_released_after_durable_accept_not_held_ across_delivery (fails TooManyInFlight on base, passes after fix) plus admission_retained_across_observer_for_non_durable_ack pinning the is_durable_outcome guard. Guardrail rule added to the crate CLAUDE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe PR changes immediate-ACK admission handling so durable inbound acknowledgements release the intake permit before post-ACK observation, while non-durable outcomes keep it held. It also adds a regression plan, a blocking observer harness, and tests for both durable release and retained backpressure. ChangesSlack admission permit lifetime
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e9d10631d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if ack.is_durable_outcome() { | ||
| drop(permit.take()); |
There was a problem hiding this comment.
Bound pending post-ACK delivery tasks separately
When a durable ack comes back, this drops the only runner-level bound before calling the observer. I checked the Slack observer path: observe_workflow_ack waits on delivery_permits.acquire_owned() in slack_delivery.rs:1385, while the separate max_pending_deliveries guard exists only for TriggeredRunDeliveryDriver (slack_delivery.rs:1839-1941), not for webhook observer tasks. Under sustained accepted Slack webhooks with slow final-reply polls, everything beyond the 64 delivery permits is still spawned into immediate_ack_tasks and can wait up to the 120s max_wait with no pending cap, trading 429s for unbounded task/envelope accumulation; add a delivery-side pending permit/queue limit for this observer path before releasing admission or before entering the observer.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug where the admission permit was held across unbounded downstream waits, leading to potential webhook rejection under load. The fix ensures that the permit is released early once the inbound is durably accepted, and includes regression tests to verify the behavior. The reviewer suggested restructuring the newly added architectural guideline in CLAUDE.md into a bulleted list to improve readability.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| - Runtime-specific WIT/request shapes may live here, but shared product-surface DTOs stay in `ironclaw_product_adapters`. | ||
| - ProductAdapter components may use v1-style minimal WASI p2 for wasm32-wasip2 compatibility: clock/random are allowed; env, args, stdio, preopened directories, inherited network, and DNS lookup must stay disabled. Current slice is parse/render-only: the ProductAdapter WIT `http-egress` import fails closed until host-runtime egress wiring is injected in a follow-up. | ||
| - Keep dependencies minimal. Avoid workflow/runtime crates unless a concrete host-glue call site requires them and architecture tests are updated deliberately. | ||
| - Admission/intake backpressure (the `max_in_flight` permit in `runner.rs`/`runner_immediate_ack.rs`) gates fast intake only — auth, parse, stamp, and `submit_inbound`. An admission/intake permit must NOT be held across an unbounded downstream wait (post-ACK delivery poll, final-reply wait, LLM poll). Release it once the work is durably accepted (`ProductInboundAck::is_durable_outcome`) — before invoking the post-ACK observer — and bound the downstream work with its own mechanism (the delivery-side semaphore/single-flight guard/`max_wait`). Conflating admission with work duration lets a handful of slow turns exhaust every intake slot and silently reject new inbound webhooks under load. |
There was a problem hiding this comment.
This is an important rule to add. For better readability and to make it easier to digest as a guideline, consider restructuring this paragraph into a main bullet point with sub-bullets. This will help future developers quickly grasp the core principles.
| - Admission/intake backpressure (the `max_in_flight` permit in `runner.rs`/`runner_immediate_ack.rs`) gates fast intake only — auth, parse, stamp, and `submit_inbound`. An admission/intake permit must NOT be held across an unbounded downstream wait (post-ACK delivery poll, final-reply wait, LLM poll). Release it once the work is durably accepted (`ProductInboundAck::is_durable_outcome`) — before invoking the post-ACK observer — and bound the downstream work with its own mechanism (the delivery-side semaphore/single-flight guard/`max_wait`). Conflating admission with work duration lets a handful of slow turns exhaust every intake slot and silently reject new inbound webhooks under load. | |
| - **Admission Permit Lifetime:** | |
| - The `max_in_flight` admission permit in `runner.rs`/`runner_immediate_ack.rs` gates fast intake only: auth, parse, stamp, and `submit_inbound`. | |
| - Do not hold the permit across unbounded downstream waits (e.g., post-ACK delivery polls, final-reply waits). Release it once work is durably accepted (`ProductInboundAck::is_durable_outcome`), before invoking the post-ACK observer. | |
| - Bound downstream work with its own mechanisms (e.g., delivery-side semaphores, single-flight guards, `max_wait`). | |
| - **Rationale:** Conflating admission with work duration allows a few slow downstream tasks to exhaust all intake slots, causing silent rejection of new inbound webhooks under load. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs`:
- Around line 625-629: The wait loop in the immediate-ack test can hang forever
if `observe_workflow_ack` never starts; update the blocking wait in
`runner_immediate_ack` to use a short `tokio::time::timeout(...)` around the
`yield_now()` polling. Apply the same bounded-wait pattern to both observer
readiness waits in this flow, and make the test fail fast with a clear timeout
error instead of spinning indefinitely.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 999ba238-6fb7-4269-9192-9ef8340543cd
📒 Files selected for processing (3)
crates/ironclaw_wasm_product_adapters/CLAUDE.mdcrates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rsdocs/plans/2026-06-25-slack-admission-permit.md
|
🚅 Deployed to the ironclaw-pr-5225 environment in ironclaw-ci-preview
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs (1)
99-102: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winSoften the timeout guarantee in this comment.
workflow_timeoutonly wrapssubmit_inboundat Lines 117-119; the permit is acquired at Line 84 and is already held through parse/stamp before this block. Say only the submit wait is bounded.Proposed comment adjustment
- // The admission permit gates only fast intake (auth/parse/stamp/ - // submit), bounded by `workflow_timeout`. It must NOT be held across + // The admission permit gates intake (parse/stamp) and the workflow + // submit; only the async submit wait is bounded by `workflow_timeout`. + // It must NOT be held acrossAs per coding guidelines, "Comments that promise guarantees across layers must either be enforced by code/tests or softened to describe intent".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs` around lines 99 - 102, The comment in runner_immediate_ack::ack handling overstates the scope of workflow_timeout and implies it bounds the entire fast intake path, but the code only wraps submit_inbound while the permit is already held through parse/stamp. Update the comment near the admission permit and post-ACK observer delivery logic to soften the guarantee and describe only that the submit wait is bounded by workflow_timeout, keeping the note that the permit must not be held across the observer poll.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs`:
- Around line 99-102: The comment in runner_immediate_ack::ack handling
overstates the scope of workflow_timeout and implies it bounds the entire fast
intake path, but the code only wraps submit_inbound while the permit is already
held through parse/stamp. Update the comment near the admission permit and
post-ACK observer delivery logic to soften the guarantee and describe only that
the submit wait is bounded by workflow_timeout, keeping the note that the permit
must not be held across the observer poll.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 053cc026-8770-4eb9-a93e-bf16efd40c92
📒 Files selected for processing (1)
crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs
The bug (verified)
The immediate-ACK webhook path acquires an admission permit — the
max_in_flightintake semaphore (SLACK_MAX_IN_FLIGHT_WEBHOOKS = 64,slack_host_beta.rs:84) — inprepare_inbound_envelope(
runner.rs:316), then moves it into the spawned post-ACK task and holdsit for the task's whole lifetime, including the observer delivery poll
(
runner_immediate_ack.rs).For Slack the observer is
SlackFinalReplyDeliveryObserver(
slack_delivery.rs:1230). Itsobserve_workflow_ackcallsdeliver_final_reply(slack_delivery.rs:1392), which polls thesubmitted run for its final reply for up to
max_wait— default 120s(
slack_delivery.rs:109). That poll runs outside the ~2sworkflow_timeout(SLACK_WEBHOOK_WORKFLOW_TIMEOUT,slack_host_beta.rs:83)that bounds
submit_inbound.So each turn pins an admission slot for up to ~120s even though
submit_inbounddurably accepted the run in ~2s. With 64 slow turns,every intake slot is exhausted; new inbound webhooks are rejected
TooManyInFlight→ HTTP 429 (slack_serve.rs:337). Slack retries abounded number of times and then gives up — under sustained load this is
silent user-message loss. The anti-pattern: an admission/intake slot
conflated with work duration and held across an unbounded downstream wait.
Red → green evidence
New test
admission_released_after_durable_accept_not_held_across_delivery(
max_in_flight = 1, workflow durably accepts immediately, observer blocksto model the 120s poll):
TooManyInFlight { max_in_flight: 1 }while the first is still indelivery — test fails RED, reproducing the bug exactly.
Counterpart
admission_retained_across_observer_for_non_durable_ackpinsthe
is_durable_outcome()guard: aRejected(Retryable)ack keeps thepermit across the observer, so admission still backpressures un-accepted
intake.
The fix (decouple admission from delivery)
Once
submit_inboundreturns a durable ack(
ProductInboundAck::is_durable_outcome()— the canonical predicate),dropthe permit before invoking the observer. Admission now gatesonly the fast intake window (auth/parse/stamp/submit, bounded by
workflow_timeout). The unbounded delivery wait is bounded by its ownmechanism on the delivery side (delivery semaphore / single-flight guard /
max_wait) —slack_delivery.rsis unchanged. Non-durable acks anderror/timeout paths retain the permit until scope end (
OwnedSemaphorePermitreleases on drop on every path; no
unwrap/expect).No new durable-retry subsystem is needed: the run is already durable when
submit_inboundreturns, so the long delivery poll no longer needs anintake slot.
Guardrail
Added a rule to
crates/ironclaw_wasm_product_adapters/CLAUDE.md: anadmission/intake permit must NOT be held across an unbounded downstream
wait (delivery / final-reply / LLM poll); release it once work is durably
accepted and bound the work with its own mechanism.
Quality gate
cargo fmt --all --check: cleancargo clippy -p ironclaw_wasm_product_adapters --all-targets: zero warningscargo test -p ironclaw_wasm_product_adapters: 65 passed (incl. both new tests)cargo test -p ironclaw_reborn_composition --features slack-v2-host-beta,test-support:1203 passed; 1 pre-existing load-induced timing flake in an unrelated
runtime.rstest (multi_tool_call_response_survives_surface_change_mid_register,RunTimeout) that passes in isolation — not related to this change.Plan:
docs/plans/2026-06-25-slack-admission-permit.md.🤖 Generated with Claude Code