Skip to content

fix(reborn): release Slack admission permit once inbound is durably accepted#5225

Merged
serrrfirat merged 2 commits into
mainfrom
fix/reborn-slack-admission-permit
Jun 25, 2026
Merged

fix(reborn): release Slack admission permit once inbound is durably accepted#5225
serrrfirat merged 2 commits into
mainfrom
fix/reborn-slack-admission-permit

Conversation

@henrypark133

Copy link
Copy Markdown
Collaborator

The bug (verified)

The immediate-ACK webhook path acquires an admission permit — the
max_in_flight intake semaphore (SLACK_MAX_IN_FLIGHT_WEBHOOKS = 64,
slack_host_beta.rs:84) — in prepare_inbound_envelope
(runner.rs:316), then moves it into the spawned post-ACK task and holds
it for the task's whole lifetime
, including the observer delivery poll
(runner_immediate_ack.rs).

For Slack the observer is SlackFinalReplyDeliveryObserver
(slack_delivery.rs:1230). Its observe_workflow_ack calls
deliver_final_reply (slack_delivery.rs:1392), which polls the
submitted run for its final reply for up to max_wait — default 120s

(slack_delivery.rs:109). That poll runs outside the ~2s
workflow_timeout (SLACK_WEBHOOK_WORKFLOW_TIMEOUT, slack_host_beta.rs:83)
that bounds submit_inbound.

So each turn pins an admission slot for up to ~120s even though
submit_inbound durably accepted the run in ~2s. With 64 slow turns,
every intake slot is exhausted; new inbound webhooks are rejected
TooManyInFlight → HTTP 429 (slack_serve.rs:337). Slack retries a
bounded number of times and then gives up — under sustained load this is
silent user-message loss. The anti-pattern: an admission/intake slot
conflated with work duration and held across an unbounded downstream wait.

Red → green evidence

New test admission_released_after_durable_accept_not_held_across_delivery
(max_in_flight = 1, workflow durably accepts immediately, observer blocks
to model the 120s poll):

  • On the unpatched base: the second webhook is rejected
    TooManyInFlight { max_in_flight: 1 } while the first is still in
    delivery — test fails RED, reproducing the bug exactly.
  • After the fix: the second webhook is admitted — test passes GREEN.

Counterpart admission_retained_across_observer_for_non_durable_ack pins
the is_durable_outcome() guard: a Rejected(Retryable) ack keeps the
permit across the observer, so admission still backpressures un-accepted
intake.

The fix (decouple admission from delivery)

Once submit_inbound returns a durable ack
(ProductInboundAck::is_durable_outcome() — the canonical predicate),
drop the permit before invoking the observer. Admission now gates
only the fast intake window (auth/parse/stamp/submit, bounded by
workflow_timeout). The unbounded delivery wait is bounded by its own
mechanism on the delivery side (delivery semaphore / single-flight guard /
max_wait) — slack_delivery.rs is unchanged. Non-durable acks and
error/timeout paths retain the permit until scope end (OwnedSemaphorePermit
releases on drop on every path; no unwrap/expect).

No new durable-retry subsystem is needed: the run is already durable when
submit_inbound returns, so the long delivery poll no longer needs an
intake slot.

Guardrail

Added a rule to crates/ironclaw_wasm_product_adapters/CLAUDE.md: an
admission/intake permit must NOT be held across an unbounded downstream
wait (delivery / final-reply / LLM poll); release it once work is durably
accepted and bound the work with its own mechanism.

Quality gate

  • cargo fmt --all --check: clean
  • cargo clippy -p ironclaw_wasm_product_adapters --all-targets: zero warnings
  • cargo test -p ironclaw_wasm_product_adapters: 65 passed (incl. both new tests)
  • cargo test -p ironclaw_reborn_composition --features slack-v2-host-beta,test-support:
    1203 passed; 1 pre-existing load-induced timing flake in an unrelated
    runtime.rs test (multi_tool_call_response_survives_surface_change_mid_register,
    RunTimeout) that passes in isolation — not related to this change.

Plan: docs/plans/2026-06-25-slack-admission-permit.md.

🤖 Generated with Claude Code

…ccepted

The immediate-ACK webhook path acquired an admission permit (the
max_in_flight intake semaphore, capacity 64 for Slack) and held it for
the entire spawned post-ACK task — including the observer delivery poll.
For Slack that observer (SlackFinalReplyDeliveryObserver) polls the
submitted run for its final reply for up to max_wait (default 120s)
inside observe_workflow_ack, well outside the ~2s workflow_timeout that
bounds submit_inbound.

Net effect: each turn pinned an admission slot for up to ~120s even
though submit_inbound durably accepted the run in ~2s. With 64 slow
turns, every intake slot is exhausted and new inbound webhooks are
rejected TooManyInFlight (HTTP 429) for the whole delivery window;
under sustained load Slack exhausts its bounded retries and user
messages are silently lost.

Fix: decouple admission from delivery. Once submit_inbound returns a
durable ack (ProductInboundAck::is_durable_outcome), drop the permit
before invoking the observer, so admission gates only the fast intake
window. Non-durable acks and error/timeout paths retain the permit
until scope end so admission still backpressures un-accepted intake.
The unbounded delivery wait is bounded by its own mechanism on the
delivery side (semaphore / single-flight guard / max_wait), unchanged.

Red->green coverage: admission_released_after_durable_accept_not_held_
across_delivery (fails TooManyInFlight on base, passes after fix) plus
admission_retained_across_observer_for_non_durable_ack pinning the
is_durable_outcome guard. Guardrail rule added to the crate CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5225 June 25, 2026 07:21 Destroyed
@github-actions github-actions Bot added scope: docs Documentation size: L 200-499 changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Fixed webhook intake backpressure so admission permits are released immediately after a durable inbound acceptance, reducing prolonged intake slot exhaustion and unexpected 429 responses under load.
    • Ensured non-durable outcomes keep admission backpressured through the post-ack observer, preserving correct retry/rejection behavior.
  • Documentation

    • Added a plan outlining the intended “admission permit vs. delivery polling” behavior and the correct durability signal for releasing intake capacity.
  • Tests

    • Added regression coverage for both durable and non-durable acknowledgment paths and observer synchronization.

Walkthrough

The PR changes immediate-ACK admission handling so durable inbound acknowledgements release the intake permit before post-ACK observation, while non-durable outcomes keep it held. It also adds a regression plan, a blocking observer harness, and tests for both durable release and retained backpressure.

Changes

Slack admission permit lifetime

Layer / File(s) Summary
Admission lifetime contract
docs/plans/2026-06-25-slack-admission-permit.md, crates/ironclaw_wasm_product_adapters/CLAUDE.md
The plan records the failure mode, the durable-release condition, the regression scenario, the fix outline, and the guardrail text.
Immediate-ACK permit release
crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs
The spawned immediate-ACK task stores the permit in an option and drops it early only when the ACK is durable, before calling the post-ACK observer.
Regression tests
crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs
The test module adds a non-durable rejection workflow stub, rewrites the blocking observer latch, and covers durable release versus non-durable retention.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • serrrfirat

Poem

A permit held its breath in line,
Then dropped at ACK when all was fine.
Durable green let work proceed,
While red kept backpressure in need.
🌙

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Conventional-commit title accurately summarizes the Slack admission-permit release fix.
Description check ✅ Passed It covers the bug, fix, regression tests, and validation, though it doesn't use the template's sectioned format.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5e9d10631d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +132 to +133
if ack.is_durable_outcome() {
drop(permit.take());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound pending post-ACK delivery tasks separately

When a durable ack comes back, this drops the only runner-level bound before calling the observer. I checked the Slack observer path: observe_workflow_ack waits on delivery_permits.acquire_owned() in slack_delivery.rs:1385, while the separate max_pending_deliveries guard exists only for TriggeredRunDeliveryDriver (slack_delivery.rs:1839-1941), not for webhook observer tasks. Under sustained accepted Slack webhooks with slow final-reply polls, everything beyond the 64 delivery permits is still spawned into immediate_ack_tasks and can wait up to the 120s max_wait with no pending cap, trading 429s for unbounded task/envelope accumulation; add a delivery-side pending permit/queue limit for this observer path before releasing admission or before entering the observer.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug where the admission permit was held across unbounded downstream waits, leading to potential webhook rejection under load. The fix ensures that the permit is released early once the inbound is durably accepted, and includes regression tests to verify the behavior. The reviewer suggested restructuring the newly added architectural guideline in CLAUDE.md into a bulleted list to improve readability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

- Runtime-specific WIT/request shapes may live here, but shared product-surface DTOs stay in `ironclaw_product_adapters`.
- ProductAdapter components may use v1-style minimal WASI p2 for wasm32-wasip2 compatibility: clock/random are allowed; env, args, stdio, preopened directories, inherited network, and DNS lookup must stay disabled. Current slice is parse/render-only: the ProductAdapter WIT `http-egress` import fails closed until host-runtime egress wiring is injected in a follow-up.
- Keep dependencies minimal. Avoid workflow/runtime crates unless a concrete host-glue call site requires them and architecture tests are updated deliberately.
- Admission/intake backpressure (the `max_in_flight` permit in `runner.rs`/`runner_immediate_ack.rs`) gates fast intake only — auth, parse, stamp, and `submit_inbound`. An admission/intake permit must NOT be held across an unbounded downstream wait (post-ACK delivery poll, final-reply wait, LLM poll). Release it once the work is durably accepted (`ProductInboundAck::is_durable_outcome`) — before invoking the post-ACK observer — and bound the downstream work with its own mechanism (the delivery-side semaphore/single-flight guard/`max_wait`). Conflating admission with work duration lets a handful of slow turns exhaust every intake slot and silently reject new inbound webhooks under load.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is an important rule to add. For better readability and to make it easier to digest as a guideline, consider restructuring this paragraph into a main bullet point with sub-bullets. This will help future developers quickly grasp the core principles.

Suggested change
- Admission/intake backpressure (the `max_in_flight` permit in `runner.rs`/`runner_immediate_ack.rs`) gates fast intake only — auth, parse, stamp, and `submit_inbound`. An admission/intake permit must NOT be held across an unbounded downstream wait (post-ACK delivery poll, final-reply wait, LLM poll). Release it once the work is durably accepted (`ProductInboundAck::is_durable_outcome`) — before invoking the post-ACK observer — and bound the downstream work with its own mechanism (the delivery-side semaphore/single-flight guard/`max_wait`). Conflating admission with work duration lets a handful of slow turns exhaust every intake slot and silently reject new inbound webhooks under load.
- **Admission Permit Lifetime:**
- The `max_in_flight` admission permit in `runner.rs`/`runner_immediate_ack.rs` gates fast intake only: auth, parse, stamp, and `submit_inbound`.
- Do not hold the permit across unbounded downstream waits (e.g., post-ACK delivery polls, final-reply waits). Release it once work is durably accepted (`ProductInboundAck::is_durable_outcome`), before invoking the post-ACK observer.
- Bound downstream work with its own mechanisms (e.g., delivery-side semaphores, single-flight guards, `max_wait`).
- **Rationale:** Conflating admission with work duration allows a few slow downstream tasks to exhaust all intake slots, causing silent rejection of new inbound webhooks under load.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs`:
- Around line 625-629: The wait loop in the immediate-ack test can hang forever
if `observe_workflow_ack` never starts; update the blocking wait in
`runner_immediate_ack` to use a short `tokio::time::timeout(...)` around the
`yield_now()` polling. Apply the same bounded-wait pattern to both observer
readiness waits in this flow, and make the test fail fast with a clear timeout
error instead of spinning indefinitely.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 999ba238-6fb7-4269-9192-9ef8340543cd

📥 Commits

Reviewing files that changed from the base of the PR and between 3cbde9b and 5e9d106.

📒 Files selected for processing (3)
  • crates/ironclaw_wasm_product_adapters/CLAUDE.md
  • crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs
  • docs/plans/2026-06-25-slack-admission-permit.md

Comment thread crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs Outdated
@railway-app

railway-app Bot commented Jun 25, 2026

Copy link
Copy Markdown

🚅 Deployed to the ironclaw-pr-5225 environment in ironclaw-ci-preview

Service Status Web Updated (UTC)
ironclaw ✅ Success (View Logs) Web Jun 25, 2026 at 8:07 am

@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5225 June 25, 2026 08:00 Destroyed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs (1)

99-102: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Soften the timeout guarantee in this comment.

workflow_timeout only wraps submit_inbound at Lines 117-119; the permit is acquired at Line 84 and is already held through parse/stamp before this block. Say only the submit wait is bounded.

Proposed comment adjustment
-            // The admission permit gates only fast intake (auth/parse/stamp/
-            // submit), bounded by `workflow_timeout`. It must NOT be held across
+            // The admission permit gates intake (parse/stamp) and the workflow
+            // submit; only the async submit wait is bounded by `workflow_timeout`.
+            // It must NOT be held across

As per coding guidelines, "Comments that promise guarantees across layers must either be enforced by code/tests or softened to describe intent".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs` around
lines 99 - 102, The comment in runner_immediate_ack::ack handling overstates the
scope of workflow_timeout and implies it bounds the entire fast intake path, but
the code only wraps submit_inbound while the permit is already held through
parse/stamp. Update the comment near the admission permit and post-ACK observer
delivery logic to soften the guarantee and describe only that the submit wait is
bounded by workflow_timeout, keeping the note that the permit must not be held
across the observer poll.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs`:
- Around line 99-102: The comment in runner_immediate_ack::ack handling
overstates the scope of workflow_timeout and implies it bounds the entire fast
intake path, but the code only wraps submit_inbound while the permit is already
held through parse/stamp. Update the comment near the admission permit and
post-ACK observer delivery logic to soften the guarantee and describe only that
the submit wait is bounded by workflow_timeout, keeping the note that the permit
must not be held across the observer poll.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 053cc026-8770-4eb9-a93e-bf16efd40c92

📥 Commits

Reviewing files that changed from the base of the PR and between 5e9d106 and fb2a80f.

📒 Files selected for processing (1)
  • crates/ironclaw_wasm_product_adapters/src/runner_immediate_ack.rs

@serrrfirat serrrfirat merged commit 8b87367 into main Jun 25, 2026
119 checks passed
@serrrfirat serrrfirat deleted the fix/reborn-slack-admission-permit branch June 25, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules scope: docs Documentation size: L 200-499 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants