fix(reborn): remove per-record lock convoys via shared cas_update helper by henrypark133 · Pull Request #5234 · nearai/ironclaw

henrypark133 · 2026-06-25T11:53:19Z

Problem

Each Reborn persistence store wrapped its filesystem read-modify-write in a per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) held across .await — a redundant in-process serializer over backends that already do versioned CAS. Under burst, one writer stalled inside its critical section blocks every other writer for that scope (the convoy that contributed to the runtime wedge). PR #5142 removed exactly this from ironclaw_turns; this PR finishes the job for the remaining stores via one shared helper.

Change

Shared helper (ironclaw_filesystem::cas_update): one mutex-free, bounded read-modify-write loop — read versioned snapshot → idempotent apply closure → CAS put → on VersionMismatch re-read and retry (32×, jittered 2–50ms backoff, 15s timeout), with a fail-closed capability gate. Generic over record/outcome/error; leaks no store types.

Migrated all 5 stores onto it and deleted every per-record mutex:

ironclaw_turns — re-homed its local fix(turns): prevent turn-state write convoy #5142 copy onto the shared helper (one owner).
ironclaw_run_state, ironclaw_threads — dropped the mutex; route through the helper.
ironclaw_resources — gains a retry loop it never had (the lock was its only serializer).
ironclaw_secrets — deletes the Arc-keyed lock map → fixes an unbounded memory leak (it stored Arc, never pruned).

Post-condition: grep FILESYSTEM_RECORD_LOCKS / record_lock.lock().await / filesystem_secret_lock across the 5 store crates is empty.

Guardrail: .claude/rules/database.md records the invariant — filesystem read-modify-write must go through cas_update; never wrap it in a per-record mutex held across .await.

Tri-backend parity (no diverging experience by host)

The helper is RootFilesystem-level (backend-agnostic). Verified across all three ironclaw_filesystem backends:

in-memory (default): clippy 0 across the 6 crates; all store tests pass (turns/run_state/threads/resources/secrets).
postgres (--features postgres): CAS contract tests pass — postgres_native_put_cas_absent_rejects_existing_path, postgres_native_put_cas_any_increments_existing_version, postgres_transaction_rollback_discards_prior_put_after_later_cas_conflict.
libsql (--features libsql): CAS contract tests pass. (One unrelated append/dir contract test is flaky under parallel suite execution — passes 3/3 in isolation; db.rs is untouched by this PR, so it is pre-existing test-infra flakiness, not a regression.)

Notes

Concurrency-sensitive; the helper carries a high-contention storm test (N concurrent writers, no lost update), retries-exhausted, capability-gate-rejects-non-CAS, and create-if-absent.
Stacks conceptually on fix(turns): prevent turn-state write convoy #5142 (already merged). Companion to the runtime-wedge work (fix(reborn): stop WASM execution from starving the tokio worker pool #5206).

🤖 Generated with Claude Code

Extract the proven mutex-free CAS pattern from ironclaw_turns (PR #5142) into one shared helper in ironclaw_filesystem so the remaining stores can drop their per-record tokio::sync::Mutex convoys. cas_update runs a bounded read-modify-write loop: read versioned snapshot, run the caller's idempotent apply closure, CAS-put with the read version, and on VersionMismatch re-read and retry with jittered exponential backoff (2ms..50ms), capped at 32 retries and wrapped in a 15s timeout. A fail-closed capability gate rejects backends that cannot CAS rather than silently blind-overwriting. The helper is generic over the record type, the outcome, and the caller's error; it never leaks store-specific types. No store migrated yet — that follows in the next commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…te helper ironclaw_resources had NO CAS-retry loop — its per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) held across the backend get/put awaits was the ONLY serializer. Under burst, one writer stalled inside its critical section parked every other same-scope writer (the convoy that contributed to the 2026-06-24 runtime wedge). Naively deleting that mutex without a retry loop loses updates: a racing writer's single-attempt CAS detects the version mismatch and errors out (cross-process CAS contention) instead of retrying. Route update_snapshot through ironclaw_filesystem::cas_update (bounded CAS retry + jittered backoff + 15s timeout + fail-closed capability gate) and delete the per-record mutex, FilesystemRecordLock, the local PutError, and the fail-open put_with_cas (CasExpectation::Any blind-overwrite fallback). The helper fails closed instead; verified safe — these store aliases only ever resolve to CAS-capable db/in-memory backends in production. The public update API widens from FnOnce to FnMut because the helper re-runs the closure against a freshly read snapshot on every CAS retry; leaf closures that previously moved captured values now clone per invocation. Add PartialEq to BudgetGateSnapshot (helper needs S: Clone + PartialEq). Red->green regression: cas_snapshot::tests::concurrent_increments_have_no_lost_updates proves RED on lock-removed-no-retry (lost updates / spurious contention) and GREEN through the helper (every concurrent increment lands, no convoy). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ant per-record mutex run_state held BOTH a store-local CAS-retry loop AND a per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) across the backend get/put awaits — belt-and-suspenders where the mutex was a redundant in-process serializer over a backend that already does versioned CAS. Under burst the mutex convoyed same-scope writers behind one stalled writer (the 2026-06-24 wedge pattern). Route apply_update, update_status, start, and save_pending through ironclaw_filesystem::cas_update (one shared CAS implementation: bounded retry + jittered backoff + 15s timeout + fail-closed capability gate). Delete the store-local FILESYSTEM_CAS_RETRIES loop, the per-record lock + accessor + its two unit tests, the local PutError, and the fail-open put_with_cas (CasExpectation::Any blind-overwrite fallback). The scope-ownership check and the approval Pending guard move inside the re-runnable apply closure; the ApprovalStatus guard returns Err from apply. discard_pending keeps its read-then-delete, only the lock is dropped. The run_state contract tests previously ran against LocalFilesystem (byte-only, no versioned CAS) where the OLD fail-open fallback masked the missing CAS. Production never routes run_state to LocalFilesystem — only to CAS-capable db/in-memory backends — so the tests now use InMemoryBackend, matching the production capability shape (CAS) instead of a non-production blind-overwrite path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…p per-record mutex ensure_thread held the only per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) in this crate, across the backend get/put awaits, to serialize the check-then-create against Unsupported(WriteFile)-fallback backends. Route it through ironclaw_filesystem::cas_update: the apply closure returns the existing record unchanged (no-op, no write) when the thread already exists with a matching scope, or builds the fresh StoredThreadRecord when absent. A concurrent create-if-absent loser hits VersionMismatch, the helper re-reads and re-runs apply which now sees the winner and reconciles scope — exactly the old single-reconcile semantics, now via the shared CAS-retry loop without any lock. Delete the lock infra (Weak map + accessor); add PartialEq to StoredThreadRecord (helper needs S: Clone + PartialEq). The three other record/txn loops were already lock-free and are left as-is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cord mutex (Arc leak fix) The secrets filesystem store kept FILESYSTEM_RECORD_LOCKS as a HashMap of *strong* Arc<Mutex<()>> (not Weak), one entry per (scope,lease)/session path ever seen, NEVER pruned — an unbounded memory leak. The per-record mutex was also held across the backend get/put awaits, convoying same-key writers behind one stalled writer (the 2026-06-24 wedge pattern). Route consume/revoke/consume_session_use through ironclaw_filesystem::cas_update (one shared CAS implementation), mapping the local CasDecision onto CasApply: Commit/BestEffortCommit -> changed snapshot; Settle(Ok) -> unchanged snapshot (helper skips the write via PartialEq no-op); Settle(Err)/not-found -> apply error. Crypto decrypt + use-count increment run inside the re-runnable apply closure (pure / recomputed from the freshly read record each retry). validate_session is a pure read — its lock is simply deleted. Delete the entire lock map (leak gone), cas_mutate/CasDecision/CAS_RETRY_ATTEMPTS, and the fail-open put_with_version_fallback (helper fails closed). Add PartialEq to StoredLease and StoredSession. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

PR #5142 removed the per-record mutex from ironclaw_turns but left a local copy of the CAS read-modify-write loop (apply_with_retry, put_with_cas, PutError, cas_retry_backoff + constants). Re-home it onto ironclaw_filesystem::cas_update so there is ONE CAS implementation across the codebase. Behavior is preserved: the 18 call sites and their closure shape (FnMut(InMemoryTurnStateStore) -> Fut) are unchanged; the exact timeout/exhaustion error strings are preserved. A BridgeError<T> sentinel carries the absent-record + default-snapshot no-op through the helper's apply-error channel (the helper's own no-op check only fires for Some(existing)==new; the turns store additionally must skip creating a file for an empty default store, which the old new==old check covered because read returned default on absent). The 500ms SNAPSHOT_READ_CACHE_TTL read cache is kept as a separate layer: cleared around the CAS call and repopulated on the next read (the helper does its own fresh get and does not surface the new RecordVersion; every read_snapshot caller already discards the version, so caching None is safe). Delete the now-unused local loop, put_with_cas, PutError, cas_retry_backoff, and the duplicate CAS constants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cord mutex Add the invariant to crates/ironclaw_filesystem/CLAUDE.md (#2 "CAS is the floor") and a pointer in .claude/rules/database.md: every filesystem read-modify-write must go through the one shared ironclaw_filesystem::cas_update helper; never wrap it in a per-record tokio::sync::Mutex held across the backend .await (redundant serializer + convoy/wedge risk + leak). Also commit Cargo.lock (async-trait dev-dep added to ironclaw_resources for the convoy regression test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request migrates multiple persistence stores (ironclaw_resources, ironclaw_run_state, ironclaw_threads, ironclaw_secrets, and ironclaw_turns) to use a shared, lock-free compare-and-swap (CAS) helper (cas_update). This eliminates redundant per-record mutexes held across .await boundaries, preventing runtime-wedging convoys under high contention. Feedback on the changes highlights an opportunity to improve the jitter calculation in cas_retry_backoff using RandomState to avoid zero-jitter scenarios in low clock resolution environments (such as VMs or Windows). Additionally, it is recommended to explicitly add Send bounds to the closure generic constraints in cas_update and cas_update_loop to catch any non-Send regressions at the helper's definition site.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-25T11:56:05Z

+async fn cas_retry_backoff(attempt: usize) {
+    let shift = attempt.min(8) as u32;
+    let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);
+    let base_delay = FILESYSTEM_CAS_BACKOFF_BASE
+        .saturating_mul(multiplier)
+        .min(FILESYSTEM_CAS_BACKOFF_MAX);
+    let jitter = SystemTime::now()
+        .duration_since(UNIX_EPOCH)
+        .map(|elapsed| {
+            let jitter_ceiling = base_delay.as_millis().max(1);
+            Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64)
+        })
+        .unwrap_or_default();
+    tokio::time::sleep(base_delay.saturating_add(jitter)).await;
+}


Using SystemTime::now() for jitter calculation can lead to zero jitter on systems with low clock resolution (e.g., 1ms resolution in VMs, containers, or Windows). Since base_delay.as_millis() is at most 50, and 1,000,000 (1ms in nanoseconds) is divisible by all common backoff ceilings (2, 4, 8, 16, 32, 50), the modulo operation elapsed.as_nanos() % jitter_ceiling will always evaluate to exactly 0. This completely defeats the purpose of jitter under high contention.

Instead, use std::collections::hash_map::RandomState to obtain a high-quality, thread-safe, and un-correlated pseudo-random value seeded from the OS without adding external dependencies.

Suggested change

async fn cas_retry_backoff(attempt: usize) {

let shift = attempt.min(8) as u32;

let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);

let base_delay = FILESYSTEM_CAS_BACKOFF_BASE

.saturating_mul(multiplier)

.min(FILESYSTEM_CAS_BACKOFF_MAX);

let jitter = SystemTime::now()

.duration_since(UNIX_EPOCH)

.map(|elapsed| {

let jitter_ceiling = base_delay.as_millis().max(1);

Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64)

})

.unwrap_or_default();

tokio::time::sleep(base_delay.saturating_add(jitter)).await;

}

async fn cas_retry_backoff(attempt: usize) {

let shift = attempt.min(8) as u32;

let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);

let base_delay = FILESYSTEM_CAS_BACKOFF_BASE

.saturating_mul(multiplier)

.min(FILESYSTEM_CAS_BACKOFF_MAX);

let jitter = {

use std::collections::hash_map::RandomState;

use std::hash::{BuildHasher, Hash, Hasher};

let mut hasher = RandomState::new().build_hasher();

attempt.hash(&mut hasher);

let hash = hasher.finish();

let jitter_ceiling = base_delay.as_millis().max(1) as u64;

Duration::from_millis(hash % jitter_ceiling)

};

tokio::time::sleep(base_delay.saturating_add(jitter)).await;

}

gemini-code-assist · 2026-06-25T11:56:05Z

+pub async fn cas_update<F, S, T, E, D, N, A, Fut>(
+    filesystem: &ScopedFilesystem<F>,
+    scope: &ResourceScope,
+    path: &ScopedPath,
+    decode: D,
+    encode: N,
+    mut apply: A,
+) -> Result<T, CasUpdateError<E>>
+where
+    F: RootFilesystem + ?Sized,
+    S: PartialEq + Clone,
+    D: Fn(&[u8]) -> Result<S, E>,
+    N: Fn(&S) -> Result<Entry, E>,
+    A: FnMut(Option<S>) -> Fut,
+    Fut: Future<Output = Result<CasApply<S, T>, E>>,


According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.

pub async fn cas_update<F, S, T, E, D, N, A, Fut>( filesystem: &ScopedFilesystem<F>, scope: &ResourceScope, path: &ScopedPath, decode: D, encode: N, mut apply: A, ) -> Result<T, CasUpdateError<E>> where F: RootFilesystem + ?Sized, S: PartialEq + Clone + Send, T: Send, E: Send, D: Fn(&[u8]) -> Result<S, E> + Send, N: Fn(&S) -> Result<Entry, E> + Send, A: FnMut(Option<S>) -> Fut + Send, Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,

References

In Rust, explicitly add Send bounds to closure generic constraints in async helper functions that return futures intended to be Send. This ensures that non-Send regressions are caught at the helper's definition site rather than at the call site, even if the future is already bound by Send at the call site.

gemini-code-assist · 2026-06-25T11:56:05Z

+async fn cas_update_loop<F, S, T, E, D, N, A, Fut>(
+    filesystem: &ScopedFilesystem<F>,
+    scope: &ResourceScope,
+    path: &ScopedPath,
+    decode: &D,
+    encode: &N,
+    apply: &mut A,
+) -> Result<T, CasUpdateError<E>>
+where
+    F: RootFilesystem + ?Sized,
+    S: PartialEq + Clone,
+    D: Fn(&[u8]) -> Result<S, E>,
+    N: Fn(&S) -> Result<Entry, E>,
+    A: FnMut(Option<S>) -> Fut,
+    Fut: Future<Output = Result<CasApply<S, T>, E>>,


According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.

async fn cas_update_loop<F, S, T, E, D, N, A, Fut>( filesystem: &ScopedFilesystem<F>, scope: &ResourceScope, path: &ScopedPath, decode: &D, encode: &N, apply: &mut A, ) -> Result<T, CasUpdateError<E>> where F: RootFilesystem + ?Sized, S: PartialEq + Clone + Send, T: Send, E: Send, D: Fn(&[u8]) -> Result<S, E> + Send, N: Fn(&S) -> Result<Entry, E> + Send, A: FnMut(Option<S>) -> Fut + Send, Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,

References

In Rust, explicitly add Send bounds to closure generic constraints in async helper functions that return futures intended to be Send. This ensures that non-Send regressions are caught at the helper's definition site rather than at the call site, even if the future is already bound by Send at the call site.

coderabbitai · 2026-06-25T11:58:10Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added a shared, lock-free update path for filesystem-backed data, improving consistency under contention.
- Exposed filesystem capability information for smarter update handling.
Bug Fixes
- Reduced lost updates and convoying during concurrent writes.
- Improved handling of no-op updates, retries, timeouts, and unsupported backends.
Refactor
- Migrated several stores to a common update flow, removing older per-record locking approaches and simplifying concurrent persistence.

Walkthrough

Introduced a shared filesystem cas_update helper with capability gating, bounded retries, backoff, and no-op detection, then migrated run-state, approval, resource, secret, thread, and turn stores to retryable CAS update paths with updated error mapping, tests, and migration notes.

Changes

Shared filesystem CAS migration

Layer / File(s)	Summary
Contract and public API surface `.claude/rules/database.md`, `crates/ironclaw_filesystem/CLAUDE.md`, `docs/plans/2026-06-25-cas-migration.md`, `crates/ironclaw_filesystem/src/scoped.rs`, `crates/ironclaw_filesystem/src/lib.rs`, `crates/ironclaw_filesystem/src/cas.rs`	Filesystem invariants and migration notes now require `cas_update`; `ScopedFilesystem` exposes capabilities; and the crate root exports the new CAS helper, types, and timing constants.
cas_update loop and tests `crates/ironclaw_filesystem/src/cas.rs`, `crates/ironclaw_filesystem/src/cas/tests.rs`	The shared helper performs capability-gated CAS retries with timeout, no-op detection, backoff, and mapped error variants, with behavior covered by tests.
CasSnapshotStore migration `crates/ironclaw_resources/src/cas_snapshot.rs`, `crates/ironclaw_resources/Cargo.toml`	`CasSnapshotStore` switches to shared CAS retries, requires re-runnable snapshot updates, and adds contention regression coverage.
Resource governor retries `crates/ironclaw_resources/src/filesystem_store.rs`, `crates/ironclaw_resources/src/lib.rs`	`ResourceGovernorStore` and `FilesystemBudgetGateStore` accept re-runnable closures, clone captured values on retries, and add `PartialEq` to the budget snapshot.
Run-state record updates `crates/ironclaw_run_state/src/lib.rs`	`FilesystemRunStateStore` moves record mutation and creation onto `cas_update` and removes the local lock map and legacy CAS write helpers.
Approval request updates and tests `crates/ironclaw_run_state/src/lib.rs`, `crates/ironclaw_run_state/tests/*`	`FilesystemApprovalRequestStore` uses `cas_update` for status transitions and creation, and the run-state harness now uses `InMemoryBackend` in the affected tests.
Secret and session CAS updates `crates/ironclaw_secrets/src/filesystem_store.rs`	`SecretStore` and `CredentialSessionStore` now use shared CAS retries, and stored records derive `PartialEq` for no-op detection.
Thread service CAS updates `crates/ironclaw_threads/src/filesystem_service.rs`	`FilesystemSessionThreadService::ensure_thread` now uses `cas_update`, the stored thread record derives `PartialEq`, and per-path locking is removed.
Turn-state writes and cache handling `crates/ironclaw_turns/src/filesystem_store.rs`	`FilesystemTurnStateStore` routes writes through `cas_update`, the snapshot cache is cleared and repopulated around retries, and the old manual retry helpers are removed.

Sequence Diagram(s)

sequenceDiagram
  participant FilesystemRunStateStore
  participant cas_update
  participant ScopedFilesystem

  FilesystemRunStateStore->>cas_update: apply_update
  cas_update->>ScopedFilesystem: capabilities()
  ScopedFilesystem-->>cas_update: BackendCapabilities
  cas_update->>ScopedFilesystem: get(path)
  cas_update->>ScopedFilesystem: put(path, CasExpectation::Version)
  ScopedFilesystem-->>cas_update: VersionMismatch
  cas_update->>cas_update: retry with backoff
  cas_update-->>FilesystemRunStateStore: Result

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

nearai/ironclaw#5142 — Also rewrites crates/ironclaw_turns/src/filesystem_store.rs toward bounded CAS retries and cache-clearing semantics.
nearai/ironclaw#5232 — Also touches crates/ironclaw_turns/src/filesystem_store.rs and related CAS update flows for runner-lease sidecar overlays.

Suggested reviewers

think-in-universe

Poem

A mutex slept beside the gate,
Then CAS arrived and would not wait.
Retries danced in bounded time,
Snapshots kept their steady rhyme,
And versioned writes rang clear and bright ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	It has useful content, but it misses most required template sections and headings, so the PR description is largely incomplete.	Rewrite it with the repository template and add Summary, Change Type, Linked Issue, Validation, Security Impact, Blast Radius, Rollback Plan, Review Follow-Through, and Review track.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	Conventional-Commits style and accurately summarizes the shared cas_update migration across the filesystem stores.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

⚔️ Resolve merge conflicts

Resolve merge conflict in branch fix/reborn-p1-cas-helper

_{Comment @coderabbitai help to get the list of available commands.}

chatgpt-codex-connector

💡 Codex Review

ironclaw/crates/ironclaw_run_state/src/lib.rs

Line 932 in 3223540

self.filesystem.delete(scope, &path).await?;

Guard discard deletes with CAS

When discard_pending races in the same process with approve or deny for the same request, this unconditional delete can now run after the resolver's CAS-protected status update and remove the terminal approval record. The removed per-record lock used to serialize that get/check/delete sequence with status updates; without either that lock or a CAS/tombstone transition here, a user approval can be lost and later look like UnknownApprovalRequest rather than Approved.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/ironclaw_run_state/src/lib.rs (1)
922-932: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Make discard_pending atomic before dropping the lock.

This path reads Pending, then deletes later with no CAS expectation. A concurrent approve()/deny() can win the CAS update between Line 922 and Line 932, then this delete removes a non-pending approval record. Use a CAS-aware delete/transactional transition, or model discard as a CAS status update instead of read-then-delete. As per path instructions: “Fail loud” and filesystem read-modify-write paths must use the CAS invariant rather than split multi-step state changes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_run_state/src/lib.rs` around lines 922 - 932, The
discard_pending flow in RunState::discard_pending is currently split into a
read/validate step and a later filesystem delete, which can race with
approve()/deny() and remove a record after its status has already changed.
Update this path to be atomic by using a CAS-aware transition or transactional
delete that enforces the same status invariant at the point of mutation, rather
than relying on a prior Pending check; keep the failure path loud if the
expected state no longer matches.
Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_filesystem/src/cas.rs`:
- Line 193: The new Clippy allow on cas.rs needs the required arch-exempt
rationale comment immediately above #[allow(clippy::too_many_arguments)]. Update
the same site near the cas:: function signature that triggered the lint by
adding the mandatory // arch-exempt: too_many_args, <reason naming the missing
aggregation>, plan `#NNNN` annotation, and make sure the reason justifies why this
many parameters is unavoidable until the missing aggregation is introduced.
- Around line 295-306: The CAS preflight is treating
BackendCapabilities::default() as “unknown,” which lets a backend with no
advertised capabilities bypass the fail-closed gate. Update capabilities_known()
in cas.rs to only treat an explicit unknown state as unknown and not overload
BackendCapabilities::empty()/default(), then make the cas_update path rely on
that check so unsupported backends still return CasUpdateError::CasUnsupported
instead of falling back to op-time behavior.
- Around line 262-265: The no-op shortcut in cas_update_loop currently only
returns early when current is Some(existing) and matches the snapshot, while
absent records still fall through to encoding and writing; update the logic to
either treat None plus an unchanged/default snapshot as a true no-op or adjust
the CasApply/cas_update contract and surrounding comments to state the shortcut
only applies to existing records. Use the existing cas_update_loop and CasApply
symbols to keep the behavior and docs aligned, and remove any need for
downstream NoOp suppression if you choose to enforce it here.

In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 184-399: Add a regression test in cas/tests.rs for the timeout
path in cas_update, since the current tests cover retries, CAS rejection, and
apply errors but never exercise CasUpdateError::Timeout. Introduce a tiny wedged
backend or stub that blocks one of the CAS operations (get, put, or apply), and
use paused Tokio time so the outer timeout in cas_update deterministically
fires. Reference cas_update and CasUpdateError::Timeout in the new test, and
assert that the timeout branch wins over the hung backend behavior.

In `@crates/ironclaw_resources/src/cas_snapshot.rs`:
- Line 212: The CasUpdateError::Backend handling in cas_snapshot.rs currently
forwards the backend error string directly via E::storage_from(inner), which can
leak virtual paths and raw backend details into StorageError. Update this
mapping to return a stable sanitized public message from the error conversion
path, and preserve the detailed backend error only in internal diagnostics or
source context associated with E::storage_from.

In `@crates/ironclaw_run_state/src/lib.rs`:
- Line 1109: The `CasUpdateError::Backend` branch in `RunStateError` currently
converts the filesystem failure to a string, which drops the structured
`FilesystemError` context. Update this match arm in `lib.rs` to preserve the
typed error by using the existing `?`-style conversion path or direct typed
`From`/`Into` conversion used elsewhere in this file, so the `FilesystemError`
variant and its path/operation details continue to propagate through
`RunStateError::Filesystem`.

In `@crates/ironclaw_secrets/src/filesystem_store.rs`:
- Around line 479-483: The no-op CAS comment in filesystem_store.rs is
inaccurate: the `already_marked` branch in the lease update path returns
`Err(SecretStoreError::LeaseExpired { lease_id })` and goes through
`CasUpdateError::Apply`, not the unchanged record/`PartialEq` skip path. Update
the comment to describe the actual intent, or change the branch in the relevant
lease/CAS helper to return `CasApply::new(lease, Err(...))` if that is the
desired behavior; use the `already_marked`, `SecretLeaseStatus::Expired`, and
`CasUpdateError::Apply` symbols to locate and align the code with the documented
contract.

In `@crates/ironclaw_turns/src/filesystem_store.rs`:
- Around line 378-385: The CAS error mapper in map_cas_error should not panic on
BridgeError::NoOp, since this is production error-handling code. Replace the
unreachable! fallback in map_cas_error with a typed unavailable TurnError that
preserves context about the unexpected NoOp leak, while keeping the
BridgeError::Real(inner) path unchanged and consistent with CasUpdateError
handling.

In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 137-140: The quality gate in the migration plan is outdated and
should match the repo’s required backend validation for persistence changes.
Update the “Quality gate” section to include the feature-isolation `cargo check`
matrix for the default/postgres build, `--no-default-features --features
libsql`, and `--all-features`, and strengthen the clippy step to the full `cargo
clippy --all --benches --tests --examples --all-features -- -D warnings`. Keep
the existing formatting/tests coverage, but ensure the plan explicitly reflects
the required checks for the legacy persistence migration.

---

Outside diff comments:
In `@crates/ironclaw_run_state/src/lib.rs`:
- Around line 922-932: The discard_pending flow in RunState::discard_pending is
currently split into a read/validate step and a later filesystem delete, which
can race with approve()/deny() and remove a record after its status has already
changed. Update this path to be atomic by using a CAS-aware transition or
transactional delete that enforces the same status invariant at the point of
mutation, rather than relying on a prior Pending check; keep the failure path
loud if the expected state no longer matches.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 1c594037-4929-451d-97ad-73e096ed325d

📥 Commits

Reviewing files that changed from the base of the PR and between 44f063d and 3223540.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock, !**/Cargo.lock

📒 Files selected for processing (17)

.claude/rules/database.md
crates/ironclaw_filesystem/CLAUDE.md
crates/ironclaw_filesystem/src/cas.rs
crates/ironclaw_filesystem/src/cas/tests.rs
crates/ironclaw_filesystem/src/lib.rs
crates/ironclaw_filesystem/src/scoped.rs
crates/ironclaw_resources/Cargo.toml
crates/ironclaw_resources/src/cas_snapshot.rs
crates/ironclaw_resources/src/filesystem_store.rs
crates/ironclaw_resources/src/lib.rs
crates/ironclaw_run_state/src/lib.rs
crates/ironclaw_run_state/tests/approval_resolution_contract.rs
crates/ironclaw_run_state/tests/run_state_contract.rs
crates/ironclaw_secrets/src/filesystem_store.rs
crates/ironclaw_threads/src/filesystem_service.rs
crates/ironclaw_turns/src/filesystem_store.rs
docs/plans/2026-06-25-cas-migration.md

coderabbitai · 2026-06-25T12:09:28Z

+/// changed); `S: Clone` lets each retry hand `apply` an owned snapshot while the
+/// helper retains a copy for that equality check — both mirror the turns
+/// reference.
+#[allow(clippy::too_many_arguments)]


📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Add the required arch-exempt annotation above this Clippy allow.

The repo's architecture rule rejects new #[allow(clippy::too_many_arguments)] without a same-site rationale. As per coding guidelines, "When introducing #[allow(clippy::too_many_arguments)], annotate with // arch-exempt: too_many_args, <reason naming the missing aggregation>, plan #NNNN``." Based on learnings, require the exemption comment immediately above the attribute and explain why this signature is justified.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_filesystem/src/cas.rs` at line 193, The new Clippy allow on cas.rs needs the required arch-exempt rationale comment immediately above #[allow(clippy::too_many_arguments)]. Update the same site near the cas:: function signature that triggered the lint by adding the mandatory // arch-exempt: too_many_args, <reason naming the missing aggregation>, plan `#NNNN` annotation, and make sure the reason justifies why this many parameters is unavoidable until the missing aggregation is introduced.

Sources: Coding guidelines, Learnings

coderabbitai · 2026-06-25T12:09:28Z

+        // 3. No-op fast path: nothing changed, so skip the write entirely.
+        if matches!(&current, Some(existing) if *existing == snapshot) {
+            return Ok(outcome);
+        }


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

The no-op shortcut doesn't cover absent records, but the public contract reads as if it does.

cas_update_loop() only skips the write when current is Some(existing) and existing == snapshot. None plus a default/unchanged snapshot still encodes and writes, which is why crates/ironclaw_turns/src/filesystem_store.rs had to add a separate BridgeError::NoOp path to suppress the write. Either implement the absent-record no-op here or narrow the docs around CasApply/cas_update to say the shortcut only applies when a record already exists. As per coding guidelines, "Comments that promise guarantees across layers must either be enforced by code/tests or softened to describe intent."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_filesystem/src/cas.rs` around lines 262 - 265, The no-op shortcut in cas_update_loop currently only returns early when current is Some(existing) and matches the snapshot, while absent records still fall through to encoding and writing; update the logic to either treat None plus an unchanged/default snapshot as a true no-op or adjust the CasApply/cas_update contract and surrounding comments to state the shortcut only applies to existing records. Use the existing cas_update_loop and CasApply symbols to keep the behavior and docs aligned, and remove any need for downstream NoOp suppression if you choose to enforce it here.

Source: Coding guidelines

coderabbitai · 2026-06-25T12:09:28Z

+fn capabilities_known(capabilities: &BackendCapabilities) -> bool {
+    *capabilities != BackendCapabilities::default()
+}
+
+/// `true` when the backend's transaction tier is at least
+/// [`TxnCapability::Cas`].
+fn capabilities_support_cas(capabilities: &BackendCapabilities) -> bool {
+    matches!(
+        capabilities.txn(),
+        TxnCapability::Cas | TxnCapability::MultiKey
+    )
+}


🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Don't overload BackendCapabilities::default() to mean "unknown".

BackendCapabilities::empty() is the crate's canonical "no capabilities advertised" shape, but capabilities_known() treats that same value as "unknown" and skips the pre-flight CAS gate. A concrete backend that inherits the default capabilities() now silently falls back to op-time behavior, which weakens the new fail-closed cas_update contract. As per coding guidelines, "Declare backend capabilities up front with BackendCapabilities" and "cas_update must fail closed with CasUpdateError::CasUnsupported."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_filesystem/src/cas.rs` around lines 295 - 306, The CAS preflight is treating BackendCapabilities::default() as “unknown,” which lets a backend with no advertised capabilities bypass the fail-closed gate. Update capabilities_known() in cas.rs to only treat an explicit unknown state as unknown and not overload BackendCapabilities::empty()/default(), then make the cas_update path rely on that check so unsupported backends still return CasUpdateError::CasUnsupported instead of falling back to op-time behavior.

Source: Coding guidelines

coderabbitai · 2026-06-25T12:09:28Z

+#[tokio::test]
+async fn create_if_absent_first_write_succeeds() {
+    let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new())));
+    let scope = ResourceScope::system();
+
+    let outcome = cas_update(
+        fs.as_ref(),
+        &scope,
+        &counter_path(),
+        decode_counter,
+        encode_counter,
+        increment,
+    )
+    .await
+    .unwrap();
+
+    assert_eq!(outcome, 1, "first write returns the new value");
+
+    // The record now exists at version 1 with the expected body.
+    let stored = fs
+        .get(&scope, &counter_path())
+        .await
+        .unwrap()
+        .expect("counter persisted");
+    let counter = decode_counter(&stored.entry.body).unwrap();
+    assert_eq!(counter.value, 1);
+}
+
+#[tokio::test]
+async fn no_op_apply_skips_write() {
+    let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new())));
+    let scope = ResourceScope::system();
+
+    // Seed a value of 5.
+    cas_update(
+        fs.as_ref(),
+        &scope,
+        &counter_path(),
+        decode_counter,
+        encode_counter,
+        |current: Option<Counter>| async move {
+            let _ = current;
+            Ok::<_, TestError>(CasApply::new(Counter { value: 5 }, ()))
+        },
+    )
+    .await
+    .unwrap();
+
+    let version_before = fs
+        .get(&scope, &counter_path())
+        .await
+        .unwrap()
+        .unwrap()
+        .version;
+
+    // An apply that returns the unchanged snapshot must not bump the version.
+    cas_update(
+        fs.as_ref(),
+        &scope,
+        &counter_path(),
+        decode_counter,
+        encode_counter,
+        |current: Option<Counter>| async move {
+            let snapshot = current.unwrap();
+            Ok::<_, TestError>(CasApply::new(snapshot, ()))
+        },
+    )
+    .await
+    .unwrap();
+
+    let version_after = fs
+        .get(&scope, &counter_path())
+        .await
+        .unwrap()
+        .unwrap()
+        .version;
+    assert_eq!(
+        version_before, version_after,
+        "no-op apply must not issue a write"
+    );
+}
+
+#[tokio::test]
+async fn high_contention_storm_has_no_lost_updates() {
+    const WRITERS: u64 = 50;
+
+    let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new())));
+    let scope = ResourceScope::system();
+
+    let mut handles = Vec::new();
+    for _ in 0..WRITERS {
+        let fs = fs.clone();
+        let scope = scope.clone();
+        handles.push(tokio::spawn(async move {
+            cas_update(
+                fs.as_ref(),
+                &scope,
+                &counter_path(),
+                decode_counter,
+                encode_counter,
+                increment,
+            )
+            .await
+        }));
+    }
+
+    let mut observed = Vec::new();
+    for handle in handles {
+        observed.push(handle.await.unwrap().expect("writer succeeded"));
+    }
+
+    // Final value must equal the number of writers — every increment landed.
+    let final_counter = decode_counter(
+        &fs.get(&scope, &counter_path())
+            .await
+            .unwrap()
+            .unwrap()
+            .entry
+            .body,
+    )
+    .unwrap();
+    assert_eq!(
+        final_counter.value, WRITERS,
+        "every concurrent increment must be observed (no lost update)"
+    );
+
+    // Each writer observed a distinct increment value in 1..=WRITERS.
+    observed.sort_unstable();
+    let expected: Vec<u64> = (1..=WRITERS).collect();
+    assert_eq!(
+        observed, expected,
+        "each writer's returned outcome must be a unique increment"
+    );
+}
+
+#[tokio::test]
+async fn persistent_version_mismatch_exhausts_retries() {
+    let backend = Arc::new(AlwaysMismatchBackend::new());
+    let fs = Arc::new(scoped(backend.clone()));
+    let scope = ResourceScope::system();
+
+    // `get` always returns a synthetic existing record, so every attempt takes
+    // the put path and races into a VersionMismatch — no seed write needed.
+    let result = cas_update(
+        fs.as_ref(),
+        &scope,
+        &counter_path(),
+        decode_counter,
+        encode_counter,
+        increment,
+    )
+    .await;
+
+    assert!(
+        matches!(result, Err(CasUpdateError::RetriesExhausted)),
+        "persistent VersionMismatch must terminate with RetriesExhausted, got {result:?}"
+    );
+    assert_eq!(
+        backend.put_attempts.load(Ordering::SeqCst),
+        super::FILESYSTEM_CAS_RETRIES,
+        "the loop must attempt exactly the retry cap before giving up"
+    );
+}
+
+#[tokio::test]
+async fn non_cas_backend_is_rejected_not_overwritten() {
+    let backend = Arc::new(NonCasBackend::new());
+    let fs = Arc::new(scoped(backend.clone()));
+    let scope = ResourceScope::system();
+
+    let result = cas_update(
+        fs.as_ref(),
+        &scope,
+        &counter_path(),
+        decode_counter,
+        encode_counter,
+        increment,
+    )
+    .await;
+
+    assert!(
+        matches!(result, Err(CasUpdateError::CasUnsupported)),
+        "a non-CAS backend must be rejected by the capability gate, got {result:?}"
+    );
+
+    // Critically: nothing was written. The pre-flight gate refused before the
+    // blind-overwrite `put` could run.
+    let stored = fs.get(&scope, &counter_path()).await.unwrap();
+    assert!(
+        stored.is_none(),
+        "the capability gate must reject before any write (no blind overwrite)"
+    );
+}
+
+#[tokio::test]
+async fn apply_error_is_carried_through_unwrapped() {
+    let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new())));
+    let scope = ResourceScope::system();
+
+    let result: Result<u64, CasUpdateError<TestError>> = cas_update(
+        fs.as_ref(),
+        &scope,
+        &counter_path(),
+        decode_counter,
+        encode_counter,
+        |_current: Option<Counter>| async move {
+            Err::<CasApply<Counter, u64>, _>(TestError("boom".to_string()))
+        },
+    )
+    .await;
+
+    match result {
+        Err(CasUpdateError::Apply(TestError(reason))) => assert_eq!(reason, "boom"),
+        other => panic!("expected Apply error carried through, got {other:?}"),
+    }
+}


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Add a regression test for the timeout branch.

CasUpdateError::Timeout is part of the new public contract and is the load-bearing guard for the wedged-backend failure mode, but this suite never stalls get, put, or apply long enough to prove the outer timeout wins. A tiny hanging backend plus paused Tokio time would cover the branch deterministically. As per coding guidelines, "Every bug fix must include a regression test" and Rust tests should cover changed critical paths.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_filesystem/src/cas/tests.rs` around lines 184 - 399, Add a regression test in cas/tests.rs for the timeout path in cas_update, since the current tests cover retries, CAS rejection, and apply errors but never exercise CasUpdateError::Timeout. Introduce a tiny wedged backend or stub that blocks one of the CAS operations (get, put, or apply), and use paused Tokio time so the outer timeout in cas_update deterministically fires. Reference cas_update and CasUpdateError::Timeout in the new test, and assert that the timeout branch wins over the hung backend behavior.

Source: Coding guidelines

coderabbitai · 2026-06-25T12:09:28Z

+        CasUpdateError::CasUnsupported => {
+            E::storage("snapshot backend does not support versioned compare-and-swap".to_string())
+        }
+        CasUpdateError::Backend(inner) => E::storage_from(inner),


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Sanitize backend errors before returning StorageError.

Line 212 forwards FilesystemError::to_string() into the store error; those displays can include backend virtual paths and raw backend reasons. Return a stable sanitized message here, and keep the detailed error only in internal diagnostics/source context. As per path instructions, do not expose backend paths or raw backend errors across public surfaces.

Suggested direction

- CasUpdateError::Backend(inner) => E::storage_from(inner), + CasUpdateError::Backend(_inner) => { + E::storage("snapshot filesystem backend error".to_string()) + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

CasUpdateError::Backend(inner) => E::storage_from(inner),

CasUpdateError::Backend(_inner) => {

E::storage("snapshot filesystem backend error".to_string())

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_resources/src/cas_snapshot.rs` at line 212, The CasUpdateError::Backend handling in cas_snapshot.rs currently forwards the backend error string directly via E::storage_from(inner), which can leak virtual paths and raw backend details into StorageError. Update this mapping to return a stable sanitized public message from the error conversion path, and preserve the detailed backend error only in internal diagnostics or source context associated with E::storage_from.

Source: Path instructions

coderabbitai · 2026-06-25T12:09:28Z

+        CasUpdateError::CasUnsupported => RunStateError::Backend(
+            "backend does not support versioned compare-and-swap".to_string(),
+        ),
+        CasUpdateError::Backend(fs_err) => RunStateError::Filesystem(fs_err.to_string()),


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Preserve the typed filesystem error.

fs_err.to_string() drops the structured FilesystemError variant and path/operation context that existing ? conversions preserve elsewhere in this file. Prefer the existing typed conversion:

Proposed fix

- CasUpdateError::Backend(fs_err) => RunStateError::Filesystem(fs_err.to_string()), + CasUpdateError::Backend(fs_err) => fs_err.into(),

As per path instructions: errors should propagate with context into typed errors.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

CasUpdateError::Backend(fs_err) => RunStateError::Filesystem(fs_err.to_string()),

CasUpdateError::Backend(fs_err) => fs_err.into(),

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_run_state/src/lib.rs` at line 1109, The `CasUpdateError::Backend` branch in `RunStateError` currently converts the filesystem failure to a string, which drops the structured `FilesystemError` context. Update this match arm in `lib.rs` to preserve the typed error by using the existing `?`-style conversion path or direct typed `From`/`Into` conversion used elsewhere in this file, so the `FilesystemError` variant and its path/operation details continue to propagate through `RunStateError::Filesystem`.

Sources: Coding guidelines, Path instructions

coderabbitai · 2026-06-25T12:09:28Z

+                            let already_marked = lease.status == SecretLeaseStatus::Expired;
+                            if already_marked {
+                                // No-op: return unchanged record so helper skips
+                                // the write (PartialEq equality path).
+                                Err(SecretStoreError::LeaseExpired { lease_id })


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the CAS no-op comment.

Line 481 says this returns the unchanged record for the PartialEq no-op path, but the code returns Err(LeaseExpired) and exits through CasUpdateError::Apply. Update the comment or return CasApply::new(lease, Err(...)) to match the documented behavior. As per coding guidelines, comments that promise guarantees across layers must be enforced by code/tests or softened to describe intent.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_secrets/src/filesystem_store.rs` around lines 479 - 483, The no-op CAS comment in filesystem_store.rs is inaccurate: the `already_marked` branch in the lease update path returns `Err(SecretStoreError::LeaseExpired { lease_id })` and goes through `CasUpdateError::Apply`, not the unchanged record/`PartialEq` skip path. Update the comment to describe the actual intent, or change the branch in the relevant lease/CAS helper to return `CasApply::new(lease, Err(...))` if that is the desired behavior; use the `already_marked`, `SecretLeaseStatus::Expired`, and `CasUpdateError::Apply` symbols to locate and align the code with the documented contract.

Source: Coding guidelines

coderabbitai · 2026-06-25T12:09:28Z

+fn map_cas_error<T>(error: CasUpdateError<BridgeError<T>>) -> TurnError {
+    match error {
+        CasUpdateError::Apply(BridgeError::Real(inner)) => inner,
+        CasUpdateError::Apply(BridgeError::NoOp(_)) => {
+            // Should be unreachable: the caller extracts NoOp before calling
+            // map_cas_error. Defensive fallback.
+            unreachable!("NoOp bridge error must be handled by the apply caller")
        }


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don’t panic in the CAS error mapper.

This is production error-handling code; if the NoOp sentinel ever leaks past the caller, unreachable! turns a recoverable mapping bug into a process panic. Return a typed unavailable error instead.

Suggested fix

CasUpdateError::Apply(BridgeError::NoOp(_)) => { - // Should be unreachable: the caller extracts NoOp before calling - // map_cas_error. Defensive fallback. - unreachable!("NoOp bridge error must be handled by the apply caller") + tracing::debug!("turn state CAS no-op sentinel reached error mapper"); + TurnError::Unavailable { + reason: "turn state persistence temporarily unavailable".to_string(), + } }

As per coding guidelines, production Rust should avoid panic-style failure paths and map errors with context.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fn map_cas_error<T>(error: CasUpdateError<BridgeError<T>>) -> TurnError {

match error {

CasUpdateError::Apply(BridgeError::Real(inner)) => inner,

CasUpdateError::Apply(BridgeError::NoOp(_)) => {

// Should be unreachable: the caller extracts NoOp before calling

// map_cas_error. Defensive fallback.

unreachable!("NoOp bridge error must be handled by the apply caller")

}

fn map_cas_error<T>(error: CasUpdateError<BridgeError<T>>) -> TurnError {

match error {

CasUpdateError::Apply(BridgeError::Real(inner)) => inner,

CasUpdateError::Apply(BridgeError::NoOp(_)) => {

tracing::debug!("turn state CAS no-op sentinel reached error mapper");

TurnError::Unavailable {

reason: "turn state persistence temporarily unavailable".to_string(),

}

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_turns/src/filesystem_store.rs` around lines 378 - 385, The CAS error mapper in map_cas_error should not panic on BridgeError::NoOp, since this is production error-handling code. Replace the unreachable! fallback in map_cas_error with a typed unavailable TurnError that preserves context about the unexpected NoOp leak, while keeping the BridgeError::Real(inner) path unchanged and consistent with CasUpdateError handling.

Sources: Coding guidelines, Path instructions

coderabbitai · 2026-06-25T12:09:28Z

+## Quality gate
+
+Per touched crate: `cargo fmt --all`, `cargo clippy --all-targets` (zero
+warnings), `cargo test` (+ `--features integration` where stores have it).


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Align the plan's validation steps with the repo's required backend checks.

This plan documents a legacy persistence migration, but its quality gate drops the required feature-isolation cargo check matrix and weakens clippy to cargo clippy --all-targets. That makes the migration plan a stale source of truth for the postgres/libSQL split this PR is changing. As per coding guidelines, legacy persistence changes must "Test feature isolation" with default, --no-default-features --features libsql, and --all-features, and "Before You Open a PR" requires cargo clippy --all --benches --tests --examples --all-features -- -D warnings. Based on learnings, verify feature isolation with the required cargo check combinations for postgres, libSQL-only, and all-features builds.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/plans/2026-06-25-cas-migration.md` around lines 137 - 140, The quality gate in the migration plan is outdated and should match the repo’s required backend validation for persistence changes. Update the “Quality gate” section to include the feature-isolation `cargo check` matrix for the default/postgres build, `--no-default-features --features libsql`, and `--all-features`, and strengthen the clippy step to the full `cargo clippy --all --benches --tests --examples --all-features -- -D warnings`. Keep the existing formatting/tests coverage, but ensure the plan explicitly reflects the required checks for the legacy persistence migration.

Sources: Coding guidelines, Learnings

railway-app · 2026-06-25T12:13:10Z

🚅 Deployed to the ironclaw-pr-5234 environment in ironclaw-ci-preview

Service	Status	Web	Updated (UTC)
ironclaw	✅ Success (View Logs)	Web	Jun 25, 2026 at 12:03 pm

Fold in PR review (henrypark133) + a durable-write hot-path audit. Adds §12 framing the deployed lease-expiry cascade as three independent layers: - Layer A: in-process lock convoy → #5234 (open); write-behind composes on the post-#5234 CAS path. - Layer B: pool starvation — DEFAULT_POSTGRES_POOL_MAX_SIZE=2 shared across all Postgres FS I/O. Notes the existing 30s checkout guard (closes the infinite hang) but flags pool-too-small + checkout(30s)>apply(15s); cheap mitigations (raise pool, reserve a critical connection, align checkout<apply) prior to and complementary with write-behind. - Layer C: the synchronous hot-path write map (events / governor / thread-append / lease / memory) with the write-behind-vs-batch-coalesce distinction. Events are the top batch-coalesce target (highest churn, O(1) INSERT no CAS); must stay DURABLE (source of truth), per-step flush to preserve live SSE. Memory stays synchronous (FTS-only, no embedding write; agent-initiated, low churn). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

henrypark133 and others added 7 commits June 24, 2026 19:01

railway-app Bot deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 25, 2026 11:53 View deployment

github-actions Bot added scope: docs Documentation scope: dependencies Dependency updates size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Jun 25, 2026

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 25, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

henrypark133 mentioned this pull request Jun 25, 2026

docs(reborn): design — write-behind lease durability + side-effect gate #5249

Open

henrypark133 mentioned this pull request Jun 25, 2026

Heartbeat lease write-behind: move runner-lease renewal off the synchronous Postgres path #5253

Open

-        CasUpdateError::Backend(inner) => E::storage_from(inner),
+        CasUpdateError::Backend(_inner) => {
+            E::storage("snapshot filesystem backend error".to_string())
+        }

	CasUpdateError::Backend(fs_err) => RunStateError::Filesystem(fs_err.to_string()),
	CasUpdateError::Backend(fs_err) => fs_err.into(),

Uh oh!

Conversation

henrypark133 commented Jun 25, 2026

Problem

Change

Tri-backend parity (no diverging experience by host)

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

railway-app Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading