fix(reborn): remove per-record lock convoys via shared cas_update helper#5234
fix(reborn): remove per-record lock convoys via shared cas_update helper#5234henrypark133 wants to merge 7 commits into
Conversation
Extract the proven mutex-free CAS pattern from ironclaw_turns (PR #5142) into one shared helper in ironclaw_filesystem so the remaining stores can drop their per-record tokio::sync::Mutex convoys. cas_update runs a bounded read-modify-write loop: read versioned snapshot, run the caller's idempotent apply closure, CAS-put with the read version, and on VersionMismatch re-read and retry with jittered exponential backoff (2ms..50ms), capped at 32 retries and wrapped in a 15s timeout. A fail-closed capability gate rejects backends that cannot CAS rather than silently blind-overwriting. The helper is generic over the record type, the outcome, and the caller's error; it never leaks store-specific types. No store migrated yet — that follows in the next commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te helper ironclaw_resources had NO CAS-retry loop — its per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) held across the backend get/put awaits was the ONLY serializer. Under burst, one writer stalled inside its critical section parked every other same-scope writer (the convoy that contributed to the 2026-06-24 runtime wedge). Naively deleting that mutex without a retry loop loses updates: a racing writer's single-attempt CAS detects the version mismatch and errors out (cross-process CAS contention) instead of retrying. Route update_snapshot through ironclaw_filesystem::cas_update (bounded CAS retry + jittered backoff + 15s timeout + fail-closed capability gate) and delete the per-record mutex, FilesystemRecordLock, the local PutError, and the fail-open put_with_cas (CasExpectation::Any blind-overwrite fallback). The helper fails closed instead; verified safe — these store aliases only ever resolve to CAS-capable db/in-memory backends in production. The public update API widens from FnOnce to FnMut because the helper re-runs the closure against a freshly read snapshot on every CAS retry; leaf closures that previously moved captured values now clone per invocation. Add PartialEq to BudgetGateSnapshot (helper needs S: Clone + PartialEq). Red->green regression: cas_snapshot::tests::concurrent_increments_have_no_lost_updates proves RED on lock-removed-no-retry (lost updates / spurious contention) and GREEN through the helper (every concurrent increment lands, no convoy). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ant per-record mutex run_state held BOTH a store-local CAS-retry loop AND a per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) across the backend get/put awaits — belt-and-suspenders where the mutex was a redundant in-process serializer over a backend that already does versioned CAS. Under burst the mutex convoyed same-scope writers behind one stalled writer (the 2026-06-24 wedge pattern). Route apply_update, update_status, start, and save_pending through ironclaw_filesystem::cas_update (one shared CAS implementation: bounded retry + jittered backoff + 15s timeout + fail-closed capability gate). Delete the store-local FILESYSTEM_CAS_RETRIES loop, the per-record lock + accessor + its two unit tests, the local PutError, and the fail-open put_with_cas (CasExpectation::Any blind-overwrite fallback). The scope-ownership check and the approval Pending guard move inside the re-runnable apply closure; the ApprovalStatus guard returns Err from apply. discard_pending keeps its read-then-delete, only the lock is dropped. The run_state contract tests previously ran against LocalFilesystem (byte-only, no versioned CAS) where the OLD fail-open fallback masked the missing CAS. Production never routes run_state to LocalFilesystem — only to CAS-capable db/in-memory backends — so the tests now use InMemoryBackend, matching the production capability shape (CAS) instead of a non-production blind-overwrite path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p per-record mutex ensure_thread held the only per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) in this crate, across the backend get/put awaits, to serialize the check-then-create against Unsupported(WriteFile)-fallback backends. Route it through ironclaw_filesystem::cas_update: the apply closure returns the existing record unchanged (no-op, no write) when the thread already exists with a matching scope, or builds the fresh StoredThreadRecord when absent. A concurrent create-if-absent loser hits VersionMismatch, the helper re-reads and re-runs apply which now sees the winner and reconciles scope — exactly the old single-reconcile semantics, now via the shared CAS-retry loop without any lock. Delete the lock infra (Weak map + accessor); add PartialEq to StoredThreadRecord (helper needs S: Clone + PartialEq). The three other record/txn loops were already lock-free and are left as-is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cord mutex (Arc leak fix) The secrets filesystem store kept FILESYSTEM_RECORD_LOCKS as a HashMap of *strong* Arc<Mutex<()>> (not Weak), one entry per (scope,lease)/session path ever seen, NEVER pruned — an unbounded memory leak. The per-record mutex was also held across the backend get/put awaits, convoying same-key writers behind one stalled writer (the 2026-06-24 wedge pattern). Route consume/revoke/consume_session_use through ironclaw_filesystem::cas_update (one shared CAS implementation), mapping the local CasDecision onto CasApply: Commit/BestEffortCommit -> changed snapshot; Settle(Ok) -> unchanged snapshot (helper skips the write via PartialEq no-op); Settle(Err)/not-found -> apply error. Crypto decrypt + use-count increment run inside the re-runnable apply closure (pure / recomputed from the freshly read record each retry). validate_session is a pure read — its lock is simply deleted. Delete the entire lock map (leak gone), cas_mutate/CasDecision/CAS_RETRY_ATTEMPTS, and the fail-open put_with_version_fallback (helper fails closed). Add PartialEq to StoredLease and StoredSession. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR #5142 removed the per-record mutex from ironclaw_turns but left a local copy of the CAS read-modify-write loop (apply_with_retry, put_with_cas, PutError, cas_retry_backoff + constants). Re-home it onto ironclaw_filesystem::cas_update so there is ONE CAS implementation across the codebase. Behavior is preserved: the 18 call sites and their closure shape (FnMut(InMemoryTurnStateStore) -> Fut) are unchanged; the exact timeout/exhaustion error strings are preserved. A BridgeError<T> sentinel carries the absent-record + default-snapshot no-op through the helper's apply-error channel (the helper's own no-op check only fires for Some(existing)==new; the turns store additionally must skip creating a file for an empty default store, which the old new==old check covered because read returned default on absent). The 500ms SNAPSHOT_READ_CACHE_TTL read cache is kept as a separate layer: cleared around the CAS call and repopulated on the next read (the helper does its own fresh get and does not surface the new RecordVersion; every read_snapshot caller already discards the version, so caching None is safe). Delete the now-unused local loop, put_with_cas, PutError, cas_retry_backoff, and the duplicate CAS constants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cord mutex Add the invariant to crates/ironclaw_filesystem/CLAUDE.md (#2 "CAS is the floor") and a pointer in .claude/rules/database.md: every filesystem read-modify-write must go through the one shared ironclaw_filesystem::cas_update helper; never wrap it in a per-record tokio::sync::Mutex held across the backend .await (redundant serializer + convoy/wedge risk + leak). Also commit Cargo.lock (async-trait dev-dep added to ironclaw_resources for the convoy regression test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request migrates multiple persistence stores (ironclaw_resources, ironclaw_run_state, ironclaw_threads, ironclaw_secrets, and ironclaw_turns) to use a shared, lock-free compare-and-swap (CAS) helper (cas_update). This eliminates redundant per-record mutexes held across .await boundaries, preventing runtime-wedging convoys under high contention. Feedback on the changes highlights an opportunity to improve the jitter calculation in cas_retry_backoff using RandomState to avoid zero-jitter scenarios in low clock resolution environments (such as VMs or Windows). Additionally, it is recommended to explicitly add Send bounds to the closure generic constraints in cas_update and cas_update_loop to catch any non-Send regressions at the helper's definition site.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| async fn cas_retry_backoff(attempt: usize) { | ||
| let shift = attempt.min(8) as u32; | ||
| let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX); | ||
| let base_delay = FILESYSTEM_CAS_BACKOFF_BASE | ||
| .saturating_mul(multiplier) | ||
| .min(FILESYSTEM_CAS_BACKOFF_MAX); | ||
| let jitter = SystemTime::now() | ||
| .duration_since(UNIX_EPOCH) | ||
| .map(|elapsed| { | ||
| let jitter_ceiling = base_delay.as_millis().max(1); | ||
| Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64) | ||
| }) | ||
| .unwrap_or_default(); | ||
| tokio::time::sleep(base_delay.saturating_add(jitter)).await; | ||
| } |
There was a problem hiding this comment.
Using SystemTime::now() for jitter calculation can lead to zero jitter on systems with low clock resolution (e.g., 1ms resolution in VMs, containers, or Windows). Since base_delay.as_millis() is at most 50, and 1,000,000 (1ms in nanoseconds) is divisible by all common backoff ceilings (2, 4, 8, 16, 32, 50), the modulo operation elapsed.as_nanos() % jitter_ceiling will always evaluate to exactly 0. This completely defeats the purpose of jitter under high contention.
Instead, use std::collections::hash_map::RandomState to obtain a high-quality, thread-safe, and un-correlated pseudo-random value seeded from the OS without adding external dependencies.
| async fn cas_retry_backoff(attempt: usize) { | |
| let shift = attempt.min(8) as u32; | |
| let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX); | |
| let base_delay = FILESYSTEM_CAS_BACKOFF_BASE | |
| .saturating_mul(multiplier) | |
| .min(FILESYSTEM_CAS_BACKOFF_MAX); | |
| let jitter = SystemTime::now() | |
| .duration_since(UNIX_EPOCH) | |
| .map(|elapsed| { | |
| let jitter_ceiling = base_delay.as_millis().max(1); | |
| Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64) | |
| }) | |
| .unwrap_or_default(); | |
| tokio::time::sleep(base_delay.saturating_add(jitter)).await; | |
| } | |
| async fn cas_retry_backoff(attempt: usize) { | |
| let shift = attempt.min(8) as u32; | |
| let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX); | |
| let base_delay = FILESYSTEM_CAS_BACKOFF_BASE | |
| .saturating_mul(multiplier) | |
| .min(FILESYSTEM_CAS_BACKOFF_MAX); | |
| let jitter = { | |
| use std::collections::hash_map::RandomState; | |
| use std::hash::{BuildHasher, Hash, Hasher}; | |
| let mut hasher = RandomState::new().build_hasher(); | |
| attempt.hash(&mut hasher); | |
| let hash = hasher.finish(); | |
| let jitter_ceiling = base_delay.as_millis().max(1) as u64; | |
| Duration::from_millis(hash % jitter_ceiling) | |
| }; | |
| tokio::time::sleep(base_delay.saturating_add(jitter)).await; | |
| } |
| pub async fn cas_update<F, S, T, E, D, N, A, Fut>( | ||
| filesystem: &ScopedFilesystem<F>, | ||
| scope: &ResourceScope, | ||
| path: &ScopedPath, | ||
| decode: D, | ||
| encode: N, | ||
| mut apply: A, | ||
| ) -> Result<T, CasUpdateError<E>> | ||
| where | ||
| F: RootFilesystem + ?Sized, | ||
| S: PartialEq + Clone, | ||
| D: Fn(&[u8]) -> Result<S, E>, | ||
| N: Fn(&S) -> Result<Entry, E>, | ||
| A: FnMut(Option<S>) -> Fut, | ||
| Fut: Future<Output = Result<CasApply<S, T>, E>>, |
There was a problem hiding this comment.
According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.
pub async fn cas_update<F, S, T, E, D, N, A, Fut>(
filesystem: &ScopedFilesystem<F>,
scope: &ResourceScope,
path: &ScopedPath,
decode: D,
encode: N,
mut apply: A,
) -> Result<T, CasUpdateError<E>>
where
F: RootFilesystem + ?Sized,
S: PartialEq + Clone + Send,
T: Send,
E: Send,
D: Fn(&[u8]) -> Result<S, E> + Send,
N: Fn(&S) -> Result<Entry, E> + Send,
A: FnMut(Option<S>) -> Fut + Send,
Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,References
- In Rust, explicitly add
Sendbounds to closure generic constraints in async helper functions that return futures intended to beSend. This ensures that non-Sendregressions are caught at the helper's definition site rather than at the call site, even if the future is already bound bySendat the call site.
| async fn cas_update_loop<F, S, T, E, D, N, A, Fut>( | ||
| filesystem: &ScopedFilesystem<F>, | ||
| scope: &ResourceScope, | ||
| path: &ScopedPath, | ||
| decode: &D, | ||
| encode: &N, | ||
| apply: &mut A, | ||
| ) -> Result<T, CasUpdateError<E>> | ||
| where | ||
| F: RootFilesystem + ?Sized, | ||
| S: PartialEq + Clone, | ||
| D: Fn(&[u8]) -> Result<S, E>, | ||
| N: Fn(&S) -> Result<Entry, E>, | ||
| A: FnMut(Option<S>) -> Fut, | ||
| Fut: Future<Output = Result<CasApply<S, T>, E>>, |
There was a problem hiding this comment.
According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.
async fn cas_update_loop<F, S, T, E, D, N, A, Fut>(
filesystem: &ScopedFilesystem<F>,
scope: &ResourceScope,
path: &ScopedPath,
decode: &D,
encode: &N,
apply: &mut A,
) -> Result<T, CasUpdateError<E>>
where
F: RootFilesystem + ?Sized,
S: PartialEq + Clone + Send,
T: Send,
E: Send,
D: Fn(&[u8]) -> Result<S, E> + Send,
N: Fn(&S) -> Result<Entry, E> + Send,
A: FnMut(Option<S>) -> Fut + Send,
Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,References
- In Rust, explicitly add
Sendbounds to closure generic constraints in async helper functions that return futures intended to beSend. This ensures that non-Sendregressions are caught at the helper's definition site rather than at the call site, even if the future is already bound bySendat the call site.
📝 WalkthroughSummary by CodeRabbit
WalkthroughIntroduced a shared filesystem ChangesShared filesystem CAS migration
Sequence Diagram(s)sequenceDiagram
participant FilesystemRunStateStore
participant cas_update
participant ScopedFilesystem
FilesystemRunStateStore->>cas_update: apply_update
cas_update->>ScopedFilesystem: capabilities()
ScopedFilesystem-->>cas_update: BackendCapabilities
cas_update->>ScopedFilesystem: get(path)
cas_update->>ScopedFilesystem: put(path, CasExpectation::Version)
ScopedFilesystem-->>cas_update: VersionMismatch
cas_update->>cas_update: retry with backoff
cas_update-->>FilesystemRunStateStore: Result
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches⚔️ Resolve merge conflicts
Comment |
There was a problem hiding this comment.
💡 Codex Review
ironclaw/crates/ironclaw_run_state/src/lib.rs
Line 932 in 3223540
When discard_pending races in the same process with approve or deny for the same request, this unconditional delete can now run after the resolver's CAS-protected status update and remove the terminal approval record. The removed per-record lock used to serialize that get/check/delete sequence with status updates; without either that lock or a CAS/tombstone transition here, a user approval can be lost and later look like UnknownApprovalRequest rather than Approved.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/ironclaw_run_state/src/lib.rs (1)
922-932: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftMake
discard_pendingatomic before dropping the lock.This path reads
Pending, then deletes later with no CAS expectation. A concurrentapprove()/deny()can win the CAS update between Line 922 and Line 932, then this delete removes a non-pending approval record. Use a CAS-aware delete/transactional transition, or model discard as a CAS status update instead of read-then-delete. As per path instructions: “Fail loud” and filesystem read-modify-write paths must use the CAS invariant rather than split multi-step state changes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_run_state/src/lib.rs` around lines 922 - 932, The discard_pending flow in RunState::discard_pending is currently split into a read/validate step and a later filesystem delete, which can race with approve()/deny() and remove a record after its status has already changed. Update this path to be atomic by using a CAS-aware transition or transactional delete that enforces the same status invariant at the point of mutation, rather than relying on a prior Pending check; keep the failure path loud if the expected state no longer matches.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ironclaw_filesystem/src/cas.rs`:
- Line 193: The new Clippy allow on cas.rs needs the required arch-exempt
rationale comment immediately above #[allow(clippy::too_many_arguments)]. Update
the same site near the cas:: function signature that triggered the lint by
adding the mandatory // arch-exempt: too_many_args, <reason naming the missing
aggregation>, plan `#NNNN` annotation, and make sure the reason justifies why this
many parameters is unavoidable until the missing aggregation is introduced.
- Around line 295-306: The CAS preflight is treating
BackendCapabilities::default() as “unknown,” which lets a backend with no
advertised capabilities bypass the fail-closed gate. Update capabilities_known()
in cas.rs to only treat an explicit unknown state as unknown and not overload
BackendCapabilities::empty()/default(), then make the cas_update path rely on
that check so unsupported backends still return CasUpdateError::CasUnsupported
instead of falling back to op-time behavior.
- Around line 262-265: The no-op shortcut in cas_update_loop currently only
returns early when current is Some(existing) and matches the snapshot, while
absent records still fall through to encoding and writing; update the logic to
either treat None plus an unchanged/default snapshot as a true no-op or adjust
the CasApply/cas_update contract and surrounding comments to state the shortcut
only applies to existing records. Use the existing cas_update_loop and CasApply
symbols to keep the behavior and docs aligned, and remove any need for
downstream NoOp suppression if you choose to enforce it here.
In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 184-399: Add a regression test in cas/tests.rs for the timeout
path in cas_update, since the current tests cover retries, CAS rejection, and
apply errors but never exercise CasUpdateError::Timeout. Introduce a tiny wedged
backend or stub that blocks one of the CAS operations (get, put, or apply), and
use paused Tokio time so the outer timeout in cas_update deterministically
fires. Reference cas_update and CasUpdateError::Timeout in the new test, and
assert that the timeout branch wins over the hung backend behavior.
In `@crates/ironclaw_resources/src/cas_snapshot.rs`:
- Line 212: The CasUpdateError::Backend handling in cas_snapshot.rs currently
forwards the backend error string directly via E::storage_from(inner), which can
leak virtual paths and raw backend details into StorageError. Update this
mapping to return a stable sanitized public message from the error conversion
path, and preserve the detailed backend error only in internal diagnostics or
source context associated with E::storage_from.
In `@crates/ironclaw_run_state/src/lib.rs`:
- Line 1109: The `CasUpdateError::Backend` branch in `RunStateError` currently
converts the filesystem failure to a string, which drops the structured
`FilesystemError` context. Update this match arm in `lib.rs` to preserve the
typed error by using the existing `?`-style conversion path or direct typed
`From`/`Into` conversion used elsewhere in this file, so the `FilesystemError`
variant and its path/operation details continue to propagate through
`RunStateError::Filesystem`.
In `@crates/ironclaw_secrets/src/filesystem_store.rs`:
- Around line 479-483: The no-op CAS comment in filesystem_store.rs is
inaccurate: the `already_marked` branch in the lease update path returns
`Err(SecretStoreError::LeaseExpired { lease_id })` and goes through
`CasUpdateError::Apply`, not the unchanged record/`PartialEq` skip path. Update
the comment to describe the actual intent, or change the branch in the relevant
lease/CAS helper to return `CasApply::new(lease, Err(...))` if that is the
desired behavior; use the `already_marked`, `SecretLeaseStatus::Expired`, and
`CasUpdateError::Apply` symbols to locate and align the code with the documented
contract.
In `@crates/ironclaw_turns/src/filesystem_store.rs`:
- Around line 378-385: The CAS error mapper in map_cas_error should not panic on
BridgeError::NoOp, since this is production error-handling code. Replace the
unreachable! fallback in map_cas_error with a typed unavailable TurnError that
preserves context about the unexpected NoOp leak, while keeping the
BridgeError::Real(inner) path unchanged and consistent with CasUpdateError
handling.
In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 137-140: The quality gate in the migration plan is outdated and
should match the repo’s required backend validation for persistence changes.
Update the “Quality gate” section to include the feature-isolation `cargo check`
matrix for the default/postgres build, `--no-default-features --features
libsql`, and `--all-features`, and strengthen the clippy step to the full `cargo
clippy --all --benches --tests --examples --all-features -- -D warnings`. Keep
the existing formatting/tests coverage, but ensure the plan explicitly reflects
the required checks for the legacy persistence migration.
---
Outside diff comments:
In `@crates/ironclaw_run_state/src/lib.rs`:
- Around line 922-932: The discard_pending flow in RunState::discard_pending is
currently split into a read/validate step and a later filesystem delete, which
can race with approve()/deny() and remove a record after its status has already
changed. Update this path to be atomic by using a CAS-aware transition or
transactional delete that enforces the same status invariant at the point of
mutation, rather than relying on a prior Pending check; keep the failure path
loud if the expected state no longer matches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 1c594037-4929-451d-97ad-73e096ed325d
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock,!**/Cargo.lock
📒 Files selected for processing (17)
.claude/rules/database.mdcrates/ironclaw_filesystem/CLAUDE.mdcrates/ironclaw_filesystem/src/cas.rscrates/ironclaw_filesystem/src/cas/tests.rscrates/ironclaw_filesystem/src/lib.rscrates/ironclaw_filesystem/src/scoped.rscrates/ironclaw_resources/Cargo.tomlcrates/ironclaw_resources/src/cas_snapshot.rscrates/ironclaw_resources/src/filesystem_store.rscrates/ironclaw_resources/src/lib.rscrates/ironclaw_run_state/src/lib.rscrates/ironclaw_run_state/tests/approval_resolution_contract.rscrates/ironclaw_run_state/tests/run_state_contract.rscrates/ironclaw_secrets/src/filesystem_store.rscrates/ironclaw_threads/src/filesystem_service.rscrates/ironclaw_turns/src/filesystem_store.rsdocs/plans/2026-06-25-cas-migration.md
| /// changed); `S: Clone` lets each retry hand `apply` an owned snapshot while the | ||
| /// helper retains a copy for that equality check — both mirror the turns | ||
| /// reference. | ||
| #[allow(clippy::too_many_arguments)] |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Add the required arch-exempt annotation above this Clippy allow.
The repo's architecture rule rejects new #[allow(clippy::too_many_arguments)] without a same-site rationale. As per coding guidelines, "When introducing #[allow(clippy::too_many_arguments)], annotate with // arch-exempt: too_many_args, <reason naming the missing aggregation>, plan #NNNN``." Based on learnings, require the exemption comment immediately above the attribute and explain why this signature is justified.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ironclaw_filesystem/src/cas.rs` at line 193, The new Clippy allow on
cas.rs needs the required arch-exempt rationale comment immediately above
#[allow(clippy::too_many_arguments)]. Update the same site near the cas::
function signature that triggered the lint by adding the mandatory //
arch-exempt: too_many_args, <reason naming the missing aggregation>, plan `#NNNN`
annotation, and make sure the reason justifies why this many parameters is
unavoidable until the missing aggregation is introduced.
Sources: Coding guidelines, Learnings
| // 3. No-op fast path: nothing changed, so skip the write entirely. | ||
| if matches!(¤t, Some(existing) if *existing == snapshot) { | ||
| return Ok(outcome); | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
The no-op shortcut doesn't cover absent records, but the public contract reads as if it does.
cas_update_loop() only skips the write when current is Some(existing) and existing == snapshot. None plus a default/unchanged snapshot still encodes and writes, which is why crates/ironclaw_turns/src/filesystem_store.rs had to add a separate BridgeError::NoOp path to suppress the write. Either implement the absent-record no-op here or narrow the docs around CasApply/cas_update to say the shortcut only applies when a record already exists. As per coding guidelines, "Comments that promise guarantees across layers must either be enforced by code/tests or softened to describe intent."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ironclaw_filesystem/src/cas.rs` around lines 262 - 265, The no-op
shortcut in cas_update_loop currently only returns early when current is
Some(existing) and matches the snapshot, while absent records still fall through
to encoding and writing; update the logic to either treat None plus an
unchanged/default snapshot as a true no-op or adjust the CasApply/cas_update
contract and surrounding comments to state the shortcut only applies to existing
records. Use the existing cas_update_loop and CasApply symbols to keep the
behavior and docs aligned, and remove any need for downstream NoOp suppression
if you choose to enforce it here.
Source: Coding guidelines
| fn capabilities_known(capabilities: &BackendCapabilities) -> bool { | ||
| *capabilities != BackendCapabilities::default() | ||
| } | ||
|
|
||
| /// `true` when the backend's transaction tier is at least | ||
| /// [`TxnCapability::Cas`]. | ||
| fn capabilities_support_cas(capabilities: &BackendCapabilities) -> bool { | ||
| matches!( | ||
| capabilities.txn(), | ||
| TxnCapability::Cas | TxnCapability::MultiKey | ||
| ) | ||
| } |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
Don't overload BackendCapabilities::default() to mean "unknown".
BackendCapabilities::empty() is the crate's canonical "no capabilities advertised" shape, but capabilities_known() treats that same value as "unknown" and skips the pre-flight CAS gate. A concrete backend that inherits the default capabilities() now silently falls back to op-time behavior, which weakens the new fail-closed cas_update contract. As per coding guidelines, "Declare backend capabilities up front with BackendCapabilities" and "cas_update must fail closed with CasUpdateError::CasUnsupported."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ironclaw_filesystem/src/cas.rs` around lines 295 - 306, The CAS
preflight is treating BackendCapabilities::default() as “unknown,” which lets a
backend with no advertised capabilities bypass the fail-closed gate. Update
capabilities_known() in cas.rs to only treat an explicit unknown state as
unknown and not overload BackendCapabilities::empty()/default(), then make the
cas_update path rely on that check so unsupported backends still return
CasUpdateError::CasUnsupported instead of falling back to op-time behavior.
Source: Coding guidelines
| #[tokio::test] | ||
| async fn create_if_absent_first_write_succeeds() { | ||
| let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new()))); | ||
| let scope = ResourceScope::system(); | ||
|
|
||
| let outcome = cas_update( | ||
| fs.as_ref(), | ||
| &scope, | ||
| &counter_path(), | ||
| decode_counter, | ||
| encode_counter, | ||
| increment, | ||
| ) | ||
| .await | ||
| .unwrap(); | ||
|
|
||
| assert_eq!(outcome, 1, "first write returns the new value"); | ||
|
|
||
| // The record now exists at version 1 with the expected body. | ||
| let stored = fs | ||
| .get(&scope, &counter_path()) | ||
| .await | ||
| .unwrap() | ||
| .expect("counter persisted"); | ||
| let counter = decode_counter(&stored.entry.body).unwrap(); | ||
| assert_eq!(counter.value, 1); | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn no_op_apply_skips_write() { | ||
| let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new()))); | ||
| let scope = ResourceScope::system(); | ||
|
|
||
| // Seed a value of 5. | ||
| cas_update( | ||
| fs.as_ref(), | ||
| &scope, | ||
| &counter_path(), | ||
| decode_counter, | ||
| encode_counter, | ||
| |current: Option<Counter>| async move { | ||
| let _ = current; | ||
| Ok::<_, TestError>(CasApply::new(Counter { value: 5 }, ())) | ||
| }, | ||
| ) | ||
| .await | ||
| .unwrap(); | ||
|
|
||
| let version_before = fs | ||
| .get(&scope, &counter_path()) | ||
| .await | ||
| .unwrap() | ||
| .unwrap() | ||
| .version; | ||
|
|
||
| // An apply that returns the unchanged snapshot must not bump the version. | ||
| cas_update( | ||
| fs.as_ref(), | ||
| &scope, | ||
| &counter_path(), | ||
| decode_counter, | ||
| encode_counter, | ||
| |current: Option<Counter>| async move { | ||
| let snapshot = current.unwrap(); | ||
| Ok::<_, TestError>(CasApply::new(snapshot, ())) | ||
| }, | ||
| ) | ||
| .await | ||
| .unwrap(); | ||
|
|
||
| let version_after = fs | ||
| .get(&scope, &counter_path()) | ||
| .await | ||
| .unwrap() | ||
| .unwrap() | ||
| .version; | ||
| assert_eq!( | ||
| version_before, version_after, | ||
| "no-op apply must not issue a write" | ||
| ); | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn high_contention_storm_has_no_lost_updates() { | ||
| const WRITERS: u64 = 50; | ||
|
|
||
| let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new()))); | ||
| let scope = ResourceScope::system(); | ||
|
|
||
| let mut handles = Vec::new(); | ||
| for _ in 0..WRITERS { | ||
| let fs = fs.clone(); | ||
| let scope = scope.clone(); | ||
| handles.push(tokio::spawn(async move { | ||
| cas_update( | ||
| fs.as_ref(), | ||
| &scope, | ||
| &counter_path(), | ||
| decode_counter, | ||
| encode_counter, | ||
| increment, | ||
| ) | ||
| .await | ||
| })); | ||
| } | ||
|
|
||
| let mut observed = Vec::new(); | ||
| for handle in handles { | ||
| observed.push(handle.await.unwrap().expect("writer succeeded")); | ||
| } | ||
|
|
||
| // Final value must equal the number of writers — every increment landed. | ||
| let final_counter = decode_counter( | ||
| &fs.get(&scope, &counter_path()) | ||
| .await | ||
| .unwrap() | ||
| .unwrap() | ||
| .entry | ||
| .body, | ||
| ) | ||
| .unwrap(); | ||
| assert_eq!( | ||
| final_counter.value, WRITERS, | ||
| "every concurrent increment must be observed (no lost update)" | ||
| ); | ||
|
|
||
| // Each writer observed a distinct increment value in 1..=WRITERS. | ||
| observed.sort_unstable(); | ||
| let expected: Vec<u64> = (1..=WRITERS).collect(); | ||
| assert_eq!( | ||
| observed, expected, | ||
| "each writer's returned outcome must be a unique increment" | ||
| ); | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn persistent_version_mismatch_exhausts_retries() { | ||
| let backend = Arc::new(AlwaysMismatchBackend::new()); | ||
| let fs = Arc::new(scoped(backend.clone())); | ||
| let scope = ResourceScope::system(); | ||
|
|
||
| // `get` always returns a synthetic existing record, so every attempt takes | ||
| // the put path and races into a VersionMismatch — no seed write needed. | ||
| let result = cas_update( | ||
| fs.as_ref(), | ||
| &scope, | ||
| &counter_path(), | ||
| decode_counter, | ||
| encode_counter, | ||
| increment, | ||
| ) | ||
| .await; | ||
|
|
||
| assert!( | ||
| matches!(result, Err(CasUpdateError::RetriesExhausted)), | ||
| "persistent VersionMismatch must terminate with RetriesExhausted, got {result:?}" | ||
| ); | ||
| assert_eq!( | ||
| backend.put_attempts.load(Ordering::SeqCst), | ||
| super::FILESYSTEM_CAS_RETRIES, | ||
| "the loop must attempt exactly the retry cap before giving up" | ||
| ); | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn non_cas_backend_is_rejected_not_overwritten() { | ||
| let backend = Arc::new(NonCasBackend::new()); | ||
| let fs = Arc::new(scoped(backend.clone())); | ||
| let scope = ResourceScope::system(); | ||
|
|
||
| let result = cas_update( | ||
| fs.as_ref(), | ||
| &scope, | ||
| &counter_path(), | ||
| decode_counter, | ||
| encode_counter, | ||
| increment, | ||
| ) | ||
| .await; | ||
|
|
||
| assert!( | ||
| matches!(result, Err(CasUpdateError::CasUnsupported)), | ||
| "a non-CAS backend must be rejected by the capability gate, got {result:?}" | ||
| ); | ||
|
|
||
| // Critically: nothing was written. The pre-flight gate refused before the | ||
| // blind-overwrite `put` could run. | ||
| let stored = fs.get(&scope, &counter_path()).await.unwrap(); | ||
| assert!( | ||
| stored.is_none(), | ||
| "the capability gate must reject before any write (no blind overwrite)" | ||
| ); | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn apply_error_is_carried_through_unwrapped() { | ||
| let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new()))); | ||
| let scope = ResourceScope::system(); | ||
|
|
||
| let result: Result<u64, CasUpdateError<TestError>> = cas_update( | ||
| fs.as_ref(), | ||
| &scope, | ||
| &counter_path(), | ||
| decode_counter, | ||
| encode_counter, | ||
| |_current: Option<Counter>| async move { | ||
| Err::<CasApply<Counter, u64>, _>(TestError("boom".to_string())) | ||
| }, | ||
| ) | ||
| .await; | ||
|
|
||
| match result { | ||
| Err(CasUpdateError::Apply(TestError(reason))) => assert_eq!(reason, "boom"), | ||
| other => panic!("expected Apply error carried through, got {other:?}"), | ||
| } | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Add a regression test for the timeout branch.
CasUpdateError::Timeout is part of the new public contract and is the load-bearing guard for the wedged-backend failure mode, but this suite never stalls get, put, or apply long enough to prove the outer timeout wins. A tiny hanging backend plus paused Tokio time would cover the branch deterministically. As per coding guidelines, "Every bug fix must include a regression test" and Rust tests should cover changed critical paths.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ironclaw_filesystem/src/cas/tests.rs` around lines 184 - 399, Add a
regression test in cas/tests.rs for the timeout path in cas_update, since the
current tests cover retries, CAS rejection, and apply errors but never exercise
CasUpdateError::Timeout. Introduce a tiny wedged backend or stub that blocks one
of the CAS operations (get, put, or apply), and use paused Tokio time so the
outer timeout in cas_update deterministically fires. Reference cas_update and
CasUpdateError::Timeout in the new test, and assert that the timeout branch wins
over the hung backend behavior.
Source: Coding guidelines
| CasUpdateError::CasUnsupported => { | ||
| E::storage("snapshot backend does not support versioned compare-and-swap".to_string()) | ||
| } | ||
| CasUpdateError::Backend(inner) => E::storage_from(inner), |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Sanitize backend errors before returning StorageError.
Line 212 forwards FilesystemError::to_string() into the store error; those displays can include backend virtual paths and raw backend reasons. Return a stable sanitized message here, and keep the detailed error only in internal diagnostics/source context. As per path instructions, do not expose backend paths or raw backend errors across public surfaces.
Suggested direction
- CasUpdateError::Backend(inner) => E::storage_from(inner),
+ CasUpdateError::Backend(_inner) => {
+ E::storage("snapshot filesystem backend error".to_string())
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CasUpdateError::Backend(inner) => E::storage_from(inner), | |
| CasUpdateError::Backend(_inner) => { | |
| E::storage("snapshot filesystem backend error".to_string()) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ironclaw_resources/src/cas_snapshot.rs` at line 212, The
CasUpdateError::Backend handling in cas_snapshot.rs currently forwards the
backend error string directly via E::storage_from(inner), which can leak virtual
paths and raw backend details into StorageError. Update this mapping to return a
stable sanitized public message from the error conversion path, and preserve the
detailed backend error only in internal diagnostics or source context associated
with E::storage_from.
Source: Path instructions
| CasUpdateError::CasUnsupported => RunStateError::Backend( | ||
| "backend does not support versioned compare-and-swap".to_string(), | ||
| ), | ||
| CasUpdateError::Backend(fs_err) => RunStateError::Filesystem(fs_err.to_string()), |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Preserve the typed filesystem error.
fs_err.to_string() drops the structured FilesystemError variant and path/operation context that existing ? conversions preserve elsewhere in this file. Prefer the existing typed conversion:
Proposed fix
- CasUpdateError::Backend(fs_err) => RunStateError::Filesystem(fs_err.to_string()),
+ CasUpdateError::Backend(fs_err) => fs_err.into(),As per path instructions: errors should propagate with context into typed errors.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CasUpdateError::Backend(fs_err) => RunStateError::Filesystem(fs_err.to_string()), | |
| CasUpdateError::Backend(fs_err) => fs_err.into(), |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ironclaw_run_state/src/lib.rs` at line 1109, The
`CasUpdateError::Backend` branch in `RunStateError` currently converts the
filesystem failure to a string, which drops the structured `FilesystemError`
context. Update this match arm in `lib.rs` to preserve the typed error by using
the existing `?`-style conversion path or direct typed `From`/`Into` conversion
used elsewhere in this file, so the `FilesystemError` variant and its
path/operation details continue to propagate through
`RunStateError::Filesystem`.
Sources: Coding guidelines, Path instructions
| let already_marked = lease.status == SecretLeaseStatus::Expired; | ||
| if already_marked { | ||
| // No-op: return unchanged record so helper skips | ||
| // the write (PartialEq equality path). | ||
| Err(SecretStoreError::LeaseExpired { lease_id }) |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix the CAS no-op comment.
Line 481 says this returns the unchanged record for the PartialEq no-op path, but the code returns Err(LeaseExpired) and exits through CasUpdateError::Apply. Update the comment or return CasApply::new(lease, Err(...)) to match the documented behavior. As per coding guidelines, comments that promise guarantees across layers must be enforced by code/tests or softened to describe intent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ironclaw_secrets/src/filesystem_store.rs` around lines 479 - 483, The
no-op CAS comment in filesystem_store.rs is inaccurate: the `already_marked`
branch in the lease update path returns `Err(SecretStoreError::LeaseExpired {
lease_id })` and goes through `CasUpdateError::Apply`, not the unchanged
record/`PartialEq` skip path. Update the comment to describe the actual intent,
or change the branch in the relevant lease/CAS helper to return
`CasApply::new(lease, Err(...))` if that is the desired behavior; use the
`already_marked`, `SecretLeaseStatus::Expired`, and `CasUpdateError::Apply`
symbols to locate and align the code with the documented contract.
Source: Coding guidelines
| fn map_cas_error<T>(error: CasUpdateError<BridgeError<T>>) -> TurnError { | ||
| match error { | ||
| CasUpdateError::Apply(BridgeError::Real(inner)) => inner, | ||
| CasUpdateError::Apply(BridgeError::NoOp(_)) => { | ||
| // Should be unreachable: the caller extracts NoOp before calling | ||
| // map_cas_error. Defensive fallback. | ||
| unreachable!("NoOp bridge error must be handled by the apply caller") | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Don’t panic in the CAS error mapper.
This is production error-handling code; if the NoOp sentinel ever leaks past the caller, unreachable! turns a recoverable mapping bug into a process panic. Return a typed unavailable error instead.
Suggested fix
CasUpdateError::Apply(BridgeError::NoOp(_)) => {
- // Should be unreachable: the caller extracts NoOp before calling
- // map_cas_error. Defensive fallback.
- unreachable!("NoOp bridge error must be handled by the apply caller")
+ tracing::debug!("turn state CAS no-op sentinel reached error mapper");
+ TurnError::Unavailable {
+ reason: "turn state persistence temporarily unavailable".to_string(),
+ }
}As per coding guidelines, production Rust should avoid panic-style failure paths and map errors with context.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn map_cas_error<T>(error: CasUpdateError<BridgeError<T>>) -> TurnError { | |
| match error { | |
| CasUpdateError::Apply(BridgeError::Real(inner)) => inner, | |
| CasUpdateError::Apply(BridgeError::NoOp(_)) => { | |
| // Should be unreachable: the caller extracts NoOp before calling | |
| // map_cas_error. Defensive fallback. | |
| unreachable!("NoOp bridge error must be handled by the apply caller") | |
| } | |
| fn map_cas_error<T>(error: CasUpdateError<BridgeError<T>>) -> TurnError { | |
| match error { | |
| CasUpdateError::Apply(BridgeError::Real(inner)) => inner, | |
| CasUpdateError::Apply(BridgeError::NoOp(_)) => { | |
| tracing::debug!("turn state CAS no-op sentinel reached error mapper"); | |
| TurnError::Unavailable { | |
| reason: "turn state persistence temporarily unavailable".to_string(), | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ironclaw_turns/src/filesystem_store.rs` around lines 378 - 385, The
CAS error mapper in map_cas_error should not panic on BridgeError::NoOp, since
this is production error-handling code. Replace the unreachable! fallback in
map_cas_error with a typed unavailable TurnError that preserves context about
the unexpected NoOp leak, while keeping the BridgeError::Real(inner) path
unchanged and consistent with CasUpdateError handling.
Sources: Coding guidelines, Path instructions
| ## Quality gate | ||
|
|
||
| Per touched crate: `cargo fmt --all`, `cargo clippy --all-targets` (zero | ||
| warnings), `cargo test` (+ `--features integration` where stores have it). |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Align the plan's validation steps with the repo's required backend checks.
This plan documents a legacy persistence migration, but its quality gate drops the required feature-isolation cargo check matrix and weakens clippy to cargo clippy --all-targets. That makes the migration plan a stale source of truth for the postgres/libSQL split this PR is changing. As per coding guidelines, legacy persistence changes must "Test feature isolation" with default, --no-default-features --features libsql, and --all-features, and "Before You Open a PR" requires cargo clippy --all --benches --tests --examples --all-features -- -D warnings. Based on learnings, verify feature isolation with the required cargo check combinations for postgres, libSQL-only, and all-features builds.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/plans/2026-06-25-cas-migration.md` around lines 137 - 140, The quality
gate in the migration plan is outdated and should match the repo’s required
backend validation for persistence changes. Update the “Quality gate” section to
include the feature-isolation `cargo check` matrix for the default/postgres
build, `--no-default-features --features libsql`, and `--all-features`, and
strengthen the clippy step to the full `cargo clippy --all --benches --tests
--examples --all-features -- -D warnings`. Keep the existing formatting/tests
coverage, but ensure the plan explicitly reflects the required checks for the
legacy persistence migration.
Sources: Coding guidelines, Learnings
|
🚅 Deployed to the ironclaw-pr-5234 environment in ironclaw-ci-preview
|
Fold in PR review (henrypark133) + a durable-write hot-path audit. Adds §12 framing the deployed lease-expiry cascade as three independent layers: - Layer A: in-process lock convoy → #5234 (open); write-behind composes on the post-#5234 CAS path. - Layer B: pool starvation — DEFAULT_POSTGRES_POOL_MAX_SIZE=2 shared across all Postgres FS I/O. Notes the existing 30s checkout guard (closes the infinite hang) but flags pool-too-small + checkout(30s)>apply(15s); cheap mitigations (raise pool, reserve a critical connection, align checkout<apply) prior to and complementary with write-behind. - Layer C: the synchronous hot-path write map (events / governor / thread-append / lease / memory) with the write-behind-vs-batch-coalesce distinction. Events are the top batch-coalesce target (highest churn, O(1) INSERT no CAS); must stay DURABLE (source of truth), per-step flush to preserve live SSE. Memory stays synchronous (FTS-only, no embedding write; agent-initiated, low churn). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Problem
Each Reborn persistence store wrapped its filesystem read-modify-write in a per-record
tokio::sync::Mutex(FILESYSTEM_RECORD_LOCKS) held across.await— a redundant in-process serializer over backends that already do versioned CAS. Under burst, one writer stalled inside its critical section blocks every other writer for that scope (the convoy that contributed to the runtime wedge). PR #5142 removed exactly this fromironclaw_turns; this PR finishes the job for the remaining stores via one shared helper.Change
Shared helper (
ironclaw_filesystem::cas_update): one mutex-free, bounded read-modify-write loop — read versioned snapshot → idempotentapplyclosure → CAS put → onVersionMismatchre-read and retry (32×, jittered 2–50ms backoff, 15s timeout), with a fail-closed capability gate. Generic over record/outcome/error; leaks no store types.Migrated all 5 stores onto it and deleted every per-record mutex:
ironclaw_turns— re-homed its local fix(turns): prevent turn-state write convoy #5142 copy onto the shared helper (one owner).ironclaw_run_state,ironclaw_threads— dropped the mutex; route through the helper.ironclaw_resources— gains a retry loop it never had (the lock was its only serializer).ironclaw_secrets— deletes theArc-keyed lock map → fixes an unbounded memory leak (it storedArc, never pruned).Post-condition:
grep FILESYSTEM_RECORD_LOCKS / record_lock.lock().await / filesystem_secret_lockacross the 5 store crates is empty.Guardrail:
.claude/rules/database.mdrecords the invariant — filesystem read-modify-write must go throughcas_update; never wrap it in a per-record mutex held across.await.Tri-backend parity (no diverging experience by host)
The helper is
RootFilesystem-level (backend-agnostic). Verified across all threeironclaw_filesystembackends:--features postgres): CAS contract tests pass —postgres_native_put_cas_absent_rejects_existing_path,postgres_native_put_cas_any_increments_existing_version,postgres_transaction_rollback_discards_prior_put_after_later_cas_conflict.--features libsql): CAS contract tests pass. (One unrelated append/dir contract test is flaky under parallel suite execution — passes 3/3 in isolation;db.rsis untouched by this PR, so it is pre-existing test-infra flakiness, not a regression.)Notes
🤖 Generated with Claude Code