Skip to content

fix(reborn): remove per-record lock convoys via shared cas_update helper#5234

Open
henrypark133 wants to merge 7 commits into
mainfrom
fix/reborn-p1-cas-helper
Open

fix(reborn): remove per-record lock convoys via shared cas_update helper#5234
henrypark133 wants to merge 7 commits into
mainfrom
fix/reborn-p1-cas-helper

Conversation

@henrypark133

Copy link
Copy Markdown
Collaborator

Problem

Each Reborn persistence store wrapped its filesystem read-modify-write in a per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) held across .await — a redundant in-process serializer over backends that already do versioned CAS. Under burst, one writer stalled inside its critical section blocks every other writer for that scope (the convoy that contributed to the runtime wedge). PR #5142 removed exactly this from ironclaw_turns; this PR finishes the job for the remaining stores via one shared helper.

Change

Shared helper (ironclaw_filesystem::cas_update): one mutex-free, bounded read-modify-write loop — read versioned snapshot → idempotent apply closure → CAS put → on VersionMismatch re-read and retry (32×, jittered 2–50ms backoff, 15s timeout), with a fail-closed capability gate. Generic over record/outcome/error; leaks no store types.

Migrated all 5 stores onto it and deleted every per-record mutex:

  • ironclaw_turns — re-homed its local fix(turns): prevent turn-state write convoy #5142 copy onto the shared helper (one owner).
  • ironclaw_run_state, ironclaw_threads — dropped the mutex; route through the helper.
  • ironclaw_resourcesgains a retry loop it never had (the lock was its only serializer).
  • ironclaw_secrets — deletes the Arc-keyed lock map → fixes an unbounded memory leak (it stored Arc, never pruned).

Post-condition: grep FILESYSTEM_RECORD_LOCKS / record_lock.lock().await / filesystem_secret_lock across the 5 store crates is empty.

Guardrail: .claude/rules/database.md records the invariant — filesystem read-modify-write must go through cas_update; never wrap it in a per-record mutex held across .await.

Tri-backend parity (no diverging experience by host)

The helper is RootFilesystem-level (backend-agnostic). Verified across all three ironclaw_filesystem backends:

  • in-memory (default): clippy 0 across the 6 crates; all store tests pass (turns/run_state/threads/resources/secrets).
  • postgres (--features postgres): CAS contract tests pass — postgres_native_put_cas_absent_rejects_existing_path, postgres_native_put_cas_any_increments_existing_version, postgres_transaction_rollback_discards_prior_put_after_later_cas_conflict.
  • libsql (--features libsql): CAS contract tests pass. (One unrelated append/dir contract test is flaky under parallel suite execution — passes 3/3 in isolation; db.rs is untouched by this PR, so it is pre-existing test-infra flakiness, not a regression.)

Notes

🤖 Generated with Claude Code

henrypark133 and others added 7 commits June 24, 2026 19:01
Extract the proven mutex-free CAS pattern from ironclaw_turns (PR #5142)
into one shared helper in ironclaw_filesystem so the remaining stores can
drop their per-record tokio::sync::Mutex convoys.

cas_update runs a bounded read-modify-write loop: read versioned snapshot,
run the caller's idempotent apply closure, CAS-put with the read version,
and on VersionMismatch re-read and retry with jittered exponential backoff
(2ms..50ms), capped at 32 retries and wrapped in a 15s timeout. A
fail-closed capability gate rejects backends that cannot CAS rather than
silently blind-overwriting. The helper is generic over the record type,
the outcome, and the caller's error; it never leaks store-specific types.

No store migrated yet — that follows in the next commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te helper

ironclaw_resources had NO CAS-retry loop — its per-record tokio::sync::Mutex
(FILESYSTEM_RECORD_LOCKS) held across the backend get/put awaits was the ONLY
serializer. Under burst, one writer stalled inside its critical section parked
every other same-scope writer (the convoy that contributed to the 2026-06-24
runtime wedge). Naively deleting that mutex without a retry loop loses updates:
a racing writer's single-attempt CAS detects the version mismatch and errors
out (cross-process CAS contention) instead of retrying.

Route update_snapshot through ironclaw_filesystem::cas_update (bounded CAS
retry + jittered backoff + 15s timeout + fail-closed capability gate) and
delete the per-record mutex, FilesystemRecordLock, the local PutError, and the
fail-open put_with_cas (CasExpectation::Any blind-overwrite fallback). The
helper fails closed instead; verified safe — these store aliases only ever
resolve to CAS-capable db/in-memory backends in production.

The public update API widens from FnOnce to FnMut because the helper re-runs
the closure against a freshly read snapshot on every CAS retry; leaf closures
that previously moved captured values now clone per invocation. Add PartialEq
to BudgetGateSnapshot (helper needs S: Clone + PartialEq).

Red->green regression: cas_snapshot::tests::concurrent_increments_have_no_lost_updates
proves RED on lock-removed-no-retry (lost updates / spurious contention) and
GREEN through the helper (every concurrent increment lands, no convoy).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ant per-record mutex

run_state held BOTH a store-local CAS-retry loop AND a per-record
tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) across the backend get/put awaits
— belt-and-suspenders where the mutex was a redundant in-process serializer
over a backend that already does versioned CAS. Under burst the mutex convoyed
same-scope writers behind one stalled writer (the 2026-06-24 wedge pattern).

Route apply_update, update_status, start, and save_pending through
ironclaw_filesystem::cas_update (one shared CAS implementation: bounded retry +
jittered backoff + 15s timeout + fail-closed capability gate). Delete the
store-local FILESYSTEM_CAS_RETRIES loop, the per-record lock + accessor + its
two unit tests, the local PutError, and the fail-open put_with_cas
(CasExpectation::Any blind-overwrite fallback). The scope-ownership check and
the approval Pending guard move inside the re-runnable apply closure; the
ApprovalStatus guard returns Err from apply. discard_pending keeps its
read-then-delete, only the lock is dropped.

The run_state contract tests previously ran against LocalFilesystem (byte-only,
no versioned CAS) where the OLD fail-open fallback masked the missing CAS.
Production never routes run_state to LocalFilesystem — only to CAS-capable
db/in-memory backends — so the tests now use InMemoryBackend, matching the
production capability shape (CAS) instead of a non-production blind-overwrite path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p per-record mutex

ensure_thread held the only per-record tokio::sync::Mutex
(FILESYSTEM_RECORD_LOCKS) in this crate, across the backend get/put awaits, to
serialize the check-then-create against Unsupported(WriteFile)-fallback
backends. Route it through ironclaw_filesystem::cas_update: the apply closure
returns the existing record unchanged (no-op, no write) when the thread already
exists with a matching scope, or builds the fresh StoredThreadRecord when
absent. A concurrent create-if-absent loser hits VersionMismatch, the helper
re-reads and re-runs apply which now sees the winner and reconciles scope —
exactly the old single-reconcile semantics, now via the shared CAS-retry loop
without any lock. Delete the lock infra (Weak map + accessor); add PartialEq to
StoredThreadRecord (helper needs S: Clone + PartialEq). The three other
record/txn loops were already lock-free and are left as-is.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cord mutex (Arc leak fix)

The secrets filesystem store kept FILESYSTEM_RECORD_LOCKS as a HashMap of
*strong* Arc<Mutex<()>> (not Weak), one entry per (scope,lease)/session path
ever seen, NEVER pruned — an unbounded memory leak. The per-record mutex was
also held across the backend get/put awaits, convoying same-key writers behind
one stalled writer (the 2026-06-24 wedge pattern).

Route consume/revoke/consume_session_use through ironclaw_filesystem::cas_update
(one shared CAS implementation), mapping the local CasDecision onto CasApply:
Commit/BestEffortCommit -> changed snapshot; Settle(Ok) -> unchanged snapshot
(helper skips the write via PartialEq no-op); Settle(Err)/not-found -> apply
error. Crypto decrypt + use-count increment run inside the re-runnable apply
closure (pure / recomputed from the freshly read record each retry).
validate_session is a pure read — its lock is simply deleted. Delete the entire
lock map (leak gone), cas_mutate/CasDecision/CAS_RETRY_ATTEMPTS, and the
fail-open put_with_version_fallback (helper fails closed). Add PartialEq to
StoredLease and StoredSession.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR #5142 removed the per-record mutex from ironclaw_turns but left a local copy
of the CAS read-modify-write loop (apply_with_retry, put_with_cas, PutError,
cas_retry_backoff + constants). Re-home it onto ironclaw_filesystem::cas_update
so there is ONE CAS implementation across the codebase. Behavior is preserved:
the 18 call sites and their closure shape (FnMut(InMemoryTurnStateStore) -> Fut)
are unchanged; the exact timeout/exhaustion error strings are preserved.

A BridgeError<T> sentinel carries the absent-record + default-snapshot no-op
through the helper's apply-error channel (the helper's own no-op check only
fires for Some(existing)==new; the turns store additionally must skip creating a
file for an empty default store, which the old new==old check covered because
read returned default on absent). The 500ms SNAPSHOT_READ_CACHE_TTL read cache
is kept as a separate layer: cleared around the CAS call and repopulated on the
next read (the helper does its own fresh get and does not surface the new
RecordVersion; every read_snapshot caller already discards the version, so
caching None is safe). Delete the now-unused local loop, put_with_cas, PutError,
cas_retry_backoff, and the duplicate CAS constants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cord mutex

Add the invariant to crates/ironclaw_filesystem/CLAUDE.md (#2 "CAS is the
floor") and a pointer in .claude/rules/database.md: every filesystem
read-modify-write must go through the one shared ironclaw_filesystem::cas_update
helper; never wrap it in a per-record tokio::sync::Mutex held across the backend
.await (redundant serializer + convoy/wedge risk + leak). Also commit Cargo.lock
(async-trait dev-dep added to ironclaw_resources for the convoy regression test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added scope: docs Documentation scope: dependencies Dependency updates size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Jun 25, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates multiple persistence stores (ironclaw_resources, ironclaw_run_state, ironclaw_threads, ironclaw_secrets, and ironclaw_turns) to use a shared, lock-free compare-and-swap (CAS) helper (cas_update). This eliminates redundant per-record mutexes held across .await boundaries, preventing runtime-wedging convoys under high contention. Feedback on the changes highlights an opportunity to improve the jitter calculation in cas_retry_backoff using RandomState to avoid zero-jitter scenarios in low clock resolution environments (such as VMs or Windows). Additionally, it is recommended to explicitly add Send bounds to the closure generic constraints in cas_update and cas_update_loop to catch any non-Send regressions at the helper's definition site.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +311 to +325
async fn cas_retry_backoff(attempt: usize) {
let shift = attempt.min(8) as u32;
let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);
let base_delay = FILESYSTEM_CAS_BACKOFF_BASE
.saturating_mul(multiplier)
.min(FILESYSTEM_CAS_BACKOFF_MAX);
let jitter = SystemTime::now()
.duration_since(UNIX_EPOCH)
.map(|elapsed| {
let jitter_ceiling = base_delay.as_millis().max(1);
Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64)
})
.unwrap_or_default();
tokio::time::sleep(base_delay.saturating_add(jitter)).await;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using SystemTime::now() for jitter calculation can lead to zero jitter on systems with low clock resolution (e.g., 1ms resolution in VMs, containers, or Windows). Since base_delay.as_millis() is at most 50, and 1,000,000 (1ms in nanoseconds) is divisible by all common backoff ceilings (2, 4, 8, 16, 32, 50), the modulo operation elapsed.as_nanos() % jitter_ceiling will always evaluate to exactly 0. This completely defeats the purpose of jitter under high contention.

Instead, use std::collections::hash_map::RandomState to obtain a high-quality, thread-safe, and un-correlated pseudo-random value seeded from the OS without adding external dependencies.

Suggested change
async fn cas_retry_backoff(attempt: usize) {
let shift = attempt.min(8) as u32;
let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);
let base_delay = FILESYSTEM_CAS_BACKOFF_BASE
.saturating_mul(multiplier)
.min(FILESYSTEM_CAS_BACKOFF_MAX);
let jitter = SystemTime::now()
.duration_since(UNIX_EPOCH)
.map(|elapsed| {
let jitter_ceiling = base_delay.as_millis().max(1);
Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64)
})
.unwrap_or_default();
tokio::time::sleep(base_delay.saturating_add(jitter)).await;
}
async fn cas_retry_backoff(attempt: usize) {
let shift = attempt.min(8) as u32;
let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);
let base_delay = FILESYSTEM_CAS_BACKOFF_BASE
.saturating_mul(multiplier)
.min(FILESYSTEM_CAS_BACKOFF_MAX);
let jitter = {
use std::collections::hash_map::RandomState;
use std::hash::{BuildHasher, Hash, Hasher};
let mut hasher = RandomState::new().build_hasher();
attempt.hash(&mut hasher);
let hash = hasher.finish();
let jitter_ceiling = base_delay.as_millis().max(1) as u64;
Duration::from_millis(hash % jitter_ceiling)
};
tokio::time::sleep(base_delay.saturating_add(jitter)).await;
}

Comment on lines +194 to +208
pub async fn cas_update<F, S, T, E, D, N, A, Fut>(
filesystem: &ScopedFilesystem<F>,
scope: &ResourceScope,
path: &ScopedPath,
decode: D,
encode: N,
mut apply: A,
) -> Result<T, CasUpdateError<E>>
where
F: RootFilesystem + ?Sized,
S: PartialEq + Clone,
D: Fn(&[u8]) -> Result<S, E>,
N: Fn(&S) -> Result<Entry, E>,
A: FnMut(Option<S>) -> Fut,
Fut: Future<Output = Result<CasApply<S, T>, E>>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.

pub async fn cas_update<F, S, T, E, D, N, A, Fut>(
    filesystem: &ScopedFilesystem<F>,
    scope: &ResourceScope,
    path: &ScopedPath,
    decode: D,
    encode: N,
    mut apply: A,
) -> Result<T, CasUpdateError<E>>
where
    F: RootFilesystem + ?Sized,
    S: PartialEq + Clone + Send,
    T: Send,
    E: Send,
    D: Fn(&[u8]) -> Result<S, E> + Send,
    N: Fn(&S) -> Result<Entry, E> + Send,
    A: FnMut(Option<S>) -> Fut + Send,
    Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,
References
  1. In Rust, explicitly add Send bounds to closure generic constraints in async helper functions that return futures intended to be Send. This ensures that non-Send regressions are caught at the helper's definition site rather than at the call site, even if the future is already bound by Send at the call site.

Comment on lines +226 to +240
async fn cas_update_loop<F, S, T, E, D, N, A, Fut>(
filesystem: &ScopedFilesystem<F>,
scope: &ResourceScope,
path: &ScopedPath,
decode: &D,
encode: &N,
apply: &mut A,
) -> Result<T, CasUpdateError<E>>
where
F: RootFilesystem + ?Sized,
S: PartialEq + Clone,
D: Fn(&[u8]) -> Result<S, E>,
N: Fn(&S) -> Result<Entry, E>,
A: FnMut(Option<S>) -> Fut,
Fut: Future<Output = Result<CasApply<S, T>, E>>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.

async fn cas_update_loop<F, S, T, E, D, N, A, Fut>(
    filesystem: &ScopedFilesystem<F>,
    scope: &ResourceScope,
    path: &ScopedPath,
    decode: &D,
    encode: &N,
    apply: &mut A,
) -> Result<T, CasUpdateError<E>>
where
    F: RootFilesystem + ?Sized,
    S: PartialEq + Clone + Send,
    T: Send,
    E: Send,
    D: Fn(&[u8]) -> Result<S, E> + Send,
    N: Fn(&S) -> Result<Entry, E> + Send,
    A: FnMut(Option<S>) -> Fut + Send,
    Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,
References
  1. In Rust, explicitly add Send bounds to closure generic constraints in async helper functions that return futures intended to be Send. This ensures that non-Send regressions are caught at the helper's definition site rather than at the call site, even if the future is already bound by Send at the call site.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added a shared, lock-free update path for filesystem-backed data, improving consistency under contention.
    • Exposed filesystem capability information for smarter update handling.
  • Bug Fixes

    • Reduced lost updates and convoying during concurrent writes.
    • Improved handling of no-op updates, retries, timeouts, and unsupported backends.
  • Refactor

    • Migrated several stores to a common update flow, removing older per-record locking approaches and simplifying concurrent persistence.

Walkthrough

Introduced a shared filesystem cas_update helper with capability gating, bounded retries, backoff, and no-op detection, then migrated run-state, approval, resource, secret, thread, and turn stores to retryable CAS update paths with updated error mapping, tests, and migration notes.

Changes

Shared filesystem CAS migration

Layer / File(s) Summary
Contract and public API surface
.claude/rules/database.md, crates/ironclaw_filesystem/CLAUDE.md, docs/plans/2026-06-25-cas-migration.md, crates/ironclaw_filesystem/src/scoped.rs, crates/ironclaw_filesystem/src/lib.rs, crates/ironclaw_filesystem/src/cas.rs
Filesystem invariants and migration notes now require cas_update; ScopedFilesystem exposes capabilities; and the crate root exports the new CAS helper, types, and timing constants.
cas_update loop and tests
crates/ironclaw_filesystem/src/cas.rs, crates/ironclaw_filesystem/src/cas/tests.rs
The shared helper performs capability-gated CAS retries with timeout, no-op detection, backoff, and mapped error variants, with behavior covered by tests.
CasSnapshotStore migration
crates/ironclaw_resources/src/cas_snapshot.rs, crates/ironclaw_resources/Cargo.toml
CasSnapshotStore switches to shared CAS retries, requires re-runnable snapshot updates, and adds contention regression coverage.
Resource governor retries
crates/ironclaw_resources/src/filesystem_store.rs, crates/ironclaw_resources/src/lib.rs
ResourceGovernorStore and FilesystemBudgetGateStore accept re-runnable closures, clone captured values on retries, and add PartialEq to the budget snapshot.
Run-state record updates
crates/ironclaw_run_state/src/lib.rs
FilesystemRunStateStore moves record mutation and creation onto cas_update and removes the local lock map and legacy CAS write helpers.
Approval request updates and tests
crates/ironclaw_run_state/src/lib.rs, crates/ironclaw_run_state/tests/*
FilesystemApprovalRequestStore uses cas_update for status transitions and creation, and the run-state harness now uses InMemoryBackend in the affected tests.
Secret and session CAS updates
crates/ironclaw_secrets/src/filesystem_store.rs
SecretStore and CredentialSessionStore now use shared CAS retries, and stored records derive PartialEq for no-op detection.
Thread service CAS updates
crates/ironclaw_threads/src/filesystem_service.rs
FilesystemSessionThreadService::ensure_thread now uses cas_update, the stored thread record derives PartialEq, and per-path locking is removed.
Turn-state writes and cache handling
crates/ironclaw_turns/src/filesystem_store.rs
FilesystemTurnStateStore routes writes through cas_update, the snapshot cache is cleared and repopulated around retries, and the old manual retry helpers are removed.

Sequence Diagram(s)

sequenceDiagram
  participant FilesystemRunStateStore
  participant cas_update
  participant ScopedFilesystem

  FilesystemRunStateStore->>cas_update: apply_update
  cas_update->>ScopedFilesystem: capabilities()
  ScopedFilesystem-->>cas_update: BackendCapabilities
  cas_update->>ScopedFilesystem: get(path)
  cas_update->>ScopedFilesystem: put(path, CasExpectation::Version)
  ScopedFilesystem-->>cas_update: VersionMismatch
  cas_update->>cas_update: retry with backoff
  cas_update-->>FilesystemRunStateStore: Result
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • nearai/ironclaw#5142 — Also rewrites crates/ironclaw_turns/src/filesystem_store.rs toward bounded CAS retries and cache-clearing semantics.
  • nearai/ironclaw#5232 — Also touches crates/ironclaw_turns/src/filesystem_store.rs and related CAS update flows for runner-lease sidecar overlays.

Suggested reviewers

  • think-in-universe

Poem

A mutex slept beside the gate,
Then CAS arrived and would not wait.
Retries danced in bounded time,
Snapshots kept their steady rhyme,
And versioned writes rang clear and bright ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning It has useful content, but it misses most required template sections and headings, so the PR description is largely incomplete. Rewrite it with the repository template and add Summary, Change Type, Linked Issue, Validation, Security Impact, Blast Radius, Rollback Plan, Review Follow-Through, and Review track.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed Conventional-Commits style and accurately summarizes the shared cas_update migration across the filesystem stores.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix/reborn-p1-cas-helper

Comment @coderabbitai help to get the list of available commands.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

self.filesystem.delete(scope, &path).await?;

P1 Badge Guard discard deletes with CAS

When discard_pending races in the same process with approve or deny for the same request, this unconditional delete can now run after the resolver's CAS-protected status update and remove the terminal approval record. The removed per-record lock used to serialize that get/check/delete sequence with status updates; without either that lock or a CAS/tombstone transition here, a user approval can be lost and later look like UnknownApprovalRequest rather than Approved.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/ironclaw_run_state/src/lib.rs (1)

922-932: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Make discard_pending atomic before dropping the lock.

This path reads Pending, then deletes later with no CAS expectation. A concurrent approve()/deny() can win the CAS update between Line 922 and Line 932, then this delete removes a non-pending approval record. Use a CAS-aware delete/transactional transition, or model discard as a CAS status update instead of read-then-delete. As per path instructions: “Fail loud” and filesystem read-modify-write paths must use the CAS invariant rather than split multi-step state changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_run_state/src/lib.rs` around lines 922 - 932, The
discard_pending flow in RunState::discard_pending is currently split into a
read/validate step and a later filesystem delete, which can race with
approve()/deny() and remove a record after its status has already changed.
Update this path to be atomic by using a CAS-aware transition or transactional
delete that enforces the same status invariant at the point of mutation, rather
than relying on a prior Pending check; keep the failure path loud if the
expected state no longer matches.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_filesystem/src/cas.rs`:
- Line 193: The new Clippy allow on cas.rs needs the required arch-exempt
rationale comment immediately above #[allow(clippy::too_many_arguments)]. Update
the same site near the cas:: function signature that triggered the lint by
adding the mandatory // arch-exempt: too_many_args, <reason naming the missing
aggregation>, plan `#NNNN` annotation, and make sure the reason justifies why this
many parameters is unavoidable until the missing aggregation is introduced.
- Around line 295-306: The CAS preflight is treating
BackendCapabilities::default() as “unknown,” which lets a backend with no
advertised capabilities bypass the fail-closed gate. Update capabilities_known()
in cas.rs to only treat an explicit unknown state as unknown and not overload
BackendCapabilities::empty()/default(), then make the cas_update path rely on
that check so unsupported backends still return CasUpdateError::CasUnsupported
instead of falling back to op-time behavior.
- Around line 262-265: The no-op shortcut in cas_update_loop currently only
returns early when current is Some(existing) and matches the snapshot, while
absent records still fall through to encoding and writing; update the logic to
either treat None plus an unchanged/default snapshot as a true no-op or adjust
the CasApply/cas_update contract and surrounding comments to state the shortcut
only applies to existing records. Use the existing cas_update_loop and CasApply
symbols to keep the behavior and docs aligned, and remove any need for
downstream NoOp suppression if you choose to enforce it here.

In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 184-399: Add a regression test in cas/tests.rs for the timeout
path in cas_update, since the current tests cover retries, CAS rejection, and
apply errors but never exercise CasUpdateError::Timeout. Introduce a tiny wedged
backend or stub that blocks one of the CAS operations (get, put, or apply), and
use paused Tokio time so the outer timeout in cas_update deterministically
fires. Reference cas_update and CasUpdateError::Timeout in the new test, and
assert that the timeout branch wins over the hung backend behavior.

In `@crates/ironclaw_resources/src/cas_snapshot.rs`:
- Line 212: The CasUpdateError::Backend handling in cas_snapshot.rs currently
forwards the backend error string directly via E::storage_from(inner), which can
leak virtual paths and raw backend details into StorageError. Update this
mapping to return a stable sanitized public message from the error conversion
path, and preserve the detailed backend error only in internal diagnostics or
source context associated with E::storage_from.

In `@crates/ironclaw_run_state/src/lib.rs`:
- Line 1109: The `CasUpdateError::Backend` branch in `RunStateError` currently
converts the filesystem failure to a string, which drops the structured
`FilesystemError` context. Update this match arm in `lib.rs` to preserve the
typed error by using the existing `?`-style conversion path or direct typed
`From`/`Into` conversion used elsewhere in this file, so the `FilesystemError`
variant and its path/operation details continue to propagate through
`RunStateError::Filesystem`.

In `@crates/ironclaw_secrets/src/filesystem_store.rs`:
- Around line 479-483: The no-op CAS comment in filesystem_store.rs is
inaccurate: the `already_marked` branch in the lease update path returns
`Err(SecretStoreError::LeaseExpired { lease_id })` and goes through
`CasUpdateError::Apply`, not the unchanged record/`PartialEq` skip path. Update
the comment to describe the actual intent, or change the branch in the relevant
lease/CAS helper to return `CasApply::new(lease, Err(...))` if that is the
desired behavior; use the `already_marked`, `SecretLeaseStatus::Expired`, and
`CasUpdateError::Apply` symbols to locate and align the code with the documented
contract.

In `@crates/ironclaw_turns/src/filesystem_store.rs`:
- Around line 378-385: The CAS error mapper in map_cas_error should not panic on
BridgeError::NoOp, since this is production error-handling code. Replace the
unreachable! fallback in map_cas_error with a typed unavailable TurnError that
preserves context about the unexpected NoOp leak, while keeping the
BridgeError::Real(inner) path unchanged and consistent with CasUpdateError
handling.

In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 137-140: The quality gate in the migration plan is outdated and
should match the repo’s required backend validation for persistence changes.
Update the “Quality gate” section to include the feature-isolation `cargo check`
matrix for the default/postgres build, `--no-default-features --features
libsql`, and `--all-features`, and strengthen the clippy step to the full `cargo
clippy --all --benches --tests --examples --all-features -- -D warnings`. Keep
the existing formatting/tests coverage, but ensure the plan explicitly reflects
the required checks for the legacy persistence migration.

---

Outside diff comments:
In `@crates/ironclaw_run_state/src/lib.rs`:
- Around line 922-932: The discard_pending flow in RunState::discard_pending is
currently split into a read/validate step and a later filesystem delete, which
can race with approve()/deny() and remove a record after its status has already
changed. Update this path to be atomic by using a CAS-aware transition or
transactional delete that enforces the same status invariant at the point of
mutation, rather than relying on a prior Pending check; keep the failure path
loud if the expected state no longer matches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 1c594037-4929-451d-97ad-73e096ed325d

📥 Commits

Reviewing files that changed from the base of the PR and between 44f063d and 3223540.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock, !**/Cargo.lock
📒 Files selected for processing (17)
  • .claude/rules/database.md
  • crates/ironclaw_filesystem/CLAUDE.md
  • crates/ironclaw_filesystem/src/cas.rs
  • crates/ironclaw_filesystem/src/cas/tests.rs
  • crates/ironclaw_filesystem/src/lib.rs
  • crates/ironclaw_filesystem/src/scoped.rs
  • crates/ironclaw_resources/Cargo.toml
  • crates/ironclaw_resources/src/cas_snapshot.rs
  • crates/ironclaw_resources/src/filesystem_store.rs
  • crates/ironclaw_resources/src/lib.rs
  • crates/ironclaw_run_state/src/lib.rs
  • crates/ironclaw_run_state/tests/approval_resolution_contract.rs
  • crates/ironclaw_run_state/tests/run_state_contract.rs
  • crates/ironclaw_secrets/src/filesystem_store.rs
  • crates/ironclaw_threads/src/filesystem_service.rs
  • crates/ironclaw_turns/src/filesystem_store.rs
  • docs/plans/2026-06-25-cas-migration.md

/// changed); `S: Clone` lets each retry hand `apply` an owned snapshot while the
/// helper retains a copy for that equality check — both mirror the turns
/// reference.
#[allow(clippy::too_many_arguments)]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Add the required arch-exempt annotation above this Clippy allow.

The repo's architecture rule rejects new #[allow(clippy::too_many_arguments)] without a same-site rationale. As per coding guidelines, "When introducing #[allow(clippy::too_many_arguments)], annotate with // arch-exempt: too_many_args, <reason naming the missing aggregation>, plan #NNNN``." Based on learnings, require the exemption comment immediately above the attribute and explain why this signature is justified.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_filesystem/src/cas.rs` at line 193, The new Clippy allow on
cas.rs needs the required arch-exempt rationale comment immediately above
#[allow(clippy::too_many_arguments)]. Update the same site near the cas::
function signature that triggered the lint by adding the mandatory //
arch-exempt: too_many_args, <reason naming the missing aggregation>, plan `#NNNN`
annotation, and make sure the reason justifies why this many parameters is
unavoidable until the missing aggregation is introduced.

Sources: Coding guidelines, Learnings

Comment on lines +262 to +265
// 3. No-op fast path: nothing changed, so skip the write entirely.
if matches!(&current, Some(existing) if *existing == snapshot) {
return Ok(outcome);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

The no-op shortcut doesn't cover absent records, but the public contract reads as if it does.

cas_update_loop() only skips the write when current is Some(existing) and existing == snapshot. None plus a default/unchanged snapshot still encodes and writes, which is why crates/ironclaw_turns/src/filesystem_store.rs had to add a separate BridgeError::NoOp path to suppress the write. Either implement the absent-record no-op here or narrow the docs around CasApply/cas_update to say the shortcut only applies when a record already exists. As per coding guidelines, "Comments that promise guarantees across layers must either be enforced by code/tests or softened to describe intent."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_filesystem/src/cas.rs` around lines 262 - 265, The no-op
shortcut in cas_update_loop currently only returns early when current is
Some(existing) and matches the snapshot, while absent records still fall through
to encoding and writing; update the logic to either treat None plus an
unchanged/default snapshot as a true no-op or adjust the CasApply/cas_update
contract and surrounding comments to state the shortcut only applies to existing
records. Use the existing cas_update_loop and CasApply symbols to keep the
behavior and docs aligned, and remove any need for downstream NoOp suppression
if you choose to enforce it here.

Source: Coding guidelines

Comment on lines +295 to +306
fn capabilities_known(capabilities: &BackendCapabilities) -> bool {
*capabilities != BackendCapabilities::default()
}

/// `true` when the backend's transaction tier is at least
/// [`TxnCapability::Cas`].
fn capabilities_support_cas(capabilities: &BackendCapabilities) -> bool {
matches!(
capabilities.txn(),
TxnCapability::Cas | TxnCapability::MultiKey
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Don't overload BackendCapabilities::default() to mean "unknown".

BackendCapabilities::empty() is the crate's canonical "no capabilities advertised" shape, but capabilities_known() treats that same value as "unknown" and skips the pre-flight CAS gate. A concrete backend that inherits the default capabilities() now silently falls back to op-time behavior, which weakens the new fail-closed cas_update contract. As per coding guidelines, "Declare backend capabilities up front with BackendCapabilities" and "cas_update must fail closed with CasUpdateError::CasUnsupported."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_filesystem/src/cas.rs` around lines 295 - 306, The CAS
preflight is treating BackendCapabilities::default() as “unknown,” which lets a
backend with no advertised capabilities bypass the fail-closed gate. Update
capabilities_known() in cas.rs to only treat an explicit unknown state as
unknown and not overload BackendCapabilities::empty()/default(), then make the
cas_update path rely on that check so unsupported backends still return
CasUpdateError::CasUnsupported instead of falling back to op-time behavior.

Source: Coding guidelines

Comment on lines +184 to +399
#[tokio::test]
async fn create_if_absent_first_write_succeeds() {
let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new())));
let scope = ResourceScope::system();

let outcome = cas_update(
fs.as_ref(),
&scope,
&counter_path(),
decode_counter,
encode_counter,
increment,
)
.await
.unwrap();

assert_eq!(outcome, 1, "first write returns the new value");

// The record now exists at version 1 with the expected body.
let stored = fs
.get(&scope, &counter_path())
.await
.unwrap()
.expect("counter persisted");
let counter = decode_counter(&stored.entry.body).unwrap();
assert_eq!(counter.value, 1);
}

#[tokio::test]
async fn no_op_apply_skips_write() {
let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new())));
let scope = ResourceScope::system();

// Seed a value of 5.
cas_update(
fs.as_ref(),
&scope,
&counter_path(),
decode_counter,
encode_counter,
|current: Option<Counter>| async move {
let _ = current;
Ok::<_, TestError>(CasApply::new(Counter { value: 5 }, ()))
},
)
.await
.unwrap();

let version_before = fs
.get(&scope, &counter_path())
.await
.unwrap()
.unwrap()
.version;

// An apply that returns the unchanged snapshot must not bump the version.
cas_update(
fs.as_ref(),
&scope,
&counter_path(),
decode_counter,
encode_counter,
|current: Option<Counter>| async move {
let snapshot = current.unwrap();
Ok::<_, TestError>(CasApply::new(snapshot, ()))
},
)
.await
.unwrap();

let version_after = fs
.get(&scope, &counter_path())
.await
.unwrap()
.unwrap()
.version;
assert_eq!(
version_before, version_after,
"no-op apply must not issue a write"
);
}

#[tokio::test]
async fn high_contention_storm_has_no_lost_updates() {
const WRITERS: u64 = 50;

let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new())));
let scope = ResourceScope::system();

let mut handles = Vec::new();
for _ in 0..WRITERS {
let fs = fs.clone();
let scope = scope.clone();
handles.push(tokio::spawn(async move {
cas_update(
fs.as_ref(),
&scope,
&counter_path(),
decode_counter,
encode_counter,
increment,
)
.await
}));
}

let mut observed = Vec::new();
for handle in handles {
observed.push(handle.await.unwrap().expect("writer succeeded"));
}

// Final value must equal the number of writers — every increment landed.
let final_counter = decode_counter(
&fs.get(&scope, &counter_path())
.await
.unwrap()
.unwrap()
.entry
.body,
)
.unwrap();
assert_eq!(
final_counter.value, WRITERS,
"every concurrent increment must be observed (no lost update)"
);

// Each writer observed a distinct increment value in 1..=WRITERS.
observed.sort_unstable();
let expected: Vec<u64> = (1..=WRITERS).collect();
assert_eq!(
observed, expected,
"each writer's returned outcome must be a unique increment"
);
}

#[tokio::test]
async fn persistent_version_mismatch_exhausts_retries() {
let backend = Arc::new(AlwaysMismatchBackend::new());
let fs = Arc::new(scoped(backend.clone()));
let scope = ResourceScope::system();

// `get` always returns a synthetic existing record, so every attempt takes
// the put path and races into a VersionMismatch — no seed write needed.
let result = cas_update(
fs.as_ref(),
&scope,
&counter_path(),
decode_counter,
encode_counter,
increment,
)
.await;

assert!(
matches!(result, Err(CasUpdateError::RetriesExhausted)),
"persistent VersionMismatch must terminate with RetriesExhausted, got {result:?}"
);
assert_eq!(
backend.put_attempts.load(Ordering::SeqCst),
super::FILESYSTEM_CAS_RETRIES,
"the loop must attempt exactly the retry cap before giving up"
);
}

#[tokio::test]
async fn non_cas_backend_is_rejected_not_overwritten() {
let backend = Arc::new(NonCasBackend::new());
let fs = Arc::new(scoped(backend.clone()));
let scope = ResourceScope::system();

let result = cas_update(
fs.as_ref(),
&scope,
&counter_path(),
decode_counter,
encode_counter,
increment,
)
.await;

assert!(
matches!(result, Err(CasUpdateError::CasUnsupported)),
"a non-CAS backend must be rejected by the capability gate, got {result:?}"
);

// Critically: nothing was written. The pre-flight gate refused before the
// blind-overwrite `put` could run.
let stored = fs.get(&scope, &counter_path()).await.unwrap();
assert!(
stored.is_none(),
"the capability gate must reject before any write (no blind overwrite)"
);
}

#[tokio::test]
async fn apply_error_is_carried_through_unwrapped() {
let fs = Arc::new(scoped(Arc::new(InMemoryBackend::new())));
let scope = ResourceScope::system();

let result: Result<u64, CasUpdateError<TestError>> = cas_update(
fs.as_ref(),
&scope,
&counter_path(),
decode_counter,
encode_counter,
|_current: Option<Counter>| async move {
Err::<CasApply<Counter, u64>, _>(TestError("boom".to_string()))
},
)
.await;

match result {
Err(CasUpdateError::Apply(TestError(reason))) => assert_eq!(reason, "boom"),
other => panic!("expected Apply error carried through, got {other:?}"),
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Add a regression test for the timeout branch.

CasUpdateError::Timeout is part of the new public contract and is the load-bearing guard for the wedged-backend failure mode, but this suite never stalls get, put, or apply long enough to prove the outer timeout wins. A tiny hanging backend plus paused Tokio time would cover the branch deterministically. As per coding guidelines, "Every bug fix must include a regression test" and Rust tests should cover changed critical paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_filesystem/src/cas/tests.rs` around lines 184 - 399, Add a
regression test in cas/tests.rs for the timeout path in cas_update, since the
current tests cover retries, CAS rejection, and apply errors but never exercise
CasUpdateError::Timeout. Introduce a tiny wedged backend or stub that blocks one
of the CAS operations (get, put, or apply), and use paused Tokio time so the
outer timeout in cas_update deterministically fires. Reference cas_update and
CasUpdateError::Timeout in the new test, and assert that the timeout branch wins
over the hung backend behavior.

Source: Coding guidelines

CasUpdateError::CasUnsupported => {
E::storage("snapshot backend does not support versioned compare-and-swap".to_string())
}
CasUpdateError::Backend(inner) => E::storage_from(inner),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Sanitize backend errors before returning StorageError.

Line 212 forwards FilesystemError::to_string() into the store error; those displays can include backend virtual paths and raw backend reasons. Return a stable sanitized message here, and keep the detailed error only in internal diagnostics/source context. As per path instructions, do not expose backend paths or raw backend errors across public surfaces.

Suggested direction
-        CasUpdateError::Backend(inner) => E::storage_from(inner),
+        CasUpdateError::Backend(_inner) => {
+            E::storage("snapshot filesystem backend error".to_string())
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CasUpdateError::Backend(inner) => E::storage_from(inner),
CasUpdateError::Backend(_inner) => {
E::storage("snapshot filesystem backend error".to_string())
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_resources/src/cas_snapshot.rs` at line 212, The
CasUpdateError::Backend handling in cas_snapshot.rs currently forwards the
backend error string directly via E::storage_from(inner), which can leak virtual
paths and raw backend details into StorageError. Update this mapping to return a
stable sanitized public message from the error conversion path, and preserve the
detailed backend error only in internal diagnostics or source context associated
with E::storage_from.

Source: Path instructions

CasUpdateError::CasUnsupported => RunStateError::Backend(
"backend does not support versioned compare-and-swap".to_string(),
),
CasUpdateError::Backend(fs_err) => RunStateError::Filesystem(fs_err.to_string()),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Preserve the typed filesystem error.

fs_err.to_string() drops the structured FilesystemError variant and path/operation context that existing ? conversions preserve elsewhere in this file. Prefer the existing typed conversion:

Proposed fix
-        CasUpdateError::Backend(fs_err) => RunStateError::Filesystem(fs_err.to_string()),
+        CasUpdateError::Backend(fs_err) => fs_err.into(),

As per path instructions: errors should propagate with context into typed errors.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CasUpdateError::Backend(fs_err) => RunStateError::Filesystem(fs_err.to_string()),
CasUpdateError::Backend(fs_err) => fs_err.into(),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_run_state/src/lib.rs` at line 1109, The
`CasUpdateError::Backend` branch in `RunStateError` currently converts the
filesystem failure to a string, which drops the structured `FilesystemError`
context. Update this match arm in `lib.rs` to preserve the typed error by using
the existing `?`-style conversion path or direct typed `From`/`Into` conversion
used elsewhere in this file, so the `FilesystemError` variant and its
path/operation details continue to propagate through
`RunStateError::Filesystem`.

Sources: Coding guidelines, Path instructions

Comment on lines +479 to +483
let already_marked = lease.status == SecretLeaseStatus::Expired;
if already_marked {
// No-op: return unchanged record so helper skips
// the write (PartialEq equality path).
Err(SecretStoreError::LeaseExpired { lease_id })

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the CAS no-op comment.

Line 481 says this returns the unchanged record for the PartialEq no-op path, but the code returns Err(LeaseExpired) and exits through CasUpdateError::Apply. Update the comment or return CasApply::new(lease, Err(...)) to match the documented behavior. As per coding guidelines, comments that promise guarantees across layers must be enforced by code/tests or softened to describe intent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_secrets/src/filesystem_store.rs` around lines 479 - 483, The
no-op CAS comment in filesystem_store.rs is inaccurate: the `already_marked`
branch in the lease update path returns `Err(SecretStoreError::LeaseExpired {
lease_id })` and goes through `CasUpdateError::Apply`, not the unchanged
record/`PartialEq` skip path. Update the comment to describe the actual intent,
or change the branch in the relevant lease/CAS helper to return
`CasApply::new(lease, Err(...))` if that is the desired behavior; use the
`already_marked`, `SecretLeaseStatus::Expired`, and `CasUpdateError::Apply`
symbols to locate and align the code with the documented contract.

Source: Coding guidelines

Comment on lines +378 to 385
fn map_cas_error<T>(error: CasUpdateError<BridgeError<T>>) -> TurnError {
match error {
CasUpdateError::Apply(BridgeError::Real(inner)) => inner,
CasUpdateError::Apply(BridgeError::NoOp(_)) => {
// Should be unreachable: the caller extracts NoOp before calling
// map_cas_error. Defensive fallback.
unreachable!("NoOp bridge error must be handled by the apply caller")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don’t panic in the CAS error mapper.

This is production error-handling code; if the NoOp sentinel ever leaks past the caller, unreachable! turns a recoverable mapping bug into a process panic. Return a typed unavailable error instead.

Suggested fix
         CasUpdateError::Apply(BridgeError::NoOp(_)) => {
-            // Should be unreachable: the caller extracts NoOp before calling
-            // map_cas_error. Defensive fallback.
-            unreachable!("NoOp bridge error must be handled by the apply caller")
+            tracing::debug!("turn state CAS no-op sentinel reached error mapper");
+            TurnError::Unavailable {
+                reason: "turn state persistence temporarily unavailable".to_string(),
+            }
         }

As per coding guidelines, production Rust should avoid panic-style failure paths and map errors with context.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn map_cas_error<T>(error: CasUpdateError<BridgeError<T>>) -> TurnError {
match error {
CasUpdateError::Apply(BridgeError::Real(inner)) => inner,
CasUpdateError::Apply(BridgeError::NoOp(_)) => {
// Should be unreachable: the caller extracts NoOp before calling
// map_cas_error. Defensive fallback.
unreachable!("NoOp bridge error must be handled by the apply caller")
}
fn map_cas_error<T>(error: CasUpdateError<BridgeError<T>>) -> TurnError {
match error {
CasUpdateError::Apply(BridgeError::Real(inner)) => inner,
CasUpdateError::Apply(BridgeError::NoOp(_)) => {
tracing::debug!("turn state CAS no-op sentinel reached error mapper");
TurnError::Unavailable {
reason: "turn state persistence temporarily unavailable".to_string(),
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_turns/src/filesystem_store.rs` around lines 378 - 385, The
CAS error mapper in map_cas_error should not panic on BridgeError::NoOp, since
this is production error-handling code. Replace the unreachable! fallback in
map_cas_error with a typed unavailable TurnError that preserves context about
the unexpected NoOp leak, while keeping the BridgeError::Real(inner) path
unchanged and consistent with CasUpdateError handling.

Sources: Coding guidelines, Path instructions

Comment on lines +137 to +140
## Quality gate

Per touched crate: `cargo fmt --all`, `cargo clippy --all-targets` (zero
warnings), `cargo test` (+ `--features integration` where stores have it).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Align the plan's validation steps with the repo's required backend checks.

This plan documents a legacy persistence migration, but its quality gate drops the required feature-isolation cargo check matrix and weakens clippy to cargo clippy --all-targets. That makes the migration plan a stale source of truth for the postgres/libSQL split this PR is changing. As per coding guidelines, legacy persistence changes must "Test feature isolation" with default, --no-default-features --features libsql, and --all-features, and "Before You Open a PR" requires cargo clippy --all --benches --tests --examples --all-features -- -D warnings. Based on learnings, verify feature isolation with the required cargo check combinations for postgres, libSQL-only, and all-features builds.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/2026-06-25-cas-migration.md` around lines 137 - 140, The quality
gate in the migration plan is outdated and should match the repo’s required
backend validation for persistence changes. Update the “Quality gate” section to
include the feature-isolation `cargo check` matrix for the default/postgres
build, `--no-default-features --features libsql`, and `--all-features`, and
strengthen the clippy step to the full `cargo clippy --all --benches --tests
--examples --all-features -- -D warnings`. Keep the existing formatting/tests
coverage, but ensure the plan explicitly reflects the required checks for the
legacy persistence migration.

Sources: Coding guidelines, Learnings

@railway-app

railway-app Bot commented Jun 25, 2026

Copy link
Copy Markdown

🚅 Deployed to the ironclaw-pr-5234 environment in ironclaw-ci-preview

Service Status Web Updated (UTC)
ironclaw ✅ Success (View Logs) Web Jun 25, 2026 at 12:03 pm

serrrfirat added a commit that referenced this pull request Jun 25, 2026
Fold in PR review (henrypark133) + a durable-write hot-path audit. Adds §12
framing the deployed lease-expiry cascade as three independent layers:
- Layer A: in-process lock convoy → #5234 (open); write-behind composes on the
  post-#5234 CAS path.
- Layer B: pool starvation — DEFAULT_POSTGRES_POOL_MAX_SIZE=2 shared across all
  Postgres FS I/O. Notes the existing 30s checkout guard (closes the infinite
  hang) but flags pool-too-small + checkout(30s)>apply(15s); cheap mitigations
  (raise pool, reserve a critical connection, align checkout<apply) prior to and
  complementary with write-behind.
- Layer C: the synchronous hot-path write map (events / governor / thread-append
  / lease / memory) with the write-behind-vs-batch-coalesce distinction. Events
  are the top batch-coalesce target (highest churn, O(1) INSERT no CAS); must
  stay DURABLE (source of truth), per-step flush to preserve live SSE. Memory
  stays synchronous (FTS-only, no embedding write; agent-initiated, low churn).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules scope: dependencies Dependency updates scope: docs Documentation size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant