Backlog item from the Reborn write caching/batching audit (docs/plans/2026-06-25-write-caching-batching-classification.md).
Problem
The runner heartbeat (every 30s) renews the lease (lease_expires_at = now+90s) via a synchronous CAS write to the turn-state store, sharing the /turns/state.json blob with claim/complete and contending on the 2-connection Postgres pool (DEFAULT_POSTGRES_POOL_MAX_SIZE = 2). On a cross-region deployment (~100-200ms/round-trip), a pool-starved heartbeat hits VersionMismatch retries and, past runner_heartbeat_interval, is treated as fatal (turn_scheduler.rs:660) → converts clean runs to scheduler_heartbeat_failed failures (the "did not finish before timeout" / lease-expiry cascade).
Proposal
The heartbeat is pure liveness (90s TTL backstop, write result discarded) → move it to a non-blocking in-memory write-behind (coalescing drain), off the synchronous hot path. Ownership-critical writes (claim/complete/fail/block) stay synchronously durable.
Relationship
Acceptance
- Heartbeat renewal does not block on a Postgres round-trip on the hot path.
- A pool-starved/contended backend no longer converts a live run to
scheduler_heartbeat_failed.
- Ownership writes remain synchronously durable (no Double-Processing / Split-Brain-on-lease regression).
Backlog item from the Reborn write caching/batching audit (
docs/plans/2026-06-25-write-caching-batching-classification.md).Problem
The runner heartbeat (every 30s) renews the lease (
lease_expires_at = now+90s) via a synchronous CAS write to the turn-state store, sharing the/turns/state.jsonblob with claim/complete and contending on the 2-connection Postgres pool (DEFAULT_POSTGRES_POOL_MAX_SIZE = 2). On a cross-region deployment (~100-200ms/round-trip), a pool-starved heartbeat hitsVersionMismatchretries and, pastrunner_heartbeat_interval, is treated as fatal (turn_scheduler.rs:660) → converts clean runs toscheduler_heartbeat_failedfailures (the "did not finish before timeout" / lease-expiry cascade).Proposal
The heartbeat is pure liveness (90s TTL backstop, write result discarded) → move it to a non-blocking in-memory write-behind (coalescing drain), off the synchronous hot path. Ownership-critical writes (claim/complete/fail/block) stay synchronously durable.
Relationship
put3→1 round-trip reduction, and Postgres pool sizing + checkout timeout.Acceptance
scheduler_heartbeat_failed.