Skip to content

Heartbeat lease write-behind: move runner-lease renewal off the synchronous Postgres path #5253

Description

@henrypark133

Backlog item from the Reborn write caching/batching audit (docs/plans/2026-06-25-write-caching-batching-classification.md).

Problem

The runner heartbeat (every 30s) renews the lease (lease_expires_at = now+90s) via a synchronous CAS write to the turn-state store, sharing the /turns/state.json blob with claim/complete and contending on the 2-connection Postgres pool (DEFAULT_POSTGRES_POOL_MAX_SIZE = 2). On a cross-region deployment (~100-200ms/round-trip), a pool-starved heartbeat hits VersionMismatch retries and, past runner_heartbeat_interval, is treated as fatal (turn_scheduler.rs:660) → converts clean runs to scheduler_heartbeat_failed failures (the "did not finish before timeout" / lease-expiry cascade).

Proposal

The heartbeat is pure liveness (90s TTL backstop, write result discarded) → move it to a non-blocking in-memory write-behind (coalescing drain), off the synchronous hot path. Ownership-critical writes (claim/complete/fail/block) stay synchronously durable.

Relationship

Acceptance

  • Heartbeat renewal does not block on a Postgres round-trip on the hot path.
  • A pool-starved/contended backend no longer converts a live run to scheduler_heartbeat_failed.
  • Ownership writes remain synchronously durable (no Double-Processing / Split-Brain-on-lease regression).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions