Skip to content

feat(publisher): file-backed CNF storage + lock-free serve for 8k/day throughput#265

Merged
wallscaler merged 1 commit into
mainfrom
feat/publisher-cnf-throughput-8k
Jun 9, 2026
Merged

feat(publisher): file-backed CNF storage + lock-free serve for 8k/day throughput#265
wallscaler merged 1 commit into
mainfrom
feat/publisher-cnf-throughput-8k

Conversation

@wallscaler

Copy link
Copy Markdown
Contributor

Why

The publisher needs to feed up to ~8,000 challenges/day. It can't today: import stored the full ~0.5–1.8 MB DIMACS body as inline sqlite TEXT under the global db_write_lock, serialized against the open-window solve-write flood (~12 s/import observed → ~4.6/min ceiling, board couldn't even hold 50). Inline text was chosen because the file-backed serve path (_CnfSnapshotCache) held a process-global asyncio.Lock + prune-on-every-fetch that wedged /cnf under miner load. This PR fixes both sides.

Generator is not the bottleneck (healthy: ready_depth 31, producing on demand). No validator changes — validators pull signed eval rows, not CNFs.

What

  • File-backed import (sat_generator_import.py): the row carries cnf_path + metadata, no inline cnf_text, so the lock-held insert is tiny. Drops per-import lock hold from ~12 s to ~ms (also relieves the solve flood). Widen the minted challenge_id suffix to 64 bits (at 8k/day, 32 bits was collision-prone and a collision can overwrite a still-pending row).
  • Lock-free serve (challenge_cnf.py): per-request read+verify off the event loop that hashes the exact bytes it returns (no check/use gap) and holds them in memory (an in-flight retirement unlink can't corrupt the response). Preserves token/status/grace gating, size cap, regular-file check, and digest enforcement. Removes the global-locked snapshot cache. Inline cnf_text rows still serve → backward compatible, no migration.
  • Env-tunable retirement thresholds (sat_fill.py): age / distinct-solvers now read from env so turnover can balance against import rate without a redeploy (8k/day ÷ ~50 slots ≈ 9-min lifetime).
  • DB-driven CNF GC (sat_fill.py): unlinks retired CNF files after a grace window to bound disk under churn. Targets only status='retired' rows, so it can never delete an in-flight import (those are 'pending').

⚠️ Behavior change to review

On an in-place file mutation, the serve endpoint now returns 404 (digest mismatch) instead of serving a pinned original snapshot. It never serves bytes that don't match the announced cnf_sha256. For generator-owned immutable files this is equivalent; please confirm you're happy dropping the snapshot-isolation behavior for the operator-managed file path.

Out of scope / follow-ups

  • Parallel import (lease/fetch concurrency): intentionally deferred — file-backed import alone clears well past 8k/day serially. Codex flagged shared-connection risks; leaving it out keeps this PR low-risk.
  • One-time migration of existing inline cnf_text rows to file-backed + VACUUM to reclaim disk (not required for correctness).
  • Pre-existing ASYNC240 on sat_generator_import.py:92 (unchanged line).

Testing

  • New tests: lock-free serve (+ mutation→404, missing-file→404), file-backed import shape + 64-bit id, GC (removes past-grace / keeps within-grace / ignores active), env-tunable thresholds. Existing snapshot-cache tests replaced to match new serve semantics.
  • Changed-module suites green; ruff clean on changed files.
  • Full tests/publisher run identical to main (43 failed / 42 errors on both — pre-existing full-suite-in-one-process flakiness from real-interval tests; this branch adds 3 passing tests, zero new failures).

Provenance

Plan independently reviewed by a separate Codex pass before implementation (no blockers; HIGH/MEDIUM findings folded in: full CBMC-style flag parity not relevant here, but the _hotkey_for-equivalent integrity checks, GC in-flight-import safety, 64-bit id, and serve-race were all addressed).

… throughput

Raise publisher CNF throughput so the board can sustain up to ~8k
challenges/day. Publisher-only; no validator changes (validators pull
signed eval rows, not CNFs).

Root cause: import stored the full ~0.5-1.8 MB DIMACS body as inline
sqlite TEXT under the global db_write_lock, serializing against the
open-window solve-write flood (~12s/import). Inline text was chosen
because the file-backed serve path (_CnfSnapshotCache) held a process-
global asyncio.Lock + prune-on-every-fetch that wedged /cnf under load.

Changes:
- Import is now file-backed: the row carries cnf_path + metadata, no
  inline cnf_text, so the lock-held insert is tiny. Widen the minted
  challenge_id suffix to 64 bits (8k/day made 32 bits collision-prone,
  and a collision can overwrite a still-pending row).
- Serve file-backed CNFs lock-free: a per-request read+verify off the
  event loop that hashes the exact bytes it returns (no check/use gap)
  and holds them in memory (an in-flight retirement unlink can't corrupt
  the response). Preserves token/status/grace gating, size cap, regular-
  file check, and digest enforcement. Removes the global-locked snapshot
  cache. Inline cnf_text rows still serve (backward compatible; no
  migration). NOTE: on an in-place file mutation the endpoint now returns
  404 (digest mismatch) rather than serving a pinned original snapshot —
  it never serves bytes that don't match the announced cnf_sha256.
- Retirement thresholds (age / distinct-solvers) are now env-tunable so
  turnover can balance against import rate without a redeploy.
- DB-driven GC unlinks retired CNF files after a grace window, bounding
  disk under high churn. It targets only status='retired' rows, so it can
  never delete an in-flight import (those are 'pending').

Tests: lock-free serve (incl. mutation->404, missing-file->404), file-
backed import shape + 64-bit id, GC removes past-grace/keeps recent/
ignores active, env-tunable thresholds. Existing snapshot-cache tests
replaced to match the new serve semantics.

Parallel import (lease/fetch concurrency) intentionally left as a
follow-up: file-backed import already drops the per-import lock hold from
~12s to ~ms, so serial imports clear well past 8k/day.
@wallscaler wallscaler merged commit 8ff39a5 into main Jun 9, 2026
3 checks passed
@wallscaler wallscaler deleted the feat/publisher-cnf-throughput-8k branch June 9, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant