feat(publisher): file-backed CNF storage + lock-free serve for 8k/day throughput#265
Merged
Merged
Conversation
… throughput Raise publisher CNF throughput so the board can sustain up to ~8k challenges/day. Publisher-only; no validator changes (validators pull signed eval rows, not CNFs). Root cause: import stored the full ~0.5-1.8 MB DIMACS body as inline sqlite TEXT under the global db_write_lock, serializing against the open-window solve-write flood (~12s/import). Inline text was chosen because the file-backed serve path (_CnfSnapshotCache) held a process- global asyncio.Lock + prune-on-every-fetch that wedged /cnf under load. Changes: - Import is now file-backed: the row carries cnf_path + metadata, no inline cnf_text, so the lock-held insert is tiny. Widen the minted challenge_id suffix to 64 bits (8k/day made 32 bits collision-prone, and a collision can overwrite a still-pending row). - Serve file-backed CNFs lock-free: a per-request read+verify off the event loop that hashes the exact bytes it returns (no check/use gap) and holds them in memory (an in-flight retirement unlink can't corrupt the response). Preserves token/status/grace gating, size cap, regular- file check, and digest enforcement. Removes the global-locked snapshot cache. Inline cnf_text rows still serve (backward compatible; no migration). NOTE: on an in-place file mutation the endpoint now returns 404 (digest mismatch) rather than serving a pinned original snapshot — it never serves bytes that don't match the announced cnf_sha256. - Retirement thresholds (age / distinct-solvers) are now env-tunable so turnover can balance against import rate without a redeploy. - DB-driven GC unlinks retired CNF files after a grace window, bounding disk under high churn. It targets only status='retired' rows, so it can never delete an in-flight import (those are 'pending'). Tests: lock-free serve (incl. mutation->404, missing-file->404), file- backed import shape + 64-bit id, GC removes past-grace/keeps recent/ ignores active, env-tunable thresholds. Existing snapshot-cache tests replaced to match the new serve semantics. Parallel import (lease/fetch concurrency) intentionally left as a follow-up: file-backed import already drops the per-import lock hold from ~12s to ~ms, so serial imports clear well past 8k/day.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The publisher needs to feed up to ~8,000 challenges/day. It can't today: import stored the full ~0.5–1.8 MB DIMACS body as inline sqlite TEXT under the global
db_write_lock, serialized against the open-window solve-write flood (~12 s/import observed → ~4.6/min ceiling, board couldn't even hold 50). Inline text was chosen because the file-backed serve path (_CnfSnapshotCache) held a process-globalasyncio.Lock+ prune-on-every-fetch that wedged/cnfunder miner load. This PR fixes both sides.Generator is not the bottleneck (healthy:
ready_depth 31, producing on demand). No validator changes — validators pull signed eval rows, not CNFs.What
sat_generator_import.py): the row carriescnf_path+ metadata, no inlinecnf_text, so the lock-held insert is tiny. Drops per-import lock hold from ~12 s to ~ms (also relieves the solve flood). Widen the mintedchallenge_idsuffix to 64 bits (at 8k/day, 32 bits was collision-prone and a collision can overwrite a still-pendingrow).challenge_cnf.py): per-request read+verify off the event loop that hashes the exact bytes it returns (no check/use gap) and holds them in memory (an in-flight retirementunlinkcan't corrupt the response). Preserves token/status/grace gating, size cap, regular-file check, and digest enforcement. Removes the global-locked snapshot cache. Inlinecnf_textrows still serve → backward compatible, no migration.sat_fill.py): age / distinct-solvers now read from env so turnover can balance against import rate without a redeploy (8k/day ÷ ~50 slots ≈ 9-min lifetime).sat_fill.py): unlinks retired CNF files after a grace window to bound disk under churn. Targets onlystatus='retired'rows, so it can never delete an in-flight import (those are'pending').On an in-place file mutation, the serve endpoint now returns 404 (digest mismatch) instead of serving a pinned original snapshot. It never serves bytes that don't match the announced
cnf_sha256. For generator-owned immutable files this is equivalent; please confirm you're happy dropping the snapshot-isolation behavior for the operator-managed file path.Out of scope / follow-ups
cnf_textrows to file-backed + VACUUM to reclaim disk (not required for correctness).ASYNC240onsat_generator_import.py:92(unchanged line).Testing
ruffclean on changed files.tests/publisherrun identical tomain(43 failed / 42 errors on both — pre-existing full-suite-in-one-process flakiness from real-interval tests; this branch adds 3 passing tests, zero new failures).Provenance
Plan independently reviewed by a separate Codex pass before implementation (no blockers; HIGH/MEDIUM findings folded in: full CBMC-style flag parity not relevant here, but the
_hotkey_for-equivalent integrity checks, GC in-flight-import safety, 64-bit id, and serve-race were all addressed).