Skip to content

perf(publisher): 30s response cache + ETag/304 on hot read endpoints#267

Merged
wallscaler merged 2 commits into
mainfrom
perf/active-challenges-cache
Jun 9, 2026
Merged

perf(publisher): 30s response cache + ETag/304 on hot read endpoints#267
wallscaler merged 2 commits into
mainfrom
perf/active-challenges-cache

Conversation

@wallscaler

Copy link
Copy Markdown
Contributor

Problem

`GET /api/cathedral/v1/synthetic-boolean/active-challenges` is hit ~11 req/s. The ~32 KB JSON payload (51 challenges) was rebuilt from SQLite on every request. SQLite reads queue behind `AsyncDbWriteLock` held by eval writes, producing:

  • Median latency: 748 ms
  • Average latency: 16.7 s (max 200 s during write bursts)
  • 15-20% of requests die as client-timeout 499s and immediately retry — thundering herd that amplifies write lock pressure

`GET /v1/leaderboard/recent` (~39 KB avg, up to 458 KB) is polled by every validator on every pull loop tick. No caching headers meant every poll transferred the full body even when there were no new rows.

Estimated egress from active-challenges alone: ~30 GB/day at 11 req/s x 32 KB.

Changes

`src/cathedral/publisher/response_cache.py` (new)

`TtlResponseCache` — pure-TTL single-flight cache:

  • 30s TTL, no explicit invalidation
  • One coroutine rebuilds; concurrent callers get stale bytes immediately (never queue N callers against SQLite)
  • Stores serialised JSON bytes — serialisation cost paid once per TTL window
  • Lives on `app.state` (not a module global) so each app instance/test gets an isolated cache

`etag_response()` — shared ETag helper:

  • Computes `sha256[:16]` ETag over response body
  • Adds `Cache-Control: public, max-age=15` to every response
  • Returns `304 Not Modified` (empty body) when `If-None-Match` matches
  • Normalises `W/"..."` weak ETags per RFC 9110

`src/cathedral/publisher/submit.py`

`list_active_sat_challenges` now:

  1. Fetches bytes from the per-app `TtlResponseCache` (single SQLite round-trip per 30s window)
  2. Returns via `etag_response()` — Cache-Control + ETag + optional 304

`src/cathedral/publisher/reads.py`

`get_leaderboard_recent` now returns via `etag_response()`. No in-process TTL cache here — cursor params vary per validator, so per-response ETags are correct. Validators polling an up-to-date cursor get a 304 with no body to decode.

`src/cathedral/publisher/app.py`

Seeds `app.state.active_challenges_cache = TtlResponseCache(ttl_seconds=30.0)` in the lifespan so each test's fresh FastAPI app gets its own clean cache instance.

`tests/publisher/test_response_cache.py` (new)

21 unit + integration tests covering:

  • `_compute_etag`: determinism, length, uniqueness
  • `etag_response`: 200 with headers, 304 on match, weak ETag normalisation, pre-serialised fast-path, headers present on 304
  • `TtlResponseCache`: cold start, TTL expiry, stale-serve, JSON byte storage, error propagation, concurrent single-flight
  • HTTP-level smoke tests for both endpoints: ETag presence, 304 on repeat

Staleness rationale

`active-challenges` embeds only absolute fields (challenge_id, tier, cnf_sha256, score_multiplier, difficulty_label, etc.). There are no relative "seconds remaining" counters. A 30s stale window is safe — challenge open/close events happen at most once per several minutes in production, and every miner polling loop already tolerates eventual consistency.

Rollback

Revert this commit. Both endpoints are purely stateless without the cache — no DB schema, no env vars, no Railway config changes required.

🤖 Generated with Claude Code

wallscaler added 2 commits June 9, 2026 17:16
active-challenges was rebuilt from SQLite on every request (~11 req/s).
AsyncDbWriteLock contention from eval writes pushed median latency to
748ms and average to 16.7s (max 200s). 15-20% of requests timed out
as 499s and immediately retried, creating a thundering herd.

Changes:
- Add TtlResponseCache (response_cache.py): per-app 30s TTL single-flight
  cache; one coroutine rebuilds, concurrent callers get stale bytes
  immediately rather than queuing behind the lock.
- active-challenges endpoint: caches the full serialised response for 30s.
  Cache lives on app.state (not module-global) so each test app gets an
  isolated instance.
- Both active-challenges and leaderboard/recent: serve Cache-Control:
  public, max-age=15 + ETag (sha256[:16]); honour If-None-Match with 304.
  Shared etag_response() helper — no copy-paste.
- Add tests/publisher/test_response_cache.py: 21 unit + integration tests
  covering TTL expiry, single-flight, ETag generation, 304 on match,
  weak ETag normalisation, and both endpoint smoke tests.

Staleness rationale: active-challenges embeds absolute timestamps only
(cnf_sha256, challenge_id, tier, score_multiplier); no relative
seconds-remaining fields. 30s staleness is safe — challenge state
changes at most once per several minutes in production.

Rollback: revert this commit; endpoints are purely stateless without cache.
…h list/star parsing

- TtlResponseCache._refresh: when builder raises and a stale body exists,
  log a warning and return the stale body instead of propagating; propagate
  only on cold start (no body at all), so a transient DB error no longer
  causes a 500 on an already-populated cache.
- etag_response: replace single-value lstrip hack with proper RFC 9110
  If-None-Match parsing: split on commas, strip whitespace + W/ prefix via
  removeprefix + strip quotes per member; handle star (*) form.
- Tests: 6 new cases covering expired+fail→stale, cold+fail→raise,
  fail-then-recover, list-form match, star match, list-form no-match.
@wallscaler wallscaler merged commit 57e5703 into main Jun 9, 2026
3 checks passed
@wallscaler wallscaler deleted the perf/active-challenges-cache branch June 9, 2026 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant