perf(publisher): 30s response cache + ETag/304 on hot read endpoints by wallscaler · Pull Request #267 · cathedralai/cathedral

wallscaler · 2026-06-09T21:17:32Z

Problem

`GET /api/cathedral/v1/synthetic-boolean/active-challenges` is hit ~11 req/s. The ~32 KB JSON payload (51 challenges) was rebuilt from SQLite on every request. SQLite reads queue behind `AsyncDbWriteLock` held by eval writes, producing:

Median latency: 748 ms
Average latency: 16.7 s (max 200 s during write bursts)
15-20% of requests die as client-timeout 499s and immediately retry — thundering herd that amplifies write lock pressure

`GET /v1/leaderboard/recent` (~39 KB avg, up to 458 KB) is polled by every validator on every pull loop tick. No caching headers meant every poll transferred the full body even when there were no new rows.

Estimated egress from active-challenges alone: ~30 GB/day at 11 req/s x 32 KB.

Changes

`src/cathedral/publisher/response_cache.py` (new)

`TtlResponseCache` — pure-TTL single-flight cache:

30s TTL, no explicit invalidation
One coroutine rebuilds; concurrent callers get stale bytes immediately (never queue N callers against SQLite)
Stores serialised JSON bytes — serialisation cost paid once per TTL window
Lives on `app.state` (not a module global) so each app instance/test gets an isolated cache

`etag_response()` — shared ETag helper:

Computes `sha256[:16]` ETag over response body
Adds `Cache-Control: public, max-age=15` to every response
Returns `304 Not Modified` (empty body) when `If-None-Match` matches
Normalises `W/"..."` weak ETags per RFC 9110

`src/cathedral/publisher/submit.py`

`list_active_sat_challenges` now:

Fetches bytes from the per-app `TtlResponseCache` (single SQLite round-trip per 30s window)
Returns via `etag_response()` — Cache-Control + ETag + optional 304

`src/cathedral/publisher/reads.py`

`get_leaderboard_recent` now returns via `etag_response()`. No in-process TTL cache here — cursor params vary per validator, so per-response ETags are correct. Validators polling an up-to-date cursor get a 304 with no body to decode.

`src/cathedral/publisher/app.py`

Seeds `app.state.active_challenges_cache = TtlResponseCache(ttl_seconds=30.0)` in the lifespan so each test's fresh FastAPI app gets its own clean cache instance.

`tests/publisher/test_response_cache.py` (new)

21 unit + integration tests covering:

`_compute_etag`: determinism, length, uniqueness
`etag_response`: 200 with headers, 304 on match, weak ETag normalisation, pre-serialised fast-path, headers present on 304
`TtlResponseCache`: cold start, TTL expiry, stale-serve, JSON byte storage, error propagation, concurrent single-flight
HTTP-level smoke tests for both endpoints: ETag presence, 304 on repeat

Staleness rationale

`active-challenges` embeds only absolute fields (challenge_id, tier, cnf_sha256, score_multiplier, difficulty_label, etc.). There are no relative "seconds remaining" counters. A 30s stale window is safe — challenge open/close events happen at most once per several minutes in production, and every miner polling loop already tolerates eventual consistency.

Rollback

Revert this commit. Both endpoints are purely stateless without the cache — no DB schema, no env vars, no Railway config changes required.

🤖 Generated with Claude Code

active-challenges was rebuilt from SQLite on every request (~11 req/s). AsyncDbWriteLock contention from eval writes pushed median latency to 748ms and average to 16.7s (max 200s). 15-20% of requests timed out as 499s and immediately retried, creating a thundering herd. Changes: - Add TtlResponseCache (response_cache.py): per-app 30s TTL single-flight cache; one coroutine rebuilds, concurrent callers get stale bytes immediately rather than queuing behind the lock. - active-challenges endpoint: caches the full serialised response for 30s. Cache lives on app.state (not module-global) so each test app gets an isolated instance. - Both active-challenges and leaderboard/recent: serve Cache-Control: public, max-age=15 + ETag (sha256[:16]); honour If-None-Match with 304. Shared etag_response() helper — no copy-paste. - Add tests/publisher/test_response_cache.py: 21 unit + integration tests covering TTL expiry, single-flight, ETag generation, 304 on match, weak ETag normalisation, and both endpoint smoke tests. Staleness rationale: active-challenges embeds absolute timestamps only (cnf_sha256, challenge_id, tier, score_multiplier); no relative seconds-remaining fields. 30s staleness is safe — challenge state changes at most once per several minutes in production. Rollback: revert this commit; endpoints are purely stateless without cache.

…h list/star parsing - TtlResponseCache._refresh: when builder raises and a stale body exists, log a warning and return the stale body instead of propagating; propagate only on cold start (no body at all), so a transient DB error no longer causes a 500 on an already-populated cache. - etag_response: replace single-value lstrip hack with proper RFC 9110 If-None-Match parsing: split on commas, strip whitespace + W/ prefix via removeprefix + strip quotes per member; handle star (*) form. - Tests: 6 new cases covering expired+fail→stale, cold+fail→raise, fail-then-recover, list-form match, star match, list-form no-match.

wallscaler added 2 commits June 9, 2026 17:16

wallscaler merged commit 57e5703 into main Jun 9, 2026
3 checks passed

wallscaler deleted the perf/active-challenges-cache branch June 9, 2026 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(publisher): 30s response cache + ETag/304 on hot read endpoints#267

perf(publisher): 30s response cache + ETag/304 on hot read endpoints#267
wallscaler merged 2 commits into
mainfrom
perf/active-challenges-cache

wallscaler commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wallscaler commented Jun 9, 2026

Problem

Changes

`src/cathedral/publisher/response_cache.py` (new)

`src/cathedral/publisher/submit.py`

`src/cathedral/publisher/reads.py`

`src/cathedral/publisher/app.py`

`tests/publisher/test_response_cache.py` (new)

Staleness rationale

Rollback

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant