Skip to content

Statement store index optimization, stage 2a#12304

Open
s0me0ne-unkn0wn wants to merge 3 commits into
masterfrom
s0me0ne/statement-store-read-index-on-disk
Open

Statement store index optimization, stage 2a#12304
s0me0ne-unkn0wn wants to merge 3 commits into
masterfrom
s0me0ne/statement-store-read-index-on-disk

Conversation

@s0me0ne-unkn0wn

Copy link
Copy Markdown
Contributor

This PR implements stage 2a of the optimizations proposed in #10910 (Move Read Index to Disk with LRU Cache).

As a notable difference from the original issue, the evicted index has also been moved to disk (AFAIR, it used to consume ~150 MiB of RAM with 4M statements).

@s0me0ne-unkn0wn s0me0ne-unkn0wn added the T0-node This PR/Issue is related to the topic “node”. label Jun 8, 2026
@alexggh

alexggh commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

It would be interesting to have some performance benchmarks here, to better understand what is the penalty for keeping this indexes in memory vs in the database.

@s0me0ne-unkn0wn

Copy link
Copy Markdown
Contributor Author

@alexggh This is just the "cost" side benchmark (without the "benefit" side — requires some memory measurement harness)

Benchmark RAM index Disk index Slowdown
Reads — point
has_statement (1k/10k/50k) ~19 ns ~117 ns ~6× (flat across size)
statement_lookup (by hash) 6.83 ms 8.55 ms 1.25×
Reads — topic queries
cache/hit 2.00 µs 10.4 µs ~5×
cache/miss 3.29 µs 19.4 µs ~6×
read_scaling/broadcasts 1k 31.9 µs 188 µs 5.9×
read_scaling/broadcasts 10k 344 µs 2.73 ms 7.9×
read_scaling/broadcasts 50k 1.93 ms 18.9 ms 9.8×
broadcasts (1k store, 640 concurrent) 16.4 ms 227 ms ~14×
posted 16.5 ms 25.6 ms 1.6×
statements_all 166 ms 257 ms 1.55×
Writes
submit 10.7 ms 21.7 ms ~2×
remove 10.6 ms 22.4 ms 2.1×
maintain 9.8 ms 19.4 ms 2.0×
Mixed / concurrency
mixed_workload 119 ms 861 ms 7.2×
contention_read_under_write 47.5 ms 439 ms 9.2×

Some analysis by our common friend follow:

What it means

  • Writes ~2× — expected: index mutations are folded into the same commit, plus the extra columns. Moderate.
  • Reads 5–14×, and worse as the store grows (broadcasts: 6×→10× as size goes 1k→50k). At 50k a topic query goes from 1.9 ms to 18.9 ms.
  • Why broadcasts is 14× but posted is only 1.6×. Both are MatchAll, but broadcasts(&[t0, t1]) intersects three sets: it materializes the smallest (a topic set, ~100) and then does a disk get_size membership probe into each of the other sets (including the large dec_key = None set) for every candidate — ~200 disk probes per query × 640 concurrent. posted(&[], key) has no topics, hence no membership probes, just body fetches. So the dominant read cost is O(smallest_set × number_of_other_sets) disk probes in MatchAll.
  • Even a cache hit is ~5× slower than RAM (10 µs vs 2 µs): on a hit for the topic set, the membership check against the dec_key set still goes to disk.

Actionable (surfaced by the benchmark)

On a cache hit for the "other" sets in iterate_with_match_all, use the cached set (in-RAM contains) instead of a db.get_size per candidate. Only the smallest set is cached today; caching and intersecting the other sets in RAM on a hit should bring broadcasts / cache-hit reads close to stage 1. This is a direct read-path optimization (and matters for stage 2b too).

Caveats

  • Reduced sampling (--sample-size 10) → directional, not certified; the multi-threaded benchmarks are noisy.
  • Local temp DB with a hot OS page cache → on real disk the absolutes are higher, but the ratios are indicative.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cla-bot-2021

cla-bot-2021 Bot commented Jun 9, 2026

Copy link
Copy Markdown

User @claude, please sign the CLA here.

@alexggh

alexggh commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@alexggh This is just the "cost" side benchmark (without the "benefit" side — requires some memory measurement harness)

We do need the benefit side as well, both an analysis how much we save an some benchmark to try to prove it.

let key_set = IndexSet::DecKey(key);
for topic in topics {
let prefix = topic[..].to_vec();
let mut iter =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this use the cache index somehow ?

let mut result = Vec::new();
let query_index = self.query_index.read();
self.collect_statements_locked(key, topic_filter, &query_index, &mut result, f)?;
let mut query_index = self.query_index.lock();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a big problem now, because now you can't run any queries in parallel ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ready to discuss ways around that. The problem is that the LRU cache needs to update recency when hit so it naturally needs &mut self. Interior mutability may be the way, but let me know if you have other ideas

@@ -1261,11 +1532,25 @@ impl StatementStore for Store {
}

fn has_statement(&self, hash: &Hash) -> bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on the hot path of everything, I don't expect we can go to disk here too often, I would have expected hitting a LRU cache more often than not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T0-node This PR/Issue is related to the topic “node”.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants