Abstract tantivy's data storage behind traits for pluggable backends by PSeitz · Pull Request #2868 · quickwit-oss/tantivy

PSeitz · 2026-03-30T04:21:47Z

Extract trait interfaces from tantivy's core reader types so that alternative storage backends (e.g. Quickwit) can provide their own implementations while tantivy's query engine works through dynamic dispatch.

Reader trait extraction:

SegmentReader is now a trait; the concrete implementation is renamed to TantivySegmentReader.
DynInvertedIndexReader trait for object-safe dynamic dispatch, plus a typed InvertedIndexReader trait with associated Postings/DocSet types for static dispatch. The concrete reader becomes TantivyInvertedIndexReader.
StoreReader is now a trait; the concrete implementation is renamed to TantivyStoreReader. get() returns TantivyDocument directly instead of requiring a generic DocumentDeserialize bound.

Typed downcast for performance-critical paths:

try_downcast_and_call() + TypedInvertedIndexReaderCb allow query weights (TermWeight, PhraseWeight) to attempt a downcast to the concrete TantivyInvertedIndexReader, obtaining typed postings for zero-cost scoring, and falling back to the dynamic path otherwise.
TermScorer is now generic over its postings type.
PostingsWithBlockMax trait enables block-max WAND acceleration through the trait boundary.
block_wand() and block_wand_single_scorer() are generic over PostingsWithBlockMax, and for_each_pruning is dispatched through the SegmentReader trait so custom backends can provide their own block-max implementations.

Searcher decoupled from Index:

New SearcherContext holds schema, executor, and tokenizers.
Searcher can be constructed from Vec<Arc> via Searcher::from_segment_readers(), without needing an Index.
Searcher::index() is deprecated in favor of Searcher::context().

Postings and DocSet changes:

Postings trait gains doc_freq() -> DocFreq (Exact/Approximate) and has_freq().
RawPostingsData struct carries raw postings bytes across the trait boundary for custom reader implementations.
BlockSegmentPostings::open() takes OwnedBytes instead of FileSlice.
DocSet gains fill_bitset() method.

Scorer improvements:

Scorer trait absorbs for_each, for_each_pruning, and explain (previously free functions or on Weight).
box_scorer() helper avoids double-boxing Box.
BoxedTermScorer wraps a type-erased term scorer.
BufferedUnionScorer initialization fixed to avoid an extra advance() on construction.

Other changes:

Document::to_json() now returns serde_json::Value; the old string serialization is renamed to to_serialized_json().
DocumentDeserialize removed from the store reader public API.

Extract trait interfaces from tantivy's core reader types so that alternative storage backends (e.g. Quickwit) can provide their own implementations while tantivy's query engine works through dynamic dispatch. Reader trait extraction: - SegmentReader is now a trait; the concrete implementation is renamed to TantivySegmentReader. - DynInvertedIndexReader trait for object-safe dynamic dispatch, plus a typed InvertedIndexReader trait with associated Postings/DocSet types for static dispatch. The concrete reader becomes TantivyInvertedIndexReader. - StoreReader is now a trait; the concrete implementation is renamed to TantivyStoreReader. get() returns TantivyDocument directly instead of requiring a generic DocumentDeserialize bound. Typed downcast for performance-critical paths: - try_downcast_and_call() + TypedInvertedIndexReaderCb allow query weights (TermWeight, PhraseWeight) to attempt a downcast to the concrete TantivyInvertedIndexReader, obtaining typed postings for zero-cost scoring, and falling back to the dynamic path otherwise. - TermScorer<TPostings> is now generic over its postings type. - PostingsWithBlockMax trait enables block-max WAND acceleration through the trait boundary. - block_wand() and block_wand_single_scorer() are generic over PostingsWithBlockMax, and for_each_pruning is dispatched through the SegmentReader trait so custom backends can provide their own block-max implementations. Searcher decoupled from Index: - New SearcherContext holds schema, executor, and tokenizers. - Searcher can be constructed from Vec<Arc<dyn SegmentReader>> via Searcher::from_segment_readers(), without needing an Index. - Searcher::index() is deprecated in favor of Searcher::context(). Postings and DocSet changes: - Postings trait gains doc_freq() -> DocFreq (Exact/Approximate) and has_freq(). - RawPostingsData struct carries raw postings bytes across the trait boundary for custom reader implementations. - BlockSegmentPostings::open() takes OwnedBytes instead of FileSlice. - DocSet gains fill_bitset() method. Scorer improvements: - Scorer trait absorbs for_each, for_each_pruning, and explain (previously free functions or on Weight). - box_scorer() helper avoids double-boxing Box<dyn Scorer>. - BoxedTermScorer wraps a type-erased term scorer. - BufferedUnionScorer initialization fixed to avoid an extra advance() on construction. Other changes: - Document::to_json() now returns serde_json::Value; the old string serialization is renamed to to_serialized_json(). - DocumentDeserialize removed from the store reader public API.

fulmicoton-dd · 2026-03-31T15:25:06Z

src/core/searcher.rs

+
+    /// Same as [`Searcher::from_segment_readers`] but allows setting
+    /// a custom generation id.
+    pub fn from_segment_readers_with_generation_id<Ctx: Into<SearcherContext>>(


Is there a real use case for exposing a public function with the generation id?

currently not that I'm aware of. Should be probably moved to SearcherContext anyway

fulmicoton-dd · 2026-03-31T15:29:57Z

src/core/searcher.rs

    /// The searcher uses the segment ordinal to route the
    /// request to the right `Segment`.
-    pub fn doc<D: DocumentDeserialize>(&self, doc_address: DocAddress) -> crate::Result<D> {
+    pub fn doc(&self, doc_address: DocAddress) -> crate::Result<TantivyDocument> {


This is reverting a public feature. Can we check this is not hurting people? Do we use it in Quickwit?
Can you make sure this shows up in the changelog?

we use TantivyDocument in quickwit currently. It's unlikely relevant for performance, I'll ask in the chat

fulmicoton-dd · 2026-03-31T15:34:12Z

src/index/segment_reader.rs

+    fn schema(&self) -> &Schema;
+
+    /// Performs a for_each_pruning operation on the given scorer.
+    fn for_each_pruning(


What is the justification to have for_each_pruning in there?
The goal is to allow custom project to have their own specialized impl?

This is actually the reason I converted it to a draft to remove it from there first. We don't need it with the downcast approach.

src/index/index_meta.rs

fulmicoton-dd · 2026-04-01T09:06:06Z

src/indexer/merger.rs

        );

-        let mut segment_postings_containing_the_term: Vec<(usize, SegmentPostings)> = vec![];
+        let mut segment_postings_containing_the_term: Vec<(usize, Box<dyn Postings>)> =


Is there really no regression here!?

I didn't apply the downcast trick yet here, currently there's a regression.

fulmicoton-dd · 2026-04-01T11:34:42Z

src/index/inverted_index_reader.rs

 ///
-/// `InvertedIndexReader` are created by calling
+/// `TantivyInvertedIndexReader` instances are created by calling
 /// [`SegmentReader::inverted_index()`](crate::SegmentReader::inverted_index).


why was it moved here?

I'm not sure what you are referring to? TantivyInvertedIndexReader ?

fulmicoton-dd · 2026-04-01T11:39:36Z

benches/fill_bitset.rs

+                .read_raw_postings_data(term_info, IndexRecordOption::Basic)
+                .unwrap();
+            let mut block_postings = BlockSegmentPostings::open(
+                term_info.doc_freq,
+                raw.postings_data,


That's code smell here: inverted index at this point is a DynInvertedIndex, yet you use the fact that you expect it to be the standard one to interpret the raw postings data.

fulmicoton-dd · 2026-04-01T11:44:05Z

src/index/inverted_index_reader.rs

+    fn list_encoded_json_fields(&self) -> io::Result<Vec<InvertedIndexFieldSpace>>;
+
+    /// Returns the raw postings bytes and metadata for a term.
+    fn read_raw_postings_data(


Why do we need read_raw_postings_data? It seems very strange.

Also, I expected InvertedIndexReader: DynInvertedIndexReader

agree on both, removed read_raw_postings_data and added InvertedIndexReader: DynInvertedIndexReader

fulmicoton-dd · 2026-04-01T11:47:01Z

src/index/inverted_index_reader.rs

+/// enough to work with both postings types.
+///
+/// # Example
+///


Does this thing help much? It keeps things dynamic but attempts to replace the dynamic part by a if-statement in some case.

The example is a bit silly. The main use case is to build query types with concrete Postlinglists types e.g. TermScorer<TantivySegmentPostings>

src/postings/block_segment_postings.rs

fulmicoton-dd · 2026-04-01T12:56:54Z

src/postings/loaded_postings.rs

    /// It will also preload positions, if positions are available in the SegmentPostings.
-    pub fn load(segment_postings: &mut SegmentPostings) -> LoadedPostings {
-        let num_docs = segment_postings.doc_freq() as usize;
+    pub fn load(postings: &mut Box<dyn Postings>) -> LoadedPostings {


&mut dyn Postings

I don't think there is a case for the Box, is there?

fulmicoton-dd · 2026-04-01T12:59:29Z

src/postings/loaded_postings.rs

        let mut positions = Vec::with_capacity(num_docs);
        let mut position_offsets = Vec::with_capacity(num_docs);
-        while segment_postings.doc() != TERMINATED {
+        while postings.doc() != TERMINATED {


that will be slow. If it is used in RegexPhraseQuery, we might want to push this to Postings.

Although... The current impl is pretty terrible: we should probably not load positions but instead save something that makes it easy to load them later. Like an offset or something.

fulmicoton-dd · 2026-04-01T12:59:54Z

src/postings/mod.rs


+/// Raw postings bytes and metadata read from storage.
+#[derive(Debug, Clone)]
+pub struct RawPostingsData {


Do we need this to be pub?

fulmicoton-dd · 2026-04-01T13:07:57Z

src/query/union/buffered_union.rs

        let mut min_new_target = TERMINATED;

-        for docset in self.docsets.iter_mut() {
+        for docset in self.scorers.iter_mut() {


rename scorer maybe?

fulmicoton-dd · 2026-04-01T13:08:58Z

src/query/union/buffered_union.rs

-            for obsolete_tinyset in self.bitsets.iter_mut() {
-                *obsolete_tinyset = TinySet::empty();
-            }
+            self.bitsets.fill(TinySet::empty());


I wonder if we should make the compiler's life easier by have a const EMPTY: TinySet

fulmicoton-dd · 2026-04-01T13:12:21Z

src/query/union/buffered_union.rs

            bitsets: Box::new([TinySet::empty(); HORIZON_NUM_TINYBITSETS]),
-            scores: Box::new([score_combiner_fn(); HORIZON as usize]),
+            scores: Box::new([score_combiner; HORIZON as usize]),
            bucket_idx: HORIZON_NUM_TINYBITSETS,


I think you borrowed that code from my PR. Just so this is in your radar, the scary part is that to avoid regression I made it so that the Scorer does not buffer docs on initialization.

(for instance on a TopK union, it would be a waste to load the first block )

This is probably the highest risk of regression, so if you can have an extra look at the logic it might be worth for you to do it.

fulmicoton

See comments.

tokenizer_for_field is a duplicate of SearcherContext::tokenizer_for_field.

Raw postings are an implementation detail of TantivyInvertedIndexReader, not something custom backends should need to produce. Make read_postings_from_terminfo the required trait method instead.

Without it, Searcher::doc_async produces a non-Send future, breaking the Send bound on SearchService::fetch_docs.

PSeitz-dd force-pushed the storage_abstraction branch from 64b93ce to 20bdf2a Compare March 30, 2026 04:52

PSeitz-dd force-pushed the storage_abstraction branch from 20bdf2a to 3cbcb5d Compare March 30, 2026 04:58

PSeitz marked this pull request as draft March 31, 2026 09:07

fulmicoton-dd reviewed Mar 31, 2026

View reviewed changes

src/index/index_meta.rs Show resolved Hide resolved

fulmicoton-dd reviewed Apr 1, 2026

View reviewed changes

use downcast macro, remove block wand from trait

1f370b4

fulmicoton-dd reviewed Apr 1, 2026

View reviewed changes

src/postings/block_segment_postings.rs Outdated Show resolved Hide resolved

fulmicoton-dd reviewed Apr 1, 2026

View reviewed changes

fulmicoton requested changes Apr 1, 2026

View reviewed changes

PSeitz added 6 commits April 2, 2026 13:10

make warmup object safe

1356734

Remove Index::tokenizer_for_field, fix BlockSegmentPostings doc comment

548eb6e

tokenizer_for_field is a duplicate of SearcherContext::tokenizer_for_field.

Simplify Index::fields_metadata to use TantivySegmentReader::open

32aee13

Remove read_raw_postings_data from DynInvertedIndexReader trait

0734836

Raw postings are an implementation detail of TantivyInvertedIndexReader, not something custom backends should need to produce. Make read_postings_from_terminfo the required trait method instead.

Make InvertedIndexReader a subtrait of DynInvertedIndexReader

f730825

Add Send bound to StoreReader::get_async return type

c958f86

Without it, Searcher::doc_async produces a non-Send future, breaking the Send bound on SearchService::fetch_docs.

Uh oh!

Conversation

PSeitz commented Mar 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fulmicoton-dd Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PSeitz Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fulmicoton left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fulmicoton-dd Mar 31, 2026 •

edited

Loading

PSeitz Apr 1, 2026 •

edited

Loading