Skip to content

Abstract tantivy's data storage behind traits for pluggable backends#2868

Draft
PSeitz wants to merge 8 commits intomainfrom
storage_abstraction
Draft

Abstract tantivy's data storage behind traits for pluggable backends#2868
PSeitz wants to merge 8 commits intomainfrom
storage_abstraction

Conversation

@PSeitz
Copy link
Copy Markdown
Collaborator

@PSeitz PSeitz commented Mar 30, 2026

Extract trait interfaces from tantivy's core reader types so that alternative storage backends (e.g. Quickwit) can provide their own implementations while tantivy's query engine works through dynamic dispatch.

Reader trait extraction:

  • SegmentReader is now a trait; the concrete implementation is renamed to TantivySegmentReader.
  • DynInvertedIndexReader trait for object-safe dynamic dispatch, plus a typed InvertedIndexReader trait with associated Postings/DocSet types for static dispatch. The concrete reader becomes TantivyInvertedIndexReader.
  • StoreReader is now a trait; the concrete implementation is renamed to TantivyStoreReader. get() returns TantivyDocument directly instead of requiring a generic DocumentDeserialize bound.

Typed downcast for performance-critical paths:

  • try_downcast_and_call() + TypedInvertedIndexReaderCb allow query weights (TermWeight, PhraseWeight) to attempt a downcast to the concrete TantivyInvertedIndexReader, obtaining typed postings for zero-cost scoring, and falling back to the dynamic path otherwise.
  • TermScorer is now generic over its postings type.
  • PostingsWithBlockMax trait enables block-max WAND acceleration through the trait boundary.
  • block_wand() and block_wand_single_scorer() are generic over PostingsWithBlockMax, and for_each_pruning is dispatched through the SegmentReader trait so custom backends can provide their own block-max implementations.

Searcher decoupled from Index:

  • New SearcherContext holds schema, executor, and tokenizers.
  • Searcher can be constructed from Vec<Arc> via Searcher::from_segment_readers(), without needing an Index.
  • Searcher::index() is deprecated in favor of Searcher::context().

Postings and DocSet changes:

  • Postings trait gains doc_freq() -> DocFreq (Exact/Approximate) and has_freq().
  • RawPostingsData struct carries raw postings bytes across the trait boundary for custom reader implementations.
  • BlockSegmentPostings::open() takes OwnedBytes instead of FileSlice.
  • DocSet gains fill_bitset() method.

Scorer improvements:

  • Scorer trait absorbs for_each, for_each_pruning, and explain (previously free functions or on Weight).
  • box_scorer() helper avoids double-boxing Box.
  • BoxedTermScorer wraps a type-erased term scorer.
  • BufferedUnionScorer initialization fixed to avoid an extra advance() on construction.

Other changes:

  • Document::to_json() now returns serde_json::Value; the old string serialization is renamed to to_serialized_json().
  • DocumentDeserialize removed from the store reader public API.

@PSeitz-dd PSeitz-dd force-pushed the storage_abstraction branch from 64b93ce to 20bdf2a Compare March 30, 2026 04:52
Extract trait interfaces from tantivy's core reader types so that
alternative storage backends (e.g. Quickwit) can provide their own
implementations while tantivy's query engine works through dynamic
dispatch.

Reader trait extraction:

- SegmentReader is now a trait; the concrete implementation is renamed
  to TantivySegmentReader.
- DynInvertedIndexReader trait for object-safe dynamic dispatch, plus
  a typed InvertedIndexReader trait with associated Postings/DocSet
  types for static dispatch.  The concrete reader becomes
  TantivyInvertedIndexReader.
- StoreReader is now a trait; the concrete implementation is renamed
  to TantivyStoreReader.  get() returns TantivyDocument directly
  instead of requiring a generic DocumentDeserialize bound.

Typed downcast for performance-critical paths:

- try_downcast_and_call() + TypedInvertedIndexReaderCb allow query
  weights (TermWeight, PhraseWeight) to attempt a downcast to the
  concrete TantivyInvertedIndexReader, obtaining typed postings for
  zero-cost scoring, and falling back to the dynamic path otherwise.
- TermScorer<TPostings> is now generic over its postings type.
- PostingsWithBlockMax trait enables block-max WAND acceleration
  through the trait boundary.
- block_wand() and block_wand_single_scorer() are generic over
  PostingsWithBlockMax, and for_each_pruning is dispatched through
  the SegmentReader trait so custom backends can provide their own
  block-max implementations.

Searcher decoupled from Index:

- New SearcherContext holds schema, executor, and tokenizers.
- Searcher can be constructed from Vec<Arc<dyn SegmentReader>>
  via Searcher::from_segment_readers(), without needing an Index.
- Searcher::index() is deprecated in favor of Searcher::context().

Postings and DocSet changes:

- Postings trait gains doc_freq() -> DocFreq (Exact/Approximate)
  and has_freq().
- RawPostingsData struct carries raw postings bytes across the trait
  boundary for custom reader implementations.
- BlockSegmentPostings::open() takes OwnedBytes instead of FileSlice.
- DocSet gains fill_bitset() method.

Scorer improvements:

- Scorer trait absorbs for_each, for_each_pruning, and explain
  (previously free functions or on Weight).
- box_scorer() helper avoids double-boxing Box<dyn Scorer>.
- BoxedTermScorer wraps a type-erased term scorer.
- BufferedUnionScorer initialization fixed to avoid an extra
  advance() on construction.

Other changes:

- Document::to_json() now returns serde_json::Value; the old
  string serialization is renamed to to_serialized_json().
- DocumentDeserialize removed from the store reader public API.
@PSeitz-dd PSeitz-dd force-pushed the storage_abstraction branch from 20bdf2a to 3cbcb5d Compare March 30, 2026 04:58
@PSeitz PSeitz marked this pull request as draft March 31, 2026 09:07

/// Same as [`Searcher::from_segment_readers`] but allows setting
/// a custom generation id.
pub fn from_segment_readers_with_generation_id<Ctx: Into<SearcherContext>>(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a real use case for exposing a public function with the generation id?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently not that I'm aware of. Should be probably moved to SearcherContext anyway

/// The searcher uses the segment ordinal to route the
/// request to the right `Segment`.
pub fn doc<D: DocumentDeserialize>(&self, doc_address: DocAddress) -> crate::Result<D> {
pub fn doc(&self, doc_address: DocAddress) -> crate::Result<TantivyDocument> {
Copy link
Copy Markdown
Collaborator

@fulmicoton-dd fulmicoton-dd Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reverting a public feature. Can we check this is not hurting people? Do we use it in Quickwit?
Can you make sure this shows up in the changelog?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use TantivyDocument in quickwit currently. It's unlikely relevant for performance, I'll ask in the chat

fn schema(&self) -> &Schema;

/// Performs a for_each_pruning operation on the given scorer.
fn for_each_pruning(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the justification to have for_each_pruning in there?
The goal is to allow custom project to have their own specialized impl?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the reason I converted it to a draft to remove it from there first. We don't need it with the downcast approach.

);

let mut segment_postings_containing_the_term: Vec<(usize, SegmentPostings)> = vec![];
let mut segment_postings_containing_the_term: Vec<(usize, Box<dyn Postings>)> =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there really no regression here!?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't apply the downcast trick yet here, currently there's a regression.

///
/// `InvertedIndexReader` are created by calling
/// `TantivyInvertedIndexReader` instances are created by calling
/// [`SegmentReader::inverted_index()`](crate::SegmentReader::inverted_index).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was it moved here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you are referring to? TantivyInvertedIndexReader ?

Comment on lines +58 to +62
.read_raw_postings_data(term_info, IndexRecordOption::Basic)
.unwrap();
let mut block_postings = BlockSegmentPostings::open(
term_info.doc_freq,
raw.postings_data,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's code smell here: inverted index at this point is a DynInvertedIndex, yet you use the fact that you expect it to be the standard one to interpret the raw postings data.

fn list_encoded_json_fields(&self) -> io::Result<Vec<InvertedIndexFieldSpace>>;

/// Returns the raw postings bytes and metadata for a term.
fn read_raw_postings_data(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need read_raw_postings_data? It seems very strange.

Also, I expected InvertedIndexReader: DynInvertedIndexReader

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree on both, removed read_raw_postings_data and added InvertedIndexReader: DynInvertedIndexReader

/// enough to work with both postings types.
///
/// # Example
///
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this thing help much? It keeps things dynamic but attempts to replace the dynamic part by a if-statement in some case.

Copy link
Copy Markdown
Collaborator Author

@PSeitz PSeitz Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is a bit silly. The main use case is to build query types with concrete Postlinglists types e.g. TermScorer<TantivySegmentPostings>

/// It will also preload positions, if positions are available in the SegmentPostings.
pub fn load(segment_postings: &mut SegmentPostings) -> LoadedPostings {
let num_docs = segment_postings.doc_freq() as usize;
pub fn load(postings: &mut Box<dyn Postings>) -> LoadedPostings {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&mut dyn Postings

I don't think there is a case for the Box, is there?

let mut positions = Vec::with_capacity(num_docs);
let mut position_offsets = Vec::with_capacity(num_docs);
while segment_postings.doc() != TERMINATED {
while postings.doc() != TERMINATED {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will be slow. If it is used in RegexPhraseQuery, we might want to push this to Postings.

Although... The current impl is pretty terrible: we should probably not load positions but instead save something that makes it easy to load them later. Like an offset or something.


/// Raw postings bytes and metadata read from storage.
#[derive(Debug, Clone)]
pub struct RawPostingsData {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this to be pub?

let mut min_new_target = TERMINATED;

for docset in self.docsets.iter_mut() {
for docset in self.scorers.iter_mut() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename scorer maybe?

for obsolete_tinyset in self.bitsets.iter_mut() {
*obsolete_tinyset = TinySet::empty();
}
self.bitsets.fill(TinySet::empty());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should make the compiler's life easier by have a const EMPTY: TinySet

bitsets: Box::new([TinySet::empty(); HORIZON_NUM_TINYBITSETS]),
scores: Box::new([score_combiner_fn(); HORIZON as usize]),
scores: Box::new([score_combiner; HORIZON as usize]),
bucket_idx: HORIZON_NUM_TINYBITSETS,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you borrowed that code from my PR. Just so this is in your radar, the scary part is that to avoid regression I made it so that the Scorer does not buffer docs on initialization.

(for instance on a TopK union, it would be a waste to load the first block )

This is probably the highest risk of regression, so if you can have an extra look at the logic it might be worth for you to do it.

Copy link
Copy Markdown
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

PSeitz added 6 commits April 2, 2026 13:10
tokenizer_for_field is a duplicate of SearcherContext::tokenizer_for_field.
Raw postings are an implementation detail of TantivyInvertedIndexReader,
not something custom backends should need to produce. Make
read_postings_from_terminfo the required trait method instead.
Without it, Searcher::doc_async produces a non-Send future, breaking
the Send bound on SearchService::fetch_docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants