Abstract tantivy's data storage behind traits for pluggable backends#2868
Abstract tantivy's data storage behind traits for pluggable backends#2868
Conversation
64b93ce to
20bdf2a
Compare
Extract trait interfaces from tantivy's core reader types so that alternative storage backends (e.g. Quickwit) can provide their own implementations while tantivy's query engine works through dynamic dispatch. Reader trait extraction: - SegmentReader is now a trait; the concrete implementation is renamed to TantivySegmentReader. - DynInvertedIndexReader trait for object-safe dynamic dispatch, plus a typed InvertedIndexReader trait with associated Postings/DocSet types for static dispatch. The concrete reader becomes TantivyInvertedIndexReader. - StoreReader is now a trait; the concrete implementation is renamed to TantivyStoreReader. get() returns TantivyDocument directly instead of requiring a generic DocumentDeserialize bound. Typed downcast for performance-critical paths: - try_downcast_and_call() + TypedInvertedIndexReaderCb allow query weights (TermWeight, PhraseWeight) to attempt a downcast to the concrete TantivyInvertedIndexReader, obtaining typed postings for zero-cost scoring, and falling back to the dynamic path otherwise. - TermScorer<TPostings> is now generic over its postings type. - PostingsWithBlockMax trait enables block-max WAND acceleration through the trait boundary. - block_wand() and block_wand_single_scorer() are generic over PostingsWithBlockMax, and for_each_pruning is dispatched through the SegmentReader trait so custom backends can provide their own block-max implementations. Searcher decoupled from Index: - New SearcherContext holds schema, executor, and tokenizers. - Searcher can be constructed from Vec<Arc<dyn SegmentReader>> via Searcher::from_segment_readers(), without needing an Index. - Searcher::index() is deprecated in favor of Searcher::context(). Postings and DocSet changes: - Postings trait gains doc_freq() -> DocFreq (Exact/Approximate) and has_freq(). - RawPostingsData struct carries raw postings bytes across the trait boundary for custom reader implementations. - BlockSegmentPostings::open() takes OwnedBytes instead of FileSlice. - DocSet gains fill_bitset() method. Scorer improvements: - Scorer trait absorbs for_each, for_each_pruning, and explain (previously free functions or on Weight). - box_scorer() helper avoids double-boxing Box<dyn Scorer>. - BoxedTermScorer wraps a type-erased term scorer. - BufferedUnionScorer initialization fixed to avoid an extra advance() on construction. Other changes: - Document::to_json() now returns serde_json::Value; the old string serialization is renamed to to_serialized_json(). - DocumentDeserialize removed from the store reader public API.
20bdf2a to
3cbcb5d
Compare
|
|
||
| /// Same as [`Searcher::from_segment_readers`] but allows setting | ||
| /// a custom generation id. | ||
| pub fn from_segment_readers_with_generation_id<Ctx: Into<SearcherContext>>( |
There was a problem hiding this comment.
Is there a real use case for exposing a public function with the generation id?
There was a problem hiding this comment.
currently not that I'm aware of. Should be probably moved to SearcherContext anyway
| /// The searcher uses the segment ordinal to route the | ||
| /// request to the right `Segment`. | ||
| pub fn doc<D: DocumentDeserialize>(&self, doc_address: DocAddress) -> crate::Result<D> { | ||
| pub fn doc(&self, doc_address: DocAddress) -> crate::Result<TantivyDocument> { |
There was a problem hiding this comment.
This is reverting a public feature. Can we check this is not hurting people? Do we use it in Quickwit?
Can you make sure this shows up in the changelog?
There was a problem hiding this comment.
we use TantivyDocument in quickwit currently. It's unlikely relevant for performance, I'll ask in the chat
src/index/segment_reader.rs
Outdated
| fn schema(&self) -> &Schema; | ||
|
|
||
| /// Performs a for_each_pruning operation on the given scorer. | ||
| fn for_each_pruning( |
There was a problem hiding this comment.
What is the justification to have for_each_pruning in there?
The goal is to allow custom project to have their own specialized impl?
There was a problem hiding this comment.
This is actually the reason I converted it to a draft to remove it from there first. We don't need it with the downcast approach.
| ); | ||
|
|
||
| let mut segment_postings_containing_the_term: Vec<(usize, SegmentPostings)> = vec![]; | ||
| let mut segment_postings_containing_the_term: Vec<(usize, Box<dyn Postings>)> = |
There was a problem hiding this comment.
Is there really no regression here!?
There was a problem hiding this comment.
I didn't apply the downcast trick yet here, currently there's a regression.
| /// | ||
| /// `InvertedIndexReader` are created by calling | ||
| /// `TantivyInvertedIndexReader` instances are created by calling | ||
| /// [`SegmentReader::inverted_index()`](crate::SegmentReader::inverted_index). |
There was a problem hiding this comment.
why was it moved here?
There was a problem hiding this comment.
I'm not sure what you are referring to? TantivyInvertedIndexReader ?
| .read_raw_postings_data(term_info, IndexRecordOption::Basic) | ||
| .unwrap(); | ||
| let mut block_postings = BlockSegmentPostings::open( | ||
| term_info.doc_freq, | ||
| raw.postings_data, |
There was a problem hiding this comment.
That's code smell here: inverted index at this point is a DynInvertedIndex, yet you use the fact that you expect it to be the standard one to interpret the raw postings data.
src/index/inverted_index_reader.rs
Outdated
| fn list_encoded_json_fields(&self) -> io::Result<Vec<InvertedIndexFieldSpace>>; | ||
|
|
||
| /// Returns the raw postings bytes and metadata for a term. | ||
| fn read_raw_postings_data( |
There was a problem hiding this comment.
Why do we need read_raw_postings_data? It seems very strange.
Also, I expected InvertedIndexReader: DynInvertedIndexReader
There was a problem hiding this comment.
agree on both, removed read_raw_postings_data and added InvertedIndexReader: DynInvertedIndexReader
| /// enough to work with both postings types. | ||
| /// | ||
| /// # Example | ||
| /// |
There was a problem hiding this comment.
Does this thing help much? It keeps things dynamic but attempts to replace the dynamic part by a if-statement in some case.
There was a problem hiding this comment.
The example is a bit silly. The main use case is to build query types with concrete Postlinglists types e.g. TermScorer<TantivySegmentPostings>
| /// It will also preload positions, if positions are available in the SegmentPostings. | ||
| pub fn load(segment_postings: &mut SegmentPostings) -> LoadedPostings { | ||
| let num_docs = segment_postings.doc_freq() as usize; | ||
| pub fn load(postings: &mut Box<dyn Postings>) -> LoadedPostings { |
There was a problem hiding this comment.
&mut dyn Postings
I don't think there is a case for the Box, is there?
| let mut positions = Vec::with_capacity(num_docs); | ||
| let mut position_offsets = Vec::with_capacity(num_docs); | ||
| while segment_postings.doc() != TERMINATED { | ||
| while postings.doc() != TERMINATED { |
There was a problem hiding this comment.
that will be slow. If it is used in RegexPhraseQuery, we might want to push this to Postings.
Although... The current impl is pretty terrible: we should probably not load positions but instead save something that makes it easy to load them later. Like an offset or something.
|
|
||
| /// Raw postings bytes and metadata read from storage. | ||
| #[derive(Debug, Clone)] | ||
| pub struct RawPostingsData { |
There was a problem hiding this comment.
Do we need this to be pub?
| let mut min_new_target = TERMINATED; | ||
|
|
||
| for docset in self.docsets.iter_mut() { | ||
| for docset in self.scorers.iter_mut() { |
There was a problem hiding this comment.
rename scorer maybe?
| for obsolete_tinyset in self.bitsets.iter_mut() { | ||
| *obsolete_tinyset = TinySet::empty(); | ||
| } | ||
| self.bitsets.fill(TinySet::empty()); |
There was a problem hiding this comment.
I wonder if we should make the compiler's life easier by have a const EMPTY: TinySet
| bitsets: Box::new([TinySet::empty(); HORIZON_NUM_TINYBITSETS]), | ||
| scores: Box::new([score_combiner_fn(); HORIZON as usize]), | ||
| scores: Box::new([score_combiner; HORIZON as usize]), | ||
| bucket_idx: HORIZON_NUM_TINYBITSETS, |
There was a problem hiding this comment.
I think you borrowed that code from my PR. Just so this is in your radar, the scary part is that to avoid regression I made it so that the Scorer does not buffer docs on initialization.
(for instance on a TopK union, it would be a waste to load the first block )
This is probably the highest risk of regression, so if you can have an extra look at the logic it might be worth for you to do it.
tokenizer_for_field is a duplicate of SearcherContext::tokenizer_for_field.
Raw postings are an implementation detail of TantivyInvertedIndexReader, not something custom backends should need to produce. Make read_postings_from_terminfo the required trait method instead.
Without it, Searcher::doc_async produces a non-Send future, breaking the Send bound on SearchService::fetch_docs.
Extract trait interfaces from tantivy's core reader types so that alternative storage backends (e.g. Quickwit) can provide their own implementations while tantivy's query engine works through dynamic dispatch.
Reader trait extraction:
Typed downcast for performance-critical paths:
Searcher decoupled from Index:
Postings and DocSet changes:
Scorer improvements:
Other changes: