LanceDB PoC , DO NOT MERGE (relational queries over lance_vector_search)#844
Draft
ran-yuan-rui wants to merge 2 commits into
Draft
LanceDB PoC , DO NOT MERGE (relational queries over lance_vector_search)#844ran-yuan-rui wants to merge 2 commits into
ran-yuan-rui wants to merge 2 commits into
Conversation
Sirius now recognizes the DuckDB lance extension's lance_vector_search(...) table function so the relational work after vector retrieval (filter, projection, join, aggregation, sort) runs on GPU instead of forcing a whole-query CPU fallback. The function is routed through the existing generic DUCKDB_SCAN path (the same path seq_scan uses): added to the supported-scan whitelist in the physical plan generator and dispatched to sirius_physical_duckdb_scan in both the engine and the pipeline converter. lance reports a full output schema that always includes a vector column (FLOAT[N]) which Sirius cannot represent. Two changes make this work: - A plan-time guard rejects any projected unsupported column type (LIST, ARRAY, STRUCT, HUGEINT, UHUGEINT, DECIMAL width > 18) by throwing NotImplementedException, so such queries fall back cleanly to CPU rather than erroring during execution. The guard inspects only projected columns. - The un-referenced vector column (pruned from column_ids by projection pushdown, never materialized) is tolerated during the scan node's returned-types conversion via a benign placeholder, so plan generation no longer aborts on it. Strict conversion is kept for all other functions. The duckdb scan now moves the table function's bind data instead of copying it: the scan node is converted exactly once and never reads its bind data afterwards, so transferring ownership is behavior-preserving and additionally supports table functions whose bind data is movable but not copyable (the lance extension's bind data does not implement Copy). The parquet and iceberg scan operators keep their existing copy behavior. A two-step form is also supported and needs no special handling: materialize the scalar candidate columns into a temporary table (excluding the vector column) and run gpu_execution over that table, which scans as an ordinary seq_scan. This is the recommended path when a query would otherwise project the vector column. Tests cover whitelist routing, the projected-type guard (including the DECIMAL width boundary and no over-rejection of unprojected columns), tolerance of an unprojected ARRAY/LIST column, and the bind-data move path, using an in-process fake table function so they run offline. A self-skipping integration test scaffolds end-to-end checks for when the lance extension is available.
A hidden, self-skipping integration test that builds its own Lance dataset in DuckDB (COPY ... FORMAT lance) and verifies, against the real DuckDB lance extension, that a scalar lance_vector_search tail runs on GPU with results matching CPU, and that projecting the vector column falls back cleanly. Hidden by default (Catch2 "[.]" tag) so it does not run in normal CI; run explicitly with: sirius_unittest "[.lance_e2e_selfcontained]". Self-skips when the lance extension is unavailable, and builds its dataset in a temp dir so it needs no external fixture or environment query.
Member
|
Marking as draft (because: do not merge) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sirius now GPU-accelerates the relational work that follows a vector search over the DuckDB
lanceextension'slance_vector_search(...)table function. Previously any query whose source waslance_vector_searchfell back to CPU wholesale; now the scan and everything after it (filter, projection, join, aggregation, sort) run on GPU.No LanceDB/Lance code is changed. The integration lives entirely at the DuckDB extension boundary — see Design.
Use case
Retrieval-heavy analytics — RAG candidate generation, recommendation recall — where a vector search returns a large candidate set and is followed by meaningful relational work:
Lance performs the vector search (CPU); Sirius runs the
GROUP BY/filter/join on GPU. The win concentrates on large candidate sets + heavy downstream relational work; small top-k (10/20) RAG queries are already CPU-fast and are not the target.Design
Three independent pieces share one DuckDB process:
lance_vector_searchis an ordinary DuckDB table function. When Sirius's physical-plan generator sees aLogicalGetwhosefunction.name == "lance_vector_search", it routes the scan through the existing genericDUCKDB_SCANpath (the same pathseq_scanuses), which drives the table function on CPU, ingests its output rows to the GPU, and runs the rest of the plan on GPU. Production changes:lance_vector_searchto the supported-scan whitelist (sirius_plan_get.cpp) and dispatch it tosirius_physical_duckdb_scanin the engine and pipeline converter (sirius_engine.cpp,sirius_pipeline_converter.cpp).FLOAT[N]). If a query projects an unsupported column type (LIST/ARRAY/STRUCT/HUGEINT/UHUGEINT/DECIMAL width>18) the plan throwsNotImplementedExceptionat plan time → clean CPU fallback (no execution error). Inspects projected columns only.A two-step form also works with no special handling and is the recommended path when a query would otherwise project the vector column: materialize the scalar candidate columns into a temp table (excluding the vector), then
gpu_executionover that table (an ordinaryseq_scan).In scope / out of scope
In: single-query routing + GPU execution of the relational tail; clean fallback for unsupported projections; offline planner tests.
Out: GPU vector search itself (Lance still does retrieval on CPU);
lance_fts/lance_hybrid_search(different column contract); consuming Lance index artifacts on GPU (that is the future Option B / Phase 2, which is the only direction that would touch the Lance side).Workarounds to follow up
1.
bind_datais moved, not copied (sirius_physical_duckdb_scan.cpp). The real lance bind data does not implementCopy(), so the duckdb-scan ctor nowstd::moves it out of the table-scan node instead of copying. Verified safe today (the scan node is converted exactly once and never reads its bind data afterward; the null-guard keeps a hypothetical double-conversion defined rather than UB), but it leaves the original table-scan node with a nullbind_data.Follow-up: move ownership to a
shared_ptrso the scan node and the duckdb-scan share oneFunctionData— cleaner and immune to the double-conversion edge.2. Placeholder type for the un-referenced vector column (
sirius_plan_get.cpp). Sirius cannot representFLOAT[N](ARRAY). Because projection pushdown prunes the vector column fromcolumn_ids(it is never materialized), the scan node'sreturned_typesconversion substitutes a benign placeholder for it so plan generation does not abort. Strict conversion is kept for all other functions.Follow-up: this is the likely source of a benign runtime warning
gpu_pipeline_task: column count mismatch: got 4, expected 3(an extra column — probably lance_rowid— flowing through the pipeline; results are correct). Properly prune/account for the un-referenced column.Test plan
test/cpp/planner/test_lance_vector_search_routing.cpp,[lance_vector_search]): whitelist routing, the projected-type guard (incl. DECIMAL width boundary and no over-rejection of unprojected columns), tolerance of an unprojected ARRAY/LIST column, the bind-data move path — using an in-process fake table function so the suite runs offline. 10/10.[planner],[transparent]): 19/19, no change to seq_scan/parquet/iceberg routing.lance_vector_searchquery runs on GPU with results identical to CPU; projecting the vector column falls back cleanly. A self-skipping integration scaffold (test_lance_vector_search_transparent.cpp,SIRIUS_TEST_LANCE_*env) is included for when the lance extension is available in CI.Note for e2e reviewers
Use a real GPU execution config. The planner-unit
minimal.yamlis not set up for execution and segfaults in thedrain_after_errorpath (pre-existing Sirius error-path robustness, unrelated to this change); the repo'sintegration.yamlrequests a 32 GB GPU pool, so reduce it for smaller cards.