Skip to content

LanceDB PoC , DO NOT MERGE (relational queries over lance_vector_search)#844

Draft
ran-yuan-rui wants to merge 2 commits into
sirius-db:devfrom
ran-yuan-rui:feature-lance-vector-search-phase1
Draft

LanceDB PoC , DO NOT MERGE (relational queries over lance_vector_search)#844
ran-yuan-rui wants to merge 2 commits into
sirius-db:devfrom
ran-yuan-rui:feature-lance-vector-search-phase1

Conversation

@ran-yuan-rui

Copy link
Copy Markdown
Contributor

Summary

Sirius now GPU-accelerates the relational work that follows a vector search over the DuckDB lance extension's lance_vector_search(...) table function. Previously any query whose source was lance_vector_search fell back to CPU wholesale; now the scan and everything after it (filter, projection, join, aggregation, sort) run on GPU.

No LanceDB/Lance code is changed. The integration lives entirely at the DuckDB extension boundary — see Design.

Use case

Retrieval-heavy analytics — RAG candidate generation, recommendation recall — where a vector search returns a large candidate set and is followed by meaningful relational work:

LOAD lance;        -- DuckDB lance extension (vector retrieval, CPU)
LOAD sirius;       -- GPU SQL

CALL gpu_execution('
  SELECT cat, count(*) AS n, avg(_distance) AS ad
  FROM lance_vector_search(''/data/items.lance'', ''vec'', [...]::FLOAT[768], k => 10000)
  WHERE ...
  GROUP BY cat ORDER BY n DESC
');

Lance performs the vector search (CPU); Sirius runs the GROUP BY/filter/join on GPU. The win concentrates on large candidate sets + heavy downstream relational work; small top-k (10/20) RAG queries are already CPU-fast and are not the target.

Design

Three independent pieces share one DuckDB process:

DuckDB process
├── lance extension   (INSTALL lance; LOAD lance)  ← separate, published; unmodified
│     exposes lance_vector_search(...) as a DuckDB table function
├── sirius extension  (LOAD sirius)                ← all changes in this PR
└── DuckDB core                                    ← parser/planner/glue

lance_vector_search is an ordinary DuckDB table function. When Sirius's physical-plan generator sees a LogicalGet whose function.name == "lance_vector_search", it routes the scan through the existing generic DUCKDB_SCAN path (the same path seq_scan uses), which drives the table function on CPU, ingests its output rows to the GPU, and runs the rest of the plan on GPU. Production changes:

  • Routing — add lance_vector_search to the supported-scan whitelist (sirius_plan_get.cpp) and dispatch it to sirius_physical_duckdb_scan in the engine and pipeline converter (sirius_engine.cpp, sirius_pipeline_converter.cpp).
  • Projected-type guard — lance returns a full schema that always includes the vector column (FLOAT[N]). If a query projects an unsupported column type (LIST/ARRAY/STRUCT/HUGEINT/UHUGEINT/DECIMAL width>18) the plan throws NotImplementedException at plan time → clean CPU fallback (no execution error). Inspects projected columns only.
  • See Workarounds for the two remaining hacks.

A two-step form also works with no special handling and is the recommended path when a query would otherwise project the vector column: materialize the scalar candidate columns into a temp table (excluding the vector), then gpu_execution over that table (an ordinary seq_scan).

In scope / out of scope

In: single-query routing + GPU execution of the relational tail; clean fallback for unsupported projections; offline planner tests.

Out: GPU vector search itself (Lance still does retrieval on CPU); lance_fts / lance_hybrid_search (different column contract); consuming Lance index artifacts on GPU (that is the future Option B / Phase 2, which is the only direction that would touch the Lance side).

Workarounds to follow up

1. bind_data is moved, not copied (sirius_physical_duckdb_scan.cpp). The real lance bind data does not implement Copy(), so the duckdb-scan ctor now std::moves it out of the table-scan node instead of copying. Verified safe today (the scan node is converted exactly once and never reads its bind data afterward; the null-guard keeps a hypothetical double-conversion defined rather than UB), but it leaves the original table-scan node with a null bind_data.
Follow-up: move ownership to a shared_ptr so the scan node and the duckdb-scan share one FunctionData — cleaner and immune to the double-conversion edge.

2. Placeholder type for the un-referenced vector column (sirius_plan_get.cpp). Sirius cannot represent FLOAT[N] (ARRAY). Because projection pushdown prunes the vector column from column_ids (it is never materialized), the scan node's returned_types conversion substitutes a benign placeholder for it so plan generation does not abort. Strict conversion is kept for all other functions.
Follow-up: this is the likely source of a benign runtime warning gpu_pipeline_task: column count mismatch: got 4, expected 3 (an extra column — probably lance _rowid — flowing through the pipeline; results are correct). Properly prune/account for the un-referenced column.

Test plan

  • Unit (test/cpp/planner/test_lance_vector_search_routing.cpp, [lance_vector_search]): whitelist routing, the projected-type guard (incl. DECIMAL width boundary and no over-rejection of unprojected columns), tolerance of an unprojected ARRAY/LIST column, the bind-data move path — using an in-process fake table function so the suite runs offline. 10/10.
  • Regression ([planner],[transparent]): 19/19, no change to seq_scan/parquet/iceberg routing.
  • End-to-end (manual, real lance dataset, a real GPU exec config): a scalar lance_vector_search query runs on GPU with results identical to CPU; projecting the vector column falls back cleanly. A self-skipping integration scaffold (test_lance_vector_search_transparent.cpp, SIRIUS_TEST_LANCE_* env) is included for when the lance extension is available in CI.

Note for e2e reviewers

Use a real GPU execution config. The planner-unit minimal.yaml is not set up for execution and segfaults in the drain_after_error path (pre-existing Sirius error-path robustness, unrelated to this change); the repo's integration.yaml requests a 32 GB GPU pool, so reduce it for smaller cards.

Sirius now recognizes the DuckDB lance extension's lance_vector_search(...)
table function so the relational work after vector retrieval (filter,
projection, join, aggregation, sort) runs on GPU instead of forcing a
whole-query CPU fallback.

The function is routed through the existing generic DUCKDB_SCAN path (the
same path seq_scan uses): added to the supported-scan whitelist in the
physical plan generator and dispatched to sirius_physical_duckdb_scan in
both the engine and the pipeline converter.

lance reports a full output schema that always includes a vector column
(FLOAT[N]) which Sirius cannot represent. Two changes make this work:
- A plan-time guard rejects any projected unsupported column type (LIST,
  ARRAY, STRUCT, HUGEINT, UHUGEINT, DECIMAL width > 18) by throwing
  NotImplementedException, so such queries fall back cleanly to CPU rather
  than erroring during execution. The guard inspects only projected columns.
- The un-referenced vector column (pruned from column_ids by projection
  pushdown, never materialized) is tolerated during the scan node's
  returned-types conversion via a benign placeholder, so plan generation no
  longer aborts on it. Strict conversion is kept for all other functions.

The duckdb scan now moves the table function's bind data instead of copying
it: the scan node is converted exactly once and never reads its bind data
afterwards, so transferring ownership is behavior-preserving and additionally
supports table functions whose bind data is movable but not copyable (the
lance extension's bind data does not implement Copy). The parquet and iceberg
scan operators keep their existing copy behavior.

A two-step form is also supported and needs no special handling: materialize
the scalar candidate columns into a temporary table (excluding the vector
column) and run gpu_execution over that table, which scans as an ordinary
seq_scan. This is the recommended path when a query would otherwise project
the vector column.

Tests cover whitelist routing, the projected-type guard (including the
DECIMAL width boundary and no over-rejection of unprojected columns),
tolerance of an unprojected ARRAY/LIST column, and the bind-data move path,
using an in-process fake table function so they run offline. A self-skipping
integration test scaffolds end-to-end checks for when the lance extension is
available.
@ran-yuan-rui ran-yuan-rui requested a review from bwyogatama June 1, 2026 14:00
A hidden, self-skipping integration test that builds its own Lance dataset
in DuckDB (COPY ... FORMAT lance) and verifies, against the real DuckDB lance
extension, that a scalar lance_vector_search tail runs on GPU with results
matching CPU, and that projecting the vector column falls back cleanly.

Hidden by default (Catch2 "[.]" tag) so it does not run in normal CI; run
explicitly with: sirius_unittest "[.lance_e2e_selfcontained]". Self-skips when
the lance extension is unavailable, and builds its dataset in a temp dir so it
needs no external fixture or environment query.
@mbrobbel

mbrobbel commented Jun 4, 2026

Copy link
Copy Markdown
Member

Marking as draft (because: do not merge)

@mbrobbel mbrobbel marked this pull request as draft June 4, 2026 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants