Lance is a modern columnar data format optimized for ML workflows and datasets, providing high-performance random access, vector search, zero-copy automatic versioning, and ecosystem integrations. The vision is to become the de facto standard columnar data format for machine learning and large language models.
Also see directory-specific guidelines: rust/ | python/ | java/ | protos/ | docs/src/format/
Rust workspace with Python and Java bindings:
rust/lance/- Main library implementing the columnar formatrust/lance-core/- Core types, traits, and utilitiesrust/lance-arrow/- Apache Arrow integration layerrust/lance-encoding/- Data encoding and compression algorithmsrust/lance-file/- File format reading/writingrust/lance-index/- Vector and scalar indexingrust/lance-io/- I/O operations and object store integrationrust/lance-linalg/- Linear algebra for vector searchrust/lance-table/- Table format and operationsrust/lance-geo/- Geospatial data supportrust/lance-datagen/- Data generation for tests and benchmarksrust/lance-namespace//rust/lance-namespace-impls/- Namespace/catalog interfacesrust/lance-test-macros//rust/lance-testing/- Test infrastructurerust/lance-tools/- CLI and developer toolingrust/examples/- Sample binaries and demonstrationsrust/compression/bitpacking//rust/compression/fsst/- Compression codecsrust/lance-datafusion/- DataFusion integration (built separately)python/- Python bindings (PyO3/maturin)java/- Java bindings (JNI)
Key technical traits: async-first (tokio), Arrow-native, versioned writes with manifest tracking, custom ML-optimized encodings, unified object store interface (local/S3/Azure/GCS).
- Check:
cargo check --workspace --tests --benches - Test:
cargo test --workspaceorcargo test -p <package> <test_name> - Lint:
cargo clippy --all --tests --benches -- -D warnings - Format:
cargo fmt --all - Coverage:
cargo +nightly llvm-cov -q -p <crate> --branch - Coverage HTML:
cargo +nightly llvm-cov -q -p <crate> --branch --html - Coverage for file:
python ci/coverage.py -p <crate> -f <file_path>
See python/AGENTS.md and java/AGENTS.md.
cd test_data && docker compose up -d
AWS_DEFAULT_REGION=us-east-1 pytest --run-integration python/tests/test_s3_ddb.py- Always use English in code, examples, and comments.
- Code is for readability, not just execution. Only add meaningful comments and tests.
- Comments should explain non-obvious "why" reasoning, not restate what the code does.
- Remove debug prints (
println!,dbg!,print()) before merging — usetracingor logging frameworks. - Extract logic repeated in 2+ places into a shared helper; inline single-use logic at its call site.
- Keep PRs focused — no drive-by refactors, reformatting, or cosmetic changes.
- Be mindful of memory use: avoid collecting streams of
RecordBatchinto memory; useRoaringBitmapinstead ofHashSet<u32>.
- Keep Python and Java bindings as thin wrappers — centralize validation and logic in the Rust core.
- Keep parameter names consistent across all bindings (Rust, Python, Java) — rename everywhere or nowhere.
- Never break public API signatures — deprecate with
#[deprecated]/@deprecatedand add a new method. - Replace mutually exclusive boolean flags with a single enum/mode parameter.
- Name variables after what the value is (e.g.,
partition_idnotmask) — precise names act as inline docs. - Drop redundant prefixes when the struct/module already implies the domain.
- Use
indices(notindexes) consistently in all APIs and docs. - Use storage-agnostic terms in API names (e.g.,
basenotbucket). - When renaming a type/struct/enum, update all references (methods, fields, variables, test names).
- Validate inputs and reject invalid values with descriptive errors at API boundaries — never silently clamp or adjust.
- Validate mutually exclusive options in builders/configs — throw a clear error if both are set.
- Include full context in error messages: variable names, values, sizes, types.
- Prefer implementing functionality with the standard library or existing workspace dependencies before adding new external crates.
- Keep
Cargo.lockchanges intentional; revert unrelated dependency bumps. Pin broken deps with a comment linking the upstream issue. - Gate optional/domain-specific deps behind Cargo feature flags. Prefer separate crates for domain functionality (geo, NLP).
- All bugfixes and features must have corresponding tests. We do not merge code without tests.
- Use
rstest(Rust) or@pytest.mark.parametrize(Python) for tests that differ only in inputs. Use#[case::{name}(...)]for readable case names. - Replace
print()in tests withassert— prints don't catch regressions. - Extend existing tests instead of adding overlapping new ones. Add to existing test files.
- Link a GitHub issue when skipping a test — never bare
@pytest.mark.skipor@Ignorewithout a tracking URL. - Include multi-fragment scenarios for dataset operations (reads, indexes, scans).
- Cover NULL edge cases in index tests: null items, all-null collections, empty collections, null columns.
- Vector index tests must assert recall metrics (>=0.5 threshold), not just verify creation succeeds.
- For backwards compatibility, use the
test_datadirectory with checked-in datasets from older versions. Include adatagen.pythat asserts the Lance version used. Usecopy_test_data_to_tmpto read this data. - Avoid
ignorein doctests — write Rust doctests that compile a function instead:/// ``` /// # use lance::{Dataset, Result}; /// # async fn test(dataset: &Dataset) -> Result<()> { /// dataset.delete("id = 25").await?; /// # Ok(()) /// # } /// ``` - Skip coverage for test utilities using
#[cfg_attr(coverage, coverage(off))].
- All public APIs must have documentation with examples. Link to relevant structs and methods.
- Use ASCII tree diagrams for hierarchical structures (encoding layers, file formats, storage layouts).
- Keep doc examples in sync with actual API signatures — update when refactoring.
- Indent content under MkDocs admonition directives (
!!! note, etc.) with 4 spaces. - Proofread comments and docs for typos before committing.
Contributor and maintainer attention is the most valuable resource. Less is more.
- Be concise and clear. Focus on P0/P1 issues: severe bugs, performance degradation, security concerns.
- Do not reiterate detailed changes or repeat what's already well done.
- Check naming consistency, error handling patterns, and test coverage.