feat: shared ScanScheduler, FileScheduler cache, and V1 reader cache on Dataset by jeremywgleeson · Pull Request #3 · exa-labs/lance

jeremywgleeson · 2026-03-23T18:38:06Z

Summary

Adds three layers of caching/reuse to the Dataset struct to reduce per-fragment overhead for scattered random-access workloads (e.g., 5000 ID lookups across ~3500 fragments):

Shared ScanScheduler (Arc<ScanScheduler>): Created once at dataset open time. Fragment reads reuse this scheduler instead of spawning a new I/O loop + connection pool per open_reader() call.
FileScheduler cache (Arc<DashMap<Path, FileScheduler>>): Caches opened V2 file handles so subsequent reads of the same fragment data file skip the file-open overhead.
V1 PreviousFileReader cache (Arc<DashMap<Path, PreviousFileReader>>): Caches opened V1 (legacy format) file readers. Without this, every fragment read creates a new PreviousFileReader — involving object_store.open(path), schema projection setup, and page table initialization. The cache eliminates this per-request object creation overhead for V1 data files.

Changes:

Dataset struct gains pub scan_scheduler: Arc<ScanScheduler>, pub file_scheduler_cache: Arc<DashMap<Path, FileScheduler>>, and pub v1_reader_cache: Arc<DashMap<Path, PreviousFileReader>> fields
checkout_manifest() creates all three once
with_object_store() creates new instances when the object store changes
commit.rs WriteDestination::Uri branch includes all three new fields
fragment.rs fallback path clones self.dataset.scan_scheduler instead of calling ScanScheduler::new()
fragment.rs open_reader() checks file_scheduler_cache before calling open_file_with_priority(), and inserts newly opened schedulers into the cache
fragment.rs open_reader() V1 path checks v1_reader_cache before calling PreviousFileReader::try_new_with_fragment_id(), and inserts newly created readers into the cache
S3 call counting instrumentation added to lance-io (v1_s3_calls(), v1_s3_bytes(), s3_requests_counter(), etc.) for profiling

Benchmark results (5k scattered ID lookups, url-only, staging with prod data, S3 concurrency=2000):

Config	Cold	Warm (stabilized)
Lance baseline (no optimizations)	~9,200ms	~2,830ms
+ shared ScanScheduler	~6,100ms	~2,560ms
+ ScanScheduler + FileScheduler cache	~9,680ms	~2,100ms
+ all caches (incl. V1 reader cache)	~4,230ms	~2,300ms
Jata baseline (S3 concurrency=2000)	~800ms	~364ms

The V1 reader cache provides a 56% cold-start improvement (9.7s → 4.2s) by eliminating per-request PreviousFileReader creation. Warm performance is roughly unchanged (~2.1-2.3s) because the file metadata was already served from file_metadata_cache on warm requests — the V1 reader cache avoids object construction overhead but not the underlying S3 data reads.

Root cause of remaining ~6x gap vs jata: Both systems issue the same number of S3 calls (~7,000 per request: 2 per fragment × ~3,500 fragments) with the same concurrency model (semaphore=2000). The gap is in per-S3-call latency: Lance V1 reads go through CloudObjectReader → object_store::ObjectStore::get_range() (reqwest HTTP client), while jata uses aws_sdk_s3::Client::get_object() (hyper HTTP client) directly. The object_store layer adds per-request overhead (signing, retry wrappers, path normalization, different connection pooling). With ~7,000 calls, even small per-call overhead compounds significantly. Additionally, Lance constructs intermediate Arrow arrays (positions, offsets, GenericByteArray) per fragment that jata avoids by working with raw bytes. These are metadata/abstraction-layer costs, not data caching — further improvement requires either bypassing object_store for V1 reads or a purpose-built direct-S3 reader.

Review & Testing Checklist for Human

Unbounded cache growth (all three caches): file_scheduler_cache and v1_reader_cache have no eviction policy, TTL, or size limit. For datasets with tens of thousands of fragments, these caches will hold state for every unique file path ever read. Verify memory growth is acceptable, or add LRU eviction.
PreviousFileReader::clone() semantics: The V1 reader cache returns cached.clone(). Verify that PreviousFileReader::clone() shares underlying state (object reader, page table) cheaply rather than deep-copying. If it deep-copies, the cache provides no benefit and doubles memory usage.
Cache staleness: If S3 files are replaced at the same path (compaction, rewrite), cached FileScheduler/PreviousFileReader objects may serve stale data. Caches are only cleared on with_object_store(). Verify this is sufficient for your data lifecycle, or clear caches on dataset version change.
V1 reader cache key correctness: The cache is keyed by Path (file path). Verify that different schema projections for the same file don't produce incorrect results — the cached reader has a fixed schema from its first creation. If different callers project different columns, the cached reader's schema may not match.
Warm performance regression check: The V1 reader cache showed ~2,300ms warm vs ~2,100ms without it. Verify this isn't a consistent regression (could be measurement noise, or the DashMap lookup + clone overhead may slightly outweigh the construction savings on warm paths where metadata is already cached).

Suggested test plan: Deploy to staging with ADMIN_LITE_USE_LANCE=true, send repeated 5k-ID url-only requests, and verify:

Cold start latency ~4-5s (check [admin-lite-perf] COMPLETE logs)
Warm latency stabilizes around ~2.1-2.3s
Memory usage doesn't grow unboundedly over many requests
Data correctness — returned URLs match what jata returns for the same IDs

Notes

The SchedulerConfig::max_bandwidth config uses 32 MiB * io_parallelism as the I/O buffer size. With LANCE_IO_THREADS=2048 this is ~64 GB of buffer — in practice the buffer is a backpressure limit, not allocated memory, but worth being aware of.
LANCE_IO_THREADS tuning had no measurable impact with the caches (2,160ms at default 64 vs ~2,160ms at 2048), suggesting the bottleneck is per-S3-call latency, not I/O thread count.
S3 call counting instrumentation (v1_s3_calls(), s3_requests_counter(), etc.) is included for future profiling. These are atomic counters with negligible runtime cost.
Pre-existing CI failures: cargo-deny (dependency vulnerability in AWS-LC/tar-rs) and clippy (missing package.readme on arrow-scalar) are unrelated to this PR. format, build, rustdoc, Rust Clippy and Fmt Check all pass.
To close the remaining 6x gap, the most impactful next step would be bypassing object_store for V1 reads and using the AWS SDK S3 client directly (like jata does), or caching decoded position arrays in memory to eliminate half the S3 calls.

Link to Devin session: https://app.devin.ai/sessions/0e1908318d87468ca0ecf900e16b6502
Requested by: @jeremywgleeson

… reads Previously, a new ScanScheduler (HTTP connection pool) was created on every fragment read in open_reader(). This caused significant overhead for scattered random access patterns like admin-lite's id->url lookups across thousands of fragments. Now the ScanScheduler is created once per Dataset at construction time and reused by all fragment reads via the existing FragReadConfig path. Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

devin-ai-integration · 2026-03-23T18:38:10Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

… per request Cache opened FileScheduler instances (keyed by file path) on the Dataset struct using a DashMap. This avoids re-opening ~4,600 S3 file handles on every request for scattered random access patterns like id->url lookups. The cache is only used for the default read path (no custom base_id or scan_scheduler). It is cleared when the object store changes via with_object_store(). Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

…equest S3 reader creation Each fragment read in V1 (legacy) format was creating a new PreviousFileReader, which involves opening a new S3 file handle and reading metadata/page tables. For 5k scattered ID lookups across ~4600 fragments, this created ~4600 new readers per request. Now we cache PreviousFileReader instances in a DashMap keyed by file path on the Dataset struct. On cache hit, we clone the reader (cheap - all Arc fields) instead of creating a new one. This eliminates the per-request overhead of S3 file handle creation and metadata reads for V1 format files. Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

…it.rs Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

devin-ai-integration bot assigned jeremywgleeson Mar 23, 2026

github-actions bot added the enhancement New feature or request label Mar 23, 2026

devin-ai-integration bot and others added 4 commits March 23, 2026 18:40

style: fix rustfmt formatting for shared scan scheduler tuple

40f58f7

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

style: fix import ordering for rustfmt compliance

6f8a39b

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

style: fix rustfmt formatting for FileScheduler cache code

521bd53

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

devin-ai-integration bot changed the title ~~feat: store shared ScanScheduler on Dataset for reuse across fragment reads~~ feat: shared ScanScheduler + FileScheduler cache on Dataset for fragment read reuse Mar 23, 2026

devin-ai-integration bot and others added 9 commits March 23, 2026 21:29

perf: add S3 call counting and scheduler timing instrumentation

d243bd7

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

perf: add S3 byte counting and V1/V2 format logging for profiling

42f5f1e

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

style: fix rustfmt formatting for V1/V2 format logging

42ee910

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

perf: add V1 (CloudObjectReader) S3 call counting for profiling

339918b

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

style: fix import ordering in lib.rs

dbd1244

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

feat: re-export S3 profiling counters through lance crate

7c0a86c

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

fix: add missing v1_reader_cache field to Dataset initializer in comm…

696a900

…it.rs Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

style: fix rustfmt formatting for v1_reader_cache insert

7dc6deb

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>

devin-ai-integration bot changed the title ~~feat: shared ScanScheduler + FileScheduler cache on Dataset for fragment read reuse~~ feat: shared ScanScheduler, FileScheduler cache, and V1 reader cache on Dataset Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: shared ScanScheduler, FileScheduler cache, and V1 reader cache on Dataset#3

feat: shared ScanScheduler, FileScheduler cache, and V1 reader cache on Dataset#3
jeremywgleeson wants to merge 14 commits intomainfrom
devin/1774288046-cache-fragment-readers

jeremywgleeson commented Mar 23, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremywgleeson commented Mar 23, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Mar 23, 2026

🤖 Devin AI Engineer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeremywgleeson commented Mar 23, 2026 •

edited by devin-ai-integration bot

Loading