Skip to content

feat: shared ScanScheduler, FileScheduler cache, and V1 reader cache on Dataset#3

Open
jeremywgleeson wants to merge 14 commits intomainfrom
devin/1774288046-cache-fragment-readers
Open

feat: shared ScanScheduler, FileScheduler cache, and V1 reader cache on Dataset#3
jeremywgleeson wants to merge 14 commits intomainfrom
devin/1774288046-cache-fragment-readers

Conversation

@jeremywgleeson
Copy link
Copy Markdown

@jeremywgleeson jeremywgleeson commented Mar 23, 2026

Summary

Adds three layers of caching/reuse to the Dataset struct to reduce per-fragment overhead for scattered random-access workloads (e.g., 5000 ID lookups across ~3500 fragments):

  1. Shared ScanScheduler (Arc<ScanScheduler>): Created once at dataset open time. Fragment reads reuse this scheduler instead of spawning a new I/O loop + connection pool per open_reader() call.

  2. FileScheduler cache (Arc<DashMap<Path, FileScheduler>>): Caches opened V2 file handles so subsequent reads of the same fragment data file skip the file-open overhead.

  3. V1 PreviousFileReader cache (Arc<DashMap<Path, PreviousFileReader>>): Caches opened V1 (legacy format) file readers. Without this, every fragment read creates a new PreviousFileReader — involving object_store.open(path), schema projection setup, and page table initialization. The cache eliminates this per-request object creation overhead for V1 data files.

Changes:

  • Dataset struct gains pub scan_scheduler: Arc<ScanScheduler>, pub file_scheduler_cache: Arc<DashMap<Path, FileScheduler>>, and pub v1_reader_cache: Arc<DashMap<Path, PreviousFileReader>> fields
  • checkout_manifest() creates all three once
  • with_object_store() creates new instances when the object store changes
  • commit.rs WriteDestination::Uri branch includes all three new fields
  • fragment.rs fallback path clones self.dataset.scan_scheduler instead of calling ScanScheduler::new()
  • fragment.rs open_reader() checks file_scheduler_cache before calling open_file_with_priority(), and inserts newly opened schedulers into the cache
  • fragment.rs open_reader() V1 path checks v1_reader_cache before calling PreviousFileReader::try_new_with_fragment_id(), and inserts newly created readers into the cache
  • S3 call counting instrumentation added to lance-io (v1_s3_calls(), v1_s3_bytes(), s3_requests_counter(), etc.) for profiling

Benchmark results (5k scattered ID lookups, url-only, staging with prod data, S3 concurrency=2000):

Config Cold Warm (stabilized)
Lance baseline (no optimizations) ~9,200ms ~2,830ms
+ shared ScanScheduler ~6,100ms ~2,560ms
+ ScanScheduler + FileScheduler cache ~9,680ms ~2,100ms
+ all caches (incl. V1 reader cache) ~4,230ms ~2,300ms
Jata baseline (S3 concurrency=2000) ~800ms ~364ms

The V1 reader cache provides a 56% cold-start improvement (9.7s → 4.2s) by eliminating per-request PreviousFileReader creation. Warm performance is roughly unchanged (~2.1-2.3s) because the file metadata was already served from file_metadata_cache on warm requests — the V1 reader cache avoids object construction overhead but not the underlying S3 data reads.

Root cause of remaining ~6x gap vs jata: Both systems issue the same number of S3 calls (~7,000 per request: 2 per fragment × ~3,500 fragments) with the same concurrency model (semaphore=2000). The gap is in per-S3-call latency: Lance V1 reads go through CloudObjectReaderobject_store::ObjectStore::get_range() (reqwest HTTP client), while jata uses aws_sdk_s3::Client::get_object() (hyper HTTP client) directly. The object_store layer adds per-request overhead (signing, retry wrappers, path normalization, different connection pooling). With ~7,000 calls, even small per-call overhead compounds significantly. Additionally, Lance constructs intermediate Arrow arrays (positions, offsets, GenericByteArray) per fragment that jata avoids by working with raw bytes. These are metadata/abstraction-layer costs, not data caching — further improvement requires either bypassing object_store for V1 reads or a purpose-built direct-S3 reader.

Review & Testing Checklist for Human

  • Unbounded cache growth (all three caches): file_scheduler_cache and v1_reader_cache have no eviction policy, TTL, or size limit. For datasets with tens of thousands of fragments, these caches will hold state for every unique file path ever read. Verify memory growth is acceptable, or add LRU eviction.
  • PreviousFileReader::clone() semantics: The V1 reader cache returns cached.clone(). Verify that PreviousFileReader::clone() shares underlying state (object reader, page table) cheaply rather than deep-copying. If it deep-copies, the cache provides no benefit and doubles memory usage.
  • Cache staleness: If S3 files are replaced at the same path (compaction, rewrite), cached FileScheduler/PreviousFileReader objects may serve stale data. Caches are only cleared on with_object_store(). Verify this is sufficient for your data lifecycle, or clear caches on dataset version change.
  • V1 reader cache key correctness: The cache is keyed by Path (file path). Verify that different schema projections for the same file don't produce incorrect results — the cached reader has a fixed schema from its first creation. If different callers project different columns, the cached reader's schema may not match.
  • Warm performance regression check: The V1 reader cache showed ~2,300ms warm vs ~2,100ms without it. Verify this isn't a consistent regression (could be measurement noise, or the DashMap lookup + clone overhead may slightly outweigh the construction savings on warm paths where metadata is already cached).

Suggested test plan: Deploy to staging with ADMIN_LITE_USE_LANCE=true, send repeated 5k-ID url-only requests, and verify:

  1. Cold start latency ~4-5s (check [admin-lite-perf] COMPLETE logs)
  2. Warm latency stabilizes around ~2.1-2.3s
  3. Memory usage doesn't grow unboundedly over many requests
  4. Data correctness — returned URLs match what jata returns for the same IDs

Notes

  • The SchedulerConfig::max_bandwidth config uses 32 MiB * io_parallelism as the I/O buffer size. With LANCE_IO_THREADS=2048 this is ~64 GB of buffer — in practice the buffer is a backpressure limit, not allocated memory, but worth being aware of.
  • LANCE_IO_THREADS tuning had no measurable impact with the caches (2,160ms at default 64 vs ~2,160ms at 2048), suggesting the bottleneck is per-S3-call latency, not I/O thread count.
  • S3 call counting instrumentation (v1_s3_calls(), s3_requests_counter(), etc.) is included for future profiling. These are atomic counters with negligible runtime cost.
  • Pre-existing CI failures: cargo-deny (dependency vulnerability in AWS-LC/tar-rs) and clippy (missing package.readme on arrow-scalar) are unrelated to this PR. format, build, rustdoc, Rust Clippy and Fmt Check all pass.
  • To close the remaining 6x gap, the most impactful next step would be bypassing object_store for V1 reads and using the AWS SDK S3 client directly (like jata does), or caching decoded position arrays in memory to eliminate half the S3 calls.

Link to Devin session: https://app.devin.ai/sessions/0e1908318d87468ca0ecf900e16b6502
Requested by: @jeremywgleeson

… reads

Previously, a new ScanScheduler (HTTP connection pool) was created on
every fragment read in open_reader(). This caused significant overhead
for scattered random access patterns like admin-lite's id->url lookups
across thousands of fragments.

Now the ScanScheduler is created once per Dataset at construction time
and reused by all fragment reads via the existing FragReadConfig path.

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions github-actions bot added the enhancement New feature or request label Mar 23, 2026
devin-ai-integration bot and others added 4 commits March 23, 2026 18:40
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
… per request

Cache opened FileScheduler instances (keyed by file path) on the Dataset
struct using a DashMap. This avoids re-opening ~4,600 S3 file handles on
every request for scattered random access patterns like id->url lookups.

The cache is only used for the default read path (no custom base_id or
scan_scheduler). It is cleared when the object store changes via
with_object_store().

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
@devin-ai-integration devin-ai-integration bot changed the title feat: store shared ScanScheduler on Dataset for reuse across fragment reads feat: shared ScanScheduler + FileScheduler cache on Dataset for fragment read reuse Mar 23, 2026
devin-ai-integration bot and others added 9 commits March 23, 2026 21:29
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
…equest S3 reader creation

Each fragment read in V1 (legacy) format was creating a new PreviousFileReader,
which involves opening a new S3 file handle and reading metadata/page tables.
For 5k scattered ID lookups across ~4600 fragments, this created ~4600 new
readers per request.

Now we cache PreviousFileReader instances in a DashMap keyed by file path on
the Dataset struct. On cache hit, we clone the reader (cheap - all Arc fields)
instead of creating a new one. This eliminates the per-request overhead of
S3 file handle creation and metadata reads for V1 format files.

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
…it.rs

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
@devin-ai-integration devin-ai-integration bot changed the title feat: shared ScanScheduler + FileScheduler cache on Dataset for fragment read reuse feat: shared ScanScheduler, FileScheduler cache, and V1 reader cache on Dataset Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant