feat: shared ScanScheduler, FileScheduler cache, and V1 reader cache on Dataset#3
Open
jeremywgleeson wants to merge 14 commits intomainfrom
Open
feat: shared ScanScheduler, FileScheduler cache, and V1 reader cache on Dataset#3jeremywgleeson wants to merge 14 commits intomainfrom
jeremywgleeson wants to merge 14 commits intomainfrom
Conversation
… reads Previously, a new ScanScheduler (HTTP connection pool) was created on every fragment read in open_reader(). This caused significant overhead for scattered random access patterns like admin-lite's id->url lookups across thousands of fragments. Now the ScanScheduler is created once per Dataset at construction time and reused by all fragment reads via the existing FragReadConfig path. Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
… per request Cache opened FileScheduler instances (keyed by file path) on the Dataset struct using a DashMap. This avoids re-opening ~4,600 S3 file handles on every request for scattered random access patterns like id->url lookups. The cache is only used for the default read path (no custom base_id or scan_scheduler). It is cleared when the object store changes via with_object_store(). Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
…equest S3 reader creation Each fragment read in V1 (legacy) format was creating a new PreviousFileReader, which involves opening a new S3 file handle and reading metadata/page tables. For 5k scattered ID lookups across ~4600 fragments, this created ~4600 new readers per request. Now we cache PreviousFileReader instances in a DashMap keyed by file path on the Dataset struct. On cache hit, we clone the reader (cheap - all Arc fields) instead of creating a new one. This eliminates the per-request overhead of S3 file handle creation and metadata reads for V1 format files. Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
…it.rs Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds three layers of caching/reuse to the
Datasetstruct to reduce per-fragment overhead for scattered random-access workloads (e.g., 5000 ID lookups across ~3500 fragments):Shared
ScanScheduler(Arc<ScanScheduler>): Created once at dataset open time. Fragment reads reuse this scheduler instead of spawning a new I/O loop + connection pool peropen_reader()call.FileSchedulercache (Arc<DashMap<Path, FileScheduler>>): Caches opened V2 file handles so subsequent reads of the same fragment data file skip the file-open overhead.V1
PreviousFileReadercache (Arc<DashMap<Path, PreviousFileReader>>): Caches opened V1 (legacy format) file readers. Without this, every fragment read creates a newPreviousFileReader— involvingobject_store.open(path), schema projection setup, and page table initialization. The cache eliminates this per-request object creation overhead for V1 data files.Changes:
Datasetstruct gainspub scan_scheduler: Arc<ScanScheduler>,pub file_scheduler_cache: Arc<DashMap<Path, FileScheduler>>, andpub v1_reader_cache: Arc<DashMap<Path, PreviousFileReader>>fieldscheckout_manifest()creates all three oncewith_object_store()creates new instances when the object store changescommit.rsWriteDestination::Uribranch includes all three new fieldsfragment.rsfallback path clonesself.dataset.scan_schedulerinstead of callingScanScheduler::new()fragment.rsopen_reader()checksfile_scheduler_cachebefore callingopen_file_with_priority(), and inserts newly opened schedulers into the cachefragment.rsopen_reader()V1 path checksv1_reader_cachebefore callingPreviousFileReader::try_new_with_fragment_id(), and inserts newly created readers into the cachelance-io(v1_s3_calls(),v1_s3_bytes(),s3_requests_counter(), etc.) for profilingBenchmark results (5k scattered ID lookups, url-only, staging with prod data, S3 concurrency=2000):
The V1 reader cache provides a 56% cold-start improvement (9.7s → 4.2s) by eliminating per-request
PreviousFileReadercreation. Warm performance is roughly unchanged (~2.1-2.3s) because the file metadata was already served fromfile_metadata_cacheon warm requests — the V1 reader cache avoids object construction overhead but not the underlying S3 data reads.Root cause of remaining ~6x gap vs jata: Both systems issue the same number of S3 calls (~7,000 per request: 2 per fragment × ~3,500 fragments) with the same concurrency model (semaphore=2000). The gap is in per-S3-call latency: Lance V1 reads go through
CloudObjectReader→object_store::ObjectStore::get_range()(reqwest HTTP client), while jata usesaws_sdk_s3::Client::get_object()(hyper HTTP client) directly. Theobject_storelayer adds per-request overhead (signing, retry wrappers, path normalization, different connection pooling). With ~7,000 calls, even small per-call overhead compounds significantly. Additionally, Lance constructs intermediate Arrow arrays (positions, offsets, GenericByteArray) per fragment that jata avoids by working with raw bytes. These are metadata/abstraction-layer costs, not data caching — further improvement requires either bypassingobject_storefor V1 reads or a purpose-built direct-S3 reader.Review & Testing Checklist for Human
file_scheduler_cacheandv1_reader_cachehave no eviction policy, TTL, or size limit. For datasets with tens of thousands of fragments, these caches will hold state for every unique file path ever read. Verify memory growth is acceptable, or add LRU eviction.PreviousFileReader::clone()semantics: The V1 reader cache returnscached.clone(). Verify thatPreviousFileReader::clone()shares underlying state (object reader, page table) cheaply rather than deep-copying. If it deep-copies, the cache provides no benefit and doubles memory usage.FileScheduler/PreviousFileReaderobjects may serve stale data. Caches are only cleared onwith_object_store(). Verify this is sufficient for your data lifecycle, or clear caches on dataset version change.Path(file path). Verify that different schema projections for the same file don't produce incorrect results — the cached reader has a fixed schema from its first creation. If different callers project different columns, the cached reader's schema may not match.Suggested test plan: Deploy to staging with
ADMIN_LITE_USE_LANCE=true, send repeated 5k-ID url-only requests, and verify:[admin-lite-perf] COMPLETElogs)Notes
SchedulerConfig::max_bandwidthconfig uses32 MiB * io_parallelismas the I/O buffer size. WithLANCE_IO_THREADS=2048this is ~64 GB of buffer — in practice the buffer is a backpressure limit, not allocated memory, but worth being aware of.LANCE_IO_THREADStuning had no measurable impact with the caches (2,160ms at default 64 vs ~2,160ms at 2048), suggesting the bottleneck is per-S3-call latency, not I/O thread count.v1_s3_calls(),s3_requests_counter(), etc.) is included for future profiling. These are atomic counters with negligible runtime cost.cargo-deny(dependency vulnerability in AWS-LC/tar-rs) andclippy(missingpackage.readmeon arrow-scalar) are unrelated to this PR.format,build,rustdoc,Rust Clippy and Fmt Checkall pass.object_storefor V1 reads and using the AWS SDK S3 client directly (like jata does), or caching decoded position arrays in memory to eliminate half the S3 calls.Link to Devin session: https://app.devin.ai/sessions/0e1908318d87468ca0ecf900e16b6502
Requested by: @jeremywgleeson