Skip to content

Commit 7affaee

Browse files
zaidoon1meta-codesync[bot]
authored andcommitted
Add use_direct_io_for_compaction_reads option (#14743)
Summary: Adds a new `DBOption use_direct_io_for_compaction_reads` (default false). When on, compaction-input SST files are opened with `O_DIRECT` so the sequential read-once data from compaction doesn't pollute the OS page cache and evict the hot user-read working set. User reads keep going through the buffered fast path. This protects user-read tail latency on write-heavy workloads without forcing user reads onto the existing global `use_direct_reads` knob (which pays in throughput and P50 — see the bench below). The interesting bit is that just flipping the FileOptions returned by `FileSystem::OptimizeForCompactionTableRead` doesn't actually trigger `O_DIRECT` at the kernel level. The TableCache (and `FileMetaData::pinned_reader`) is already holding buffered handles opened at flush time or at `DB::Open` via `LoadTableHandlers`. When compaction asks for an iterator, it gets back the cached buffered handle and the kernel never sees the `O_DIRECT` flag. So this PR also adds a small bypass path: - `TableCache::FindTable` / `NewIterator` learn a `open_ephemeral_table_reader` mode. When set, the pinned-reader fast path and the shared cache are skipped, `GetTableReader` is called directly with the caller's FileOptions, and ownership of the freshly opened TableReader is handed back via a `unique_ptr`. The iterator takes ownership via `RegisterCleanup` and frees the reader on destruction. - `VersionSet::MakeInputIterator` and `LevelIterator` plumb the flag through both L0 and L1+ compaction-input paths. - `CompactionJob::ProcessKeyValueCompaction` turns the bypass on when `use_direct_io_for_compaction_reads` is set, the global `use_direct_reads` is off, and `OptimizeForCompactionTableRead` produced `use_direct_reads=true` in the compaction-read FileOptions. The option is opt-in: when off, nothing changes for existing users. When on, only the compaction-input opens take the bypass path; user reads keep hitting the TableCache and the buffered fast path normally. There's also a small db_bench helper in the same PR: a new `--bgwriter_num` flag that lets the writer thread in `readwhilewriting` (and the other "while writing" variants) spread its puts across `[0, bgwriter_num)` instead of `[0, num)`. Without this the readers and writer share a key range and you can't have both a hot read subset and meaningful compaction work — this lets you have both. ### Benchmark Setup: Ubuntu 24.04 (kernel 7.0.5, OrbStack Linux VM on Apple Silicon), 14 vCPUs, virtio-blk disk, btrfs. MGLRU disabled (`echo 0 > /sys/kernel/mm/lru_gen/enabled`) so the kernel uses the classic active/inactive LRU. 14 GB DB (3.5M keys × 4 KB values), no compression. Each measurement run is pinned to a 1 GB cgroup via `systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0`. Page cache is dropped between configs. db_bench is Release build. Workload: `readwhilewriting` for 120s. 4 reader threads doing random reads over a hot key subset, plus 1 writer thread spreading overwrites across the full 3.5M-key keyspace (via `--bgwriter_num=3500000`) throttled at 200 MB/s, so there's continuous compaction running while the readers go. The size of the hot reader subset relative to available page cache controls how visible the optimization is. The Cassandra blog ([Lightfoot 2026](https://lightfoot.dev/direct-i-o-for-cassandra-compaction-cutting-p99-read-latency-by-5x/)) documented the same thing: biggest wins when the hot set is big enough to actually compete for cache, smaller wins when the hot set trivially fits, neutral when the hot set is way bigger than cache. So I ran two hot-set sizes. #### Small hot set: ~30 MB (~3% of the 1 GB cgroup) — N=5 iterations, mean (CV) `--num=7500`. The hot set is small enough that the page cache holds it without much trouble even under compaction, so the wins here are real but on the modest side. | Config | Throughput (ops/s) | Read P50 (µs) | Read P99 (µs) | Read P99.9 (µs) | Read P99.99 (µs) | |---|---|---|---|---|---| | buffered (default) | 233,477 (8.2%) | 16.09 | 82.24 | 721.0 | 2,102.5 | | direct_compaction_writes_only (existing knob alone) | 287,405 (2.8%) — **+23.1%** | 13.00 (−19.2%) | **66.77 (−18.8%)** | 553.9 (−23.2%) | 1,787.6 (−15.0%) | | direct_compaction_read_only (new knob alone) | 250,669 (2.4%) — +7.4% | 14.16 (−12.0%) | 102.99 (+25.2%) | 689.8 (−4.3%) | 1,801.3 (−14.3%) | | direct_compaction_read_write (new + existing, recommended) | 277,920 (3.3%) — **+19.0%** | **12.99 (−19.3%)** | 84.23 (+2.4%) | 613.4 (−14.9%) | **1,738.2 (−17.3%)** | | use_direct_reads=true (existing global) + write-side | 249,014 (2.5%) — +6.7% | 15.95 (−0.9%) | 68.78 (−16.4%) | **450.8 (−37.5%)** | 1,814.5 (−13.7%) | CV is 2.4–3.3% on the optimized configs (8.2% on buffered), so the deltas are real. With a hot set this small, the existing `use_direct_io_for_flush_and_compaction` knob is already doing most of the work — the new flag's main extra contribution here is P99.99 (combined wins it by ~2 points vs writes-only-alone). Worth noting: the new flag *alone* (without the existing write-side flag) improves P99.99 but regresses P99 by 25% on this small-hot-set workload, because direct compaction reads lose kernel readahead and compaction-output writes are still hitting the page cache. That regression goes away once you combine with the existing write-side flag, or once the hot set is bigger (see next table). So if you're using just one knob, use the existing one. If you're using this PR's flag, pair it with `use_direct_io_for_flush_and_compaction=true`. #### Larger hot set: ~400 MB (~40% of cache) — N=5 iterations, mean (CV) `--num=100000`. This is the case the Cassandra blog calls out — hot set big enough to actually fight compaction for cache. Their analogous setup (1M hot partitions, ~33% hot/cache) reported 1.93× p99 improvement. Numbers here are the headline: | Config | Throughput (ops/s) | Read P50 (µs) | Read P99 (µs) | Read P99.9 (µs) | Read P99.99 (µs) | |---|---|---|---|---|---| | buffered (default) | 68,959 (7.7%) | 44.81 | 541.22 | 2,225.2 | 11,334.5 | | direct_compaction_writes_only (existing knob alone) | 73,973 (10.3%) — +7.3% | 42.22 (−5.8%) | 456.27 (−15.7%) | 2,016.9 (−9.4%) | 9,190.0 (−18.9%) | | direct_compaction_read_only (new knob alone) | 84,337 (2.3%) — +22.3% | 38.66 (−13.7%) | 386.97 (−28.5%) | 1,644.8 (−26.1%) | 4,837.9 (−57.3%, 2.34×) | | direct_compaction_read_write (new + existing, recommended) | **104,923 (8.4%) — +52.2%** | **34.26 (−23.5%)** | **290.97 (−46.2%)** | **1,143.4 (−48.6%)** | **3,080.3 (−72.8%, 3.68×)** | | use_direct_reads=true (existing global) + write-side | 71,598 (9.1%) — +3.8% | 51.33 (+14.5%) | 297.91 (−45.0%) | 1,663.6 (−25.2%) | 6,530.0 (−42.4%) | Combined config gets a 3.68× p99.99 win, 1.86× p99, p50 down 23%, throughput up 52%. Same shape as the Cassandra blog's 1.93× p99 result — the improvement just lands at deeper percentiles for us because RocksDB's baseline data path is roughly 40× faster than Cassandra's (their buffered p99 was 35 ms, ours is 0.54 ms), so the cache-miss tail is further out. A few things worth calling out from this table: - The new flag is doing real work on top of the existing write-side flag here, not just shifting things around. Combined throughput is +42% over `direct_compaction_writes_only` alone, and combined p99.99 is 3× better. The existing knob alone gives a fairly modest +7% throughput / -19% p99.99 in this case — there's a clear gap that the new flag fills. - The new flag *alone* (no existing write-side flag) is also a real improvement here: +22% throughput, p99.99 down 57%. The P99 regression we saw in the small-hot-set case is gone, because the cache-protection effect now dominates the lost-readahead cost. - `use_direct_reads=true` (the existing global flag) actually regresses P50 by 14.5% in this workload — taking user reads off the page cache hurts you when the hot data could have been cached. It also gets the worst throughput of any direct config. It's not an equivalent way to get these gains. ### `compaction_readahead_size` matters when this flag is on Direct I/O bypasses kernel readahead, so RocksDB's own `DBOptions::compaction_readahead_size` becomes the only prefetch the iterator has. The default of 2 MB is enough and real users will get it automatically. **But `db_bench`'s `--compaction_readahead_size` CLI default is 0**, which defeats prefetch and makes direct compaction look slower than it actually is. If you're reproducing the numbers above, pass `--compaction_readahead_size=2097152` (or larger). - Recommended production config is `use_direct_io_for_compaction_reads=true` + `use_direct_io_for_flush_and_compaction=true`. Strongest configuration at every percentile and throughput in both benches. - The new flag is the read-side counterpart to `use_direct_io_for_flush_and_compaction`, which handles compaction-write cache pollution. They address different sources of pollution and compose. The gap between "combined" and "writes-only-alone" is 17 percentage points on p99.99 in the small-hot-set bench and 54 points in the larger one, so the new flag is contributing real value, especially as the hot set grows. - The new flag alone is also a real improvement when the hot set is big enough to compete with cache (+22% throughput, 2.34× p99.99 in the larger-hot-set bench). On a very small hot set it improves p99.99 but regresses p99, so pairing with the existing write-side flag is safer. - The benefit is workload-dependent. Small hot sets get modest tail-latency wins. Hot sets sized to actually compete for cache get the big multi-percentile wins shown above. Hot sets bigger than cache (not benched here but covered in the Cassandra blog) see no change either way — every read misses regardless. ### Reproducing Any Linux host (or a Linux VM on macOS via OrbStack / Multipass / lima): ```bash sudo apt-get install -y build-essential clang cmake git pkg-config \ libgflags-dev libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev libzstd-dev cmake -DCMAKE_BUILD_TYPE=Release -DPORTABLE=1 -DWITH_GFLAGS=1 -DWITH_TESTS=0 .. make -j db_bench echo 0 | sudo tee /sys/kernel/mm/lru_gen/enabled ``` Build the source DB once, unrestricted memory: ```bash ./db_bench --benchmarks=fillrandom,compact,waitforcompaction,stats \ --db=/path/to/source_db --num=3500000 --key_size=16 --value_size=4096 \ --write_buffer_size=16777216 --target_file_size_base=16777216 \ --max_background_jobs=4 --compression_type=none --cache_size=4194304 \ --max_bytes_for_level_base=67108864 --disable_wal=1 --sync=0 ``` For each config, copy `source_db -> scratch_db`, run `sync && echo 3 > /proc/sys/vm/drop_caches`, then: ```bash sudo systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0 \ ./db_bench --use_existing_db=1 \ --benchmarks=readwhilewriting,stats --db=/path/to/scratch_db \ --threads=5 --duration=120 --statistics=true --histogram=1 \ --num=7500 --bgwriter_num=3500000 \ --key_size=16 --value_size=4096 \ --write_buffer_size=16777216 --target_file_size_base=16777216 \ --max_background_jobs=4 --compression_type=none \ --cache_size=4194304 --open_files=200 \ --skip_stats_update_on_db_open=true \ --max_bytes_for_level_base=67108864 \ --benchmark_write_rate_limit=209715200 \ --compaction_readahead_size=2097152 \ --rate_limiter_bytes_per_sec=0 \ --use_direct_reads={true|false} \ --use_direct_io_for_compaction_reads={true|false} \ --use_direct_io_for_flush_and_compaction={true|false} ``` For the larger hot-set table, change `--num=7500` to `--num=100000`. The five configs in the tables: - `buffered`: all three flags false. - `direct_compaction_writes_only`: `use_direct_io_for_flush_and_compaction=true`, the other two false. This is what users have today without this PR. - `direct_compaction_read_only`: `use_direct_io_for_compaction_reads=true`, the other two false. - `direct_compaction_read_write`: `use_direct_io_for_compaction_reads=true`, `use_direct_io_for_flush_and_compaction=true`, `use_direct_reads=false`. **Recommended.** - `direct_all`: `use_direct_reads=true`, `use_direct_io_for_flush_and_compaction=true`, `use_direct_io_for_compaction_reads=false`. Pull Request resolved: #14743 Reviewed By: pdillinger Differential Revision: D108017601 Pulled By: xingbowang fbshipit-source-id: 4039d490d7e77b476db7a477a2f3d24738db6336
1 parent 828f6d1 commit 7affaee

34 files changed

Lines changed: 1137 additions & 65 deletions

db/c.cc

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5062,6 +5062,16 @@ unsigned char rocksdb_options_get_use_direct_io_for_flush_and_compaction(
50625062
return opt->rep.use_direct_io_for_flush_and_compaction;
50635063
}
50645064

5065+
void rocksdb_options_set_use_direct_io_for_compaction_reads(
5066+
rocksdb_options_t* opt, unsigned char v) {
5067+
opt->rep.use_direct_io_for_compaction_reads = v;
5068+
}
5069+
5070+
unsigned char rocksdb_options_get_use_direct_io_for_compaction_reads(
5071+
rocksdb_options_t* opt) {
5072+
return opt->rep.use_direct_io_for_compaction_reads;
5073+
}
5074+
50655075
void rocksdb_options_set_allow_mmap_reads(rocksdb_options_t* opt,
50665076
unsigned char v) {
50675077
opt->rep.allow_mmap_reads = v;

db/c_test.c

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2766,6 +2766,10 @@ int main(int argc, char** argv) {
27662766
CheckCondition(
27672767
1 == rocksdb_options_get_use_direct_io_for_flush_and_compaction(o));
27682768

2769+
rocksdb_options_set_use_direct_io_for_compaction_reads(o, 1);
2770+
CheckCondition(1 ==
2771+
rocksdb_options_get_use_direct_io_for_compaction_reads(o));
2772+
27692773
rocksdb_options_set_is_fd_close_on_exec(o, 1);
27702774
CheckCondition(1 == rocksdb_options_get_is_fd_close_on_exec(o));
27712775

@@ -2993,6 +2997,8 @@ int main(int argc, char** argv) {
29932997
CheckCondition(1 == rocksdb_options_get_use_direct_reads(copy));
29942998
CheckCondition(
29952999
1 == rocksdb_options_get_use_direct_io_for_flush_and_compaction(copy));
3000+
CheckCondition(
3001+
1 == rocksdb_options_get_use_direct_io_for_compaction_reads(copy));
29963002
CheckCondition(1 == rocksdb_options_get_is_fd_close_on_exec(copy));
29973003
CheckCondition(18 == rocksdb_options_get_stats_dump_period_sec(copy));
29983004
CheckCondition(5 == rocksdb_options_get_stats_persist_period_sec(copy));
@@ -3271,6 +3277,12 @@ int main(int argc, char** argv) {
32713277
CheckCondition(
32723278
1 == rocksdb_options_get_use_direct_io_for_flush_and_compaction(o));
32733279

3280+
rocksdb_options_set_use_direct_io_for_compaction_reads(copy, 0);
3281+
CheckCondition(
3282+
0 == rocksdb_options_get_use_direct_io_for_compaction_reads(copy));
3283+
CheckCondition(1 ==
3284+
rocksdb_options_get_use_direct_io_for_compaction_reads(o));
3285+
32743286
rocksdb_options_set_is_fd_close_on_exec(copy, 0);
32753287
CheckCondition(0 == rocksdb_options_get_is_fd_close_on_exec(copy));
32763288
CheckCondition(1 == rocksdb_options_get_is_fd_close_on_exec(o));

db/compaction/compaction_job.cc

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,7 @@ CompactionJob::CompactionJob(
173173
fs_(db_options.fs, io_tracer),
174174
file_options_for_read_(
175175
fs_->OptimizeForCompactionTableRead(file_options, db_options_)),
176+
file_options_for_compaction_input_read_(file_options_for_read_),
176177
versions_(versions),
177178
shutting_down_(shutting_down),
178179
manual_compaction_canceled_(manual_compaction_canceled),
@@ -203,6 +204,11 @@ CompactionJob::CompactionJob(
203204
assert(job_context);
204205
assert(job_context->snapshot_context_initialized);
205206

207+
if (db_options_.use_direct_io_for_compaction_reads &&
208+
!db_options_.use_direct_reads) {
209+
file_options_for_compaction_input_read_.use_direct_reads = true;
210+
}
211+
206212
const auto* cfd = compact_->compaction->column_family_data();
207213
ThreadStatusUtil::SetEnableTracking(db_options_.enable_thread_tracking);
208214
ThreadStatusUtil::SetColumnFamily(cfd);
@@ -1536,10 +1542,20 @@ InternalIterator* CompactionJob::CreateInputIterator(
15361542

15371543
// Although the v2 aggregator is what the level iterator(s) know about,
15381544
// the AddTombstones calls will be propagated down to the v1 aggregator.
1545+
const bool open_ephemeral_table_reader =
1546+
db_options_.use_direct_io_for_compaction_reads &&
1547+
!db_options_.use_direct_reads;
1548+
FileOptions& input_file_options =
1549+
open_ephemeral_table_reader ? file_options_for_compaction_input_read_
1550+
: file_options_for_read_;
1551+
TEST_SYNC_POINT_CALLBACK(
1552+
"CompactionJob::CreateInputIterator:InputFileOptions",
1553+
&input_file_options);
15391554
iterators.raw_input =
15401555
std::unique_ptr<InternalIterator>(versions_->MakeInputIterator(
15411556
read_options, sub_compact->compaction, sub_compact->RangeDelAgg(),
1542-
file_options_for_read_, boundaries.start, boundaries.end));
1557+
input_file_options, boundaries.start, boundaries.end,
1558+
open_ephemeral_table_reader));
15431559
InternalIterator* input = iterators.raw_input.get();
15441560

15451561
if (boundaries.start.has_value() || boundaries.end.has_value()) {

db/compaction/compaction_job.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -462,6 +462,8 @@ class CompactionJob {
462462
FileSystemPtr fs_;
463463
// env_option optimized for compaction table reads
464464
FileOptions file_options_for_read_;
465+
// file_options_for_read_ plus compaction-input-only overrides.
466+
FileOptions file_options_for_compaction_input_read_;
465467
VersionSet* versions_;
466468
const std::atomic<bool>* shutting_down_;
467469
const std::atomic<bool>& manual_compaction_canceled_;

0 commit comments

Comments
 (0)