Commit 7affaee
Add use_direct_io_for_compaction_reads option (#14743)
Summary:
Adds a new `DBOption use_direct_io_for_compaction_reads` (default false). When on, compaction-input SST files are opened with `O_DIRECT` so the sequential read-once data from compaction doesn't pollute the OS page cache and evict the hot user-read working set. User reads keep going through the buffered fast path. This protects user-read tail latency on write-heavy workloads without forcing user reads onto the existing global `use_direct_reads` knob (which pays in throughput and P50 — see the bench below).
The interesting bit is that just flipping the FileOptions returned by `FileSystem::OptimizeForCompactionTableRead` doesn't actually trigger `O_DIRECT` at the kernel level. The TableCache (and `FileMetaData::pinned_reader`) is already holding buffered handles opened at flush time or at `DB::Open` via `LoadTableHandlers`. When compaction asks for an iterator, it gets back the cached buffered handle and the kernel never sees the `O_DIRECT` flag.
So this PR also adds a small bypass path:
- `TableCache::FindTable` / `NewIterator` learn a `open_ephemeral_table_reader` mode. When set, the pinned-reader fast path and the shared cache are skipped, `GetTableReader` is called directly with the caller's FileOptions, and ownership of the freshly opened TableReader is handed back via a `unique_ptr`. The iterator takes ownership via `RegisterCleanup` and frees the reader on destruction.
- `VersionSet::MakeInputIterator` and `LevelIterator` plumb the flag through both L0 and L1+ compaction-input paths.
- `CompactionJob::ProcessKeyValueCompaction` turns the bypass on when `use_direct_io_for_compaction_reads` is set, the global `use_direct_reads` is off, and `OptimizeForCompactionTableRead` produced `use_direct_reads=true` in the compaction-read FileOptions.
The option is opt-in: when off, nothing changes for existing users. When on, only the compaction-input opens take the bypass path; user reads keep hitting the TableCache and the buffered fast path normally.
There's also a small db_bench helper in the same PR: a new `--bgwriter_num` flag that lets the writer thread in `readwhilewriting` (and the other "while writing" variants) spread its puts across `[0, bgwriter_num)` instead of `[0, num)`. Without this the readers and writer share a key range and you can't have both a hot read subset and meaningful compaction work — this lets you have both.
### Benchmark
Setup: Ubuntu 24.04 (kernel 7.0.5, OrbStack Linux VM on Apple Silicon), 14 vCPUs, virtio-blk disk, btrfs. MGLRU disabled (`echo 0 > /sys/kernel/mm/lru_gen/enabled`) so the kernel uses the classic active/inactive LRU. 14 GB DB (3.5M keys × 4 KB values), no compression. Each measurement run is pinned to a 1 GB cgroup via `systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0`. Page cache is dropped between configs. db_bench is Release build.
Workload: `readwhilewriting` for 120s. 4 reader threads doing random reads over a hot key subset, plus 1 writer thread spreading overwrites across the full 3.5M-key keyspace (via `--bgwriter_num=3500000`) throttled at 200 MB/s, so there's continuous compaction running while the readers go.
The size of the hot reader subset relative to available page cache controls how visible the optimization is. The Cassandra blog ([Lightfoot 2026](https://lightfoot.dev/direct-i-o-for-cassandra-compaction-cutting-p99-read-latency-by-5x/)) documented the same thing: biggest wins when the hot set is big enough to actually compete for cache, smaller wins when the hot set trivially fits, neutral when the hot set is way bigger than cache. So I ran two hot-set sizes.
#### Small hot set: ~30 MB (~3% of the 1 GB cgroup) — N=5 iterations, mean (CV)
`--num=7500`. The hot set is small enough that the page cache holds it without much trouble even under compaction, so the wins here are real but on the modest side.
| Config | Throughput (ops/s) | Read P50 (µs) | Read P99 (µs) | Read P99.9 (µs) | Read P99.99 (µs) |
|---|---|---|---|---|---|
| buffered (default) | 233,477 (8.2%) | 16.09 | 82.24 | 721.0 | 2,102.5 |
| direct_compaction_writes_only (existing knob alone) | 287,405 (2.8%) — **+23.1%** | 13.00 (−19.2%) | **66.77 (−18.8%)** | 553.9 (−23.2%) | 1,787.6 (−15.0%) |
| direct_compaction_read_only (new knob alone) | 250,669 (2.4%) — +7.4% | 14.16 (−12.0%) | 102.99 (+25.2%) | 689.8 (−4.3%) | 1,801.3 (−14.3%) |
| direct_compaction_read_write (new + existing, recommended) | 277,920 (3.3%) — **+19.0%** | **12.99 (−19.3%)** | 84.23 (+2.4%) | 613.4 (−14.9%) | **1,738.2 (−17.3%)** |
| use_direct_reads=true (existing global) + write-side | 249,014 (2.5%) — +6.7% | 15.95 (−0.9%) | 68.78 (−16.4%) | **450.8 (−37.5%)** | 1,814.5 (−13.7%) |
CV is 2.4–3.3% on the optimized configs (8.2% on buffered), so the deltas are real. With a hot set this small, the existing `use_direct_io_for_flush_and_compaction` knob is already doing most of the work — the new flag's main extra contribution here is P99.99 (combined wins it by ~2 points vs writes-only-alone). Worth noting: the new flag *alone* (without the existing write-side flag) improves P99.99 but regresses P99 by 25% on this small-hot-set workload, because direct compaction reads lose kernel readahead and compaction-output writes are still hitting the page cache. That regression goes away once you combine with the existing write-side flag, or once the hot set is bigger (see next table). So if you're using just one knob, use the existing one. If you're using this PR's flag, pair it with `use_direct_io_for_flush_and_compaction=true`.
#### Larger hot set: ~400 MB (~40% of cache) — N=5 iterations, mean (CV)
`--num=100000`. This is the case the Cassandra blog calls out — hot set big enough to actually fight compaction for cache. Their analogous setup (1M hot partitions, ~33% hot/cache) reported 1.93× p99 improvement. Numbers here are the headline:
| Config | Throughput (ops/s) | Read P50 (µs) | Read P99 (µs) | Read P99.9 (µs) | Read P99.99 (µs) |
|---|---|---|---|---|---|
| buffered (default) | 68,959 (7.7%) | 44.81 | 541.22 | 2,225.2 | 11,334.5 |
| direct_compaction_writes_only (existing knob alone) | 73,973 (10.3%) — +7.3% | 42.22 (−5.8%) | 456.27 (−15.7%) | 2,016.9 (−9.4%) | 9,190.0 (−18.9%) |
| direct_compaction_read_only (new knob alone) | 84,337 (2.3%) — +22.3% | 38.66 (−13.7%) | 386.97 (−28.5%) | 1,644.8 (−26.1%) | 4,837.9 (−57.3%, 2.34×) |
| direct_compaction_read_write (new + existing, recommended) | **104,923 (8.4%) — +52.2%** | **34.26 (−23.5%)** | **290.97 (−46.2%)** | **1,143.4 (−48.6%)** | **3,080.3 (−72.8%, 3.68×)** |
| use_direct_reads=true (existing global) + write-side | 71,598 (9.1%) — +3.8% | 51.33 (+14.5%) | 297.91 (−45.0%) | 1,663.6 (−25.2%) | 6,530.0 (−42.4%) |
Combined config gets a 3.68× p99.99 win, 1.86× p99, p50 down 23%, throughput up 52%. Same shape as the Cassandra blog's 1.93× p99 result — the improvement just lands at deeper percentiles for us because RocksDB's baseline data path is roughly 40× faster than Cassandra's (their buffered p99 was 35 ms, ours is 0.54 ms), so the cache-miss tail is further out.
A few things worth calling out from this table:
- The new flag is doing real work on top of the existing write-side flag here, not just shifting things around. Combined throughput is +42% over `direct_compaction_writes_only` alone, and combined p99.99 is 3× better. The existing knob alone gives a fairly modest +7% throughput / -19% p99.99 in this case — there's a clear gap that the new flag fills.
- The new flag *alone* (no existing write-side flag) is also a real improvement here: +22% throughput, p99.99 down 57%. The P99 regression we saw in the small-hot-set case is gone, because the cache-protection effect now dominates the lost-readahead cost.
- `use_direct_reads=true` (the existing global flag) actually regresses P50 by 14.5% in this workload — taking user reads off the page cache hurts you when the hot data could have been cached. It also gets the worst throughput of any direct config. It's not an equivalent way to get these gains.
### `compaction_readahead_size` matters when this flag is on
Direct I/O bypasses kernel readahead, so RocksDB's own `DBOptions::compaction_readahead_size` becomes the only prefetch the iterator has. The default of 2 MB is enough and real users will get it automatically. **But `db_bench`'s `--compaction_readahead_size` CLI default is 0**, which defeats prefetch and makes direct compaction look slower than it actually is. If you're reproducing the numbers above, pass `--compaction_readahead_size=2097152` (or larger).
- Recommended production config is `use_direct_io_for_compaction_reads=true` + `use_direct_io_for_flush_and_compaction=true`. Strongest configuration at every percentile and throughput in both benches.
- The new flag is the read-side counterpart to `use_direct_io_for_flush_and_compaction`, which handles compaction-write cache pollution. They address different sources of pollution and compose. The gap between "combined" and "writes-only-alone" is 17 percentage points on p99.99 in the small-hot-set bench and 54 points in the larger one, so the new flag is contributing real value, especially as the hot set grows.
- The new flag alone is also a real improvement when the hot set is big enough to compete with cache (+22% throughput, 2.34× p99.99 in the larger-hot-set bench). On a very small hot set it improves p99.99 but regresses p99, so pairing with the existing write-side flag is safer.
- The benefit is workload-dependent. Small hot sets get modest tail-latency wins. Hot sets sized to actually compete for cache get the big multi-percentile wins shown above. Hot sets bigger than cache (not benched here but covered in the Cassandra blog) see no change either way — every read misses regardless.
### Reproducing
Any Linux host (or a Linux VM on macOS via OrbStack / Multipass / lima):
```bash
sudo apt-get install -y build-essential clang cmake git pkg-config \
libgflags-dev libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev libzstd-dev
cmake -DCMAKE_BUILD_TYPE=Release -DPORTABLE=1 -DWITH_GFLAGS=1 -DWITH_TESTS=0 ..
make -j db_bench
echo 0 | sudo tee /sys/kernel/mm/lru_gen/enabled
```
Build the source DB once, unrestricted memory:
```bash
./db_bench --benchmarks=fillrandom,compact,waitforcompaction,stats \
--db=/path/to/source_db --num=3500000 --key_size=16 --value_size=4096 \
--write_buffer_size=16777216 --target_file_size_base=16777216 \
--max_background_jobs=4 --compression_type=none --cache_size=4194304 \
--max_bytes_for_level_base=67108864 --disable_wal=1 --sync=0
```
For each config, copy `source_db -> scratch_db`, run `sync && echo 3 > /proc/sys/vm/drop_caches`, then:
```bash
sudo systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0 \
./db_bench --use_existing_db=1 \
--benchmarks=readwhilewriting,stats --db=/path/to/scratch_db \
--threads=5 --duration=120 --statistics=true --histogram=1 \
--num=7500 --bgwriter_num=3500000 \
--key_size=16 --value_size=4096 \
--write_buffer_size=16777216 --target_file_size_base=16777216 \
--max_background_jobs=4 --compression_type=none \
--cache_size=4194304 --open_files=200 \
--skip_stats_update_on_db_open=true \
--max_bytes_for_level_base=67108864 \
--benchmark_write_rate_limit=209715200 \
--compaction_readahead_size=2097152 \
--rate_limiter_bytes_per_sec=0 \
--use_direct_reads={true|false} \
--use_direct_io_for_compaction_reads={true|false} \
--use_direct_io_for_flush_and_compaction={true|false}
```
For the larger hot-set table, change `--num=7500` to `--num=100000`.
The five configs in the tables:
- `buffered`: all three flags false.
- `direct_compaction_writes_only`: `use_direct_io_for_flush_and_compaction=true`, the other two false. This is what users have today without this PR.
- `direct_compaction_read_only`: `use_direct_io_for_compaction_reads=true`, the other two false.
- `direct_compaction_read_write`: `use_direct_io_for_compaction_reads=true`, `use_direct_io_for_flush_and_compaction=true`, `use_direct_reads=false`. **Recommended.**
- `direct_all`: `use_direct_reads=true`, `use_direct_io_for_flush_and_compaction=true`, `use_direct_io_for_compaction_reads=false`.
Pull Request resolved: #14743
Reviewed By: pdillinger
Differential Revision: D108017601
Pulled By: xingbowang
fbshipit-source-id: 4039d490d7e77b476db7a477a2f3d24738db63361 parent 828f6d1 commit 7affaee
34 files changed
Lines changed: 1137 additions & 65 deletions
File tree
- db_stress_tool
- db
- compaction
- db_impl
- env
- include/rocksdb
- options
- table
- block_based
- test_util
- tools
- unreleased_history
- behavior_changes
- new_features
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5062 | 5062 | | |
5063 | 5063 | | |
5064 | 5064 | | |
| 5065 | + | |
| 5066 | + | |
| 5067 | + | |
| 5068 | + | |
| 5069 | + | |
| 5070 | + | |
| 5071 | + | |
| 5072 | + | |
| 5073 | + | |
| 5074 | + | |
5065 | 5075 | | |
5066 | 5076 | | |
5067 | 5077 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2766 | 2766 | | |
2767 | 2767 | | |
2768 | 2768 | | |
| 2769 | + | |
| 2770 | + | |
| 2771 | + | |
| 2772 | + | |
2769 | 2773 | | |
2770 | 2774 | | |
2771 | 2775 | | |
| |||
2993 | 2997 | | |
2994 | 2998 | | |
2995 | 2999 | | |
| 3000 | + | |
| 3001 | + | |
2996 | 3002 | | |
2997 | 3003 | | |
2998 | 3004 | | |
| |||
3271 | 3277 | | |
3272 | 3278 | | |
3273 | 3279 | | |
| 3280 | + | |
| 3281 | + | |
| 3282 | + | |
| 3283 | + | |
| 3284 | + | |
| 3285 | + | |
3274 | 3286 | | |
3275 | 3287 | | |
3276 | 3288 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
173 | 173 | | |
174 | 174 | | |
175 | 175 | | |
| 176 | + | |
176 | 177 | | |
177 | 178 | | |
178 | 179 | | |
| |||
203 | 204 | | |
204 | 205 | | |
205 | 206 | | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
206 | 212 | | |
207 | 213 | | |
208 | 214 | | |
| |||
1536 | 1542 | | |
1537 | 1543 | | |
1538 | 1544 | | |
| 1545 | + | |
| 1546 | + | |
| 1547 | + | |
| 1548 | + | |
| 1549 | + | |
| 1550 | + | |
| 1551 | + | |
| 1552 | + | |
| 1553 | + | |
1539 | 1554 | | |
1540 | 1555 | | |
1541 | 1556 | | |
1542 | | - | |
| 1557 | + | |
| 1558 | + | |
1543 | 1559 | | |
1544 | 1560 | | |
1545 | 1561 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
462 | 462 | | |
463 | 463 | | |
464 | 464 | | |
| 465 | + | |
| 466 | + | |
465 | 467 | | |
466 | 468 | | |
467 | 469 | | |
| |||
0 commit comments