Pass file metadata to IngestExternalFile to improve ingestion latency (#14837)#14837
Pass file metadata to IngestExternalFile to improve ingestion latency (#14837)#14837joshkang97 wants to merge 2 commits into
Conversation
|
@joshkang97 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107721261. |
|
| Check | Count |
|---|---|
cppcoreguidelines-special-member-functions |
1 |
| Total | 1 |
Details
db/db_impl/db_impl.cc (1 warning(s))
db/db_impl/db_impl.cc:6724:7: warning: class 'FileIngestionHandleImpl' defines a non-default destructor but does not define a move constructor or a move assignment operator [cppcoreguidelines-special-member-functions]
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 8d5e7ff ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 8d5e7ff SummaryWell-designed optimization that reduces external file ingestion latency ~34% by reusing SstFileWriter metadata. The code is largely correct, with the two main code paths (fast and scan) producing equivalent results for all common scenarios. A few gaps in test coverage and a minor hot-path performance concern should be addressed. High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🟡 MEDIUMM1. Extra per-key string copy in
|
| Context | Executes? | Correct? | Notes |
|---|---|---|---|
| WritePreparedTxnDB | YES | YES | Seqno assignment in Run() unaffected |
| ReadOnly DB | NO | N/A | IngestExternalFile not available |
| allow_db_generated_files | YES | YES (with caveat L1) | SstFileWriter seqnos are always 0 |
| User-defined timestamps | YES | YES | Padding logic verified correct |
| UDT in Memtables only | YES | YES (needs test M2) | Strip/pad logic consistent |
| write_global_seqno=true | NO | N/A | Correctly rejected in validation |
| Range-del-only files | YES | Likely correct but untested (M3) | MaybeUpdateRange handles it |
| Range-del-only + UDT | YES | BUG | Empty InternalKey::Encode() assertion |
| FIFO/Universal compaction | YES | YES | Affects Run(), not Prepare() |
| atomic_replace_range | YES | YES | Orthogonal to metadata acquisition |
| verify_checksums_before_ingest | YES | YES | Opens file only for verification |
Key verified equivalences:
MaybeUpdateRangetakes min/max of boundaries regardless of call count, so single-synthetic-tombstone fast path equals per-tombstone scan pathInternalKey::DecodeFrom()copies data (rep_.assign()), no lifetime concernsexternal_sst_file_global_seqno_offsetis 0 in in-memory properties, safely handled sincewrite_global_seqnois rejectedGetTableProperties()returns by value, safe afterFinish()- Internal key encoding in SstFileWriter matches what TableReader returns
Positive Observations
- Clean separation of fast/scan paths with shared validation and checksum verification
- Good use of sync points for testing (
GetIngestedFileInfo:ReadPath) - Comprehensive db_stress integration with appropriate exclusions
- Well-designed benchmark in db_bench that isolates the optimization
- Thorough input validation (size mismatch, null pointers, write_global_seqno incompatibility)
- Ingestion timing histograms are a valuable addition independent of the fast path
- The code correctly handles the UDT timestamp lifecycle (strip in writer, pad in ingestion)
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
…facebook#14837) Summary: External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied. This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns an opaque `PreparedFileInfo` containing file size, table properties, and prepared internal `smallest`/`largest` bounds. Point-key and range-deletion bounds are folded into that same bound pair, so ingestion has one prepared boundary shape regardless of whether the file contains point keys, range deletions, or both. A new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and checksum verification still opens the file when `verify_checksums_before_ingest` is set. Range tombstones and user-defined-timestamp files (including the "UDT in Memtables only" format, whose persisted boundary keys carry no timestamp) are handled. Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use the RocksDB comparators, and the no-persisted-timestamp path pads prepared bounds back to the internal timestamp shape before installing the file. ## Benchmark results `db_bench` `ingestexternalfile` benchmark, Buck opt build (`mode/opt`), on SSD (btrfs). Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput. | Config | Prepare SUM / P50 | Run SUM / P50 | Total benchmark time | | --- | ---: | ---: | ---: | | 1M keys/SST, baseline | 205.304 ms / 19.930 ms | 73.880 ms / 5.622 ms | 44.227 s | | 1M keys/SST, `file_info=true` | 73.077 ms / 8.250 ms | 125.509 ms / 11.950 ms | 42.813 s | | 2M keys/SST, baseline | 243.766 ms / 25.930 ms | 104.723 ms / 8.433 ms | 84.616 s | | 2M keys/SST, `file_info=true` | 70.712 ms / 7.733 ms | 240.838 ms / 18.000 ms | 84.946 s | The prepared metadata path cuts `Prepare()` substantially. `Run()` is longer with `file_info=true` because the baseline path opens/scans the table during `Prepare()`, which warms OS/block-cache state before the later table-cache open in `Run()`. A syscall trace of the `file_info=true` path showed the remaining prepare time is mostly expected link/sync durability work (`link()` and `fdatasync()`), not metadata scanning. Command shape: ``` cd /data/users/jkangs/fbsource/fbcode buck run mode/opt //internal_repo_rocksdb/repo:db_bench -- --benchmarks=ingestexternalfile --num=<1000000|2000000> --value_size=100 \ --compression_type=none --ingest_external_file_batch_size=10 \ --ingest_external_file_num_batches=10 --disable_auto_compactions=1 \ --statistics --db=<db_path> --wal_dir=<wal_path> \ --ingest_external_file_use_file_info=<false|true> ``` Differential Revision: D107721261
8d5e7ff to
e634e35
Compare
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit e634e35 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
…facebook#14837) Summary: External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied. This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns a `PreparedFileInfo` with the file size, table properties, and prepared internal `smallest`/`largest` bounds, and a new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified when `verify_checksums_before_ingest` is set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled. Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file. ## Benchmark Results `db_bench ingestexternalfile` benchmark, release build, on SSD (btrfs). Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput. Each generated SST was `121,883,007` bytes (`116.24 MiB`). The benchmark used 1M keys/SST and 5 ingest batches per run. | Batch size | Files/run | Config | Prepare P50 | Run P50 | Total ingestion P50 | Total P50 drop | | --- | --- | --- | --- | --- | --- | --- | | 10 | 50 | Baseline | 20.667 ms | 5.500 ms | 26.167 ms | -- | | 10 | 50 | `file_info=true` | 7.838 ms | 9.826 ms | 17.664 ms | 32.5% | | 30 | 150 | Baseline | 59.278 ms | 20.667 ms | 79.945 ms | -- | | 30 | 150 | `file_info=true` | 18.000 ms | 31.167 ms | 49.167 ms | 38.5% | | 50 | 250 | Baseline | 92.500 ms | 37.250 ms | 129.750 ms | -- | | 50 | 250 | `file_info=true` | 29.416 ms | 62.500 ms | 91.916 ms | 29.2% | The prepared metadata path cuts `Prepare()` substantially. `Run()` is longer with `file_info=true` because the baseline path opens/scans the table during `Prepare()`, which warms OS/block-cache state before the later table-cache open in `Run()`. A syscall trace of the `file_info=true` path showed the remaining prepare time is mostly due to link/sync syscalls. Benchmark args: `--benchmarks=ingestexternalfile --num=1000000 --value_size=100 --compression_type=none --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=5 --disable_auto_compactions=1 --statistics --ingest_external_file_use_file_info=<false|true>`. Differential Revision: D107721261
e634e35 to
69a4b4e
Compare
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit e634e35 SummaryWell-designed performance optimization that adds a metadata fast-path to external SST file ingestion, cutting High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🔴 HIGHNone. 🟡 MEDIUMM1. Missing test for range-deletion-only files via fast-path --
|
| Context | Executes? | Safe? | Notes |
|---|---|---|---|
| WritePreparedTxnDB | YES | YES | Seqno reassigned in common path |
| UDT (persisted) | YES | YES | Boundaries carry timestamps |
| UDT (memtable only) | YES | YES | Padding via ExtractValueType is correct |
| FIFO compaction | YES | YES | Level assignment in common path |
| allow_db_generated_files | YES | YES | Handled explicitly |
| verify_checksums | YES | YES | Opens file in common path |
Key claims verified: unique_id IS set (common path), start_ukey/limit_ukey IS set (common path), boundary computation IS equivalent (uses same SerializeKey/SerializeEndKey logic), write_global_seqno rejection IS enforced.
Positive Observations
- Clean dispatcher pattern with common post-processing
- Good validation (null pointers, size mismatches, write_global_seqno)
- Debug assertions verifying range-del timestamps
- Well-integrated stress test with configurable probability
- Test sync point verifying fast-path is taken
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 69a4b4e ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 69a4b4e SummaryWell-designed performance optimization that cleanly splits ingestion metadata acquisition into a fast-path (reuse SstFileWriter metadata) and scan-path. The boundary computation refactoring is thorough, with careful handling of UDT (both persisted and stripped), range deletions, and sequence number bounds. Good test coverage with unit tests, stress test integration, db_bench support, and crash test configuration. High-severity findings (1):
Full review (click to expand)Findings🔴 HIGHH1. Unconditional boundary computation in Finish() penalizes existing callers --
|
| Context | Fast path handles correctly? | Verification |
|---|---|---|
| UDT persisted | YES | Range dels re-serialized with max_ts in Finish(); tested in TimestampsPersistedIngestWithFileInfoRangeDeletionBounds |
| UDT not persisted (strip) | YES | Timestamps stripped in Finish(), padded back in GetIngestedFileInfoFromFileInfo; debug assertion in scan path validates max_ts invariant for range dels; tested in TimestampsNotPersistedIngestWithFileInfoRangeDeletionBounds |
| Range deletions | YES | Boundary extension tracked via SerializeKey/SerializeEndKey; tested in IngestWithFileInfoRangeDeletion |
| allow_db_generated_files | YES | Seqno bounds from table properties; rejection tested |
| verify_checksums | YES | Opens file just for verification; tested in IngestWithFileInfoVerifiesChecksum |
| write_global_seqno | REJECTED | Validated in db_impl.cc; tested in IngestWithFileInfoRejectsWriteGlobalSeqno |
| Concurrent ingestions | YES | shared_ptr is safe; stress test exercises concurrent paths |
| File modification between Finish/Ingest | Same risk as existing API | Both paths read the file at ingestion time if checksums enabled |
Positive Observations
- Clean separation of concerns:
GetIngestedFileInfoFromFileInfo(fast) vsGetIngestedFileInfoFromFile(scan) with shared post-processing inGetIngestedFileInfowrapper. The common wrapper handles checksum verification, FD assignment, and start_ukey/limit_ukey computation uniformly. - Thorough input validation in
db_impl.cc: null pointer check, size mismatch check, write_global_seqno incompatibility. - Debug assertions in the scan path (new
#ifndef NDEBUGblock) cross-validate the assumption that range tombstone keys use max timestamp, providing a safety net against future regressions. - Well-integrated stress testing with configurable probability via
ingest_external_file_use_file_info_one_in. - db_bench integration provides a reproducible benchmark with clear methodology (link files, isolate ingest path).
- The
IngestWithFileInfotest uses sync points to verify the scan path is actually skipped (read_path_countcheck). - Range deletion boundary tracking refactored to use
InternalKeythroughout SstFileWriter, eliminating a user-key/internal-key impedance mismatch.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
) Summary: This change splits ingestion into two public calls: `PrepareFileIngestion()` performs all of the off-mutex work and returns an opaque `FileIngestionHandle`, and `CommitFileIngestionHandle()` makes the prepared files visible under the mutex. Even though the "Prepare" phase was off the DB mutex, the application layer may still have application logic that depends on the completion of `IngestExternalFile`. `IngestExternalFiles(args)` is also just `PrepareFileIngestion(args)` followed by `CommitFileIngestionHandle()`, so its behavior is unchanged. `CommitFileIngestionHandles()` also supports committing multiple handles atomically. However, it is possible that a CF may be present in multiple handles. In this case, we merge the ingestion jobs together. This allows applications to prepare file ingestions at different times, but still provide a single atomic commit. Differential Revision: D108225105
…facebook#14837) Summary: External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied. This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns a `PreparedFileInfo` with the file size, table properties, and prepared internal `smallest`/`largest` bounds, and a new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified when `verify_checksums_before_ingest` is set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled. Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file. The `ingestexternalfile` `db_bench` benchmark now also exposes `--ingest_external_file_fill_cache` so ingestion-read block-cache effects can be controlled independently while comparing the file-info fast path against the normal open-and-scan path. ## Benchmark Results Existing `db_bench ingestexternalfile` benchmark, release build, on SSD (btrfs). Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput. These numbers used the benchmark's previous default `fill_cache=true`. Each generated SST was `121,883,007` bytes (`116.24 MiB`). The benchmark used 1M keys/SST and 5 ingest batches per run. | Batch size | Files/run | Config | Prepare P50 | Run P50 | Total ingestion P50 | Total P50 drop | | --- | --- | --- | --- | --- | --- | --- | | 10 | 50 | Baseline | 20.667 ms | 5.500 ms | 26.167 ms | -- | | 10 | 50 | `file_info=true` | 7.838 ms | 9.826 ms | 17.664 ms | 32.5% | | 30 | 150 | Baseline | 59.278 ms | 20.667 ms | 79.945 ms | -- | | 30 | 150 | `file_info=true` | 18.000 ms | 31.167 ms | 49.167 ms | 38.5% | | 50 | 250 | Baseline | 92.500 ms | 37.250 ms | 129.750 ms | -- | | 50 | 250 | `file_info=true` | 29.416 ms | 62.500 ms | 91.916 ms | 29.2% | The prepared metadata path cuts `Prepare()` substantially. `Run()` is longer with `file_info=true` because the baseline path opens/scans the table during `Prepare()`, which warms OS/block-cache state before the later table-cache open in `Run()`. A syscall trace of the `file_info=true` path showed the remaining prepare time is mostly due to link/sync syscalls. Updated fill-cache-isolated benchmark args: `--benchmarks=ingestexternalfile --num=1000000 --value_size=100 --compression_type=none --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=5 --disable_auto_compactions=1 --statistics --ingest_external_file_use_file_info=<false|true> --ingest_external_file_fill_cache=false`. Differential Revision: D107721261
69a4b4e to
507675f
Compare
Summary:
External file ingestion (
DB::IngestExternalFile[s]) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied.This change lets a caller skip that work when it already has the file's metadata.
SstFileWriter::Finish()now returns aPreparedFileInfowith the file size, table properties, and prepared internalsmallest/largestbounds, and a newIngestExternalFileArg::file_infosfield carries onePreparedFileInfoper file intoIngestExternalFiles(). When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified whenverify_checksums_before_ingestis set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled.Internally, ingestion-job metadata acquisition was split into
GetIngestedFileInfoFromFileInfo(reuse caller metadata) andGetIngestedFileInfoFromFile(open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file.The
ingestexternalfiledb_benchbenchmark now also exposes--ingest_external_file_fill_cacheso ingestion-read block-cache effects can be controlled independently while comparing the file-info fast path against the normal open-and-scan path.Benchmark Results
Existing
db_bench ingestexternalfilebenchmark, release build, on SSD (btrfs). Files are linked (move_files) so the measurement isolates the ingest path rather than file-copy throughput. These numbers used the benchmark's previous defaultfill_cache=true.Each generated SST was
121,883,007bytes (116.24 MiB). The benchmark used 1M keys/SST and 5 ingest batches per run.file_info=truefile_info=truefile_info=trueThe prepared metadata path cuts
Prepare()substantially.Run()is longer withfile_info=truebecause the baseline path opens/scans the table duringPrepare(), which warms OS/block-cache state before the later table-cache open inRun(). A syscall trace of thefile_info=truepath showed the remaining prepare time is mostly due to link/sync syscalls.Updated fill-cache-isolated benchmark args:
--benchmarks=ingestexternalfile --num=1000000 --value_size=100 --compression_type=none --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=5 --disable_auto_compactions=1 --statistics --ingest_external_file_use_file_info=<false|true> --ingest_external_file_fill_cache=false.Differential Revision: D107721261