Skip to content

Pass file metadata to IngestExternalFile to improve ingestion latency (#14837)#14837

Open
joshkang97 wants to merge 2 commits into
facebook:mainfrom
joshkang97:export-D107721261
Open

Pass file metadata to IngestExternalFile to improve ingestion latency (#14837)#14837
joshkang97 wants to merge 2 commits into
facebook:mainfrom
joshkang97:export-D107721261

Conversation

@joshkang97

@joshkang97 joshkang97 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary:

External file ingestion (DB::IngestExternalFile[s]) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied.

This change lets a caller skip that work when it already has the file's metadata. SstFileWriter::Finish() now returns a PreparedFileInfo with the file size, table properties, and prepared internal smallest/largest bounds, and a new IngestExternalFileArg::file_infos field carries one PreparedFileInfo per file into IngestExternalFiles(). When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified when verify_checksums_before_ingest is set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled.

Internally, ingestion-job metadata acquisition was split into GetIngestedFileInfoFromFileInfo (reuse caller metadata) and GetIngestedFileInfoFromFile (open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file.

The ingestexternalfile db_bench benchmark now also exposes --ingest_external_file_fill_cache so ingestion-read block-cache effects can be controlled independently while comparing the file-info fast path against the normal open-and-scan path.

Benchmark Results

Existing db_bench ingestexternalfile benchmark, release build, on SSD (btrfs). Files are linked (move_files) so the measurement isolates the ingest path rather than file-copy throughput. These numbers used the benchmark's previous default fill_cache=true.

Each generated SST was 121,883,007 bytes (116.24 MiB). The benchmark used 1M keys/SST and 5 ingest batches per run.

Batch size Files/run Config Prepare P50 Run P50 Total ingestion P50 Total P50 drop
10 50 Baseline 20.667 ms 5.500 ms 26.167 ms --
10 50 file_info=true 7.838 ms 9.826 ms 17.664 ms 32.5%
30 150 Baseline 59.278 ms 20.667 ms 79.945 ms --
30 150 file_info=true 18.000 ms 31.167 ms 49.167 ms 38.5%
50 250 Baseline 92.500 ms 37.250 ms 129.750 ms --
50 250 file_info=true 29.416 ms 62.500 ms 91.916 ms 29.2%

The prepared metadata path cuts Prepare() substantially. Run() is longer with file_info=true because the baseline path opens/scans the table during Prepare(), which warms OS/block-cache state before the later table-cache open in Run(). A syscall trace of the file_info=true path showed the remaining prepare time is mostly due to link/sync syscalls.

Updated fill-cache-isolated benchmark args: --benchmarks=ingestexternalfile --num=1000000 --value_size=100 --compression_type=none --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=5 --disable_auto_compactions=1 --statistics --ingest_external_file_use_file_info=<false|true> --ingest_external_file_fill_cache=false.

Differential Revision: D107721261

@meta-cla meta-cla Bot added the CLA Signed label Jun 8, 2026
@meta-codesync

meta-codesync Bot commented Jun 8, 2026

Copy link
Copy Markdown

@joshkang97 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107721261.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

⚠️ clang-tidy: 1 warning(s) on changed lines

Completed in 483.8s.

Summary by check

Check Count
cppcoreguidelines-special-member-functions 1
Total 1

Details

db/db_impl/db_impl.cc (1 warning(s))
db/db_impl/db_impl.cc:6724:7: warning: class 'FileIngestionHandleImpl' defines a non-default destructor but does not define a move constructor or a move assignment operator [cppcoreguidelines-special-member-functions]

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 8d5e7ff


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 8d5e7ff


Summary

Well-designed optimization that reduces external file ingestion latency ~34% by reusing SstFileWriter metadata. The code is largely correct, with the two main code paths (fast and scan) producing equivalent results for all common scenarios. A few gaps in test coverage and a minor hot-path performance concern should be addressed.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Extra per-key string copy in AddImpl() hot path -- table/sst_file_writer.cc:130
  • Issue: largest_internal_key.assign(encoded_ikey.data(), encoded_ikey.size()) is called on every Put()/Merge()/Delete(). This doubles the number of per-key string assignments compared to the baseline (which already does largest_key.assign()). For workloads inserting millions of keys, this adds non-trivial copying overhead even when the caller never requests PreparedFileInfo.
  • Root cause: The tracking fields are unconditionally maintained regardless of whether the caller will use the prepared metadata.
  • Suggested fix: Defer the largest_internal_key capture to Finish(). Since keys are added in strictly increasing order, the last key written to the builder is the largest. At Finish() time, the builder is still alive and the last key is deterministic. Alternatively, only track these fields when a flag (set in Open() or at construction time) indicates the caller intends to request PreparedFileInfo.
M2. Missing test coverage for UDT (user-defined timestamps) fast path -- db/external_sst_file_test.cc
  • Issue: The fast path has dedicated logic for the "UDT in Memtables only" scenario (timestamp stripping in Finish(), timestamp padding in GetIngestedFileInfoFromFileInfo()), but no unit test exercises this path. If the padding logic produces keys that differ from the scan path, ingestion would silently produce incorrect file boundaries.
  • Root cause: The existing UDT ingestion tests (e.g., TimestampedExternalSSTFileTest) do not use ingest_with_file_info=true.
  • Suggested fix: Add a test that creates an SstFileWriter with persist_user_defined_timestamps=false (UDT-in-Memtable-only mode), writes keys, obtains PreparedFileInfo, ingests via file_infos, and verifies the data reads back correctly. Compare against the scan-path result.
M3. Missing test for range-del-only files on fast path -- db/external_sst_file_test.cc
  • Issue: The stress test explicitly excludes range-del-only files from the fast path (!test_standalone_range_deletion), and there is no unit test covering this scenario. While the logic appears correct (MaybeUpdateRange will set boundaries from the range tombstone), this is an untested edge case.
  • Root cause: The stress test exclusion suggests awareness of this gap but no corresponding unit test was added.
  • Suggested fix: Either (a) add a unit test proving range-del-only files work on the fast path, or (b) add an explicit InvalidArgument guard rejecting range-del-only files with file_infos and document the limitation. Note: range-del-only files combined with UDT would hit an assertion in InternalKey::Encode() due to empty keys being padded -- this should be guarded.
M4. Missing test for multiple files in a single IngestExternalFiles call via file_infos -- db/external_sst_file_test.cc
  • Issue: All unit tests ingest a single file via file_infos. The API supports multiple files per call, but multi-file ingestion on the fast path is only exercised by db_bench and potentially db_stress (which uses a single SST file per ingest).
  • Root cause: Test gap.
  • Suggested fix: Add a test that creates 2-3 SST files, obtains PreparedFileInfo for each, and ingests them all in a single IngestExternalFiles call.

🟢 LOW / NIT

L1. allow_db_generated_files + file_infos interaction -- db/external_sst_file_ingestion_job.cc:1095
  • Issue: For allow_db_generated_files=true, the fast path uses table_properties.key_smallest_seqno / key_largest_seqno as seqno bounds. SstFileWriter always writes seqno=0, so these will always be 0. While this is technically correct (SstFileWriter files have seqno=0), allow_db_generated_files is designed for DB-generated files with non-zero seqnos. If someone manually constructs a PreparedFileInfo for a DB-generated file, the seqno bounds might be wrong.
  • Suggested fix: Add a comment clarifying that file_infos is only supported for SstFileWriter-produced files, or add a validation check when allow_db_generated_files=true.
L2. TableProperties deep copy in Finish() -- table/sst_file_writer.cc:565
  • Issue: prepared->table_properties = r->builder->GetTableProperties() performs a full deep copy of TableProperties including all string fields. This is a one-per-file cost so it's not on a hot path, but could be optimized with shared ownership.
  • Suggested fix: Consider using std::shared_ptr<const TableProperties> if the builder supports it, or accept the copy as acceptable warm-path cost.
L3. IngestWithFileInfoRangeDeletion test could cover more scenarios -- db/external_sst_file_test.cc:419
  • Issue: The range deletion test uses a single DeleteRange. Multiple non-contiguous range tombstones would exercise the boundary extension logic more thoroughly. While the logic is correct (MaybeUpdateRange takes min/max regardless of call count), a test with multiple tombstones would provide stronger confidence.
  • Suggested fix: Add a variant with 2-3 non-contiguous range tombstones.

Cross-Component Analysis

Context Executes? Correct? Notes
WritePreparedTxnDB YES YES Seqno assignment in Run() unaffected
ReadOnly DB NO N/A IngestExternalFile not available
allow_db_generated_files YES YES (with caveat L1) SstFileWriter seqnos are always 0
User-defined timestamps YES YES Padding logic verified correct
UDT in Memtables only YES YES (needs test M2) Strip/pad logic consistent
write_global_seqno=true NO N/A Correctly rejected in validation
Range-del-only files YES Likely correct but untested (M3) MaybeUpdateRange handles it
Range-del-only + UDT YES BUG Empty InternalKey::Encode() assertion
FIFO/Universal compaction YES YES Affects Run(), not Prepare()
atomic_replace_range YES YES Orthogonal to metadata acquisition
verify_checksums_before_ingest YES YES Opens file only for verification

Key verified equivalences:

  • MaybeUpdateRange takes min/max of boundaries regardless of call count, so single-synthetic-tombstone fast path equals per-tombstone scan path
  • InternalKey::DecodeFrom() copies data (rep_.assign()), no lifetime concerns
  • external_sst_file_global_seqno_offset is 0 in in-memory properties, safely handled since write_global_seqno is rejected
  • GetTableProperties() returns by value, safe after Finish()
  • Internal key encoding in SstFileWriter matches what TableReader returns

Positive Observations

  • Clean separation of fast/scan paths with shared validation and checksum verification
  • Good use of sync points for testing (GetIngestedFileInfo:ReadPath)
  • Comprehensive db_stress integration with appropriate exclusions
  • Well-designed benchmark in db_bench that isolates the optimization
  • Thorough input validation (size mismatch, null pointers, write_global_seqno incompatibility)
  • Ingestion timing histograms are a valuable addition independent of the fast path
  • The code correctly handles the UDT timestamp lifecycle (strip in writer, pad in ingestion)

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@meta-codesync meta-codesync Bot changed the title Pass file metadata to IngestExternalFile to improve ingestion latency Pass file metadata to IngestExternalFile to improve ingestion latency (#14837) Jun 9, 2026
joshkang97 added a commit to joshkang97/rocksdb that referenced this pull request Jun 9, 2026
…facebook#14837)

Summary:

External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied.

This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns an opaque `PreparedFileInfo` containing file size, table properties, and prepared internal `smallest`/`largest` bounds. Point-key and range-deletion bounds are folded into that same bound pair, so ingestion has one prepared boundary shape regardless of whether the file contains point keys, range deletions, or both. A new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and checksum verification still opens the file when `verify_checksums_before_ingest` is set. Range tombstones and user-defined-timestamp files (including the "UDT in Memtables only" format, whose persisted boundary keys carry no timestamp) are handled.

Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use the RocksDB comparators, and the no-persisted-timestamp path pads prepared bounds back to the internal timestamp shape before installing the file.

## Benchmark results

`db_bench` `ingestexternalfile` benchmark, Buck opt build (`mode/opt`), on SSD (btrfs). Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput.

| Config | Prepare SUM / P50 | Run SUM / P50 | Total benchmark time |
| --- | ---: | ---: | ---: |
| 1M keys/SST, baseline | 205.304 ms / 19.930 ms | 73.880 ms / 5.622 ms | 44.227 s |
| 1M keys/SST, `file_info=true` | 73.077 ms / 8.250 ms | 125.509 ms / 11.950 ms | 42.813 s |
| 2M keys/SST, baseline | 243.766 ms / 25.930 ms | 104.723 ms / 8.433 ms | 84.616 s |
| 2M keys/SST, `file_info=true` | 70.712 ms / 7.733 ms | 240.838 ms / 18.000 ms | 84.946 s |

The prepared metadata path cuts `Prepare()` substantially. `Run()` is longer with `file_info=true` because the baseline path opens/scans the table during `Prepare()`, which warms OS/block-cache state before the later table-cache open in `Run()`. A syscall trace of the `file_info=true` path showed the remaining prepare time is mostly expected link/sync durability work (`link()` and `fdatasync()`), not metadata scanning.

Command shape:

```
cd /data/users/jkangs/fbsource/fbcode
buck run mode/opt //internal_repo_rocksdb/repo:db_bench -- --benchmarks=ingestexternalfile --num=<1000000|2000000> --value_size=100 \
  --compression_type=none --ingest_external_file_batch_size=10 \
  --ingest_external_file_num_batches=10 --disable_auto_compactions=1 \
  --statistics --db=<db_path> --wal_dir=<wal_path> \
  --ingest_external_file_use_file_info=<false|true>
```

Differential Revision: D107721261
@joshkang97 joshkang97 force-pushed the export-D107721261 branch from 8d5e7ff to e634e35 Compare June 9, 2026 23:19
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit e634e35


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

joshkang97 added a commit to joshkang97/rocksdb that referenced this pull request Jun 10, 2026
…facebook#14837)

Summary:

External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied.

This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns a `PreparedFileInfo` with the file size, table properties, and prepared internal `smallest`/`largest` bounds, and a new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified when `verify_checksums_before_ingest` is set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled.

Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file.

## Benchmark Results

`db_bench ingestexternalfile` benchmark, release build, on SSD (btrfs). Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput.

Each generated SST was `121,883,007` bytes (`116.24 MiB`). The benchmark used 1M keys/SST and 5 ingest batches per run.

| Batch size | Files/run | Config | Prepare P50 | Run P50 | Total ingestion P50 | Total P50 drop |
| --- | --- | --- | --- | --- | --- | --- |
| 10 | 50 | Baseline | 20.667 ms | 5.500 ms | 26.167 ms | -- |
| 10 | 50 | `file_info=true` | 7.838 ms | 9.826 ms | 17.664 ms | 32.5% |
| 30 | 150 | Baseline | 59.278 ms | 20.667 ms | 79.945 ms | -- |
| 30 | 150 | `file_info=true` | 18.000 ms | 31.167 ms | 49.167 ms | 38.5% |
| 50 | 250 | Baseline | 92.500 ms | 37.250 ms | 129.750 ms | -- |
| 50 | 250 | `file_info=true` | 29.416 ms | 62.500 ms | 91.916 ms | 29.2% |

The prepared metadata path cuts `Prepare()` substantially. `Run()` is longer with `file_info=true` because the baseline path opens/scans the table during `Prepare()`, which warms OS/block-cache state before the later table-cache open in `Run()`. A syscall trace of the `file_info=true` path showed the remaining prepare time is mostly due to link/sync syscalls.

Benchmark args: `--benchmarks=ingestexternalfile --num=1000000 --value_size=100 --compression_type=none --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=5 --disable_auto_compactions=1 --statistics --ingest_external_file_use_file_info=<false|true>`.

Differential Revision: D107721261
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit e634e35


Summary

Well-designed performance optimization that adds a metadata fast-path to external SST file ingestion, cutting Prepare() latency by ~3x. The implementation is architecturally sound with clean separation between the fast-path (GetIngestedFileInfoFromFileInfo) and scan-path (GetIngestedFileInfoFromFile), proper validation, and good test coverage for the core scenarios. No high-severity correctness bugs found after thorough analysis.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. Missing test for range-deletion-only files via fast-path -- db/external_sst_file_test.cc
  • Issue: No test covers a file that contains ONLY range deletions (no point keys) ingested via file_infos. This exercises a distinct code path in SstFileWriter::Finish() where smallest/largest come entirely from range deletion boundaries, and file_info.num_entries == 0.
  • Suggested fix: Add a test that creates an SST with only DeleteRange() calls, finishes with PreparedFileInfo, and ingests via file_infos.
M2. Missing round-trip equivalence test -- db/external_sst_file_test.cc
  • Issue: No test ingests the SAME file via both the fast-path and scan-path and compares the resulting IngestedFileInfo (or DB state) for equivalence. Such a test would be the strongest correctness guarantee that the two paths produce identical metadata.
  • Suggested fix: Add a test that ingests identical files both ways and compares LiveFileMetaData (smallestkey, largestkey, size, etc.).
M3. Missing multi-file batch test via file_infos -- db/external_sst_file_test.cc
  • Issue: All file_infos tests use a single file. No test covers multiple files in one IngestExternalFiles call via file_infos. The file_infos vector indexing logic in Prepare() should be validated with >1 file.
  • Suggested fix: Add test with 2-3 files using file_infos, including overlapping keys to exercise files_overlap_ detection.
M4. SstFileWriter::Rep fields not reset in Open() -- table/sst_file_writer.cc
  • Issue: The new fields smallest_internal_key, smallest_range_del_internal_key, and largest_range_del_internal_key are not explicitly reset in SstFileWriter::Open(). If a user reuses the same SstFileWriter instance across multiple files (Open() -> Finish() -> Open() -> Finish()), stale values from the first file could leak into the second. file_info IS reset (line 438), but the new InternalKey fields are not.
  • Suggested fix: Reset these fields in Open() via default construction.

🟢 LOW / NIT

L1. PreparedFileInfo defined in internal header but exposed via public API -- include/rocksdb/options.h:2912
  • Issue: Forward-declared in public options.h, defined in internal table/prepared_file_info.h. Opaque design is correct, but C API / Java JNI bindings cannot easily wrap this. Acceptable for initial implementation.
L2. TableProperties copy in Finish() -- table/sst_file_writer.cc:600
  • Issue: Full TableProperties copy including string fields. Not on a hot path (once per file), so acceptable.
L3. Comment explaining max_ts for range deletion re-serialization -- table/sst_file_writer.cc:558-566
  • Issue: When !strip_timestamp && ts_sz > 0, re-serialization with max_ts matches what NewRangeTombstoneIterator() returns. A brief comment explaining this would improve clarity.

Cross-Component Analysis

Context Executes? Safe? Notes
WritePreparedTxnDB YES YES Seqno reassigned in common path
UDT (persisted) YES YES Boundaries carry timestamps
UDT (memtable only) YES YES Padding via ExtractValueType is correct
FIFO compaction YES YES Level assignment in common path
allow_db_generated_files YES YES Handled explicitly
verify_checksums YES YES Opens file in common path

Key claims verified: unique_id IS set (common path), start_ukey/limit_ukey IS set (common path), boundary computation IS equivalent (uses same SerializeKey/SerializeEndKey logic), write_global_seqno rejection IS enforced.

Positive Observations

  • Clean dispatcher pattern with common post-processing
  • Good validation (null pointers, size mismatches, write_global_seqno)
  • Debug assertions verifying range-del timestamps
  • Well-integrated stress test with configurable probability
  • Test sync point verifying fast-path is taken

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 69a4b4e


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 69a4b4e


Summary

Well-designed performance optimization that cleanly splits ingestion metadata acquisition into a fast-path (reuse SstFileWriter metadata) and scan-path. The boundary computation refactoring is thorough, with careful handling of UDT (both persisted and stripped), range deletions, and sequence number bounds. Good test coverage with unit tests, stress test integration, db_bench support, and crash test configuration.

High-severity findings (1):

  • [sst_file_writer.cc:Finish] SstFileWriter::Finish() unconditionally computes PreparedFileInfo boundary data (InternalKey merging, range-del re-serialization, UDT timestamp handling) even when prepared_file_info is nullptr, adding overhead to all existing callers.
Full review (click to expand)

Findings

🔴 HIGH

H1. Unconditional boundary computation in Finish() penalizes existing callers -- sst_file_writer.cc:Finish
  • Issue: The refactored Finish() always computes PreparedFileInfo boundary data -- InternalKey smallest/largest via update_file_boundaries, range deletion bound merging, UDT timestamp re-serialization -- even when prepared_file_info is nullptr. This adds computation (InternalKey comparisons, string allocations, RangeTombstone construction) to every SstFileWriter::Finish() call, including the 100+ existing call sites that do not use the feature.
  • Root cause: The code unconditionally populates finished_prepared_file_info and associated temporaries before checking whether the caller wants it.
  • Suggested fix: Gate the PreparedFileInfo-specific work behind if (prepared_file_info != nullptr). The ExternalSstFileInfo population (user key strings) can be done independently. Alternatively, accept the overhead if benchmarks show it's negligible for the Finish() cold path -- but document the decision.

🟡 MEDIUM

M1. Missing test for range-del-only file with file_infos -- external_sst_file_test.cc
  • Issue: No test covers a file with only range deletions (zero point keys) ingested via the file_infos fast path. The IngestWithFileInfoRangeDeletion test includes a Put(Key(30)) alongside the DeleteRange. A range-del-only file exercises a different code path where smallest_internal_key comes entirely from range deletion bounds, and point-key tracking is skipped.
  • Suggested fix: Add a test creating an SST with only DeleteRange calls, then ingest via file_infos.
M2. Missing test for allow_db_generated_files + file_infos (success path) -- external_sst_file_test.cc
  • Issue: IngestWithFileInfoRejectsMissingDbGeneratedSeqnoBounds tests the rejection case. There is no positive test verifying correct ingestion with allow_db_generated_files=true and valid seqno bounds in PreparedFileInfo.
  • Suggested fix: Add a test that creates a file with valid seqno bounds and successfully ingests it with allow_db_generated_files=true + file_infos.
M3. Missing test for multi-file batch with file_infos -- external_sst_file_test.cc
  • Issue: All file_infos tests use a single file per ingestion call. The batch path (multiple files in IngestExternalFileArg::external_files) with matching file_infos is not tested. The index-based access file_infos[i] in the Prepare() loop should be verified with >1 file.
  • Suggested fix: Test ingesting 2+ files in a single IngestExternalFiles call via file_infos.
M4. SstFileWriter::Rep::DeleteRangeImpl now uses InternalKey comparisons for boundary tracking -- sst_file_writer.cc:~240
  • Issue: The old code compared user keys (internal_comparator.user_comparator()->Compare). The new code compares InternalKey objects with the full internal comparator (internal_comparator.Compare(range_del_start_bound, smallest_range_del_internal_key)). This is semantically correct and more precise, but it changes behavior: two range deletions with the same user key but different sequence numbers could produce different boundary decisions. For SstFileWriter (where all seqnos are 0), this is a no-op difference, but the change is worth documenting.
  • Suggested fix: Add a brief comment noting that InternalKey comparison is used (and why it's safe with seqno=0).
M5. global_seqno_offset validation change is correct but subtle -- external_sst_file_ingestion_job.cc:~993
  • Issue: The old code unconditionally errored when external_sst_file_global_seqno_offset == 0 for V2 files. The new code only errors when write_global_seqno && offset == 0. This is necessary because in-memory table properties from builder->GetTableProperties() have offset=0 (it's computed only when reading the file back). The change is safe because: (1) write_global_seqno=true is rejected when file_infos is set (db_impl.cc validation), and (2) AssignGlobalSeqnoForIngestedFile also checks write_global_seqno && offset == 0 at line 1436.
  • Suggested fix: Add a comment explaining why the condition was relaxed, referencing the in-memory properties limitation.
M6. ExternalSstFileInfo boundary population changed semantics in Finish() -- sst_file_writer.cc
  • Issue: The old Rep::AddImpl tracked file_info.smallest_key and file_info.largest_key as user keys assigned on each entry. The new code sets smallest_internal_key from r->smallest_internal_key (the first point key's InternalKey) and ikey (the last point key's InternalKey), then extracts user keys for finished_file_info. The ordering check was also changed from Compare(user_key, file_info.largest_key) to Compare(user_key, ikey.user_key()). Since ikey is set via ikey.Set(user_key, 0, value_type) in AddImpl, ikey.user_key() returns the same user_key. The semantics are equivalent.
  • Suggested fix: None needed -- verified equivalent.

🟢 LOW / NIT

L1. shared_ptr<const PreparedFileInfo> overhead
  • Issue: PreparedFileInfo uses shared_ptr with atomic ref-counting. Since Finish() is a cold path (once per file) and the ownership pattern is simple (one producer, one or few consumers), unique_ptr could avoid the overhead. However, shared_ptr simplifies user code (no explicit move into the vector of raw pointers), and the PR correctly uses shared_ptr<const> for safe sharing.
  • Suggested fix: Acceptable as-is; document the rationale.
L2. IngestExternalFile single-CF API doesn't expose file_infos
  • Issue: Users must use the multi-CF IngestExternalFiles(vector<IngestExternalFileArg>) API to use file_infos. The convenience single-CF IngestExternalFile doesn't have a file_infos parameter.
  • Suggested fix: Consider adding an overload in a follow-up PR.
L3. PreparedFileInfo struct is not truly opaque
  • Issue: PreparedFileInfo is forward-declared in options.h as opaque but defined in table/prepared_file_info.h with public fields including InternalKey (an internal type). Users who include prepared_file_info.h can see and modify the struct. This is acceptable since users would need to intentionally include an internal header, and the const qualifier on the shared_ptr prevents modification through the normal API.
  • Suggested fix: None -- current approach is pragmatic.
L4. Crash test values for ingest_external_file_use_file_info_one_in
  • Issue: Values [0, 1, 2] in db_crashtest.py provide good coverage: 0=never, 1=always, 2=50%. This is a reasonable distribution for a new feature.
  • Suggested fix: None needed.

Cross-Component Analysis

Context Fast path handles correctly? Verification
UDT persisted YES Range dels re-serialized with max_ts in Finish(); tested in TimestampsPersistedIngestWithFileInfoRangeDeletionBounds
UDT not persisted (strip) YES Timestamps stripped in Finish(), padded back in GetIngestedFileInfoFromFileInfo; debug assertion in scan path validates max_ts invariant for range dels; tested in TimestampsNotPersistedIngestWithFileInfoRangeDeletionBounds
Range deletions YES Boundary extension tracked via SerializeKey/SerializeEndKey; tested in IngestWithFileInfoRangeDeletion
allow_db_generated_files YES Seqno bounds from table properties; rejection tested
verify_checksums YES Opens file just for verification; tested in IngestWithFileInfoVerifiesChecksum
write_global_seqno REJECTED Validated in db_impl.cc; tested in IngestWithFileInfoRejectsWriteGlobalSeqno
Concurrent ingestions YES shared_ptr is safe; stress test exercises concurrent paths
File modification between Finish/Ingest Same risk as existing API Both paths read the file at ingestion time if checksums enabled

Positive Observations

  • Clean separation of concerns: GetIngestedFileInfoFromFileInfo (fast) vs GetIngestedFileInfoFromFile (scan) with shared post-processing in GetIngestedFileInfo wrapper. The common wrapper handles checksum verification, FD assignment, and start_ukey/limit_ukey computation uniformly.
  • Thorough input validation in db_impl.cc: null pointer check, size mismatch check, write_global_seqno incompatibility.
  • Debug assertions in the scan path (new #ifndef NDEBUG block) cross-validate the assumption that range tombstone keys use max timestamp, providing a safety net against future regressions.
  • Well-integrated stress testing with configurable probability via ingest_external_file_use_file_info_one_in.
  • db_bench integration provides a reproducible benchmark with clear methodology (link files, isolate ingest path).
  • The IngestWithFileInfo test uses sync points to verify the scan path is actually skipped (read_path_count check).
  • Range deletion boundary tracking refactored to use InternalKey throughout SstFileWriter, eliminating a user-key/internal-key impedance mismatch.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

)

Summary:

This change splits ingestion into two public calls: `PrepareFileIngestion()` performs all of the off-mutex work and returns an opaque `FileIngestionHandle`, and `CommitFileIngestionHandle()` makes the prepared files visible under the mutex.

Even though the "Prepare" phase was off the DB mutex, the application layer may still have application logic that depends on the completion of `IngestExternalFile`.

`IngestExternalFiles(args)` is also just `PrepareFileIngestion(args)` followed by `CommitFileIngestionHandle()`, so its behavior is unchanged.

`CommitFileIngestionHandles()` also supports committing multiple handles atomically. However, it is possible that a CF may be present in multiple handles. In this case, we merge the ingestion jobs together. This allows applications to prepare file ingestions at different times, but still provide a single atomic commit.

Differential Revision: D108225105
…facebook#14837)

Summary:

External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied.

This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns a `PreparedFileInfo` with the file size, table properties, and prepared internal `smallest`/`largest` bounds, and a new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified when `verify_checksums_before_ingest` is set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled.

Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file.

The `ingestexternalfile` `db_bench` benchmark now also exposes `--ingest_external_file_fill_cache` so ingestion-read block-cache effects can be controlled independently while comparing the file-info fast path against the normal open-and-scan path.

## Benchmark Results

Existing `db_bench ingestexternalfile` benchmark, release build, on SSD (btrfs). Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput. These numbers used the benchmark's previous default `fill_cache=true`.

Each generated SST was `121,883,007` bytes (`116.24 MiB`). The benchmark used 1M keys/SST and 5 ingest batches per run.

| Batch size | Files/run | Config | Prepare P50 | Run P50 | Total ingestion P50 | Total P50 drop |
| --- | --- | --- | --- | --- | --- | --- |
| 10 | 50 | Baseline | 20.667 ms | 5.500 ms | 26.167 ms | -- |
| 10 | 50 | `file_info=true` | 7.838 ms | 9.826 ms | 17.664 ms | 32.5% |
| 30 | 150 | Baseline | 59.278 ms | 20.667 ms | 79.945 ms | -- |
| 30 | 150 | `file_info=true` | 18.000 ms | 31.167 ms | 49.167 ms | 38.5% |
| 50 | 250 | Baseline | 92.500 ms | 37.250 ms | 129.750 ms | -- |
| 50 | 250 | `file_info=true` | 29.416 ms | 62.500 ms | 91.916 ms | 29.2% |

The prepared metadata path cuts `Prepare()` substantially. `Run()` is longer with `file_info=true` because the baseline path opens/scans the table during `Prepare()`, which warms OS/block-cache state before the later table-cache open in `Run()`. A syscall trace of the `file_info=true` path showed the remaining prepare time is mostly due to link/sync syscalls.

Updated fill-cache-isolated benchmark args: `--benchmarks=ingestexternalfile --num=1000000 --value_size=100 --compression_type=none --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=5 --disable_auto_compactions=1 --statistics --ingest_external_file_use_file_info=<false|true> --ingest_external_file_fill_cache=false`.

Differential Revision: D107721261
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant