Skip to content

Commit 69a4b4e

Browse files
joshkang97facebook-github-bot
authored andcommitted
Pass file metadata to IngestExternalFile to improve ingestion latency (#14837)
Summary: External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied. This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns a `PreparedFileInfo` with the file size, table properties, and prepared internal `smallest`/`largest` bounds, and a new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified when `verify_checksums_before_ingest` is set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled. Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file. ## Benchmark Results `db_bench ingestexternalfile` benchmark, release build, on SSD (btrfs). Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput. Each generated SST was `121,883,007` bytes (`116.24 MiB`). The benchmark used 1M keys/SST and 5 ingest batches per run. | Batch size | Files/run | Config | Prepare P50 | Run P50 | Total ingestion P50 | Total P50 drop | | --- | --- | --- | --- | --- | --- | --- | | 10 | 50 | Baseline | 20.667 ms | 5.500 ms | 26.167 ms | -- | | 10 | 50 | `file_info=true` | 7.838 ms | 9.826 ms | 17.664 ms | 32.5% | | 30 | 150 | Baseline | 59.278 ms | 20.667 ms | 79.945 ms | -- | | 30 | 150 | `file_info=true` | 18.000 ms | 31.167 ms | 49.167 ms | 38.5% | | 50 | 250 | Baseline | 92.500 ms | 37.250 ms | 129.750 ms | -- | | 50 | 250 | `file_info=true` | 29.416 ms | 62.500 ms | 91.916 ms | 29.2% | The prepared metadata path cuts `Prepare()` substantially. `Run()` is longer with `file_info=true` because the baseline path opens/scans the table during `Prepare()`, which warms OS/block-cache state before the later table-cache open in `Run()`. A syscall trace of the `file_info=true` path showed the remaining prepare time is mostly due to link/sync syscalls. Benchmark args: `--benchmarks=ingestexternalfile --num=1000000 --value_size=100 --compression_type=none --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=5 --disable_auto_compactions=1 --statistics --ingest_external_file_use_file_info=<false|true>`. Differential Revision: D107721261
1 parent 828f6d1 commit 69a4b4e

15 files changed

Lines changed: 913 additions & 138 deletions

db/db_impl/db_impl.cc

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6709,6 +6709,27 @@ Status DBImpl::IngestExternalFiles(
67096709
"external_files[" + std::to_string(i) + "] is empty";
67106710
return Status::InvalidArgument(err_msg);
67116711
}
6712+
if (!args[i].file_infos.empty()) {
6713+
if (args[i].file_infos.size() != args[i].external_files.size()) {
6714+
return Status::InvalidArgument("file_infos[" + std::to_string(i) +
6715+
"] size must match external_files[" +
6716+
std::to_string(i) + "] size");
6717+
}
6718+
for (const auto* prepared_file_info : args[i].file_infos) {
6719+
if (prepared_file_info == nullptr) {
6720+
return Status::InvalidArgument(
6721+
"file_infos[" + std::to_string(i) +
6722+
"] contains a null PreparedFileInfo pointer; each entry must "
6723+
"point to a handle from SstFileWriter::Finish");
6724+
}
6725+
}
6726+
// file_infos avoids opening the file, so it cannot write a global seqno
6727+
// back into it. (write_global_seqno is deprecated and defaults to false.)
6728+
if (args[i].options.write_global_seqno) {
6729+
return Status::InvalidArgument(
6730+
"write_global_seqno is not supported when file_infos is set");
6731+
}
6732+
}
67126733
if (i && args[i].options.fill_cache != args[i - 1].options.fill_cache) {
67136734
return Status::InvalidArgument(
67146735
"fill_cache should be the same across ingestion options.");
@@ -6802,8 +6823,9 @@ Status DBImpl::IngestExternalFiles(
68026823
this);
68036824
Status es = ingestion_jobs[i].Prepare(
68046825
args[i].external_files, args[i].files_checksums,
6805-
args[i].files_checksum_func_names, args[i].atomic_replace_range,
6806-
args[i].file_temperature, start_file_number, super_version);
6826+
args[i].files_checksum_func_names, args[i].file_infos,
6827+
args[i].atomic_replace_range, args[i].file_temperature,
6828+
start_file_number, super_version);
68076829
// capture first error only
68086830
if (!es.ok() && status.ok()) {
68096831
status = es;
@@ -6818,8 +6840,9 @@ Status DBImpl::IngestExternalFiles(
68186840
this);
68196841
Status es = ingestion_jobs[0].Prepare(
68206842
args[0].external_files, args[0].files_checksums,
6821-
args[0].files_checksum_func_names, args[0].atomic_replace_range,
6822-
args[0].file_temperature, next_file_number, super_version);
6843+
args[0].files_checksum_func_names, args[0].file_infos,
6844+
args[0].atomic_replace_range, args[0].file_temperature,
6845+
next_file_number, super_version);
68236846
if (!es.ok()) {
68246847
status = es;
68256848
}

db/external_sst_file_ingestion_job.cc

Lines changed: 215 additions & 68 deletions
Large diffs are not rendered by default.

db/external_sst_file_ingestion_job.h

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,7 @@ class ExternalSstFileIngestionJob {
248248
Status Prepare(const std::vector<std::string>& external_files_paths,
249249
const std::vector<std::string>& files_checksums,
250250
const std::vector<std::string>& files_checksum_func_names,
251+
const std::vector<const PreparedFileInfo*>& file_infos,
251252
const std::optional<RangeOpt>& atomic_replace_range,
252253
const Temperature& file_temperature, uint64_t next_file_number,
253254
SuperVersion* sv);
@@ -315,17 +316,37 @@ class ExternalSstFileIngestionJob {
315316
// different options. For example: when external file does not contain
316317
// timestamps while column family enables UDT in Memtables only feature.
317318
Status SanityCheckTableProperties(const std::string& external_file,
318-
uint64_t new_file_number, SuperVersion* sv,
319-
IngestedFileInfo* file_to_ingest,
320-
std::unique_ptr<TableReader>* table_reader);
319+
const TableProperties& props,
320+
IngestedFileInfo* file_to_ingest);
321321

322322
// Open the external file and populate `file_to_ingest` with all the
323-
// external information we need to ingest this file.
323+
// external information we need to ingest this file. When
324+
// `prepared_file_info` is non-null, its caller-supplied metadata is reused
325+
// instead of opening and scanning the file.
324326
Status GetIngestedFileInfo(const std::string& external_file,
325327
uint64_t new_file_number,
328+
const PreparedFileInfo* prepared_file_info,
326329
IngestedFileInfo* file_to_ingest,
327330
SuperVersion* sv);
328331

332+
// Acquire the per-file metadata from the caller-supplied opaque
333+
// `PreparedFileInfo` (produced by SstFileWriter::Finish) instead of opening
334+
// the file.
335+
Status GetIngestedFileInfoFromFileInfo(
336+
const std::string& external_file,
337+
const PreparedFileInfo& prepared_file_info,
338+
IngestedFileInfo* file_to_ingest);
339+
340+
// Acquire the per-file metadata by opening the external file and scanning it
341+
// (table properties, sequence number bounds, and boundary keys including any
342+
// range-tombstone extensions). Used when no file_info is available. The
343+
// opened `TableReader` is returned via `*table_reader` so the caller can
344+
// reuse it (e.g. to verify the file checksum) without re-opening the file.
345+
Status GetIngestedFileInfoFromFile(
346+
const std::string& external_file, uint64_t new_file_number,
347+
IngestedFileInfo* file_to_ingest, SuperVersion* sv,
348+
std::unique_ptr<TableReader>* out_table_reader);
349+
329350
// If the input files' key range overlaps themselves, this function divides
330351
// them in the user specified order into multiple batches. Where the files
331352
// within a batch do not overlap with each other, but key range could overlap

0 commit comments

Comments
 (0)