Split ingest external files into prepare and commit APIs (#14849)#14849
Split ingest external files into prepare and commit APIs (#14849)#14849joshkang97 wants to merge 1 commit into
Conversation
|
@joshkang97 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108225105. |
✅ clang-tidy: No findings on changed linesCompleted in 462.1s. |
) Summary: This change splits ingestion into two public calls: `PrepareFileIngestion()` performs all of the off-mutex work and returns an opaque `FileIngestionHandle`, and `CommitFileIngestionHandle()` makes the prepared files visible under the mutex. Even though the "Prepare" phase was off the DB mutex, the application layer may still have application logic that depends on the completion of `IngestExternalFile`. `IngestExternalFiles(args)` is also just `PrepareFileIngestion(args)` followed by `CommitFileIngestionHandle()`, so its behavior is unchanged. `CommitFileIngestionHandles()` also supports committing multiple handles atomically. However, it is possible that a CF may be present in multiple handles. In this case, we merge the ingestion jobs together. This allows applications to prepare file ingestions at different times, but still provide a single atomic commit. Differential Revision: D108225105
d4137c0 to
221b75d
Compare
) Summary: This change splits ingestion into two public calls: `PrepareFileIngestion()` performs all of the off-mutex work and returns an opaque `FileIngestionHandle`, and `CommitFileIngestionHandle()` makes the prepared files visible under the mutex. Even though the "Prepare" phase was off the DB mutex, the application layer may still have application logic that depends on the completion of `IngestExternalFile`. `IngestExternalFiles(args)` is also just `PrepareFileIngestion(args)` followed by `CommitFileIngestionHandle()`, so its behavior is unchanged. `CommitFileIngestionHandles()` also supports committing multiple handles atomically. However, it is possible that a CF may be present in multiple handles. In this case, we merge the ingestion jobs together. This allows applications to prepare file ingestions at different times, but still provide a single atomic commit. Differential Revision: D108225105
221b75d to
c4de32f
Compare
) Summary: This change splits ingestion into two public calls: `PrepareFileIngestion()` performs all of the off-mutex work and returns an opaque `FileIngestionHandle`, and `CommitFileIngestionHandle()` makes the prepared files visible under the mutex. Even though the "Prepare" phase was off the DB mutex, the application layer may still have application logic that depends on the completion of `IngestExternalFile`. `IngestExternalFiles(args)` is also just `PrepareFileIngestion(args)` followed by `CommitFileIngestionHandle()`, so its behavior is unchanged. `CommitFileIngestionHandles()` also supports committing multiple handles atomically. However, it is possible that a CF may be present in multiple handles. In this case, we merge the ingestion jobs together. This allows applications to prepare file ingestions at different times, but still provide a single atomic commit. Differential Revision: D108225105
) Summary: This change splits ingestion into two public calls: `PrepareFileIngestion()` performs all of the off-mutex work and returns an opaque `FileIngestionHandle`, and `CommitFileIngestionHandle()` makes the prepared files visible under the mutex. Even though the "Prepare" phase was off the DB mutex, the application layer may still have application logic that depends on the completion of `IngestExternalFile`. `IngestExternalFiles(args)` is also just `PrepareFileIngestion(args)` followed by `CommitFileIngestionHandle()`, so its behavior is unchanged. `CommitFileIngestionHandles()` also supports committing multiple handles atomically. However, it is possible that a CF may be present in multiple handles. In this case, we merge the ingestion jobs together. This allows applications to prepare file ingestions at different times, but still provide a single atomic commit. Differential Revision: D108225105
c4de32f to
31acb42
Compare
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 31acb42 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 31acb42 SummarySolid two-phase API design that correctly splits the ingestion hot path. The core Prepare/Commit split and handle lifecycle are well-implemented, with good test coverage via parameterized tests. Several issues need attention before merge. High-severity findings (3):
Full review (click to expand)Findings🔴 HIGHH1. Missing
|
| Context | Executes? | Assumptions hold? | Action needed? |
|---|---|---|---|
| DBImplReadOnly | YES (inherits DBImpl) | NO — should not allow ingestion | Add NotSupported overrides (H1) |
| DBImplSecondary | YES (inherits DBImpl) | NO — should not allow ingestion | Add NotSupported overrides (H1) |
| WritePreparedTxnDB | YES (via StackableDB) | YES — passthrough is correct | Safe |
| WriteUnpreparedTxnDB | YES (via StackableDB) | YES — passthrough is correct | Safe |
| BlobDB | YES (via StackableDB) | YES — passthrough is correct | Safe |
| Crash between Prepare and Commit | N/A | Files on disk but not in MANIFEST | Safe — recovery treats as obsolete |
| CF dropped between Prepare and Commit | YES | Commit checks IsDropped() |
Safe — returns error |
Positive Observations
- The parameterized test approach (
two_phase_ingest_as 3rd bool param) is excellent — it runs all existing ingestion tests through the two-phase path automatically, providing comprehensive coverage. - The stress test integration with 25% rollback probability is well-designed for catching edge cases.
- The
ingestion_options_change from reference to value correctly addresses the lifetime issue introduced by the two-phase design. - The
MergeForSameColumnFamilydesign withComputeFilesOverlap+DivideInputFilesIntoBatchesreuse is clean. - Good use of RAII pattern for handle cleanup.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
) Summary: This change splits ingestion into two public calls: `PrepareFileIngestion()` performs all of the off-mutex work and returns an opaque `FileIngestionHandle`, and `CommitFileIngestionHandle()` makes the prepared files visible under the mutex. Even though the "Prepare" phase was off the DB mutex, the application layer may still have application logic that depends on the completion of `IngestExternalFile`. `IngestExternalFiles(args)` is also just `PrepareFileIngestion(args)` followed by `CommitFileIngestionHandle()`, so its behavior is unchanged. `CommitFileIngestionHandles()` also supports committing multiple handles atomically. However, it is possible that a CF may be present in multiple handles. In this case, we merge the ingestion jobs together. This allows applications to prepare file ingestions at different times, but still provide a single atomic commit. Differential Revision: D108225105
31acb42 to
be3849f
Compare
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit be3849f ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit be3849f SummarySolid architectural split of High-severity findings (3):
Full review (click to expand)Findings🔴 HIGHH1. Missing
|
| Context | Safe? | Notes |
|---|---|---|
| WritePreparedTxnDB | Yes | Inherits via StackableDB forwarding; no override of ingestion |
| ReadOnly DB | Yes | Returns NotSupported |
| Secondary Instance | Yes | Returns NotSupported |
| BlobDB | Yes | Does not override IngestExternalFiles |
| Concurrent ingestions | Yes | Same locking as before (per-CF ingest lock + DB mutex) |
| Compactions between Prepare/Commit | Yes | Level assignment happens during Run() under mutex |
mutable_db_options_ changes |
Yes | Job holds a reference to the member variable; reads current value at Run() time |
| User-defined timestamps | Yes | Same validation as before |
atomic_replace_range + merge |
Yes | MergeForSameColumnFamily rejects this combination |
Positive Observations
- The
ingestion_options_change fromconst&to value is correct and essential for the split lifetime. ComputeFilesOverlapextraction is a good refactoring that makes the overlap check reusable for merge.- The
ingestion_jobs.reserve(num_cfs)addition is a nice optimization. - The
MergeForSameColumnFamilydesign correctly recomputes overlap and rebuilds batches. - Sequence number ordering for merged jobs is correct: appended files get later batches and higher seqnos.
- Stress test coverage with both abort paths (explicit Abort + RAII destructor) is thorough.
- The
operator==onIngestExternalFileOptionsusing C++20 defaulted comparison is correct (all fields are trivially comparable) and forward-compatible. - Test parameterization to run all existing tests with
two_phase_ingest_=trueprovides excellent regression coverage.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
) Summary: This change splits ingestion into two public calls: `PrepareFileIngestion()` performs all of the off-mutex work and returns an opaque `FileIngestionHandle`, and `CommitFileIngestionHandle()` makes the prepared files visible under the mutex. Even though the "Prepare" phase was off the DB mutex, the application layer may still have application logic that depends on the completion of `IngestExternalFile`. `IngestExternalFiles(args)` is also just `PrepareFileIngestion(args)` followed by `CommitFileIngestionHandle()`, so its behavior is unchanged. `CommitFileIngestionHandles()` also supports committing multiple handles atomically. However, it is possible that a CF may be present in multiple handles. In this case, we merge the ingestion jobs together. This allows applications to prepare file ingestions at different times, but still provide a single atomic commit. Differential Revision: D108225105
be3849f to
e182491
Compare
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit e182491 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit e182491 SummarySolid design that correctly splits the heavy off-mutex work from the brief on-mutex commit. The refactoring preserves existing High-severity findings (3):
Full review (click to expand)Findings🔴 HIGHH1.
|
| Context | Overridden? | Action needed? |
|---|---|---|
| ReadOnly DB | Yes → NotSupported | OK |
| SecondaryInstance | Yes → NotSupported | OK |
| CompactedDB | No — inherits DBImpl | Add NotSupported overrides |
| StackableDB | Forwarded | OK |
| TransactionDB | Inherits base | OK |
Positive Observations
- Clean refactoring:
IngestExternalFilesis now Prepare+Commit, ensuring all existing tests exercise the new path. MergeForSameColumnFamilycorrectly rebuilds batches after appending files.consumed_flag + destructor rollback is a good safety net.ingestion_options_correctly changed from reference to value.- Thorough stress test and parametric test coverage.
ComputeFilesOverlapextraction improves reuse.
The full review is also saved to review-findings.md.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
) Summary: This change splits ingestion into two public calls: `PrepareFileIngestion()` performs all of the off-mutex work and returns an opaque `FileIngestionHandle`, and `CommitFileIngestionHandle()` makes the prepared files visible under the mutex. Even though the "Prepare" phase was off the DB mutex, the application layer may still have application logic that depends on the completion of `IngestExternalFile`. `IngestExternalFiles(args)` is also just `PrepareFileIngestion(args)` followed by `CommitFileIngestionHandle()`, so its behavior is unchanged. `CommitFileIngestionHandles()` also supports committing multiple handles atomically. However, it is possible that a CF may be present in multiple handles. In this case, we merge the ingestion jobs together. This allows applications to prepare file ingestions at different times, but still provide a single atomic commit. Differential Revision: D108225105
e182491 to
be2931b
Compare
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit be2931b ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit be2931b SummarySolid two-phase split of High-severity findings (2):
Medium-severity findings (5):
Low-severity findings (5): Good reserve() addition, seq_cst overkill for counter, missing tests for CF-drop and concurrent-writes edge cases. Full details with root causes and suggested fixes are in ℹ️ About this responseGenerated by Claude Code. Limitations:
Commands:
|
|
This pull request has been merged in 4d30cfd. |
Summary:
This change splits ingestion into two public calls:
PrepareFileIngestion()performs all of the off-mutex work and returns an opaqueFileIngestionHandle, andCommitFileIngestionHandle()makes the prepared files visible under the mutex.Even though the "Prepare" phase was off the DB mutex, the application layer may still have application logic that depends on the completion of
IngestExternalFile.IngestExternalFiles(args)is also justPrepareFileIngestion(args)followed byCommitFileIngestionHandle(), so its behavior is unchanged.CommitFileIngestionHandles()also supports committing multiple handles atomically. However, it is possible that a CF may be present in multiple handles. In this case, we merge the ingestion jobs together. This allows applications to prepare file ingestions at different times, but still provide a single atomic commit.Differential Revision: D108225105