Skip to content

Add experimental embedded blob SST support#14851

Open
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:sst_writer_embedded_blob
Open

Add experimental embedded blob SST support#14851
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:sst_writer_embedded_blob

Conversation

@pdillinger

Copy link
Copy Markdown
Contributor

Summary:
Add EXPERIMENTAL embedded blob SST support for SstFileWriter through OpenWithEmbeddedBlobs(). Eligible large values are written as same-file blob records in a strict prefix of a block-based SST, while table entries store same-file BlobIndex references that readers resolve for Get, MultiGet, and iteration, including mixed embedded and non-embedded wide-column values.

The first Codex implementation put much more of this feature directly in SstFileWriter, including table-builder timing and entry buffering concerns. This updated design is less intrusive generally: SstFileWriter selects the mode, while a BlockBasedTableBuilder wrapper owns prefix writing and replay into the normal builder. That keeps ownership closer to the table-building layer and is more open to possible generalization beyond SstFileWriter if that ever becomes useful. Regardless, this experimental feature is expected only to have niche applications.

Embedded blob record bounds are stored in a dedicated SST metaindex entry so readers have an explicit sanity/corruption-checking locator for the record range, separate from table block layout assumptions. Blob count and payload-byte totals are stored as an auxiliary table property for diagnostics and can be ignored by readers.

Same-file BlobIndex references use blob file number 0 as the marker. That value also serves as the invalid blob-file-number sentinel in broader metadata code, but the meanings do not conflict when used carefully: only the embedded-SST reader/writer path interprets 0 as same-file, while generic file-metadata paths continue to reject it as invalid. Using 1 would be worse because legacy "stackable" BlobDB can use low blob file numbers, including 1, so reserving it would collide with real blob files.

Compression options remain in the public API as placeholders, but embedded blob compression support is deferred. Integrating compression with BlockBasedTableBuilder while avoiding copied CompressAndVerifyBlock-style logic is tricky enough to deserve a separate, focused PR.

Test Plan:

  • Added BlobIndexTest.SameFileBlobIndex and BlobGarbageMeterTest.SameFileBlobIndex coverage for same-file BlobIndex encoding, display, recognition, and ignoring same-file references in blob-garbage accounting.
  • Extended FileMetaDataTest.UpdateBoundariesBlobIndex to preserve the generic zero-file-number corruption check while keeping same-file embedded blob semantics at the table-reader/writer layers.
  • Added SstFileReader embedded blob coverage for round-trip Get, MultiGet, and iterator reads; format_version gating; ignored placeholder compression options; the 2048-byte default min_blob_size; wide-column mixed embedded/non-embedded values; early append error surfacing; and reader tolerance for ignored bytes before and after the embedded blob record prefix.
  • Evaluated normal iterator CPU risk with DEBUG_LEVEL=0 PORTABLE=0 db_bench binaries against ../rocksdb_main on a non-embedded, cached 5M-key DB. Long forward scans (seekrandom --seek_nexts=65536), short forward scans (--seek_nexts=100), reverse long scans, and readseq showed no forward iterator CPU regression; reverse long scans were flat in cycles/op with a small instruction-count increase.

Summary:
Add EXPERIMENTAL embedded blob SST support for SstFileWriter through
OpenWithEmbeddedBlobs(). Eligible large values are written as same-file blob
records in a strict prefix of a block-based SST, while table entries store
same-file BlobIndex references that readers resolve for Get, MultiGet, and
iteration, including mixed embedded and non-embedded wide-column values.

The first Codex implementation put much more of this feature directly in
SstFileWriter, including table-builder timing and entry buffering concerns.
This updated design is less intrusive generally: SstFileWriter selects the mode,
while a BlockBasedTableBuilder wrapper owns prefix writing and replay into the
normal builder. That keeps ownership closer to the table-building layer and is
more open to possible generalization beyond SstFileWriter if that ever becomes
useful. Regardless, this experimental feature is expected only to have niche
applications.

Embedded blob record bounds are stored in a dedicated SST metaindex entry so
readers have an explicit sanity/corruption-checking locator for the record range,
separate from table block layout assumptions. Blob count and payload-byte totals
are stored as an auxiliary table property for diagnostics and can be ignored by
readers.

Same-file BlobIndex references use blob file number 0 as the marker. That value
also serves as the invalid blob-file-number sentinel in broader metadata code,
but the meanings do not conflict when used carefully: only the embedded-SST
reader/writer path interprets 0 as same-file, while generic file-metadata paths
continue to reject it as invalid. Using 1 would be worse because legacy
"stackable" BlobDB can use low blob file numbers, including 1, so reserving it
would collide with real blob files.

Compression options remain in the public API as placeholders, but embedded blob
compression support is deferred. Integrating compression with
BlockBasedTableBuilder while avoiding copied CompressAndVerifyBlock-style logic
is tricky enough to deserve a separate, focused PR.

Test Plan:
- Added BlobIndexTest.SameFileBlobIndex and BlobGarbageMeterTest.SameFileBlobIndex
  coverage for same-file BlobIndex encoding, display, recognition, and ignoring
  same-file references in blob-garbage accounting.
- Extended FileMetaDataTest.UpdateBoundariesBlobIndex to preserve the generic
  zero-file-number corruption check while keeping same-file embedded blob
  semantics at the table-reader/writer layers.
- Added SstFileReader embedded blob coverage for round-trip Get, MultiGet, and
  iterator reads; format_version gating; ignored placeholder compression
  options; the 2048-byte default min_blob_size; wide-column mixed
  embedded/non-embedded values; early append error surfacing; and reader
  tolerance for ignored bytes before and after the embedded blob record prefix.
- Evaluated normal iterator CPU risk with DEBUG_LEVEL=0 PORTABLE=0 db_bench
  binaries against ../rocksdb_main on a non-embedded, cached 5M-key DB. Long
  forward scans (seekrandom --seek_nexts=65536), short forward scans
  (--seek_nexts=100), reverse long scans, and readseq showed no forward iterator
  CPU regression; reverse long scans were flat in cycles/op with a small
  instruction-count increase.
@pdillinger pdillinger requested a review from xingbowang June 14, 2026 22:11
@meta-cla meta-cla Bot added the CLA Signed label Jun 14, 2026
@meta-codesync

meta-codesync Bot commented Jun 14, 2026

Copy link
Copy Markdown

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D108564468.

@github-actions

Copy link
Copy Markdown

⚠️ clang-tidy: 2 warning(s) on changed lines

Completed in 492.7s.

Summary by check

Check Count
cppcoreguidelines-pro-type-member-init 1
cppcoreguidelines-special-member-functions 1
Total 2

Details

table/block_based/block_based_table_builder.cc (2 warning(s))
table/block_based/block_based_table_builder.cc:3141:7: warning: class 'EmbeddedBlobBlockBasedTableBuilder' defines a non-default destructor but does not define a move constructor or a move assignment operator [cppcoreguidelines-special-member-functions]
table/block_based/block_based_table_builder.cc:3379:5: warning: uninitialized record type: 'trailer' [cppcoreguidelines-pro-type-member-init]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant