Skip to content

feat(mongodb-storage)!: chunked multi-op bucket documents with range-merging compaction and invariant tests#617

Open
Sleepful wants to merge 110 commits into
mainfrom
compressed-bucket-storage
Open

feat(mongodb-storage)!: chunked multi-op bucket documents with range-merging compaction and invariant tests#617
Sleepful wants to merge 110 commits into
mainfrom
compressed-bucket-storage

Conversation

@Sleepful

@Sleepful Sleepful commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replaces MongoDB bucket storage's single-operation-per-document model with chunked multi-operation documents. Operations are now grouped into BSON documents by a ~1MB data-size threshold, reducing document count and index overhead for workloads with many small rows. The change includes range-merging compaction (rebuild from survivors instead of in-place mutation), document-level checksum aggregation, and a comprehensive edge-case test suite verifying data integrity invariants.

This is a breaking change for existing MongoDB storage deployments — databases using the previous single-op document format are not compatible. No migration path is provided.

What Changed

1. Collapse Dual-Version Abstraction

During development, two document formats coexisted behind an abstraction layer. This PR removes the abstraction and all code for the discarded format, leaving a single direct implementation.

Deleted:

  • v5/ directory and all adapter files (was the alternate/new format during development)
  • document-formats/v3-format.ts — single-op format code
  • document-formats/format-interface.ts — dual-format abstraction interface
  • common/MongoSyncBucketStorageCallbacks.ts — callback indirection layer
  • v3/models.ts and v5/models.ts re-export layers
  • VersionedPowerSyncMongo wrappers — storage now uses PowerSyncMongo directly

Renamed:

  • document-formats/v5-format.tsdocument-formats/bucket-document-format.ts
  • BucketDataDocumentV5BucketDataDocument
  • BucketOperationV5BucketOperation

Architecture before:

AbstractMongoSyncBucketStorage
  └── MongoSyncBucketStorage (concrete, delegates via callbacks)
        └── MongoSyncBucketStorageV3 / V5 (thin adapters)

Architecture after:

AbstractMongoSyncBucketStorage
  └── MongoSyncBucketStorageV3 (concrete, direct implementation)

2. Chunked Multi-Op Document Format

The previous model stored exactly one operation per MongoDB document. For workloads with many small rows, this created excessive document and index overhead.

New document shape: BucketDataDocument stores an ops[] array plus aggregated metadata:

  • _id.o = maximum op_id in the document (used for range queries)
  • min_op = minimum op_id
  • count = number of operations
  • checksum = sum of operation checksums
  • size = total byte size of operation data
  • target_op = maximum target_op across operations

Chunking: The write path groups pending operations by bucket, then chunks them into documents by a 1MB data-size threshold. Each chunk becomes one BucketDataDocument. Single-operation chunks remain valid.

Read path: getBucketDataBatch() queries by _id.o range, then post-filters individual operations within partially overlapping documents. Operations outside (start, checkpoint] are skipped.

Compaction: Instead of modifying documents in-place (previously PUT→MOVE, collapse to CLEAR), the compactor now takes a "rebuild from survivors" approach:

  1. Read all documents in a bucket
  2. Load and expand all operations
  3. Filter superseded PUT/REMOVE operations (newest-to-oldest deduplication by table/row_id/source)
  4. Preserve MOVE and CLEAR operations unconditionally
  5. Re-chunk surviving operations by the same 1MB threshold
  6. Replace old documents with new chunked documents in a transaction

Checksums: computePartialChecksumsForCollection() uses the pre-computed document-level checksum aggregate for fully-included documents. Only partially-included documents fall back to iterating individual operations.

Glossary

Fully included document: min_op > start. Example: document covers [40, 60], client asks for (30, 55]. Since min_op=40 > 30, every op in this document is within the client's range. The pipeline uses the pre-computed checksum field on the document — no need to iterate individual ops.

Partially included document: min_op <= start. Example: document covers [40, 60], client asks for (45, 55]. Since min_op=40 <= 45, some ops at the beginning of the document (40, 45) are outside the range. The pipeline can't use the pre-computed checksum — it must filter individual ops in the ops[] array and sum only those with o > start.

3. Edge Case Hardening & Invariant Tests

Comprehensive test suite verifying data integrity invariants under boundary conditions:

Read Filtering Boundaries (storage_sync.test.ts) — 13 test cases covering all combinations of start and checkpoint positions relative to document boundaries:

  • Full range, exact boundaries, mid-document filters, gap-only ranges, zero-width ranges, beyond-all-docs ranges

Compaction Boundaries (storage_compacting.test.ts) — 8 test cases:

  • Superseded ops removed from middle/first/last documents
  • All ops superseded → empty bucket
  • Single surviving op per document
  • Multiple small survivors merged by rechunking
  • Same row_id spanning document boundaries
Glossary

Rechunking is the process of grouping the surviving ops into new documents using chunkBucketData() — the same function used during normal writes. It groups ops by data size (1MB threshold), creating as many new documents as needed.

Invariant Verification Tests (storage_compacting.test.ts) — 19 unit + integration tests:

  1. ops[] ordering preserved after serialization and compaction
  2. Range metadata consistency (_id.o = max_op, min_op = min_op, count = ops.length, checksum = sum(op.checksum), size = sum(data.length))
  3. target_op correctness (max of non-null target_op values)
  4. No overlapping ranges between documents
  5. Post-query filtering correctness (covered by read filtering matrix)
  6. Compaction survivor integrity (PUT/REMOVE deduplication, MOVE/CLEAR preservation)
  7. Empty document cleanup (documents with no surviving ops deleted)
  8. BSON limit safety (large ops split, oversized single op gets own chunk)
  9. Serialization fidelity (null data, empty strings, unicode preserved)
  10. Document _id.o invariant (equals max op in document)
  11. Checksum consistency (aggregation pipeline matches JavaScript addChecksums)
  12. Compaction with maxOpId filtering (ops above limit excluded)

Breaking Changes

MongoDB storage: Existing deployments using the previous single-operation-per-document format are not compatible with this change. This requires a fresh deployment or manual migration (not provided).

V1 storage is unaffected.

Test Results

All existing parameterized tests continue to pass. New edge-case tests pass with no regressions.

# module-mongodb-storage
pnpm --filter='./modules/module-mongodb-storage' test -- --run
→ all pass

Key Files Changed

Detailed description per file

Files Changed

.changeset/

  • wild-pears-sing.md — Breaking changeset for service-core and module-mongodb-storage for the chunked multi-op document format.

modules/module-mongodb-storage/src/storage/

  • MongoBucketStorage.ts — Factory and lifecycle methods for V3 storage, providing direct instantiation without version dispatch.
  • storage-index.ts — Re-exports updated for new shared modules (common/models.ts, bucket-operations/*) and consolidated type names.

modules/module-mongodb-storage/src/storage/implementation/

Core storage layer. The abstract base class and shared infrastructure live here; V1 and V3 specifics are in their respective subdirectories.

  • AbstractMongoSyncBucketStorage.ts — Abstract base class with shared storage logic. V3 implements this directly without callback indirection.
  • createMongoSyncBucketStorage.ts — Factory that instantiates MongoSyncBucketStorageV3 for V3 storage.
  • db.tsversioned() factory returning the appropriate VersionedPowerSyncMongo per storage version.
  • MongoBucketBatch.ts — Thin base class with the common batch interface and fields. Write-path logic lives in V1 and V3 subclasses.
  • MongoChecksums.ts — Shared checksum infrastructure; imports from common/models.ts.
  • MongoCompactor.ts — Shared compaction base with range-merging scaffolding. Sets target_op during MOVE and CLEAR phases.
  • MongoParameterCompactor.ts — Concrete parameter compactor with default collectionFilter() and deleteFilter() implementations. Used directly by both V1 and V3.
  • MongoPersistedSyncRulesContent.ts — Sync rules persistence using shared VersionedPowerSyncMongo collection accessors.
  • models.ts — Top-level implementation models. Shared types moved to common/models.ts and document-formats/bucket-document-format.ts.

modules/module-mongodb-storage/src/storage/implementation/bucket-operations/

Shared helpers extracted from the write path, compaction pipeline, and read path. All new files.

  • batch-write.ts — Write-path helper for flushing bucket data batches, shared by V1 and V3.
  • checksum-aggregation.ts — Document-level checksum aggregation for the compaction pipeline. Uses the pre-computed checksum field on BucketDataDocument for fully-included documents; falls back to iterating ops[] for partially-included ones.
  • chunking.tschunkBucketData() groups ops into documents by a 1MB data-size threshold. Single oversized ops get their own chunk. Used by both the write path and compaction rechunking.
  • compaction-scaffolding.ts — Compaction utilities: loading all ops in a bucket, deduplicating by table/row_id (newest-first), and rebuilding survivor documents.
  • query-builders.ts — Query construction helpers for bucket data reads. Builds the (start, checkpoint] range query using min_op for the upper bound to catch documents that straddle the range boundary.
  • source-record-store-impl.ts — Concrete SourceRecordStore implementation using shared collection accessors.

modules/module-mongodb-storage/src/storage/implementation/collection-access/

  • versioned-collections.ts — Shared collection accessor interface and factory for VersionedPowerSyncMongo. Provides typed access to bucket data, source records, parameter indexes, and source tables.

modules/module-mongodb-storage/src/storage/implementation/common/

Shared types and base classes used across V1 and V3.

  • models.ts — Shared model types: CurrentBucket, RecordedLookup, CurrentDataDocument, BucketParameterDocument, SourceTableDocument, BucketStateDocument.
  • PersistedBatchShared.ts — Shared batch persistence logic for flushing bucket data documents via serializeBucketData().
  • BucketDataDoc.ts, PersistedBatch.ts, SingleBucketStore.ts, VersionedPowerSyncMongoBase.ts — Minor import and type updates.

modules/module-mongodb-storage/src/storage/implementation/document-formats/

The chunked multi-op document format.

  • bucket-document-format.ts — Core format definition. BucketDataDocument stores an ops[] array with aggregated metadata (_id.o, min_op, count, checksum, size, target_op). serializeBucketData() groups ops and computes aggregates. buildBucketDataQuery() constructs range queries with min_op upper bound. extractRowsFromDocument() post-filters individual ops within partially overlapping documents.
  • parameter-lookup.ts — Serialization/deserialization for parameter lookup values stored in bucket documents.

modules/module-mongodb-storage/src/storage/implementation/v1/

V1 (single-op document format) is structurally updated to inline shared logic but has no functional changes.

  • MongoBucketBatchV1.ts — Inlined shared write-path logic. Single-op document format preserved.
  • MongoSyncBucketStorageV1.ts — Inlined shared storage operations.
  • MongoCompactorV1.ts, MongoChecksumsV1.ts, MongoParameterCompactorV1.ts, PersistedBatchV1.ts, SingleBucketStoreV1.ts, VersionedPowerSyncMongoV1.ts, models.ts — Import updates and minor refactoring for shared types.

modules/module-mongodb-storage/src/storage/implementation/v3/

Primary V3 implementation using chunked multi-op documents.

  • MongoSyncBucketStorageV3.ts — Main V3 storage class. Implements all operations directly: read path uses buildBucketDataQuery() with min_op upper bound and extractRowsFromDocument() for post-filtering; write path delegates to MongoBucketBatchV3.
  • MongoBucketBatchV3.ts — V3 write-path batch. Handles multi-op document serialization and chunked writes via shared bucket-operations/ helpers.
  • MongoCompactorV3.ts — Range-merging compaction: reads all ops, deduplicates by table/row_id (newest-first), rechunks survivors by 1MB threshold, replaces old documents in a transaction.
  • MongoChecksumsV3.ts — Document-level checksum aggregation. Fully-included documents use the pre-computed checksum field; partially-included documents iterate ops[].
  • PersistedBatchV3.ts — Thin wrapper delegating to shared PersistedBatchShared.
  • SingleBucketStoreV3.ts — Uses shared document format and generator-based load function for iterating ops within multi-op documents.
  • SourceRecordStoreV3.ts — Uses shared SourceRecordStoreImpl.
  • VersionedPowerSyncMongoV3.ts — Extends shared VersionedPowerSyncMongo directly.
  • models.ts — Re-exports from common/models.ts with V3-specific types kept locally.

Deleted:

  • MongoParameterCompactorV3.ts — Consolidated into shared MongoParameterCompactor.
  • MongoParameterLookupV3.ts — Consolidated into shared document-formats/parameter-lookup.ts.

modules/module-mongodb-storage/src/utils/

  • util.ts — Added utility export for shared storage code.

modules/module-mongodb-storage/test/src/

  • storage_compacting.test.ts — 8 compaction boundary tests (deduplication across document boundaries, empty buckets, single survivors, cross-doc row_ids, rechunking) and 22 invariant/edge-case tests (ops[] ordering, range metadata consistency, target_op correctness, non-overlapping ranges, BSON limit safety, serialization fidelity, checksum consistency, maxOpId filtering).
  • storage_sync.test.ts — 13 read filtering boundary tests exercising (start, checkpoint] semantics with pre-inserted documents. Existing V3 tests updated to use shared types.
  • storage.test.ts — Added compressedBucketStorage flag to V3 test config.
  • __snapshots__/storage.test.ts.snap — New snapshots for V3 storage initialization.
  • __snapshots__/storage_sync.test.ts.snap — Expanded snapshots reflecting multi-op document format.

modules/module-postgres-storage/test/src/

  • storage.test.ts — Added compressedBucketStorage: false to test config.
  • storage_sync.test.ts — Added compressedBucketStorage flag to shared test registration.

packages/service-core-tests/src/tests/

  • register-data-storage-data-tests.ts — Shared data storage tests updated with compressedBucketStorage flag for conditional assertions on multi-op vs single-op document shapes.
  • register-sync-tests.ts — Shared sync test registration with compressedBucketStorage flag for document format assertions.

packages/service-core/src/storage/

  • BucketStorageFactory.ts — Added compressedBucketStorage boolean to TestStorageConfig. Controls whether shared tests assert multi-op document shapes.
Area Files
Factory & routing createMongoSyncBucketStorage.ts, db.ts
MongoDB storage implementation v3/MongoSyncBucketStorageV3.ts, v3/MongoCompactorV3.ts, v3/MongoChecksumsV3.ts, v3/PersistedBatchV3.ts, v3/MongoBucketBatchV3.ts
Shared helpers bucket-operations/chunking.ts, bucket-operations/batch-write.ts, bucket-operations/checksum-aggregation.ts, bucket-operations/compaction-scaffolding.ts, bucket-operations/query-builders.ts
Document format document-formats/bucket-document-format.ts, document-formats/parameter-lookup.ts
Models & types common/models.ts, common/BucketDataDoc.ts
Base classes AbstractMongoSyncBucketStorage.ts, MongoSyncBucketStorage.ts
Tests test/src/storage_sync.test.ts, test/src/storage_compacting.test.ts
Changeset .changeset/wild-pears-sing.md

Follow-up Work

  • Benchmark and tune 1MB chunk threshold under production workloads
  • Extract 1MB magic number to shared constant if tuning proves necessary
  • Monitor for edge cases not covered by the test matrix

Renames all class, function, type, and collection accessor names in
the duplicated v5 storage implementation from V3→V5:
- MongoBucketBatchV3 → MongoBucketBatchV5
- MongoChecksumsV3 → MongoChecksumsV5
- MongoCompactorV3 → MongoCompactorV5
- MongoParameterCompactorV3 → MongoParameterCompactorV5
- MongoParameterLookupV3 → MongoParameterLookupV5
- MongoSyncBucketStorageV3 → MongoSyncBucketStorageV5
- PersistedBatchV3 → PersistedBatchV5
- SingleBucketStoreV3 → SingleBucketStoreV5
- SourceRecordStoreV3 → SourceRecordStoreV5
- VersionedPowerSyncMongoV3 → VersionedPowerSyncMongoV5

Also adds compressedBucketStorage to StorageConfig and wires up
MongoSyncBucketStorageV5 selection in createMongoSyncBucketStorage.

This is a pure mechanical rename with no behavior changes.
Change BucketDataDocumentV5 to store arrays of operations per document:
- Add BucketOperationV5 interface with per-op fields including op_id
- Add aggregated fields: min_op, checksum, count, size
- Implement serializeBucketDataV5() to group ops and compute aggregates
- Implement loadBucketDataDocumentV5() as generator yielding from ops array

Add chunking logic in PersistedBatchV5.flushBucketData():
- Group operations by bucket then chunk by 1MB size threshold
- Single-op chunks remain valid for backward compatibility

Update read path in MongoSyncBucketStorageV5 to iterate merged docs.
Update SingleBucketStoreV5 for new generator-based load function.
Overrides compactSingleBucket in MongoCompactorV5 to handle the
compressed bucket storage model:

1. Reads all documents in a bucket sorted by _id.o ascending
2. Loads all ops via loadBucketDataDocumentV5()
3. Filters superseded operations using the same row_id tracking
   logic as v3 (newest-to-oldest pass, keeps only latest PUT/REMOVE
   per row)
4. Re-chunks surviving ops by 1MB data-size threshold
5. Replaces old documents with new chunked docs in a transaction
6. Updates bucket_state with recomputed checksums, counts, and bytes

Unlike v3, v5 does not create MOVE/CLEAR ops during compaction.
Instead, superseded ops are dropped and surviving ops are fully
restructured into new documents.
…egation and activate v5 in test matrix

- Override MongoChecksumsV5.computePartialChecksumsForCollection to use
document-level checksum field instead of expanding ops arrays
- Handle partial ranges correctly by filtering ops when start > min_op
- Fix getBucketDataBatchV5 to respect op-level limits instead of document limits
- Update PowerSyncMongo.versioned to create VersionedPowerSyncMongoV5 for v5
- Add STORAGE_VERSION_5 to SUPPORTED_STORAGE_VERSIONS and STORAGE_VERSION_CONFIG
- Update getMongoStorageConfig to enable compressedBucketStorage for v5
- Fix v3-specific tests to only run on storageVersion == 3
@changeset-bot

changeset-bot Bot commented Apr 29, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 3493dc4

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 12 packages
Name Type
@powersync/service-module-mongodb-storage Minor
@powersync/service-core Minor
@powersync/service-schema Minor
@powersync/service-module-convex Patch
@powersync/service-module-mongodb Patch
@powersync/service-module-mssql Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres Patch
@powersync/service-image Minor
@powersync/service-module-core Patch
@powersync/service-module-postgres-storage Patch
test-client Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@Sleepful Sleepful force-pushed the compressed-bucket-storage branch from f4f82ee to b4d71e3 Compare April 29, 2026 06:02
@Sleepful Sleepful force-pushed the compressed-bucket-storage branch from b4d71e3 to 755fad1 Compare April 29, 2026 06:08
Sleepful added 18 commits May 6, 2026 02:56
…tractMongoSyncBucketStorage and MongoSyncBucketStorageBase → MongoSyncBucketStorage
…ter to MongoParameterCompactor base class

Make collectionFilter() and deleteFilter() concrete in the base class
with the V3/V5 implementation (returns {} and {lookup, _id, key}
respectively). Remove the abstract keyword from the base class.

Delete the now-redundant V3 and V5 parameter compactor subclasses:
- v3/MongoParameterCompactorV3.ts
- v5/MongoParameterCompactorV5.ts

Update MongoSyncBucketStorageV3 and V5 to instantiate MongoParameterCompactor
directly, passing the collection lister callback inline.
…acks interface to separate file

- Create common/MongoSyncBucketStorageCallbacks.ts with the full interface
- Replace inline MongoSyncBucketStorageBaseCallbacks in MongoSyncBucketStorageBase.ts
- Type _versionCallbacks as MongoSyncBucketStorageCallbacks in AbstractMongoSyncBucketStorage
- Update v3 and v5 implementations to import from the new file
- Use 'any' for createCompactor's storage parameter to avoid circular imports
Move getParameterSetsShared, getBucketDataBatchSharedWrapper,
getDataBucketChangesShared, and getParameterBucketChangesShared from
bucket-operations/storage-operations.ts into MongoSyncBucketStorageBase as
private method implementations. Eliminate the context object pattern by
accessing this.callbacks and this.group_id directly. Flatten the
getBucketDataBatchShared -> getBucketDataBatchSharedWrapper chain into a
single getBucketDataBatchImpl method. Delete the now-unused
bucket-operations/storage-operations.ts.
Extract identical types from v3/models.ts and v5/models.ts into a shared
common/models.ts without version suffixes:
- CurrentBucket
- RecordedLookup
- CurrentDataDocument
- BucketParameterDocument
- SourceTableDocument
- BucketStateDocument
- taggedBucketParameterDocumentToTagged

Update v3/models.ts and v5/models.ts to re-export from common/models.ts,
keeping only version-specific exports (BucketDataDocumentV3/V5, etc.).

Update all imports across the codebase to use non-suffixed names from
common/models.ts or version-specific names where appropriate.

Update storage-index.ts to use explicit exports to avoid naming conflicts
with v1/models.ts and models.ts.
Sleepful added 5 commits June 9, 2026 00:12
…cument

Yield doc.target_op ?? null instead of hardcoded null. This fixes
two issues: the CLEAR pass now correctly accumulates target_op from
collapsed MOVEs, and the sync path now surfaces target_op in the
SyncBucketDataChunk response for checkpoint invalidation.
Upstream test used db.sourceRecordsV3() which was renamed to
db.sourceRecords() during the V3 suffix removal refactor.
Replace TRequest generic with concrete FetchPartialBucketChecksumByBucket
in createBucketFilter, buildPartialChecksumPipeline, and
normalizePartialChecksumResults. Call createBucketFilter directly
in buildPartialChecksumPipeline instead of threading it as a parameter.

Drop the unused createFilter parameter from the V3 override of
computePartialChecksumsForCollection. TypeScript allows method
overrides to have fewer parameters than the base, and the V3 body
no longer uses it. Callers still pass createBucketFilter to satisfy
the base class contract, but it is silently stripped.

The two generics TRequest and TBucketDataDocument on the override
must stay — MongoDB Collection<T> is invariant and V1 callers
pass types narrower than the concrete alternatives.
@Sleepful Sleepful force-pushed the compressed-bucket-storage branch from 7d9c704 to d679a90 Compare June 9, 2026 07:46
@Sleepful Sleepful requested a review from rkistner June 9, 2026 07:57
rkistner and others added 3 commits June 10, 2026 10:16
Move the base class implementation into MongoChecksumsV1 and inline
it into computePartialChecksumsDirectByBucket. V1 was the only consumer
of the base method; V3 had its own override with a different pipeline.

Rename V3's override to computeChecksumsByDefinition (private, not an
override). Drop all generics — TRequest, TBucketDataDocument — and the
vestigial createFilter parameter. Use concrete FetchPartialBucketChecksumByBucket
and BucketDataDocumentBase directly.

Export DEFAULT_OPERATION_BATCH_LIMIT and make storageConfig protected
so V1 can access them after inlining.
Comment thread modules/module-mongodb-storage/test/src/__snapshots__/storage.test.ts.snap Outdated
Comment thread modules/module-mongodb-storage/src/storage/implementation/v3/MongoCompactorV3.ts Outdated
Comment thread modules/module-mongodb-storage/src/storage/implementation/v3/MongoCompactorV3.ts Outdated
Comment thread modules/module-mongodb-storage/src/storage/implementation/v3/MongoCompactorV3.ts Outdated
Comment thread modules/module-mongodb-storage/src/storage/implementation/v3/MongoCompactorV3.ts Outdated
Comment thread modules/module-mongodb-storage/src/storage/implementation/v3/MongoCompactorV3.ts Outdated
Comment thread modules/module-mongodb-storage/src/storage/implementation/v3/MongoCompactorV3.ts Outdated
Comment thread modules/module-mongodb-storage/src/storage/implementation/v3/MongoCompactorV3.ts Outdated
Comment thread packages/service-core-tests/src/tests/register-sync-tests.ts Outdated
Comment thread .changeset/wild-pears-sing.md Outdated
Comment thread modules/module-mongodb-storage/src/storage/implementation/MongoBucketBatch.ts Outdated
Comment thread modules/module-mongodb-storage/src/storage/implementation/MongoBucketBatch.ts Outdated
Sleepful and others added 15 commits June 11, 2026 16:45
Use Collection<BucketDataDocumentV3> directly instead of casting
through BucketDataDocumentBase. Drop the unused BucketDataDocumentBase
import.
Use addChecksums for op-level checksum accumulation (combinedChecksum)
in clearBucketLeading. The CLEAR document's checksum field must be
32-bit wrapped. Convert from number to bigint via BigInt() when
constructing the CLEAR op.

The expectedChecksum (verification comparison) accumulator keeps
bigint += since it must match MongoDB's unwrapped $sum result.
Remove the compressedBucketStorage if/else block that wrapped identical
assertions. Upstream already expects CLEAR ops in the compacting data
checkpoint test for all storage versions. Drop compressedBucketStorage
from registerSyncTests options type and caller sites.
…ount

Pass-through ops (o > maxOpId) were previously included in the
compacted_state checksum, inflating it beyond the compaction horizon.
Add o <= maxOpId guard to totalChecksum, totalOpBytes, and
totalOpCount accumulations.
Use addChecksums for both expectedChecksum accumulation and wrap
verification.checksumSum to signed 32-bit via (Number(bigint & 0xffffffffn) | 0)
before comparison. This keeps expected and actual in the same 32-bit
wrapped domain, matching the pattern in MongoChecksums.ts.
totalOpCount += surviving.filter((op) => op.o <= this.maxOpId).length;

// --- Advance to next batch ---
upperBound = rawBatch[batchCutIndex - 1]._id as typeof upperBound;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking issue - can be fixed in a future PR:

Suggested change
upperBound = rawBatch[batchCutIndex - 1]._id as typeof upperBound;
upperBound = (newDocs.length > 0 ? newDocs[0]._id : rawBatch[batchCutIndex - 1]._id) as typeof upperBound;

This fixes an issue if chunking splits a batch into smaller ones, which would then re-read some of the same data on the next iteration. This happens because we're filtering on _id.o, which is the upper id in the chunk. If the chunk is split into smaller chunks, the next iteration would re-read the smaller one again.

This is not a blocking issue because the chunk size is currently constant, meaning this shouldn't happen in practice. But it will become an issue if we ever change the chunk size.

Ideally we should also have a test case for this.

Details

Issue found by Claude, summary written by Codex

V3 compactor upper-bound issue

The important detail: V3 bucket-data pagination is based on the document key
_id.o, and _id.o is only the largest op id in that document.

A document can cover a range of ops:

D: min_op = 100, _id.o = 150, ops = [100, 110, 120, 130, 140, 150]

When the compactor scans newest-to-oldest, it uses _id.o as the cursor:

_id: { $gte: lowerBound, $lt: upperBound }

After processing a batch, the current code advances the cursor to the oldest
document read:

upperBound = rawBatch[batchCutIndex - 1]._id;

That document has just been deleted and replaced with newly chunked documents.
If rechunking splits it into smaller documents, the first replacement can have a
smaller _id.o than the deleted document:

C1: min_op = 100, _id.o = 120, ops = [100, 110, 120]
C2: min_op = 130, _id.o = 150, ops = [130, 140, 150]

With upperBound still set to { o: 150 }, the next query matches C1
because 120 < 150. The compactor then re-reads data it already processed.
Since the seen map persists across batches, a re-read live PUT can be converted
to a MOVE pointing at itself. That drops the row data while preserving the op
checksum, so checksums do not reliably catch the issue.

The cursor should be based on the oldest replacement document, not the deleted
document:

upperBound = newDocs.length > 0 ? newDocs[0]._id : rawBatch[batchCutIndex - 1]._id;

newDocs[0]._id.o is still an _id.o cursor, so it matches the query shape.
It also points to the actual rewritten boundary, excluding all replacement
documents from the next scan while keeping older untouched documents eligible.

Using min_op would also avoid re-reading the replacements, but it would make
the cursor jump before the oldest replacement document instead of to its actual
document key. Since the query paginates by _id.o, the replacement document's
_id is the tighter and more direct bound.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants