Add buffer-managed-state design doc and swap-extent buffer pool prototype#36973
Draft
DAlperin wants to merge 25 commits into
Draft
Add buffer-managed-state design doc and swap-extent buffer pool prototype#36973DAlperin wants to merge 25 commits into
DAlperin wants to merge 25 commits into
Conversation
08218b0 to
684d6ba
Compare
…stash pager at render
Fold the June 2026 staging measurements into the buffer-managed-state design doc: bounded accumulation at the budget floor, die-young elision observed via spill cancellations, off-worker eviction as the de facto executor answer for the swap-backed store, exact-size extents, size-class coverage, the phase-scoped boundedness finding (seal drain fixed, arrange_core materialization now an open question), and the working-set accounting caveats. Add a forward-looking section mapping the object-store literature (cloud-native tiering, AnyBlob request economics, far-memory interface results, log-structured GC) onto the extent seam, including the persist-convergence question and the EBS-swap intermediate.
Chains shorter than MIN_PAGED_CHAIN_LEN (4 chunks) no longer route their entries through the pager: the rebalancing cascade consumes short chains almost immediately, so paging their chunks scheduled work the next merge cancelled — measured under hydration load as the spill queue pinned at its cap with cancellation rates of 100-400/s. Singleton pushes and below-threshold merge outputs stay resident; chunks reach the pager once they land in a chain long enough to sit out a few rebalance rounds, and the seal's extract path pages keep/ship buffers as before. Resident overhead is bounded by the chain-stack shape (the youngest chain is under half its predecessor, so sub-threshold chains hold fewer than MIN_PAGED_CHAIN_LEN chunks between them): single-digit MiB per batcher, paid per worker per consumer. The disabled pager is safe on the rehydration side because ColumnPager::take is variant-driven: pooled and paged inputs rehydrate through their own handles regardless of which pager performs the take.
…ll_worker_count The prior name was registered in LaunchDarkly with the wrong data type, and LD does not allow changing a parameter's type after creation; a fresh name lets the flag be recreated correctly. Default and semantics unchanged.
8a4675e to
e9cf871
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Working sets that exceed memory currently spill through
mz_ore::pager's two blob backends (kernel swap viaMADV_COLD, per-chunk scratch files), each accidentally good at half the workload: swap is lazy and translation-free but kernel-paced (per-4 KiB synchronous faults on worker threads, direct reclaim); files are controllable but pay per-chunk inode churn and eager full-cost serialization, with residency decided irrevocably at pageout time.This PR proposes a successor architecture and includes a working prototype of its first two layers.
Design doc
doc/developer/design/20260610_buffer_managed_state.md: a buffer-managed architecture in the Umbra/LeanStore/vmcache lineage, adapted to properties those systems don't have — state is immutable once sealed, recreatable from persist (no WAL/manifest/fsync anywhere), and its lifecycle is known to the engine rather than guessed by a cache. Three layers: pooled extent store, stable-address buffer pool with write-behind and lifecycle-driven eviction, and paged sealed arrangement batches. Covers the lgalloc → swap → explicit-pager history, integration with differential's pendingChunkabstraction (TimelyDataflow/differential-dataflow#744), performance estimates against the measured baselines, an eager/lazy backing policy keyed to spine level, and an incremental migration story that coexists with node-level swap indefinitely. Because production nodes currently provision the whole disk as swap (no scratch filesystem), deployment starts with a swap-backed extent store: extents are anonymous allocations holding lz4-compressed bytes pushed to the swap device withMADV_PAGEOUT— the strategy benchmarked in CLU-108 / #36948, generalized into the pool's backing layer.Prototype
mz_ore::pool: size-class anonymous VM regions (64 KiB–2 MiB) with stable slot addresses; per-chunk state machine (UnbackedResident/BackedResident/Evicted/Oversize) under a per-chunk mutex with pin guards; swap-backed extents (lz4 at the eviction boundary only — resident bytes are always uncompressed); FIFO + second-chance eviction against a live-tunable resident-bytes budget. The design's two load-bearing properties hold by construction and are test-asserted: chunks freed before eviction never cost a compression or a write (elided_frees), and re-evicting an already-backed chunk does no I/O (evictions_cheap).column_pagerintegration: aPagedColumn::Pooledpath andColumnPager::pooled, so the columnar merge batcher works unchanged above the seam.PagedRun(mz_timely_util::columnar::paged_run): a standalone Layer 3 prototype — sealed sorted runs as eagerly-evicted pool pages plus resident fence keys, zero-copy seeks via borrowed columnar views over pinned pool memory, prefetching iteration, and a streaming bounded-window merge. Not yet wired to arrangements; it exists to prove the format and the borrow-safety story.column_paged_batcher_use_pool(default off) routes the compute batchers through the pool with the existing fraction-derived budget;enable_upsert_paged_spillnow follows whichever mechanism (pool or tiered) the last config apply installed, so the storage upsert stash opts in with no storage-side changes. Ninemz_column_pool_*metrics expose the pool's counters, including the elision rate the design's estimates hinge on.Testing
Unit tests throughout: 24 for the pool (round trips, evict/fault integrity, a slot-poisoning test proving fault-in reads the extent rather than stale
MADV_DONTNEED'd memory, budget enforcement, second chance, dead-data elision, stable addresses, multithreaded smoke), batcher-level round trips through the pool with stats assertions, fault-count-exact seek tests and reference-checked bounded-window merges forPagedRun, and a mechanism-switch routing test for the config seam. The new flag is registered with mzcompose's system-parameter list and parallel-workload.Status
Draft for discussion alongside the design doc (also up separately as the
dov/buffer-managed-state-designbranch). The prototype intentionally takes the design's simplest sound options — synchronous on-worker I/O, per-chunk mutexes instead of epochs, owned rehydration — all marked as such in the doc's open questions.Generated with Claude Code