Skip to content

Add buffer-managed-state design doc and swap-extent buffer pool prototype#36973

Draft
DAlperin wants to merge 25 commits into
MaterializeInc:mainfrom
DAlperin:dov/swap-pool-prototype
Draft

Add buffer-managed-state design doc and swap-extent buffer pool prototype#36973
DAlperin wants to merge 25 commits into
MaterializeInc:mainfrom
DAlperin:dov/swap-pool-prototype

Conversation

@DAlperin

Copy link
Copy Markdown
Member

Motivation

Working sets that exceed memory currently spill through mz_ore::pager's two blob backends (kernel swap via MADV_COLD, per-chunk scratch files), each accidentally good at half the workload: swap is lazy and translation-free but kernel-paced (per-4 KiB synchronous faults on worker threads, direct reclaim); files are controllable but pay per-chunk inode churn and eager full-cost serialization, with residency decided irrevocably at pageout time.

This PR proposes a successor architecture and includes a working prototype of its first two layers.

Design doc

doc/developer/design/20260610_buffer_managed_state.md: a buffer-managed architecture in the Umbra/LeanStore/vmcache lineage, adapted to properties those systems don't have — state is immutable once sealed, recreatable from persist (no WAL/manifest/fsync anywhere), and its lifecycle is known to the engine rather than guessed by a cache. Three layers: pooled extent store, stable-address buffer pool with write-behind and lifecycle-driven eviction, and paged sealed arrangement batches. Covers the lgalloc → swap → explicit-pager history, integration with differential's pending Chunk abstraction (TimelyDataflow/differential-dataflow#744), performance estimates against the measured baselines, an eager/lazy backing policy keyed to spine level, and an incremental migration story that coexists with node-level swap indefinitely. Because production nodes currently provision the whole disk as swap (no scratch filesystem), deployment starts with a swap-backed extent store: extents are anonymous allocations holding lz4-compressed bytes pushed to the swap device with MADV_PAGEOUT — the strategy benchmarked in CLU-108 / #36948, generalized into the pool's backing layer.

Prototype

  • mz_ore::pool: size-class anonymous VM regions (64 KiB–2 MiB) with stable slot addresses; per-chunk state machine (UnbackedResident / BackedResident / Evicted / Oversize) under a per-chunk mutex with pin guards; swap-backed extents (lz4 at the eviction boundary only — resident bytes are always uncompressed); FIFO + second-chance eviction against a live-tunable resident-bytes budget. The design's two load-bearing properties hold by construction and are test-asserted: chunks freed before eviction never cost a compression or a write (elided_frees), and re-evicting an already-backed chunk does no I/O (evictions_cheap).
  • column_pager integration: a PagedColumn::Pooled path and ColumnPager::pooled, so the columnar merge batcher works unchanged above the seam.
  • PagedRun (mz_timely_util::columnar::paged_run): a standalone Layer 3 prototype — sealed sorted runs as eagerly-evicted pool pages plus resident fence keys, zero-copy seeks via borrowed columnar views over pinned pool memory, prefetching iteration, and a streaming bounded-window merge. Not yet wired to arrangements; it exists to prove the format and the borrow-safety story.
  • Wiring for staging: new dyncfg column_paged_batcher_use_pool (default off) routes the compute batchers through the pool with the existing fraction-derived budget; enable_upsert_paged_spill now follows whichever mechanism (pool or tiered) the last config apply installed, so the storage upsert stash opts in with no storage-side changes. Nine mz_column_pool_* metrics expose the pool's counters, including the elision rate the design's estimates hinge on.

Testing

Unit tests throughout: 24 for the pool (round trips, evict/fault integrity, a slot-poisoning test proving fault-in reads the extent rather than stale MADV_DONTNEED'd memory, budget enforcement, second chance, dead-data elision, stable addresses, multithreaded smoke), batcher-level round trips through the pool with stats assertions, fault-count-exact seek tests and reference-checked bounded-window merges for PagedRun, and a mechanism-switch routing test for the config seam. The new flag is registered with mzcompose's system-parameter list and parallel-workload.

Status

Draft for discussion alongside the design doc (also up separately as the dov/buffer-managed-state-design branch). The prototype intentionally takes the design's simplest sound options — synchronous on-worker I/O, per-chunk mutexes instead of epochs, owned rehydration — all marked as such in the doc's open questions.

Generated with Claude Code

DAlperin added 25 commits June 11, 2026 21:41
Fold the June 2026 staging measurements into the buffer-managed-state
design doc: bounded accumulation at the budget floor, die-young elision
observed via spill cancellations, off-worker eviction as the de facto
executor answer for the swap-backed store, exact-size extents,
size-class coverage, the phase-scoped boundedness finding (seal drain
fixed, arrange_core materialization now an open question), and the
working-set accounting caveats.

Add a forward-looking section mapping the object-store literature
(cloud-native tiering, AnyBlob request economics, far-memory interface
results, log-structured GC) onto the extent seam, including the
persist-convergence question and the EBS-swap intermediate.
Chains shorter than MIN_PAGED_CHAIN_LEN (4 chunks) no longer route their
entries through the pager: the rebalancing cascade consumes short chains
almost immediately, so paging their chunks scheduled work the next merge
cancelled — measured under hydration load as the spill queue pinned at
its cap with cancellation rates of 100-400/s. Singleton pushes and
below-threshold merge outputs stay resident; chunks reach the pager once
they land in a chain long enough to sit out a few rebalance rounds, and
the seal's extract path pages keep/ship buffers as before.

Resident overhead is bounded by the chain-stack shape (the youngest
chain is under half its predecessor, so sub-threshold chains hold fewer
than MIN_PAGED_CHAIN_LEN chunks between them): single-digit MiB per
batcher, paid per worker per consumer. The disabled pager is safe on the
rehydration side because ColumnPager::take is variant-driven: pooled and
paged inputs rehydrate through their own handles regardless of which
pager performs the take.
…ll_worker_count

The prior name was registered in LaunchDarkly with the wrong data type,
and LD does not allow changing a parameter's type after creation; a
fresh name lets the flag be recreated correctly. Default and semantics
unchanged.
@DAlperin DAlperin force-pushed the dov/swap-pool-prototype branch from 8a4675e to e9cf871 Compare June 12, 2026 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant