You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Part of #482. Depends on #486 (compaction admission). Opt-in mode.
Problem
Normal compaction needs free space for the whole output before the inputs are freed (Σ inputs worst case). On small-disk / embedded / blockchain deployments, "just provision more disk" is not an option, and there may be no headroom for a full compaction even though the data, once compacted, would shrink. We need a way to compact within a small window — at the cost of speed — without ever risking data loss.
Why opt-in
The default compaction stays simple: a separate output file, atomic commit, no journal, no extra fsyncs. This mode trades write throughput and complexity for a low transient-space peak, so it only makes sense when the disk is genuinely tight. It engages only when admission (#486) reports a normal compaction does not fit, and only when the operator has enabled it.
Solution
Keep the output in a separate file (never overwrite inputs — in-place is unsafe: multi-input merge has no single file, layout shifts on re-pack, concurrent readers, no WAL). Reclaim input space incrementally and make it crash-safe via resume:
Per key-range slice:
Merge the slice from the inputs into the output file; fsync the output (durable up to key K).
Record progress (consumed_key = K, output_offset = O) durably in a compaction journal (reuse the incremental-manifest edit-log machinery + its torn-tail recovery).
Only then punch a hole (fallocate(PUNCH_HOLE)) over the input prefix fully covered up to K, reclaiming physical blocks.
Invariant: an input region is freed only after its data is durable in the output and the progress is journaled. A crash at any point is recoverable — recovery reads the journal and resumes from K; nothing punched is ever lost because it is already in the durable output prefix.
Additional constraints:
Reader / snapshot coordination: only punch input regions no longer referenced by any live MVCC snapshot (a reader holding an old snapshot must still see the input). Gate hole-punching on the version/snapshot lifecycle.
Throttle: drive the slice loop through the existing RateLimiter (Config::compaction_rate_limit) so reads degrade but never stop during a slow reclaim.
On successful commit, the journal is GC'd; a failed run leaves a resumable journal, not corruption.
Acceptance criteria
A tree with too little free space for a normal merge can still compact in tight mode and ends up smaller, with peak extra space bounded by the slice + journal window (assert peak via a MemFs byte meter).
Kill-at-each-step crash matrix: after recovery the compaction resumes and produces the same final state as an uninterrupted run; no punched data is lost.
A live MVCC snapshot over an input is never served a punched (zero) region.
Reads continue (throttled, not blocked) throughout a tight compaction.
Part of #482. Depends on #486 (compaction admission). Opt-in mode.
Problem
Normal compaction needs free space for the whole output before the inputs are freed (
Σ inputsworst case). On small-disk / embedded / blockchain deployments, "just provision more disk" is not an option, and there may be no headroom for a full compaction even though the data, once compacted, would shrink. We need a way to compact within a small window — at the cost of speed — without ever risking data loss.Why opt-in
The default compaction stays simple: a separate output file, atomic commit, no journal, no extra fsyncs. This mode trades write throughput and complexity for a low transient-space peak, so it only makes sense when the disk is genuinely tight. It engages only when admission (#486) reports a normal compaction does not fit, and only when the operator has enabled it.
Solution
Keep the output in a separate file (never overwrite inputs — in-place is unsafe: multi-input merge has no single file, layout shifts on re-pack, concurrent readers, no WAL). Reclaim input space incrementally and make it crash-safe via resume:
Per key-range slice:
fsyncthe output (durable up to key K).(consumed_key = K, output_offset = O)durably in a compaction journal (reuse the incremental-manifest edit-log machinery + its torn-tail recovery).fallocate(PUNCH_HOLE)) over the input prefix fully covered up to K, reclaiming physical blocks.Invariant: an input region is freed only after its data is durable in the output and the progress is journaled. A crash at any point is recoverable — recovery reads the journal and resumes from K; nothing punched is ever lost because it is already in the durable output prefix.
Additional constraints:
RateLimiter(Config::compaction_rate_limit) so reads degrade but never stop during a slow reclaim.fallocate(PUNCH_HOLE)(a newFscapability flag + method); skip the mode on backends / filesystems that lack it. SSTs are written NoCoW (feat(fs): Fs trait capability framework — FS-aware integrity + CoW + reflink optimizations #354), so punch-hole reclaims extents on Btrfs.Acceptance criteria
MemFsbyte meter).TightCompactionAvailable = false).Estimate
5d+ (compaction journal + resume recovery + snapshot-coordinated hole-punching + crash matrix).