Skip to content

Tight-space resumable incremental-reclaim compaction (opt-in) #487

@polaz

Description

@polaz

Part of #482. Depends on #486 (compaction admission). Opt-in mode.

Problem

Normal compaction needs free space for the whole output before the inputs are freed (Σ inputs worst case). On small-disk / embedded / blockchain deployments, "just provision more disk" is not an option, and there may be no headroom for a full compaction even though the data, once compacted, would shrink. We need a way to compact within a small window — at the cost of speed — without ever risking data loss.

Why opt-in

The default compaction stays simple: a separate output file, atomic commit, no journal, no extra fsyncs. This mode trades write throughput and complexity for a low transient-space peak, so it only makes sense when the disk is genuinely tight. It engages only when admission (#486) reports a normal compaction does not fit, and only when the operator has enabled it.

Solution

Keep the output in a separate file (never overwrite inputs — in-place is unsafe: multi-input merge has no single file, layout shifts on re-pack, concurrent readers, no WAL). Reclaim input space incrementally and make it crash-safe via resume:

Per key-range slice:

  1. Merge the slice from the inputs into the output file; fsync the output (durable up to key K).
  2. Record progress (consumed_key = K, output_offset = O) durably in a compaction journal (reuse the incremental-manifest edit-log machinery + its torn-tail recovery).
  3. Only then punch a hole (fallocate(PUNCH_HOLE)) over the input prefix fully covered up to K, reclaiming physical blocks.

Invariant: an input region is freed only after its data is durable in the output and the progress is journaled. A crash at any point is recoverable — recovery reads the journal and resumes from K; nothing punched is ever lost because it is already in the durable output prefix.

Additional constraints:

  • Reader / snapshot coordination: only punch input regions no longer referenced by any live MVCC snapshot (a reader holding an old snapshot must still see the input). Gate hole-punching on the version/snapshot lifecycle.
  • Throttle: drive the slice loop through the existing RateLimiter (Config::compaction_rate_limit) so reads degrade but never stop during a slow reclaim.
  • Backend support: requires fallocate(PUNCH_HOLE) (a new Fs capability flag + method); skip the mode on backends / filesystems that lack it. SSTs are written NoCoW (feat(fs): Fs trait capability framework — FS-aware integrity + CoW + reflink optimizations #354), so punch-hole reclaims extents on Btrfs.
  • On successful commit, the journal is GC'd; a failed run leaves a resumable journal, not corruption.

Acceptance criteria

  • A tree with too little free space for a normal merge can still compact in tight mode and ends up smaller, with peak extra space bounded by the slice + journal window (assert peak via a MemFs byte meter).
  • Kill-at-each-step crash matrix: after recovery the compaction resumes and produces the same final state as an uninterrupted run; no punched data is lost.
  • A live MVCC snapshot over an input is never served a punched (zero) region.
  • Reads continue (throttled, not blocked) throughout a tight compaction.
  • Mode is opt-in and only engages when Compaction space admission (deadlock-free) #486 reports a normal compaction does not fit; default compaction path is byte-for-byte unchanged.
  • Backends without punch-hole fall back cleanly (mode unavailable, reported via Storage introspection: capacity, average K/V shape, remaining-capacity estimate #483 TightCompactionAvailable = false).

Estimate

5d+ (compaction journal + resume recovery + snapshot-coordinated hole-punching + crash matrix).

Metadata

Metadata

Assignees

No one assigned

    Labels

    compactionCompaction logic, leveled/tiered strategycrash-safetyCrash recovery, fsync ordering, data durabilityenhancementNew feature, new API, new capability

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions