perf: borrowed in-place over-window scan for dfast + row + btlazy2#422
Conversation
The borrowed (no-copy) one-shot scan was gated to in-window inputs for Dfast; over-window inputs fell back to the owned path, which copies the whole input into the history mirror. On large over-window streams that memmove dominates: ~75% of L1-dcache misses and a 2.16x cycle gap vs C (C matches in place on src). Add a per-position window_low = abs_ip - advertised_window candidate bound to the borrowed dfast scan (owned eviction's history_abs_start already provides this; borrowed-in- window saturates to 0, so both stay byte-identical) and drop the input_len <= window_size gate. Over-window dfast now matches in place, mirroring C's continuous-index + windowLow one-shot behaviour and avoiding the input->mirror copy.
Extend the borrowed (no-copy) one-shot scan to the Row backend (greedy L5, lazy L9-12), mirroring the Dfast over-window fix: over-window inputs were copied into the owned history mirror (the input->mirror memmove that dominates L1 misses + the cycle gap vs C). Add borrowed_input/borrowed_block state + set_borrowed_window/stage_borrowed_block/start_matching_borrowed/ skip_matching_borrowed to RowMatchGenerator; set_borrowed_window zeroes history_abs_start so positions stay absolute input offsets. The candidate lower bound becomes window_low = history_abs_start.max(abs_pos - max_window_size): owned picks history_abs_start (byte-identical), borrowed (history_abs_start = 0) picks the window cap so an over-window in-place scan never emits an unresolvable offset. live_history()/get_last_space()/ current_block_range()/trailing-literals all read the borrowed input in place. borrowed_eligible gates on the resolved backend (Row), not the strategy tag, so HashChain/BT (no borrowed scan yet) stay owned. Driver wires set_borrowed_window/block + start/skip routing for Row. borrowed_oneshot_matches_owned_and_roundtrips now covers L5/L9 over-window byte-identical to owned.
|
Warning Review limit reached
More reviews will be available in 1 hour, 53 minutes, and 40 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughThe PR extends the borrowed (no-copy) one-shot compression path from Simple/Dfast to Row and HashChain/BinaryTree backends. It adds ChangesBorrowed no-copy scan for all backends
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Extend the borrowed (no-copy) over-window scan to the HashChain backend's btlazy2 binary-tree parser (L13-15). Over-window inputs were copied into the owned history mirror; now matched in place via the per-position window_low cap. Key fix: the BT-tree body macros (bt_insert_step_no_rebase_body, bt_insert_and_collect_matches_body) and emit_optimal_plan read the live region via `live_history()` (borrowed-aware, reborrow-then-raw-ptr so the slice holds no borrow while the tree mutates) instead of `&history[history_start..]` directly — the direct read returned the EMPTY borrowed mirror, so the tree found no matches and the encoder over-split (periodic_stream_not_oversplit: l15 184 -> 318994). MatchTable gains the borrowed-window scaffolding (borrowed_input/block, set_borrowed_window zeroing history_abs_start, stage, current_block_range, borrowed-aware live_history/get_last_space) + window_low on the HC chain candidate checks. The OPTIMAL parsers (BtOpt/BtUltra/BtUltra2, L16-22) stay owned: their cost-based DP is sensitive to candidate quality and the borrowed continuous-index scan yields ratio-worse candidates (diverged from owned + fell outside the ffi ratio bound). borrowed_supported is the single gate. borrowed_oneshot_matches_owned_and_roundtrips covers L15 over-window byte-identical to owned; periodic_stream_not_oversplit + 839 tests pass.
|
@coderabbitai review |
✅ Action performedReview finished.
|
The borrowed over-window path matches in place on the caller's input, so the pre-split fingerprint is the first touch of each 128 KiB region: a cache-cold, sampling-strided read with interleaved random writes into the events table (latency-bound, ~3x an ERMS streaming read of the same bytes). The owned path never pays this because its history-mirror copy already warmed the bytes. Restore that warmth without the copy's write half via one bandwidth-bound sequential pass per pre-split window, gated to exactly the conditions under which the splitter reads the block.
|
@coderabbitai full review |
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/encoding/dfast/mod.rs (1)
922-926:⚠️ Potential issue | 🔴 CriticalBorrowed mode missing advertised window bounds in tail probe.
probe_tail_ip0_onlyuseshistory_abs_start(which is0in borrowed mode perscan_sourceline 1192) for candidate validity checks, but doesn't apply thewlowbounds that the fast loop body computes. For over-window borrowed scans, this could emit matches withoffset > advertised_window:
- Line 924:
cand_pos >= history_abs_start→ always true whenhistory_abs_start = 0- Line 964: same issue for short-hash probe
The fast loop (lines 1924–1932) correctly computes
wlow = abs_ip.saturating_sub(advertised_window)for borrowed mode, but this function skips that check entirely. Since this function is called at the end of every block (whenip1exceeds the readable region), an over-window input could emit an unresolvable offset from this path.The fix is to apply the same
wlowbounds logic used in the fast loop:fn probe_tail_ip0_only( &self, current_abs_start: usize, current_len: usize, ip0: usize, literals_start: usize, + borrowed: bool, ) -> Option<MatchCandidate> { ... let abs_ip0 = current_abs_start + ip0; + let advertised_window = self.max_window_size; + let wlow = if borrowed { + abs_ip0.saturating_sub(advertised_window) + } else { + history_abs_start + }; ... // Long-hash probe if idxl0 != DFAST_EMPTY_SLOT { let cand_pos = position_base + (idxl0 as usize) - 1; - if cand_pos >= history_abs_start && cand_pos < abs_ip0 { + if cand_pos >= wlow && cand_pos < abs_ip0 { ... // Short-hash probe if idxs0 != DFAST_EMPTY_SLOT { let cand_pos_s = position_base + (idxs0 as usize) - 1; - if cand_pos_s >= history_abs_start && cand_pos_s < abs_ip0 { + if cand_pos_s >= wlow && cand_pos_s < abs_ip0 {Then update the call site at line 1813 to pass
$borrowed:if let Some(committed) = - $self.probe_tail_ip0_only($current_abs_start, $current_len, ip0, literals_start) + $self.probe_tail_ip0_only($current_abs_start, $current_len, ip0, literals_start, $borrowed)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/dfast/mod.rs` around lines 922 - 926, The probe_tail_ip0_only function fails to enforce advertised window bounds in borrowed mode scans, allowing matches with offset > advertised_window to be emitted. In borrowed mode, history_abs_start is 0, making the check at line 924 (cand_pos >= history_abs_start) always true and unable to filter out-of-window candidates. The same issue exists at line 964 for short-hash probes. Fix this by computing wlow bounds using abs_ip.saturating_sub(advertised_window) and applying it to validate candidates at both probe locations, matching the logic used in the fast loop body. Additionally, update the call site at line 1813 to pass the $borrowed parameter to enable this bounds checking within the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/levels/fastest.rs`:
- Around line 289-298: The debug_assert checking borrowed_supported() is
currently positioned only within the compressed branch, allowing unsupported
borrowed dispatch to proceed unchecked for RLE/raw-fast block paths. Move the
debug_assert macro call that validates state.matcher.borrowed_supported() from
its current location in the compressed branch to the function entry point,
before the branch selection logic. This ensures that debug builds catch every
invalid caller regardless of whether the code takes the compressed, RLE, or
raw-fast path.
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1504-1520: The borrowed_supported() predicate in the match
statement correctly rejects BtOpt, BtUltra, and BtUltra2 for HashChain, but
internal callers at line 1609 can still stage borrowed blocks for these
unsupported configurations because the assertion there does not properly
validate the predicate. Add search-awareness to the borrowed block staging logic
by ensuring that all callsites where borrowed blocks are staged (at lines
1609-1617 and 2942-2954) properly gate their operations with a check against
borrowed_supported(), preventing BtOpt/BtUltra/BtUltra2 from ever staging
borrowed blocks regardless of the BackendTag.
In `@zstd/src/encoding/match_table/storage.rs`:
- Around line 2069-2078: The unsafe slice creation using from_raw_parts trusts
that current_len matches the actual length returned by get_last_space() without
validation, which could lead to undefined behavior if a mismatch occurs. Add a
debug_assert statement before the unsafe block to verify that current_len equals
the length of the space obtained from get_last_space(), ensuring that misuse is
caught in debug builds without incurring runtime cost in release builds.
---
Outside diff comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 922-926: The probe_tail_ip0_only function fails to enforce
advertised window bounds in borrowed mode scans, allowing matches with offset >
advertised_window to be emitted. In borrowed mode, history_abs_start is 0,
making the check at line 924 (cand_pos >= history_abs_start) always true and
unable to filter out-of-window candidates. The same issue exists at line 964 for
short-hash probes. Fix this by computing wlow bounds using
abs_ip.saturating_sub(advertised_window) and applying it to validate candidates
at both probe locations, matching the logic used in the fast loop body.
Additionally, update the call site at line 1813 to pass the $borrowed parameter
to enable this bounds checking within the function.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: c1b00dec-5708-46b3-b901-1586fcaab437
📒 Files selected for processing (8)
zstd/src/encoding/dfast/mod.rszstd/src/encoding/frame_compressor.rszstd/src/encoding/hc/mod.rszstd/src/encoding/levels/fastest.rszstd/src/encoding/match_generator.rszstd/src/encoding/match_table/storage.rszstd/src/encoding/row/mod.rszstd/src/tests/roundtrip_integrity.rs
dfast probe_tail_ip0_only (runs once per block at the iend boundary) validated candidates only against history_abs_start, which is 0 in borrowed mode, so an over-window borrowed scan could emit a match with offset > advertised_window from the tail path (an unresolvable offset). Apply the same wlow = abs_ip - advertised_window bound the fast loop body already uses, threaded via a borrowed flag. Also hardens the borrowed staging invariants: - borrowed_supported() is search-aware: the HashChain backend admits the lazy CHAIN parser and btlazy2 only, never the optimal BT parsers. - set_borrowed_block asserts borrowed_supported() (was a backend-only check); the borrowed BinaryTree dispatch drops the unreachable optimal arms (optimal always takes the owned path via borrowed_eligible). - compress_block_encoded_borrowed checks the invariant at function entry so RLE / raw-fast / compressed paths are all covered. - emit_optimal_plan debug_asserts current_len <= get_last_space().len() before the unchecked from_raw_parts slice. 839 tests pass; optimal levels stay byte-identical (owned path untouched).
|
| Filename | Overview |
|---|---|
| zstd/src/encoding/dfast/mod.rs | Adds borrowed: bool to probe_tail_ip0_only and introduces per-position wlow/wlow0/wlow1 window-low bounds; correctly uses saturating_sub(advertised_window) for borrowed mode and falls back to history_abs_start for owned. |
| zstd/src/encoding/frame_compressor.rs | Simplifies borrowed_eligible to delegate to borrowed_supported(); adds warm_presplit_window correctly gated inside the borrowed-only block loop. |
| zstd/src/encoding/hc/mod.rs | Strengthens candidate bounds with history_abs_start.max(abs_pos.saturating_sub(max_window_size)); the window-low expression is position-invariant but recomputed per candidate iteration. |
| zstd/src/encoding/match_generator.rs | Adds borrowed_supported(), extends borrowed dispatch to Row and HashChain; clear_borrowed_window uses a _ => {} catch-all that would silently skip clearing for future BackendTag variants. |
| zstd/src/encoding/match_table/storage.rs | Adds borrowed_input/borrowed_block fields and the full borrowed-window API with correct ownership, safety docs, and reset-clearing. |
| zstd/src/encoding/row/mod.rs | Mirrors the MatchTable borrowed-window API on RowMatchGenerator; updates greedy/lazy parse macros to use current_block_range() and live_history(); adds window_low cap and borrowed entry points. |
| zstd/src/encoding/levels/fastest.rs | Replaces the hard assert against L16-22 with debug_assert!(borrowed_supported()) to allow btlazy2 and Row through the borrowed path. |
| zstd/src/tests/roundtrip_integrity.rs | Extends borrowed_oneshot_matches_owned_and_roundtrips with L5, L9, and L15 over-window borrowed cases covering all three new backends. |
Reviews (2): Last reviewed commit: "refactor(encode): drop redundant borrowe..." | Re-trigger Greptile
The borrowed BinaryTree arm in start_matching re-staged the block on hc_matcher().table, but set_borrowed_block already staged the same range there for the HashChain backend (and borrowed_pending, which gates this arm, is set only by set_borrowed_block). The re-stage was a no-op overwrite with an identical range; remove it and note where the stage actually happens.
✅ Action performedFull review finished. |
Summary
Extend the borrowed (no-copy) one-shot scan to over-window inputs across all in-place-capable backends: Dfast, Row, and btlazy2 (the binary-tree backend, L13-15). Previously over-window inputs fell back to the owned path, which copies the whole input into the history mirror (
__memmove_avx_unaligned_erms); on large over-window streams that copy dominates L1 misses and the cycle gap vs C.Each backend applies a per-position
window_low = abs_ip - advertised_windowcandidate cap so an over-window in-place scan never emits an unresolvable offset, mirroring upstream zstd's continuous-index + windowLow one-shot behaviour. The cap is byte-identical for owned and borrowed-in-window (it collapses to the existing eviction floor / saturates to 0), so only the new over-window borrowed path changes behaviour.btlazy2 (binary-tree) borrowed
The BT tree is done in-place by upstream zstd via a
windowLowstop in the walk loop (ZSTD_insertBt1zstd_opt.c:489,ZSTD_insertBtAndGetAllMatcheszstd_opt.c:724), not via eviction — our BT walk/insert already carry the same stop, so the tree depth stays bounded regardless of eviction. The BT byte-source (bt_insert_*+emit_optimal_plan) was routed throughlive_history()(reborrow-then-raw-ptr) so the borrowed mirror is read correctly; btlazy2 borrowed is then byte-identical to owned.Pre-split cache-locality fix
The borrowed path matches in place on the caller's input, so the pre-split fingerprint (
optimal_block_size→split_block_by_chunks) becomes the first touch of each 128 KiB region: a cache-cold, sampling-strided read with interleaved random writes into the events table. That latency-bound pattern costs ~3× an ERMS streaming read of the same bytes (measured: pre-split memset/histogram 1.1% owned → 15.3% borrowed, the source of an early +45% regression on btlazy2). The owned path never pays this because its history-mirror copy already warmed the bytes. A single bandwidth-bound sequential warm pass per pre-split window restores that warmth without the copy's write half, gated to exactly the conditions under which the splitter reads the block (pre-split level, full 128 KiB remaining,savings >= 3).Results (i9-9900K,
perf stat -e cycles, over-window fixtures)The win scales inversely with search cost (cheap-search backends gain most; the memmove is a smaller fraction of heavier parsers). For btlazy2 the BT search dominates, so the throughput edge is small, but borrowed also removes the 2× window history-mirror allocation.
The OPTIMAL parsers (btopt/btultra/btultra2, L16-22) stay on the owned path: their cost-based DP is sensitive to candidate quality and the borrowed continuous-index scan produced ratio-worse candidates that fell outside the ffi bound. Tracked separately.
Correctness
borrowed_oneshot_matches_owned_and_roundtripscovers L3 / L5 / L9 / L15 over-window (3 MiB) cases byte-identical to the owned (evicting) baseline + roundtrip.periodic_stream_not_oversplitguards the btlazy2 borrowed byte-source against the BT over-split failure mode.--features dict_builder) + cross_validation green on x86_64 (avx2) and aarch64.Testing
cargo nextest run -p structured-zstd --features dict_builder— 839 passed.Summary by CodeRabbit
New Features
Performance
Bug Fixes