perf(hc): lazy-skip + separate dictMatchState beat C on small-window dict compress#427
Conversation
Split the optimal-parser DP price into a contiguous `node_prices: Box<[u32]>` companion to `opt_nodes_scratch`, maintained in lockstep with every `HcOptimalNode.price` write. The two hot price-set loop reads now read the contiguous array. This is the prerequisite for SIMD-vectorising the inner price-set loop: the 28-byte AoS node stride would otherwise force a strided gather for the per-length price compare; a contiguous u32 array allows a single vector load. Byte-identical (840 tests incl level22 parity + cross_validation). The redundant struct `price` field is retained here and removed in a follow-up once all remaining readers route through the array.
Vectorise the btultra/btultra2 (OPT_LEVEL >= 2) price-set inner loop: each match_len writes a distinct node, so iteration is order-independent and the improvement test reduces to next_cost < node_prices[next] (reset already set beyond-frontier cells to u32::MAX, subsuming next > last_pos). Process the match-length range in 8-lane chunks: scalar next_cost (same add_prices / match_price_from_parts chain, byte-identical), AVX2 unsigned compare against the contiguous node_prices SoA array (min_epu32-derived since AVX2 has no cmplt_epu32), scalar write of the improving lanes. btopt (OPT_LEVEL == 0) keeps the reverse-iterate-then-break scalar form. 840 tests byte-identical.
The standalone vectorised helper penalised the short match-length ranges (trip < 8, the majority by count) with fn-call + non-inlined price-lookup overhead and zero vectorisation benefit, regressing L18 +3.9%. Keep short ranges (< 16) on the inline-scalar simplified-compare path; only divert long ranges to the vectorised helper where the chunked compare can amortise.
Replace the standalone runtime-dispatched price-set helper with per-tier functions threaded through the build_optimal_plan macro ($priceset param), matching the existing $collect kernel pattern: the avx2 wrapper passes priceset_range_nonabort_avx2 (#[target_feature(avx2)] + #[inline] -> folds into the wrapper's avx2 umbrella: no call ABI, no per-chunk runtime feature detection, cached_match_length_price inlines), other tiers pass the inline scalar variant. Eliminates the standalone-fn overhead that regressed the gated version. byte-identical 840 tests (scalar path; avx2 path validated on x86 CI / bench host).
…e-set The price-set inner loop is compute-bound, not write-bound: an improve-ratio probe shows only 1-8% of high-trip comparisons write, so the dominant cost is the per-lane next_cost computation (ml-price cache retrieval), which the compare-only vectorisation left scalar. A hit-ratio probe shows the ml-price cache is 91-98% warm. Since cached_match_length_price indexes the cache by match_len directly, a run of consecutive match lengths maps to contiguous [price,generation] cells: vector-load 8 cells, deinterleave + check all gens == stamp, and fold the cached prices with one broadcast constant (c_base = base_cost + ll0_price + match_price_from_parts(off,0)); cold chunks fall back to the scalar fill. Byte-identical (avx2 path validated on x86).
The cross-lane permutevar8x32 deinterleave was a latency-bound serial chain that lost to pipelined scalar cache loads on high-chunk-count fixtures (low-entropy-1m L18 regressed ~27%). Replace with in-128-lane shuffle_epi32 + unpack*_epi64 and a single cross-lane permute4x64 for the ordered prices; gens stay unordered (only the all-equal stamp test matters). One cross-lane op instead of six.
Replace the macro-based vector price-set body (macro hygiene couldn't see the wrapper's parameters, so only the never-compiled-here avx2 arm "worked") with a generic `priceset_range_vec::<W>` taking the per-tier deinterleave + mask as zero-sized impl Fn closures; #[inline(always)] + monomorphisation folds them into each wrapper's target_feature umbrella (intrinsics inline, no call ABI, no runtime detection). Wire the NEON tier: vld2q_u32 deinterleaves the 4 [price,gen] pairs natively (no shuffle chain), vcltq_u32 + lane-weight vaddvq packs the 4-lane improve mask. byte-identical 846 tests on aarch64 (M1 dispatch -> neon path). AVX2 unchanged (now via the generic body).
…s test SSE4.1 4-lane deinterleave (two 128-bit loads + shuffle_epi32 + unpack_epi64) and min_epu32-based improve mask, wired to the sse42 DP wrapper via the shared generic body. Add a direct unit test asserting every compile-available tier's deinterleave + improve-mask helpers (AVX2 + SSE4.1 on x86, NEON on aarch64) match a scalar reference — the dispatch only fires the best tier per host, so this is the only coverage for the non-selected SIMD tiers.
Add the simd128 4-lane deinterleave (u32x4_shuffle selects price/gen lanes natively) + native unsigned u32x4_lt/u32x4_bitmask improve mask, a simd128 DP wrapper, and route wasm32+simd128 to it in the dispatch (was scalar). cfg-gate the now-dead scalar range / wrapper / generic body on targets where another tier is selected. Both wasm clippy variants (simd128 + scalar) clean under -D warnings; the helper correctness test covers the simd128 lane math on a wasm host.
…ned mirror The greedy/lazy levels (5-12) on small inputs take the borrowed (no-copy) one-shot path, where the resolved window is <= 14 and the matchfinder falls back to HashChain (donor ZSTD_resolveRowMatchFinderMode). In that mode the finder reads the borrowed input via live_history(), but insert_position_no_rebase and fill_hash_chain_positions hashed self.history[history_start..] — the owned mirror, which is empty under the no-copy path. The insert therefore skipped or hashed garbage while the finder hashed the real input, so the chain stayed empty, no matches were found, and a trivially-compressible repeated 4 KiB block ballooned to 2549 bytes (~16x the donor's 154). Read live_history() in both insert paths so insert and find hash the same bytes (donor uses one buffer for both). Owned mode is unchanged: live_history() == history[history_start..] there. Regression test: small_repetitive_compresses_on_borrowed_hashchain_band covers levels 5/6/7/10/12.
…lloc The greedy/lazy levels (5-12) on small dict frames fall back to HashChain (donor matchfinder gate) and restore the primed-dictionary snapshot every frame. Only the Fast backend reused its buffers via clone_from; HC/BT took `snapshot.clone()` + `ensure_tables()`, a full per-frame allocate+copy+drop that made the profile allocation-bound (malloc/sbrk/page-fault) and ran 5-7x the C dict compress. Give MatchTable a manual Clone whose clone_from reuses the history/hash/chain/hash3/dms buffers, and restore HC lazy/greedy (non-BT) snapshots through it — donor reuses the CDict tables in place rather than reallocating. BT (uses_bt) snapshots still drop their live tables, so they keep the realloc path.
Replace the bounds-checked 4-byte slice-equality gates in the hash-chain candidate walk (both the speculative tail gate and the head 4-byte gate) with raw unaligned u32 load+compare, matching upstream zstd ZSTD_HcFindBestMatch's MEM_read32 gate. One load per side, no per-candidate slice bounds-check path. Accepted-candidate set is byte-identical (848 tests pass).
The dict-primed HC lazy chain walk was the dominant compress-dict cost (76% self-time in find_best_match on small-10k-random L7). The per- candidate body did far more than upstream ZSTD_HcFindBestMatch's tight loop: an Option-returning absolute-position decode, and a window-floor recompute (.max()/.saturating_sub()) on every step. Track candidates in history-relative index space instead. The loop invariants (window floor, position->index bias) are hoisted once before the walk, so each step is a single add + range check + 4-byte gate, mirroring the donor's matchIndex>=lowLimit / NEXT_IN_CHAIN body. The absolute position is recovered only on the rare accept path. Accepted- candidate set is byte-identical (848 tests pass).
The lazy parser advanced one byte per no-match position; upstream ZSTD_compressBlock_lazy_generic (zstd_lazy.c:1614) advances step = ((ip - anchor) >> kSearchStrength) + 1, jumping faster over runs with no match. On compressible input the literal run stays short so the step is 1 (unchanged); on incompressible / dict-over-random input the run grows and the parser searches one position per step instead of every byte. Instrumented both sides on small-10k-random L7 dict: find_best_match call count was 9136 vs C's 6107 (1.5x too many). With this change ours is 6103, matching C. Output follows upstream (ratio-equivalent, not byte-identical); 848 tests pass.
Lazy-HC dict frames merged the dictionary into the live hash chain, so each input-position walk dragged dict candidates out of the shared buckets. Upstream keeps the dict in a SEPARATE dictMatchState (ZSTD_HcFindBestMatch dms HC4 loop). Mirror it: ATTACH mode (small / unknown inputs) builds a single-link hash-chain dms over the committed dict (prime_dms_hc) and keeps it out of the live chain. The dms walk is FUSED into hash_chain_candidate and shares the SAME steps/max_chain_steps budget as the live walk (donor decrements one nbAttempts across both), so the live + dms search is a single bounded operation, not two full-budget walks. Large known inputs keep the COPY (merge) path. Ratio-equivalent (not byte-identical); 848 tests pass.
The fused dms walk lived in hash_chain_candidate behind a runtime guard, which bloated the no-dict hot path codegen (z000033 L10 no-dict +6%). Mirror upstream's compile-time dictMode template: make find_best_match / pick_lazy_match / hash_chain_candidate generic over const DICT, and split start_matching_lazy into a dispatcher that picks the monomorph from whether a dms is primed. The DICT=false instance carries NO dms code, so the no-dict path is byte-for-byte the pre-dms hot loop; the DICT=true instance dual-probes the dms (still sharing the one step budget). The dms loop itself stays out-of-line (dms_chain_walk). 848 tests pass.
With the DICT const-generic split the dms_chain_walk is dropped from the no-dict monomorph regardless of inlining (compile-time gate), so #[inline(never)] only added a per-call hop on the dict hot path. Switch to #[inline] so the DICT=true instance inlines the dms walk.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughTwo concurrent improvements: (1) the HashChain lazy matcher gains compile-time ChangesDictionary-aware HashChain matching and borrowed-window correctness
SoA node_prices buffer for SIMD-friendly optimal-parser DP
Sequence Diagram(s)sequenceDiagram
participant PickLazyMatch
participant FindBestMatch
participant HashChainCandidate
participant DmsChainWalk
PickLazyMatch->>FindBestMatch: probe abs_pos + 1
FindBestMatch->>HashChainCandidate: run relative-index chain walk
alt DICT enabled with DMS primed
HashChainCandidate->>DmsChainWalk: continue walk with remaining steps
DmsChainWalk-->>HashChainCandidate: updated best candidate
end
HashChainCandidate-->>FindBestMatch: optional match candidate
FindBestMatch-->>PickLazyMatch: best match for lazy lookahead
sequenceDiagram
participant BuildOptimalPlanImpl
participant BuildOptimalPlanBody
participant PriceSetNonAbortISA
participant FinishOptimalPlan
BuildOptimalPlanImpl->>BuildOptimalPlanBody: select ISA priceset function
BuildOptimalPlanBody->>BuildOptimalPlanBody: init nodes and node_prices
loop DP frontier iteration
alt opt_level == 0
BuildOptimalPlanBody->>BuildOptimalPlanBody: reverse-iterate, sync node_prices
else non-abort tiered
BuildOptimalPlanBody->>PriceSetNonAbortISA: scatter update node_prices range
PriceSetNonAbortISA-->>BuildOptimalPlanBody: updated price lanes
end
end
BuildOptimalPlanBody->>FinishOptimalPlan: hand off nodes and node_prices
FinishOptimalPlan-->>BuildOptimalPlanBody: reclaim into opt_node_prices_scratch
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/bt/mod.rs`:
- Around line 48-54: The new persistent `opt_node_prices_scratch` buffer (a
`Box<[u32]>` allocation) has been added to the struct but is not accounted for
in the `estimated_workspace_bytes()` function. Locate the
`estimated_workspace_bytes()` function (starting around line 120) and update it
to include the memory size of the `opt_node_prices_scratch` buffer in its total
calculation. Add an allocation size computation similar to how other workspace
buffers are already being estimated in that function, ensuring the returned
value reflects the actual steady-state memory usage of the matcher.
In `@zstd/src/encoding/match_generator.rs`:
- Around line 2706-2722: The primed snapshot mechanism does not distinguish
between HashChain attach mode and copy mode, even though the code at line 2706
now produces different snapshot shapes for these modes (attach builds dms, copy
invalidates it). Add an hc_attach/backend-dict-mode bit to the reset_shape
tracking and thread this parameter through all PrimedKey constructors to ensure
the snapshot key captures the HC mode. This prevents an HC snapshot built in one
mode from being incorrectly restored under the other mode. Additionally, add a
regression test for HC boundary behavior similar to the existing Fast
attach/copy test to verify the snapshot restoration works correctly for both HC
modes.
- Around line 3881-3898: The warm-cache vector path and the scalar/cold path in
priceset_next_cost compute pricing using different formulas, creating potential
byte-level divergence between SIMD and scalar paths. Extract the "cached
ml_price to next cost" computation into a dedicated helper function that uses
consistent arithmetic (wrapping addition) matching the vector path's
computation. Replace the vector path's direct wrapping_add computation of c_base
and ml_price with a call to this new helper, and refactor the priceset_next_cost
function to also use this helper instead of separately calling
BtMatcher::add_prices and match_price_from_parts. The same fix must also be
applied at the other affected location in the file (lines 4101-4114) to ensure
both scalar paths use identical pricing semantics.
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Around line 322-324: The regression test for the borrowed HC path is weakened
because it recompresses the data via roundtrip_at_level instead of decompressing
the exact frame produced by compress_to_vec. Replace the roundtrip_at_level call
on line 323 with a decompression of the compressed frame created on line 322 to
ensure the test validates the exact borrowed-path frame and maintains regression
protection regardless of wrapper routing changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: a6362e19-7320-4dab-bac9-1586dfb31b59
📒 Files selected for processing (6)
zstd/src/encoding/bt/mod.rszstd/src/encoding/hc/mod.rszstd/src/encoding/match_generator.rszstd/src/encoding/match_table/storage.rszstd/src/encoding/opt/types.rszstd/src/tests/roundtrip_integrity.rs
- estimated_workspace_bytes(): include the retained SoA node-price Box<[u32]> (HC_OPT_NODE_LEN u32s) that was missing, so the steady-state matcher workspace is no longer under-reported - document why the lazy-HC dict attach/copy mode is intentionally NOT folded into PrimedKey: both modes share identical window geometry (max_window_size + dict limit set before the mode branch), so a cross-mode restore stays decodable; keying it would force re-primes across the cutoff. Upstream attach/copy likewise keep the caller window - add hc_dict_snapshot_reuse_roundtrips: end-to-end regression that a frame compressed after a dict prime-snapshot restore still decodes - borrowed-HC regression now decodes the exact frame it asserts the ratio on (was recompressing via roundtrip_at_level) - note the affine equivalence that keeps the scalar and SoA-vector price-set paths byte-identical, and why the c_base hoist is kept
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 10218-10226: The comment block starting around the Dfast/Row
priming description inaccurately states that these backends prime "one way" when
in fact they have their own attach/copy split behaviors similar to HashChain.
Reword the comment to clarify that Dfast and Row are backend-specific modes with
their own attach/copy cutoff logic that is intentionally excluded from the
Fast-only `fast_attach` snapshot key, to avoid unnecessary re-priming when
different backends resolve to the same LevelParams. Keep the explanation that
the `fast_attach` bit must NOT enter non-Simple snapshot keys to prevent keying
mismatches between unhinted captures and hinted resets.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 68db8195-bcd4-4623-b0aa-cbfbc4fa08c7
📒 Files selected for processing (4)
zstd/src/encoding/bt/mod.rszstd/src/encoding/match_generator.rszstd/src/tests/dict_test.rszstd/src/tests/roundtrip_integrity.rs
- HC_ATTACH_DICT_CUTOFF_LOG was 16 KiB but upstream attachDictSizeCutoffs[ZSTD_lazy/lazy2] is 32 KiB; bump to 2^15 so the lazy hash-chain dict attach/copy boundary matches upstream zstd. Also aligns with the snapshot prefer-copy gate (already 32 KiB for lazy), so the attach band and the no-snapshot band now coincide instead of leaving the 16-32 KiB band attached-without-snapshot - correct the constant doc that misstated the upstream value as 16 KiB - reword the over-key test comment: Dfast/Row/HC each have their own attach/copy regime, all intentionally kept out of the Fast-only fast_attach key (was inaccurately "Dfast/Row prime one way")
The binary-tree (btlazy2/btopt/btultra/btultra2) dict path always built a separate DUBT dms (attach), with no copy path — diverging from upstream, which COPIES the dict into the live tree above a per-strategy cutoff (ZSTD_resetCCtx_byCopyingCDict). Mirror the upstream shape: - attach (small/unknown source) builds the DUBT dms, as before - copy (large known source) merges the dict into the live binary tree via the normal skip_matching path and drops the dms - per-strategy cutoffs match attachDictSizeCutoffs: btlazy2/btopt 32 KiB, btultra/btultra2 8 KiB Unify the attach/copy decision for both the lazy-HC and BT arms behind one hc_dict_attach_mode() helper so skip_matching_for_dictionary_priming and prime_with_dictionary stay in lock-step. The snapshot capture now retains the full live tables whenever the dict was copied in (keyed on dms.is_primed(), the exact decoupled signal) instead of assuming every BT snapshot is dms-only. Output stays a valid frame that upstream decodes (cross-validation tests compress here + decompress via FFI); attach vs copy is a memory/speed trade-off, not a wire change.
|
@coderabbitai re-review |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
zstd/src/encoding/match_generator.rs (2)
2548-2551:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRefresh the snapshot comments for non-Simple attach modes.
These comments are now stale: Dfast/Row/HashChain do have backend-specific attach/copy regimes, and lazy-HC attach can prime
dmsviaprime_dms_hc. Keep the current keying behavior, but reword the rationale sofast_attachis described as Fast-only, while BT decoupling is specificallyuses_bt && dms.is_primed().Suggested wording adjustment
- // backends prime the dictionary the same way regardless, so including - // the bit there would over-key identical resolved shapes. When it - // applies it matches the decision `prime_with_dictionary` makes from the - // same `reset_size_log`. + // Dfast/Row/HashChain have their own attach/copy regimes, but this bit + // models only the Simple/Fast backend's distinct dict-table shape. Those + // other backends are keyed by the resolved matcher geometry instead. + // When this bit applies, it matches the decision `prime_with_dictionary` + // makes from the same `reset_size_log`.- // being primed is the exact "decoupled" signal — it is set only by the BT - // attach prime and dropped in copy mode. + // For BT, `uses_bt && dms.is_primed()` is the exact "decoupled" signal. + // Lazy-HC attach can also prime `dms`, but it intentionally keeps the + // full live-table snapshot path.Also applies to: 2918-2924
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/match_generator.rs` around lines 2548 - 2551, Update the stale comments about attach-vs-copy mode in match_generator.rs at lines 2548-2551 and 2918-2924. The comments incorrectly claim that fast attach-vs-copy is only used in the Simple backend, but Dfast/Row/HashChain backends also have backend-specific attach/copy regimes (and lazy-HC can prime dms via prime_dms_hc). Keep the current keying behavior unchanged, but reword the comments to accurately reflect that fast_attach is Fast-backend-only and clarify that BT decoupling is specifically controlled by the condition uses_bt && dms.is_primed(), not by the Simple/non-Simple backend distinction.
4827-4843:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winKeep
node_pricessynchronized during seed initialization.The SoA sidecar is retained scratch and is compared by the price-set paths, but these seed-frontier resets and the forced seed write update only
nodes. Mirror the writes intonode_pricesto preserve the documented lockstep invariant.Proposed fix
for p in 1..min_match_len.min(frontier_buffer_size) { BtMatcher::reset_opt_node(&mut nodes[p]); + node_prices[p] = u32::MAX; // `initial_litlen` is the litlen carried from priorlet forced_state = HcOptimalNode { price: forced_price, off: candidate.offset as u32, mlen: longest_len as u32, litlen: 0, reps: initial_reps, }; if longest_len < frontier_buffer_size && forced_price < nodes[longest_len].price { nodes[longest_len] = forced_state; + node_prices[longest_len] = forced_price; }Also applies to: 4875-4877
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/match_generator.rs` around lines 4827 - 4843, The seed initialization loop that resets nodes and assigns values to nodes[p].litlen is updating the nodes array but not the parallel node_prices sidecar structure, breaking the documented lockstep invariant between these two arrays. After each update to nodes[p] (specifically when setting nodes[p].litlen with seed_litlen), mirror the corresponding write into node_prices[p] to keep both structures synchronized. Apply the same fix pattern at the other affected locations mentioned (lines 4875-4877) where similar node updates occur without corresponding node_prices updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/hc/mod.rs`:
- Around line 449-462: The condition gating the dictionary walk in the DICT
block is checking for table allocation via table.dms.table().is_some_and()
rather than checking the actual validity of the dictionary state. Replace this
condition to check dms.is_primed() instead, since the validity contract for the
DMS is provided by is_primed(), not table presence. An allocated table can
remain after the dictionary becomes invalidated, so checking only table presence
allows stale dictionary chains and indices to be incorrectly reused.
In `@zstd/src/encoding/match_generator.rs`:
- Around line 6521-6522: The lazy skip increment at the location where `step` is
calculated and `pos` is incremented can cause `pos` to jump past the first
non-searchable-but-hashable tail anchor on long literal runs, preventing the
tail loop from inserting it for the next block. After calculating `pos += step`,
clamp `pos` to not exceed the first non-searchable position while keeping the
searched-position skips unchanged. This preserves the tail insertion contract by
ensuring the tail loop still processes the critical anchor point in subsequent
iterations.
In `@zstd/src/encoding/match_table/storage.rs`:
- Around line 808-815: The cache key validation in the condition block (checking
is_primed(), region_len(), mls, and hash_log) is incomplete and can cause HC and
BT table layouts to be confused. Add a layout/mode discriminator field (e.g.,
layout field) to the DmsDictTables struct to distinguish between HC and BT
layouts. Include this layout field in the cache validation check alongside mls
and hash_log to ensure only tables with matching layouts are reused.
Additionally, set tables.layout to DmsDictLayout::Hc in the prime_dms_hc
function and tables.layout to DmsDictLayout::Bt in the prime_dms_bt function to
properly tag each table with its intended layout.
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Line 321: The regression test on the for loop iterating over compression
levels is missing coverage for levels 8, 9, and 11 within the stated 5–12 band.
Modify the array passed to the for loop to include all integers from 5 through
12, replacing the current sparse list that skips those three levels. This
ensures complete regression coverage across the entire compression level range
being tested.
---
Outside diff comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 2548-2551: Update the stale comments about attach-vs-copy mode in
match_generator.rs at lines 2548-2551 and 2918-2924. The comments incorrectly
claim that fast attach-vs-copy is only used in the Simple backend, but
Dfast/Row/HashChain backends also have backend-specific attach/copy regimes (and
lazy-HC can prime dms via prime_dms_hc). Keep the current keying behavior
unchanged, but reword the comments to accurately reflect that fast_attach is
Fast-backend-only and clarify that BT decoupling is specifically controlled by
the condition uses_bt && dms.is_primed(), not by the Simple/non-Simple backend
distinction.
- Around line 4827-4843: The seed initialization loop that resets nodes and
assigns values to nodes[p].litlen is updating the nodes array but not the
parallel node_prices sidecar structure, breaking the documented lockstep
invariant between these two arrays. After each update to nodes[p] (specifically
when setting nodes[p].litlen with seed_litlen), mirror the corresponding write
into node_prices[p] to keep both structures synchronized. Apply the same fix
pattern at the other affected locations mentioned (lines 4875-4877) where
similar node updates occur without corresponding node_prices updates.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: c11e33a1-14aa-49ad-a651-cdb7333b6e48
📒 Files selected for processing (7)
zstd/src/encoding/bt/mod.rszstd/src/encoding/hc/mod.rszstd/src/encoding/match_generator.rszstd/src/encoding/match_table/storage.rszstd/src/encoding/opt/types.rszstd/src/tests/dict_test.rszstd/src/tests/roundtrip_integrity.rs
CodeRabbit re-review follow-ups on the dict match-finder:
- dms cache key now carries a layout discriminator (DmsDictLayout::{Hc,Bt}).
prime_dms_hc and prime_dms_bt populate hash_table/chain_table with
incompatible meanings (HC single-link chain vs BT DUBT, chain_table len
region vs 2*region); a reused compressor switching level across the
HC<->BT boundary with coincident (region, mls, hash_log) could reuse the
wrong layout. Tag each build and require a layout match to reuse.
- optimal-parser seed keeps the SoA node_prices sidecar in lockstep with
nodes[].price: the seed reset loop and the forced-seed write now mirror
into node_prices so the price-set paths compare against the live baseline.
- gate the lazy-HC dms walk on dms.is_primed() (the canonical validity
contract) instead of probing table presence.
- borrowed-HC roundtrip regression covers the full L5..=12 band.
- refresh stale attach/copy comments: fast_attach is Fast-only; BT
decoupling is uses_bt && dms.is_primed(); document why the lazy skip needs
no tail clamp (HC_MIN_MATCH_LEN == insert width, no missed anchor).
Adds dms_cache_rebuilds_across_hc_bt_layout_switch (fails without the
discriminator).
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/encoding/match_generator.rs (1)
6456-6460:⚠️ Potential issue | 🟡 MinorSwitch the dispatch guard to
dms.is_primed().The comment states "dispatcher: pick the dict-aware monomorph when a separate dms is primed," but the code checks
dms.table().is_some(). Useis_primed()for semantic correctness and consistency with the other dictionary dispatcher at line 2925 that already uses this check.Note:
invalidate()fully clears the table and primed flag, so the concern about preserving tables for cache reuse is unfounded.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@zstd/src/encoding/match_generator.rs` around lines 6456 - 6460, The dispatch guard in the start_matching_lazy_impl call uses dms.table().is_some() to determine which monomorph to invoke, but should instead use dms.is_primed() for semantic correctness. Change the condition from checking if the table is some to checking if the dms is primed, which aligns with the inline comment stating "dispatcher: pick the dict-aware monomorph when a separate dms is primed" and matches the pattern already used at line 2925. This ensures consistency across dictionary dispatchers since invalidate() clears both the table and the primed flag together.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 6456-6460: The dispatch guard in the start_matching_lazy_impl call
uses dms.table().is_some() to determine which monomorph to invoke, but should
instead use dms.is_primed() for semantic correctness. Change the condition from
checking if the table is some to checking if the dms is primed, which aligns
with the inline comment stating "dispatcher: pick the dict-aware monomorph when
a separate dms is primed" and matches the pattern already used at line 2925.
This ensures consistency across dictionary dispatchers since invalidate() clears
both the table and the primed flag together.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 78ba165c-5a38-4937-8195-c1c7bb87659b
📒 Files selected for processing (4)
zstd/src/encoding/hc/mod.rszstd/src/encoding/match_generator.rszstd/src/encoding/match_table/storage.rszstd/src/tests/roundtrip_integrity.rs
The lazy start_matching dispatcher picked the dict-aware monomorph on dms.table().is_some(); its own doc says 'when a separate dms is primed'. Switch to dms.is_primed() so it matches the inner dms-walk guard and the optimal dispatcher. Behaviorally equivalent (invalidate() clears both the table and the primed flag together), but the canonical validity contract.
|
@coderabbitai review |
✅ Action performedReview finished.
|
Finish the AoS->SoA split in the optimal parser: remove the `price` field from `HcOptimalNode` and keep each position's price ONLY in the parallel `node_prices: Box<[u32]>`. Previously price was duplicated (in the 28-byte node struct AND the flat SoA array the SIMD price-set vector-loads), which required a lockstep invariant and was the source of the desync bug class (forced-seed sidecar miss). With one home for each price there is nothing to keep in sync, so that bug class is gone by construction. - frontier reads/writes go through node_prices[i] (node struct loses price) - transient node copies that carried price (forced_end_state, last_stretch) now track it via parallel locals (forced_end_price / last_stretch_price); prev_match's price reuses the pre-overwrite node_pos_price - per node update we now store the price once (was twice), and the AoS struct shrinks 28->24 bytes - reset_opt_node only resets litlen; price reset stays in reset_opt_nodes Pure representation change: compressed output is byte-identical (850 tests incl. cross-validation FFI round-trip, clippy, fmt green). Upstream zstd keeps one `opt[].price` and has no such invariant; this matches that single-source property while keeping our SIMD vector price compare.
|
Summary
Closes the small-window dictionary
compress-dictthroughput gap for the lazy hash-chain levels (L6–L12) and ships the optimal-parser SIMD price-set vectorisation.On
small-10k-randomcompress-dict(i9, flatc_fficontrol) the lazy levels now beat C zstd across the board, where they were ~7–9× behind:No-dict
decodecorpus-z000033L10 lazy: no regression.decompress-dictround-trips correctly at every level.What changed
Dictionary lazy hash-chain (the headline):
ZSTD_compressBlock_lazy_genericadvancesstep = ((ip - anchor) >> kSearchStrength) + 1, skipping faster over matchless runs. Instrumenting both sides showed ourfind_best_matchcall count was 9136 vs C's 6107 on the target frame; with skipping it is 6103.dictMatchState: attach mode (small/unknown inputs) builds a single-link hash-chain dms over the committed dict and keeps it out of the live chain; large known inputs keep the copy/merge path. The dms walk is fused into the chain walk sharing one boundednbAttemptsbudget (not two full-budget walks).DICTmonomorphisation —find_best_match/pick_lazy_match/hash_chain_candidateare generic overconst DICT: bool(upstream's compile-timedictModetemplate). The no-dict instance carries zero dms code; the dict instance inlines it. This keeps the no-dict hot path untouched.MEM_read32-styleu32gates instead of bounds-checked slice equality, history-relative index space with hoisted loop invariants.Optimal-parser SIMD price-set: SoA node-price array + per-CPU-tier vectorised price-set compare (AVX2/BMI2, NEON, SSE4.1, wasm simd128) gated to long ranges, macro-expanded per tier.
Testing
cargo nextest run -p structured-zstd --features dict_builder: 848 passedcargo test --doc: 22 passedcargo clippy --all-targets: cleanc_fficontrol arms for every number aboveOutput is ratio-equivalent to upstream (not byte-identical) on the lazy dict path, by design — it follows upstream's match decisions.
Summary by CodeRabbit