Skip to content

perf(hc): lazy-skip + separate dictMatchState beat C on small-window dict compress#427

Merged
polaz merged 23 commits into
mainfrom
perf/hc-dict-lazyskip-dms
Jun 16, 2026
Merged

perf(hc): lazy-skip + separate dictMatchState beat C on small-window dict compress#427
polaz merged 23 commits into
mainfrom
perf/hc-dict-lazyskip-dms

Conversation

@polaz

@polaz polaz commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

Closes the small-window dictionary compress-dict throughput gap for the lazy hash-chain levels (L6–L12) and ships the optimal-parser SIMD price-set vectorisation.

On small-10k-random compress-dict (i9, flat c_ffi control) the lazy levels now beat C zstd across the board, where they were ~7–9× behind:

level before after vs C
L6 67 µs 0.47×
L7 1025 µs 98 µs 0.69×
L8 98 µs 0.68×
L9 98 µs 0.45×
L10 928 µs 156 µs 0.71×
L11 297 µs 0.87×
L12 1390 µs 297 µs 0.75×

No-dict decodecorpus-z000033 L10 lazy: no regression. decompress-dict round-trips correctly at every level.

What changed

Dictionary lazy hash-chain (the headline):

  • Lazy skipping — the lazy parser advanced one byte per no-match position; upstream ZSTD_compressBlock_lazy_generic advances step = ((ip - anchor) >> kSearchStrength) + 1, skipping faster over matchless runs. Instrumenting both sides showed our find_best_match call count was 9136 vs C's 6107 on the target frame; with skipping it is 6103.
  • Separate dictMatchState — dict frames merged the dictionary into the live chain, so every input-position walk dragged dict candidates out of the shared buckets. Mirror upstream's separate dictMatchState: attach mode (small/unknown inputs) builds a single-link hash-chain dms over the committed dict and keeps it out of the live chain; large known inputs keep the copy/merge path. The dms walk is fused into the chain walk sharing one bounded nbAttempts budget (not two full-budget walks).
  • DICT monomorphisationfind_best_match / pick_lazy_match / hash_chain_candidate are generic over const DICT: bool (upstream's compile-time dictMode template). The no-dict instance carries zero dms code; the dict instance inlines it. This keeps the no-dict hot path untouched.
  • Per-candidate codegen: raw MEM_read32-style u32 gates instead of bounds-checked slice equality, history-relative index space with hoisted loop invariants.
  • Ratio fix: hash the borrowed window (not the empty owned mirror) in the HashChain insert paths.

Optimal-parser SIMD price-set: SoA node-price array + per-CPU-tier vectorised price-set compare (AVX2/BMI2, NEON, SSE4.1, wasm simd128) gated to long ranges, macro-expanded per tier.

Testing

  • cargo nextest run -p structured-zstd --features dict_builder: 848 passed
  • cargo test --doc: 22 passed
  • cargo clippy --all-targets: clean
  • i9 paired before/after with flat c_ffi control arms for every number above

Output is ratio-equivalent to upstream (not byte-identical) on the lazy dict path, by design — it follows upstream's match decisions.

Summary by CodeRabbit

  • New Features
    • Added dictionary-aware behavior across HashChain/Binary Tree tiers, including improved attach/copy handling and snapshot reuse correctness.
  • Bug Fixes
    • Fixed borrowed-history hashing to use the correct live bytes.
    • Prevented incorrect cache reuse across HC↔BT dictionary layouts.
  • Performance Improvements
    • Improved optimal-planning DP with dedicated per-node price tracking and faster non-aborting implementations (including SIMD where available).
    • Refined lazy matching/candidate advancement and dictionary-aware hot-path dispatch.
  • Tests
    • Added regressions for small repetitive inputs and dictionary snapshot reuse roundtrips.

polaz added 17 commits June 16, 2026 12:26
Split the optimal-parser DP price into a contiguous `node_prices: Box<[u32]>`
companion to `opt_nodes_scratch`, maintained in lockstep with every
`HcOptimalNode.price` write. The two hot price-set loop reads now read the
contiguous array. This is the prerequisite for SIMD-vectorising the inner
price-set loop: the 28-byte AoS node stride would otherwise force a strided
gather for the per-length price compare; a contiguous u32 array allows a
single vector load.

Byte-identical (840 tests incl level22 parity + cross_validation). The
redundant struct `price` field is retained here and removed in a follow-up
once all remaining readers route through the array.
Vectorise the btultra/btultra2 (OPT_LEVEL >= 2) price-set inner loop: each
match_len writes a distinct node, so iteration is order-independent and the
improvement test reduces to next_cost < node_prices[next] (reset already set
beyond-frontier cells to u32::MAX, subsuming next > last_pos). Process the
match-length range in 8-lane chunks: scalar next_cost (same add_prices /
match_price_from_parts chain, byte-identical), AVX2 unsigned compare against
the contiguous node_prices SoA array (min_epu32-derived since AVX2 has no
cmplt_epu32), scalar write of the improving lanes. btopt (OPT_LEVEL == 0)
keeps the reverse-iterate-then-break scalar form. 840 tests byte-identical.
The standalone vectorised helper penalised the short match-length ranges
(trip < 8, the majority by count) with fn-call + non-inlined price-lookup
overhead and zero vectorisation benefit, regressing L18 +3.9%. Keep short
ranges (< 16) on the inline-scalar simplified-compare path; only divert
long ranges to the vectorised helper where the chunked compare can amortise.
Replace the standalone runtime-dispatched price-set helper with per-tier
functions threaded through the build_optimal_plan macro ($priceset param),
matching the existing $collect kernel pattern: the avx2 wrapper passes
priceset_range_nonabort_avx2 (#[target_feature(avx2)] + #[inline] -> folds
into the wrapper's avx2 umbrella: no call ABI, no per-chunk runtime feature
detection, cached_match_length_price inlines), other tiers pass the inline
scalar variant. Eliminates the standalone-fn overhead that regressed the
gated version. byte-identical 840 tests (scalar path; avx2 path validated
on x86 CI / bench host).
…e-set

The price-set inner loop is compute-bound, not write-bound: an improve-ratio
probe shows only 1-8% of high-trip comparisons write, so the dominant cost is
the per-lane next_cost computation (ml-price cache retrieval), which the
compare-only vectorisation left scalar. A hit-ratio probe shows the ml-price
cache is 91-98% warm. Since cached_match_length_price indexes the cache by
match_len directly, a run of consecutive match lengths maps to contiguous
[price,generation] cells: vector-load 8 cells, deinterleave + check all gens
== stamp, and fold the cached prices with one broadcast constant (c_base =
base_cost + ll0_price + match_price_from_parts(off,0)); cold chunks fall back
to the scalar fill. Byte-identical (avx2 path validated on x86).
The cross-lane permutevar8x32 deinterleave was a latency-bound serial chain
that lost to pipelined scalar cache loads on high-chunk-count fixtures
(low-entropy-1m L18 regressed ~27%). Replace with in-128-lane shuffle_epi32 +
unpack*_epi64 and a single cross-lane permute4x64 for the ordered prices;
gens stay unordered (only the all-equal stamp test matters). One cross-lane op
instead of six.
Replace the macro-based vector price-set body (macro hygiene couldn't see the
wrapper's parameters, so only the never-compiled-here avx2 arm "worked") with
a generic `priceset_range_vec::<W>` taking the per-tier deinterleave + mask as
zero-sized impl Fn closures; #[inline(always)] + monomorphisation folds them
into each wrapper's target_feature umbrella (intrinsics inline, no call ABI,
no runtime detection). Wire the NEON tier: vld2q_u32 deinterleaves the 4
[price,gen] pairs natively (no shuffle chain), vcltq_u32 + lane-weight
vaddvq packs the 4-lane improve mask. byte-identical 846 tests on aarch64
(M1 dispatch -> neon path). AVX2 unchanged (now via the generic body).
…s test

SSE4.1 4-lane deinterleave (two 128-bit loads + shuffle_epi32 + unpack_epi64)
and min_epu32-based improve mask, wired to the sse42 DP wrapper via the shared
generic body. Add a direct unit test asserting every compile-available tier's
deinterleave + improve-mask helpers (AVX2 + SSE4.1 on x86, NEON on aarch64)
match a scalar reference — the dispatch only fires the best tier per host, so
this is the only coverage for the non-selected SIMD tiers.
Add the simd128 4-lane deinterleave (u32x4_shuffle selects price/gen lanes
natively) + native unsigned u32x4_lt/u32x4_bitmask improve mask, a simd128 DP
wrapper, and route wasm32+simd128 to it in the dispatch (was scalar). cfg-gate
the now-dead scalar range / wrapper / generic body on targets where another
tier is selected. Both wasm clippy variants (simd128 + scalar) clean under
-D warnings; the helper correctness test covers the simd128 lane math on a
wasm host.
…ned mirror

The greedy/lazy levels (5-12) on small inputs take the borrowed (no-copy)
one-shot path, where the resolved window is <= 14 and the matchfinder falls
back to HashChain (donor ZSTD_resolveRowMatchFinderMode). In that mode the
finder reads the borrowed input via live_history(), but insert_position_no_rebase
and fill_hash_chain_positions hashed self.history[history_start..] — the owned
mirror, which is empty under the no-copy path. The insert therefore skipped or
hashed garbage while the finder hashed the real input, so the chain stayed empty,
no matches were found, and a trivially-compressible repeated 4 KiB block ballooned
to 2549 bytes (~16x the donor's 154). Read live_history() in both insert paths so
insert and find hash the same bytes (donor uses one buffer for both). Owned mode
is unchanged: live_history() == history[history_start..] there.

Regression test: small_repetitive_compresses_on_borrowed_hashchain_band covers
levels 5/6/7/10/12.
…lloc

The greedy/lazy levels (5-12) on small dict frames fall back to HashChain
(donor matchfinder gate) and restore the primed-dictionary snapshot every
frame. Only the Fast backend reused its buffers via clone_from; HC/BT took
`snapshot.clone()` + `ensure_tables()`, a full per-frame allocate+copy+drop
that made the profile allocation-bound (malloc/sbrk/page-fault) and ran
5-7x the C dict compress. Give MatchTable a manual Clone whose clone_from
reuses the history/hash/chain/hash3/dms buffers, and restore HC lazy/greedy
(non-BT) snapshots through it — donor reuses the CDict tables in place rather
than reallocating. BT (uses_bt) snapshots still drop their live tables, so
they keep the realloc path.
Replace the bounds-checked 4-byte slice-equality gates in the
hash-chain candidate walk (both the speculative tail gate and the
head 4-byte gate) with raw unaligned u32 load+compare, matching
upstream zstd ZSTD_HcFindBestMatch's MEM_read32 gate. One load per
side, no per-candidate slice bounds-check path. Accepted-candidate set
is byte-identical (848 tests pass).
The dict-primed HC lazy chain walk was the dominant compress-dict cost
(76% self-time in find_best_match on small-10k-random L7). The per-
candidate body did far more than upstream ZSTD_HcFindBestMatch's tight
loop: an Option-returning absolute-position decode, and a window-floor
recompute (.max()/.saturating_sub()) on every step.

Track candidates in history-relative index space instead. The loop
invariants (window floor, position->index bias) are hoisted once before
the walk, so each step is a single add + range check + 4-byte gate,
mirroring the donor's matchIndex>=lowLimit / NEXT_IN_CHAIN body. The
absolute position is recovered only on the rare accept path. Accepted-
candidate set is byte-identical (848 tests pass).
The lazy parser advanced one byte per no-match position; upstream
ZSTD_compressBlock_lazy_generic (zstd_lazy.c:1614) advances
step = ((ip - anchor) >> kSearchStrength) + 1, jumping faster over runs
with no match. On compressible input the literal run stays short so the
step is 1 (unchanged); on incompressible / dict-over-random input the
run grows and the parser searches one position per step instead of every
byte.

Instrumented both sides on small-10k-random L7 dict: find_best_match
call count was 9136 vs C's 6107 (1.5x too many). With this change ours
is 6103, matching C. Output follows upstream (ratio-equivalent, not
byte-identical); 848 tests pass.
Lazy-HC dict frames merged the dictionary into the live hash chain, so
each input-position walk dragged dict candidates out of the shared
buckets. Upstream keeps the dict in a SEPARATE dictMatchState
(ZSTD_HcFindBestMatch dms HC4 loop). Mirror it: ATTACH mode (small /
unknown inputs) builds a single-link hash-chain dms over the committed
dict (prime_dms_hc) and keeps it out of the live chain. The dms walk is
FUSED into hash_chain_candidate and shares the SAME steps/max_chain_steps
budget as the live walk (donor decrements one nbAttempts across both), so
the live + dms search is a single bounded operation, not two full-budget
walks. Large known inputs keep the COPY (merge) path. Ratio-equivalent
(not byte-identical); 848 tests pass.
The fused dms walk lived in hash_chain_candidate behind a runtime guard,
which bloated the no-dict hot path codegen (z000033 L10 no-dict +6%).
Mirror upstream's compile-time dictMode template: make find_best_match /
pick_lazy_match / hash_chain_candidate generic over const DICT, and split
start_matching_lazy into a dispatcher that picks the monomorph from
whether a dms is primed. The DICT=false instance carries NO dms code, so
the no-dict path is byte-for-byte the pre-dms hot loop; the DICT=true
instance dual-probes the dms (still sharing the one step budget). The dms
loop itself stays out-of-line (dms_chain_walk). 848 tests pass.
With the DICT const-generic split the dms_chain_walk is dropped from the
no-dict monomorph regardless of inlining (compile-time gate), so
#[inline(never)] only added a per-call hop on the dict hot path. Switch to
#[inline] so the DICT=true instance inlines the dms walk.
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: e435e57a-d74f-4f49-8d0b-c7ef62a4a53a

📥 Commits

Reviewing files that changed from the base of the PR and between 6bfe828 and 4f685a0.

📒 Files selected for processing (3)
  • zstd/src/encoding/bt/mod.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/opt/types.rs

📝 Walkthrough

Walkthrough

Two concurrent improvements: (1) the HashChain lazy matcher gains compile-time DICT const-generic specialization, an inlined relative-index chain walk with unsafe u32 loads, a new prime_dms_hc dictionary routine, borrowed-window hashing correctness fixes in MatchTable, and lazy skip-step alignment with upstream; (2) the BT optimal parser gains a parallel SoA node_prices: Box<[u32]> buffer threaded through HcOptimalPlanBuffers, BtMatcher, and the build_optimal_plan_impl_body macro, with per-ISA (AVX2/SSE/NEON/wasm/scalar) priceset_range_nonabort_* dispatch functions.

Changes

Dictionary-aware HashChain matching and borrowed-window correctness

Layer / File(s) Summary
MatchTable manual Clone and DmsDictLayout
zstd/src/encoding/match_table/storage.rs
Removes #[derive(Clone)] and implements manual Clone with clone_from() reusing heap allocations. Adds DmsDictLayout enum and layout field to DmsDictTables to distinguish HC vs BT dictionary structures and prevent cross-layout cache reuse.
Borrowed-window hashing fixes in MatchTable
zstd/src/encoding/match_table/storage.rs
Fixes insert_position_no_rebase and fill_hash_chain_positions to derive hash source from live_history() (borrowed window when active) instead of the owned history mirror, with updated unsafe pointer arithmetic for the dense fill loop.
prime_dms_hc: dictionary lazy hash-chain priming
zstd/src/encoding/match_table/storage.rs
Adds prime_dms_hc which builds and caches a single-link per-position dictionary hash-chain keyed by (region, mls, hash_log). Updates prime_dms_bt to require layout == Bt for cache reuse. Includes regression test validating rebuilds across HC↔BT layout switches.
HC hot path: const-generic DICT, inlined walk, unsafe u32 loads
zstd/src/encoding/hc/mod.rs
Generalizes hash_chain_candidate, find_best_match, and pick_lazy_match over const DICT: bool. Inlines the chain walk in relative index space with a raw base pointer, replaces slice equality with unsafe little-endian u32 loads for both speculative gates, and adds dms_chain_walk for the dict branch. Updates all unit test call sites.
match_generator: HC dict attach/copy modes and priming
zstd/src/encoding/match_generator.rs
Adds HC_ATTACH_DICT_CUTOFF_LOG and BT/ultra tier cutoff constants. Introduces hc_dict_attach_mode() to compute attach-vs-copy decision. Extends skip_matching_for_dictionary_priming to use attach mode and select BT/HC skip paths. Refactors prime_with_dictionary to recompute attach mode, set dict limit, and prime per-variant (prime_dms_bt/prime_dms_hc). Refactors restore_primed_dictionary for non-BT snapshots to clone tables in-place. Updates capture_primed_dictionary to detect dict-decoupled snapshots via dms.is_primed().
HC lazy dispatch and run-length skip step
zstd/src/encoding/match_generator.rs
Refactors start_matching_lazy into a dict-aware dispatcher selecting start_matching_lazy_impl::<DICT> based on attached-dict presence. Changes lazy no-match advancement to use run-length-derived skip ((pos - literals_start) >> 8) + 1 instead of unit steps.
Regression tests: dict snapshot reuse and borrowed HashChain
zstd/src/tests/dict_test.rs, zstd/src/tests/roundtrip_integrity.rs
Adds hc_dict_snapshot_reuse_roundtrips to verify dictionary snapshot reuse across independent frames. Adds small_repetitive_compresses_on_borrowed_hashchain_band to validate repeat detection on borrowed HashChain with <25% compression ratio at levels 5–12.

SoA node_prices buffer for SIMD-friendly optimal-parser DP

Layer / File(s) Summary
node_prices field in HcOptimalPlanBuffers and BtMatcher
zstd/src/encoding/opt/types.rs, zstd/src/encoding/bt/mod.rs
Adds node_prices: Box<[u32]> to HcOptimalPlanBuffers and opt_node_prices_scratch to BtMatcher. Initializes buffers in new(), includes in heap_size() and estimated_workspace_bytes(), and updates finish_optimal_plan() and reset_opt_nodes() to maintain the parallel price array with u32::MAX sentinels.
SIMD priceset_range_nonabort_ implementations*
zstd/src/encoding/match_generator.rs
Adds per-ISA priceset implementations for AVX2 (8-lane next_cost < node_price masks), SSE4.1, NEON, and wasm simd128, including deinterleave helpers and correctness/mask validation tests.
build_optimal_plan_impl_body: node_prices SoA + priceset dispatch
zstd/src/encoding/match_generator.rs
Refactors macro to accept $priceset function parameter. Introduces node_prices scratch with HC_OPT_NODE_LEN capacity and u32::MAX init. Threads node_prices through frontier init, resets, forced-seed updates, and all DP transitions. Bifurcates opt-level 0 abort-on-worse-match reverse iteration from non-abort tiered priceset updates. Passes node_prices to finish_optimal_plan.
Cross-ISA dispatch: wasm simd128 route and priceset wiring
zstd/src/encoding/match_generator.rs
Routes wasm+simd128 to new build_optimal_plan_impl_simd128 body. Wires priceset_range_nonabort_{neon,sse,avx2} into per-ISA macro calls with updated cfg-attributes. Adds wasm simd128 DP body plumbing. Updates fast_attach snapshot-keying comment to clarify HashChain attach/copy modeling.

Sequence Diagram(s)

sequenceDiagram
  participant PickLazyMatch
  participant FindBestMatch
  participant HashChainCandidate
  participant DmsChainWalk

  PickLazyMatch->>FindBestMatch: probe abs_pos + 1
  FindBestMatch->>HashChainCandidate: run relative-index chain walk
  alt DICT enabled with DMS primed
    HashChainCandidate->>DmsChainWalk: continue walk with remaining steps
    DmsChainWalk-->>HashChainCandidate: updated best candidate
  end
  HashChainCandidate-->>FindBestMatch: optional match candidate
  FindBestMatch-->>PickLazyMatch: best match for lazy lookahead
Loading
sequenceDiagram
  participant BuildOptimalPlanImpl
  participant BuildOptimalPlanBody
  participant PriceSetNonAbortISA
  participant FinishOptimalPlan

  BuildOptimalPlanImpl->>BuildOptimalPlanBody: select ISA priceset function
  BuildOptimalPlanBody->>BuildOptimalPlanBody: init nodes and node_prices
  loop DP frontier iteration
    alt opt_level == 0
      BuildOptimalPlanBody->>BuildOptimalPlanBody: reverse-iterate, sync node_prices
    else non-abort tiered
      BuildOptimalPlanBody->>PriceSetNonAbortISA: scatter update node_prices range
      PriceSetNonAbortISA-->>BuildOptimalPlanBody: updated price lanes
    end
  end
  BuildOptimalPlanBody->>FinishOptimalPlan: hand off nodes and node_prices
  FinishOptimalPlan-->>BuildOptimalPlanBody: reclaim into opt_node_prices_scratch
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • structured-world/structured-zstd#185: Both PRs refactor the hash_chain_candidate hot path in zstd/src/encoding/hc/mod.rs, including inlined chain walks and the speculative tail/4-byte gating behavior.
  • structured-world/structured-zstd#413: Both PRs modify BT optimal-plan construction in match_generator.rs, specifically build_optimal_plan_impl_body macro hot-loop behavior and memory-resident DP operations.
  • structured-world/structured-zstd#395: Both PRs change BT dictionary match (DMS) priming/cache behavior in zstd/src/encoding/match_table/storage.rs—main PR adds layout-aware cache separation (HC vs BT) and a new prime_dms_hc, while retrieved PR refactors prime_dms_bt to reuse/invalidate DMS tables based on shape.

Poem

🐇 Hoppity-hop through the hash-chain lane,
No more empty mirrors bringing lookup pain!
Node prices stand in a tidy SoA row,
While AVX2 lanes make the DP glow.
Dict-aware or plain, DICT true or false —
This bunny compresses with nary a pause! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes a core performance optimization: lazy-skip and separate dictMatchState improvements enabling dictionary compression to match or beat C zstd, which aligns with the PR's primary objective.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/hc-dict-lazyskip-dms

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.82822% with 125 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/match_generator.rs 77.42% 86 Missing ⚠️
zstd/src/encoding/match_table/storage.rs 76.64% 32 Missing ⚠️
zstd/src/encoding/hc/mod.rs 95.83% 5 Missing ⚠️
zstd/src/encoding/bt/mod.rs 85.71% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/bt/mod.rs`:
- Around line 48-54: The new persistent `opt_node_prices_scratch` buffer (a
`Box<[u32]>` allocation) has been added to the struct but is not accounted for
in the `estimated_workspace_bytes()` function. Locate the
`estimated_workspace_bytes()` function (starting around line 120) and update it
to include the memory size of the `opt_node_prices_scratch` buffer in its total
calculation. Add an allocation size computation similar to how other workspace
buffers are already being estimated in that function, ensuring the returned
value reflects the actual steady-state memory usage of the matcher.

In `@zstd/src/encoding/match_generator.rs`:
- Around line 2706-2722: The primed snapshot mechanism does not distinguish
between HashChain attach mode and copy mode, even though the code at line 2706
now produces different snapshot shapes for these modes (attach builds dms, copy
invalidates it). Add an hc_attach/backend-dict-mode bit to the reset_shape
tracking and thread this parameter through all PrimedKey constructors to ensure
the snapshot key captures the HC mode. This prevents an HC snapshot built in one
mode from being incorrectly restored under the other mode. Additionally, add a
regression test for HC boundary behavior similar to the existing Fast
attach/copy test to verify the snapshot restoration works correctly for both HC
modes.
- Around line 3881-3898: The warm-cache vector path and the scalar/cold path in
priceset_next_cost compute pricing using different formulas, creating potential
byte-level divergence between SIMD and scalar paths. Extract the "cached
ml_price to next cost" computation into a dedicated helper function that uses
consistent arithmetic (wrapping addition) matching the vector path's
computation. Replace the vector path's direct wrapping_add computation of c_base
and ml_price with a call to this new helper, and refactor the priceset_next_cost
function to also use this helper instead of separately calling
BtMatcher::add_prices and match_price_from_parts. The same fix must also be
applied at the other affected location in the file (lines 4101-4114) to ensure
both scalar paths use identical pricing semantics.

In `@zstd/src/tests/roundtrip_integrity.rs`:
- Around line 322-324: The regression test for the borrowed HC path is weakened
because it recompresses the data via roundtrip_at_level instead of decompressing
the exact frame produced by compress_to_vec. Replace the roundtrip_at_level call
on line 323 with a decompression of the compressed frame created on line 322 to
ensure the test validates the exact borrowed-path frame and maintains regression
protection regardless of wrapper routing changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: a6362e19-7320-4dab-bac9-1586dfb31b59

📥 Commits

Reviewing files that changed from the base of the PR and between cd1b640 and ba8dab6.

📒 Files selected for processing (6)
  • zstd/src/encoding/bt/mod.rs
  • zstd/src/encoding/hc/mod.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/match_table/storage.rs
  • zstd/src/encoding/opt/types.rs
  • zstd/src/tests/roundtrip_integrity.rs

Comment thread zstd/src/encoding/bt/mod.rs
Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs
Comment thread zstd/src/tests/roundtrip_integrity.rs Outdated
- estimated_workspace_bytes(): include the retained SoA node-price
  Box<[u32]> (HC_OPT_NODE_LEN u32s) that was missing, so the steady-state
  matcher workspace is no longer under-reported
- document why the lazy-HC dict attach/copy mode is intentionally NOT
  folded into PrimedKey: both modes share identical window geometry
  (max_window_size + dict limit set before the mode branch), so a
  cross-mode restore stays decodable; keying it would force re-primes
  across the cutoff. Upstream attach/copy likewise keep the caller window
- add hc_dict_snapshot_reuse_roundtrips: end-to-end regression that a
  frame compressed after a dict prime-snapshot restore still decodes
- borrowed-HC regression now decodes the exact frame it asserts the ratio
  on (was recompressing via roundtrip_at_level)
- note the affine equivalence that keeps the scalar and SoA-vector
  price-set paths byte-identical, and why the c_base hoist is kept

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 10218-10226: The comment block starting around the Dfast/Row
priming description inaccurately states that these backends prime "one way" when
in fact they have their own attach/copy split behaviors similar to HashChain.
Reword the comment to clarify that Dfast and Row are backend-specific modes with
their own attach/copy cutoff logic that is intentionally excluded from the
Fast-only `fast_attach` snapshot key, to avoid unnecessary re-priming when
different backends resolve to the same LevelParams. Keep the explanation that
the `fast_attach` bit must NOT enter non-Simple snapshot keys to prevent keying
mismatches between unhinted captures and hinted resets.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 68db8195-bcd4-4623-b0aa-cbfbc4fa08c7

📥 Commits

Reviewing files that changed from the base of the PR and between ba8dab6 and d6f69f8.

📒 Files selected for processing (4)
  • zstd/src/encoding/bt/mod.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/tests/dict_test.rs
  • zstd/src/tests/roundtrip_integrity.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated
polaz added 2 commits June 16, 2026 17:48
- HC_ATTACH_DICT_CUTOFF_LOG was 16 KiB but upstream
  attachDictSizeCutoffs[ZSTD_lazy/lazy2] is 32 KiB; bump to 2^15 so the
  lazy hash-chain dict attach/copy boundary matches upstream zstd. Also
  aligns with the snapshot prefer-copy gate (already 32 KiB for lazy), so
  the attach band and the no-snapshot band now coincide instead of
  leaving the 16-32 KiB band attached-without-snapshot
- correct the constant doc that misstated the upstream value as 16 KiB
- reword the over-key test comment: Dfast/Row/HC each have their own
  attach/copy regime, all intentionally kept out of the Fast-only
  fast_attach key (was inaccurately "Dfast/Row prime one way")
The binary-tree (btlazy2/btopt/btultra/btultra2) dict path always built a
separate DUBT dms (attach), with no copy path — diverging from upstream,
which COPIES the dict into the live tree above a per-strategy cutoff
(ZSTD_resetCCtx_byCopyingCDict). Mirror the upstream shape:

- attach (small/unknown source) builds the DUBT dms, as before
- copy (large known source) merges the dict into the live binary tree via
  the normal skip_matching path and drops the dms
- per-strategy cutoffs match attachDictSizeCutoffs: btlazy2/btopt 32 KiB,
  btultra/btultra2 8 KiB

Unify the attach/copy decision for both the lazy-HC and BT arms behind one
hc_dict_attach_mode() helper so skip_matching_for_dictionary_priming and
prime_with_dictionary stay in lock-step. The snapshot capture now retains
the full live tables whenever the dict was copied in (keyed on
dms.is_primed(), the exact decoupled signal) instead of assuming every BT
snapshot is dms-only.

Output stays a valid frame that upstream decodes (cross-validation tests
compress here + decompress via FFI); attach vs copy is a memory/speed
trade-off, not a wire change.
@polaz

polaz commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai re-review

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

@polaz Sure, I'll re-review the PR now!

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
zstd/src/encoding/match_generator.rs (2)

2548-2551: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Refresh the snapshot comments for non-Simple attach modes.

These comments are now stale: Dfast/Row/HashChain do have backend-specific attach/copy regimes, and lazy-HC attach can prime dms via prime_dms_hc. Keep the current keying behavior, but reword the rationale so fast_attach is described as Fast-only, while BT decoupling is specifically uses_bt && dms.is_primed().

Suggested wording adjustment
-        // backends prime the dictionary the same way regardless, so including
-        // the bit there would over-key identical resolved shapes. When it
-        // applies it matches the decision `prime_with_dictionary` makes from the
-        // same `reset_size_log`.
+        // Dfast/Row/HashChain have their own attach/copy regimes, but this bit
+        // models only the Simple/Fast backend's distinct dict-table shape. Those
+        // other backends are keyed by the resolved matcher geometry instead.
+        // When this bit applies, it matches the decision `prime_with_dictionary`
+        // makes from the same `reset_size_log`.
-        // being primed is the exact "decoupled" signal — it is set only by the BT
-        // attach prime and dropped in copy mode.
+        // For BT, `uses_bt && dms.is_primed()` is the exact "decoupled" signal.
+        // Lazy-HC attach can also prime `dms`, but it intentionally keeps the
+        // full live-table snapshot path.

Also applies to: 2918-2924

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/match_generator.rs` around lines 2548 - 2551, Update the
stale comments about attach-vs-copy mode in match_generator.rs at lines
2548-2551 and 2918-2924. The comments incorrectly claim that fast attach-vs-copy
is only used in the Simple backend, but Dfast/Row/HashChain backends also have
backend-specific attach/copy regimes (and lazy-HC can prime dms via
prime_dms_hc). Keep the current keying behavior unchanged, but reword the
comments to accurately reflect that fast_attach is Fast-backend-only and clarify
that BT decoupling is specifically controlled by the condition uses_bt &&
dms.is_primed(), not by the Simple/non-Simple backend distinction.

4827-4843: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep node_prices synchronized during seed initialization.

The SoA sidecar is retained scratch and is compared by the price-set paths, but these seed-frontier resets and the forced seed write update only nodes. Mirror the writes into node_prices to preserve the documented lockstep invariant.

Proposed fix
                 for p in 1..min_match_len.min(frontier_buffer_size) {
                     BtMatcher::reset_opt_node(&mut nodes[p]);
+                    node_prices[p] = u32::MAX;
                     // `initial_litlen` is the litlen carried from prior
                     let forced_state = HcOptimalNode {
                         price: forced_price,
                         off: candidate.offset as u32,
                         mlen: longest_len as u32,
                         litlen: 0,
                         reps: initial_reps,
                     };
                     if longest_len < frontier_buffer_size && forced_price < nodes[longest_len].price {
                         nodes[longest_len] = forced_state;
+                        node_prices[longest_len] = forced_price;
                     }

Also applies to: 4875-4877

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/match_generator.rs` around lines 4827 - 4843, The seed
initialization loop that resets nodes and assigns values to nodes[p].litlen is
updating the nodes array but not the parallel node_prices sidecar structure,
breaking the documented lockstep invariant between these two arrays. After each
update to nodes[p] (specifically when setting nodes[p].litlen with seed_litlen),
mirror the corresponding write into node_prices[p] to keep both structures
synchronized. Apply the same fix pattern at the other affected locations
mentioned (lines 4875-4877) where similar node updates occur without
corresponding node_prices updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/encoding/hc/mod.rs`:
- Around line 449-462: The condition gating the dictionary walk in the DICT
block is checking for table allocation via table.dms.table().is_some_and()
rather than checking the actual validity of the dictionary state. Replace this
condition to check dms.is_primed() instead, since the validity contract for the
DMS is provided by is_primed(), not table presence. An allocated table can
remain after the dictionary becomes invalidated, so checking only table presence
allows stale dictionary chains and indices to be incorrectly reused.

In `@zstd/src/encoding/match_generator.rs`:
- Around line 6521-6522: The lazy skip increment at the location where `step` is
calculated and `pos` is incremented can cause `pos` to jump past the first
non-searchable-but-hashable tail anchor on long literal runs, preventing the
tail loop from inserting it for the next block. After calculating `pos += step`,
clamp `pos` to not exceed the first non-searchable position while keeping the
searched-position skips unchanged. This preserves the tail insertion contract by
ensuring the tail loop still processes the critical anchor point in subsequent
iterations.

In `@zstd/src/encoding/match_table/storage.rs`:
- Around line 808-815: The cache key validation in the condition block (checking
is_primed(), region_len(), mls, and hash_log) is incomplete and can cause HC and
BT table layouts to be confused. Add a layout/mode discriminator field (e.g.,
layout field) to the DmsDictTables struct to distinguish between HC and BT
layouts. Include this layout field in the cache validation check alongside mls
and hash_log to ensure only tables with matching layouts are reused.
Additionally, set tables.layout to DmsDictLayout::Hc in the prime_dms_hc
function and tables.layout to DmsDictLayout::Bt in the prime_dms_bt function to
properly tag each table with its intended layout.

In `@zstd/src/tests/roundtrip_integrity.rs`:
- Line 321: The regression test on the for loop iterating over compression
levels is missing coverage for levels 8, 9, and 11 within the stated 5–12 band.
Modify the array passed to the for loop to include all integers from 5 through
12, replacing the current sparse list that skips those three levels. This
ensures complete regression coverage across the entire compression level range
being tested.

---

Outside diff comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 2548-2551: Update the stale comments about attach-vs-copy mode in
match_generator.rs at lines 2548-2551 and 2918-2924. The comments incorrectly
claim that fast attach-vs-copy is only used in the Simple backend, but
Dfast/Row/HashChain backends also have backend-specific attach/copy regimes (and
lazy-HC can prime dms via prime_dms_hc). Keep the current keying behavior
unchanged, but reword the comments to accurately reflect that fast_attach is
Fast-backend-only and clarify that BT decoupling is specifically controlled by
the condition uses_bt && dms.is_primed(), not by the Simple/non-Simple backend
distinction.
- Around line 4827-4843: The seed initialization loop that resets nodes and
assigns values to nodes[p].litlen is updating the nodes array but not the
parallel node_prices sidecar structure, breaking the documented lockstep
invariant between these two arrays. After each update to nodes[p] (specifically
when setting nodes[p].litlen with seed_litlen), mirror the corresponding write
into node_prices[p] to keep both structures synchronized. Apply the same fix
pattern at the other affected locations mentioned (lines 4875-4877) where
similar node updates occur without corresponding node_prices updates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: c11e33a1-14aa-49ad-a651-cdb7333b6e48

📥 Commits

Reviewing files that changed from the base of the PR and between cd1b640 and 939a16b.

📒 Files selected for processing (7)
  • zstd/src/encoding/bt/mod.rs
  • zstd/src/encoding/hc/mod.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/match_table/storage.rs
  • zstd/src/encoding/opt/types.rs
  • zstd/src/tests/dict_test.rs
  • zstd/src/tests/roundtrip_integrity.rs

Comment thread zstd/src/encoding/hc/mod.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs
Comment thread zstd/src/encoding/match_table/storage.rs
Comment thread zstd/src/tests/roundtrip_integrity.rs Outdated
CodeRabbit re-review follow-ups on the dict match-finder:

- dms cache key now carries a layout discriminator (DmsDictLayout::{Hc,Bt}).
  prime_dms_hc and prime_dms_bt populate hash_table/chain_table with
  incompatible meanings (HC single-link chain vs BT DUBT, chain_table len
  region vs 2*region); a reused compressor switching level across the
  HC<->BT boundary with coincident (region, mls, hash_log) could reuse the
  wrong layout. Tag each build and require a layout match to reuse.
- optimal-parser seed keeps the SoA node_prices sidecar in lockstep with
  nodes[].price: the seed reset loop and the forced-seed write now mirror
  into node_prices so the price-set paths compare against the live baseline.
- gate the lazy-HC dms walk on dms.is_primed() (the canonical validity
  contract) instead of probing table presence.
- borrowed-HC roundtrip regression covers the full L5..=12 band.
- refresh stale attach/copy comments: fast_attach is Fast-only; BT
  decoupling is uses_bt && dms.is_primed(); document why the lazy skip needs
  no tail clamp (HC_MIN_MATCH_LEN == insert width, no missed anchor).

Adds dms_cache_rebuilds_across_hc_bt_layout_switch (fails without the
discriminator).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/encoding/match_generator.rs (1)

6456-6460: ⚠️ Potential issue | 🟡 Minor

Switch the dispatch guard to dms.is_primed().

The comment states "dispatcher: pick the dict-aware monomorph when a separate dms is primed," but the code checks dms.table().is_some(). Use is_primed() for semantic correctness and consistency with the other dictionary dispatcher at line 2925 that already uses this check.

Note: invalidate() fully clears the table and primed flag, so the concern about preserving tables for cache reuse is unfounded.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@zstd/src/encoding/match_generator.rs` around lines 6456 - 6460, The dispatch
guard in the start_matching_lazy_impl call uses dms.table().is_some() to
determine which monomorph to invoke, but should instead use dms.is_primed() for
semantic correctness. Change the condition from checking if the table is some to
checking if the dms is primed, which aligns with the inline comment stating
"dispatcher: pick the dict-aware monomorph when a separate dms is primed" and
matches the pattern already used at line 2925. This ensures consistency across
dictionary dispatchers since invalidate() clears both the table and the primed
flag together.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 6456-6460: The dispatch guard in the start_matching_lazy_impl call
uses dms.table().is_some() to determine which monomorph to invoke, but should
instead use dms.is_primed() for semantic correctness. Change the condition from
checking if the table is some to checking if the dms is primed, which aligns
with the inline comment stating "dispatcher: pick the dict-aware monomorph when
a separate dms is primed" and matches the pattern already used at line 2925.
This ensures consistency across dictionary dispatchers since invalidate() clears
both the table and the primed flag together.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 78ba165c-5a38-4937-8195-c1c7bb87659b

📥 Commits

Reviewing files that changed from the base of the PR and between 939a16b and d9b6365.

📒 Files selected for processing (4)
  • zstd/src/encoding/hc/mod.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/match_table/storage.rs
  • zstd/src/tests/roundtrip_integrity.rs

The lazy start_matching dispatcher picked the dict-aware monomorph on
dms.table().is_some(); its own doc says 'when a separate dms is primed'.
Switch to dms.is_primed() so it matches the inner dms-walk guard and the
optimal dispatcher. Behaviorally equivalent (invalidate() clears both the
table and the primed flag together), but the canonical validity contract.
@polaz

polaz commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Finish the AoS->SoA split in the optimal parser: remove the `price` field
from `HcOptimalNode` and keep each position's price ONLY in the parallel
`node_prices: Box<[u32]>`.

Previously price was duplicated (in the 28-byte node struct AND the flat
SoA array the SIMD price-set vector-loads), which required a lockstep
invariant and was the source of the desync bug class (forced-seed sidecar
miss). With one home for each price there is nothing to keep in sync, so
that bug class is gone by construction.

- frontier reads/writes go through node_prices[i] (node struct loses price)
- transient node copies that carried price (forced_end_state, last_stretch)
  now track it via parallel locals (forced_end_price / last_stretch_price);
  prev_match's price reuses the pre-overwrite node_pos_price
- per node update we now store the price once (was twice), and the AoS
  struct shrinks 28->24 bytes
- reset_opt_node only resets litlen; price reset stays in reset_opt_nodes

Pure representation change: compressed output is byte-identical (850 tests
incl. cross-validation FFI round-trip, clippy, fmt green). Upstream zstd
keeps one `opt[].price` and has no such invariant; this matches that
single-source property while keeping our SIMD vector price compare.
@polaz polaz merged commit 0d46f4e into main Jun 16, 2026
27 of 28 checks passed
@polaz polaz deleted the perf/hc-dict-lazyskip-dms branch June 16, 2026 18:35
@sw-release-bot sw-release-bot Bot mentioned this pull request Jun 16, 2026
@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown

Greptile Summary

  • Adds dictionary-aware lazy hash-chain compression with lazy skipping and a separate dictionary match state for small-window dictionary inputs.
  • Monomorphizes dictionary and non-dictionary match search paths while keeping the no-dictionary hot path separate.
  • Moves optimal-parser node pricing into sidecar storage and adds SIMD/vectorized price-set helpers across supported CPU tiers.
  • Updates dictionary snapshot reuse behavior and regression coverage for borrowed history and dictionary round trips.

Confidence Score: 5/5

The changes are focused on compression matching and pricing paths with dedicated regression coverage and no outstanding code findings.

The implementation is supported by round-trip, dictionary, documentation, and lint testing described for the changed paths, and no blocking issues were identified.

T-Rex T-Rex Logs

What T-Rex did

  • Ran the lazy-dict performance benchmark and observed that the head run is faster for most target levels, though L6-L10 and L12 workloads no longer benefit from the dictionary, yielding roughly a 1.0014 ratio versus the base 0.2652.
  • Executed roundtrip correctness tests for no-dict and dict configurations; both sets passed (3/3 no-dict L10 and 36/36 dict roundtrip cases), with cargo test exit code 0.
  • Compared the opt-price-set baselines and head; the base checkout timed out during the release run, while the head checkout built and released successfully with SIMD tier features enabled.

View all artifacts

T-Rex Ran code and verified through T-Rex

Comments Outside Diff (1)

  1. General comment

    P1 Lazy dictionary HC speedup regresses compression ratio for L6-L10 and L12

    • Bug
      • On the executed small-window dictionary compression workload, head reduces median compression time for L6-L10 and L12, but compressed output grows from 2,716 bytes on base to 10,254 bytes on head for those levels. The input is 10,240 bytes and embeds repeated slices from a 10,240-byte raw-content dictionary, so the base output ratio is 0.2652 while head's output ratio is 1.0014. All outputs decompress correctly, but the PR validation contract requires ratio-equivalent decompressible output; head only preserves the base size at L11.
    • Cause
      • The changed lazy hash-chain dictionary/match-generator path appears to skip or fail to apply dictionary matching for L6-L10 and L12 in this small-window borrowed-dictionary scenario, likely in the new lazy HC dictionary candidate gating / dictMatchState / borrowed-window hashing logic in zstd/src/encoding/hc/mod.rs or zstd/src/encoding/match_generator.rs.
    • Fix
      • Audit the lazy HC dictionary path for levels L6-L10 and L12 to ensure the dictionary is primed and dictionary candidates are considered before falling back to literal/uncompressed-like output. Add a regression benchmark/test similar to the attached harness that asserts both decompression success and ratio-equivalent size for deterministic 10 KiB dictionary-derived input across L6-L12.

    T-Rex Ran code and verified through T-Rex

Reviews (1): Last reviewed commit: "refactor(opt): make node_prices the sing..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant