Commit 8065eb2
authored
perf(decode): straight-loop short path + donor-gated lookahead ring + SeqSymbol repack (#289)
* refactor(fse): add SeqSymbol (8B) + FseEntry trait
Lays the type-level foundation for splitting the FSE decoding table
into a 12-byte HUF-grade `Entry` variant (keeps `symbol` for
huff0) and an 8-byte sequence-section `SeqSymbol` variant
(matches donor `ZSTD_seqSymbol` exactly, drops `symbol`).
Pure additive change: trait + new struct + `impl Default for Entry`.
No callsite touches. `FSETable` / `FSEDecoder` stay non-generic for
this commit; the generic refactor lands next.
647/647 tests pass.
* refactor(fse): generic FSETableImpl<E> / FSEDecoderImpl<'t, E>
Parameterize the decoder table and decoder over the entry type via
the new `FseEntry` trait (`num_bits()`, `new_state()`, `from_raw()`).
Type aliases preserve every existing callsite:
pub type FSETable = FSETableImpl<Entry>; // HUF default
pub type SeqFSETable = FSETableImpl<SeqSymbol>; // sequence section
pub type FSEDecoder = FSEDecoderImpl<'_, Entry>;
HUF-only methods (`decode_symbol`, `enrich_with_packed_seq_meta`,
`enrich_for_offsets`) live on `impl FSETableImpl<Entry>` /
`impl FSEDecoderImpl<'_, Entry>` blocks. State transitions go through
the trait, so the same hot path serves both entry shapes.
`SeqFSETable` is declared but not yet wired into FSEScratch — that
landing requires the build-time meta-population pass, which lands
next.
647/647 tests pass. Clippy clean.
* refactor(fse): migrate sequence-section LL/ML/OF to SeqFSETable (8B entries)
Switch `FSEScratch.literal_lengths` / `match_lengths` / `offsets` and
the predefined-table cache to `SeqFSETable` (= FSETableImpl<SeqSymbol>).
LL / ML / OF entries shrink from 12 bytes (`Entry`) to 8 bytes
(`SeqSymbol`), matching donor `ZSTD_seqSymbol` exactly.
Mechanical scope:
- `AlignedFSETable` wraps `SeqFSETable`.
- `SeqFSEDecoder<'t> = FSEDecoderImpl<'t, SeqSymbol>` for all
per-sequence callsites (`run_pipelined_sequence_loop`,
`decode_one_sequence_inline`, x86-kernel macro).
- `enrich_with_packed_seq_meta` / `enrich_for_offsets` move to
`impl FSETableImpl<SeqSymbol>`, reading the source byte from the
persisted `symbol_spread_buffer` (`SeqSymbol` has no `symbol`).
- `reinit_from` now copies the spread buffer (was only reserving
capacity) so post-`reinit_from` enrich calls observe a valid
per-slot source byte.
- `compute_offsets_long_share` reads `entry.num_additional_bits`
(== offset code for code < 32 after `enrich_for_offsets`) instead
of the dropped `entry.symbol`.
- `decode_sequences_with_rle` no longer calls `decode_symbol()` —
the source byte is unused in the FSE-mode branch; RLE-mode reads
ll_rle / ml_rle / of_rle directly and OF computes `1 << of_rle`
inline.
- Drop Entry-side `enrich_*` (HUF tables never enrich).
HUF (`huff0_decoder`) continues to use `FSETable` (= 12-byte Entry,
keeps `symbol` for the per-state byte lookup).
647/647 tests pass. Clippy clean.
* perf(decode): align steady-state seq loop entry to 64-byte DSB window
Insert explicit `.p2align 6 / nop / .p2align 5 / nop / .p2align 3`
asm directives directly before the pipelined seq-decoder steady-state
loop in `run_pipelined_sequence_loop`. Mirrors donor's
`ZSTD_decompressSequences_body_bmi2_noExt_rawLit` (zstd-pure-rs
zstd_decompress_block.rs:2282-2290).
Targets the Skylake-X DSB pressure diagnosed earlier (hot loop ~3000
µops exceeds the 1536-µop DSB capacity, 40% of µops were arriving via
the legacy MITE decoder with measurable DSB2MITE switch cost).
647/647 tests pass. Single-asm, no mem/reg clobbers.
* perf(decode): replace pipelined lookahead ring with donor straight loop
Strip the 8-deep `ExecSeq` ring, `shadow_hist`, `prefetch_pos`
arithmetic, and `prefetch_lookahead_match_source` issuing inside the
pipelined seq decoder. The new body is donor `_noExt_rawLit` shape:
decode one sequence, resolve its offset, execute immediately, advance
state, next. Zero per-iteration ring bookkeeping; hardware prefetcher
handles short-distance matches on hot-cache corpora (z000033 fixture).
Applied to both the K-agnostic safe `run_pipelined_sequence_loop` and
the per-tier x86 macro `define_x86_seq_decoder_tier!`'s `$loop_fn`.
DSB align asm padding from R10-A retained ahead of the loop entry.
Outside-of-hot-loop callsites that referenced `ExecSeq`,
`execute_one_sequence_pipelined_resolved`, `ADVANCE_MASK`, and the
lookahead-prefetch surface remain compiled (warnings only); future
cleanup commit can drop them once the architecture is bench-validated.
647/647 tests pass on aarch64.
* fix(decode): drop stale shadow_hist commit in tier macro tail
* perf(decode): restore lookahead-ring \$loop_fn for use_long_pipeline arm
R10-C straight-loop replaced \$loop_fn body entirely with donor
\`_noExt_rawLit\` shape, but \$loop_fn is the long-pipeline arm: the
\`use_long_pipeline = ddict_is_cold || offsets_long_share >= 7\`
gate routes ONLY cold-dict / long-offset frames here. Donor parity
demands the lookahead ring with \`prefetch_lookahead_match_source\`
in that arm (donor \`ZSTD_decompressSequencesLong_body_bmi2_impl\`),
not the straight loop.
Restore the original 8-deep ExecSeq ring + shadow_hist +
prefetch_pos arithmetic inside \$loop_fn. The else-arm (short-block
iterative fallback) keeps its current straight single-pass-fused
shape — that's where the common warm/short-offset case lands and
where the R10-C win actually came from on z000033.
DSB align asm padding kept ahead of the long-pipeline steady-state
loop entry (R10-A carry-over). Tests 647/647 on aarch64.
* perf(decode): gate lookahead-ring on totalHistorySize > 16 MB (donor parity)
Donor `usePrefetchDecoder` (zstd_decompress_block.rs:3231-3238)
engages the prefetch decoder ONLY when total history size exceeds
16 MB (\`> 1<<24\`). Smaller frames have history that fits in
L2/L3, hardware prefetch handles short/medium offsets, and the
in-loop \`prefetch_lookahead_match_source\` issue is pure overhead.
Our previous gate \`ddict_is_cold || offsets_long_share >= 7\` was
too aggressive — engaged the ring on 1 MB frames like z000033
where donor takes the straight-loop path. Adding the donor's
\`total_history > 1<<24\` predicate to the long-offset arm.
\`ddict_is_cold\` is preserved as an unconditional gate (first
block of a freshly-attached dict frame benefits from prefetch
regardless of size — that's the cold-dict DRAM-hiding scenario
the ring was originally for).
Tests 647/647 on aarch64.
* fix(decode): gate MIN_LONG_OFFSET_SHARE / HISTORY_THRESHOLD for i686 build
- MIN_LONG_OFFSET_SHARE: gate the 64-bit value under cfg
(target_pointer_width = "64") so it isn't doubly-defined alongside
the not-64-bit override.
- HISTORY_THRESHOLD_FOR_PREFETCH: define unconditionally so the
use_long_pipeline predicate compiles on 32-bit. 16 MB literal fits
in usize on both pointer widths.
- SeqFSETable rustdoc clarified: from_raw zero-inits meta fields;
LL / ML / OF enrich pass populates them post-build via
enrich_with_packed_seq_meta / enrich_for_offsets.
- enrich_with_packed_seq_meta rustdoc: dropped stale link to the
deleted Entry-side helper.
- scratch.rs comment: Vec<Entry> -> Vec<SeqSymbol>.
- Tier-macro \$loop_fn doc: full gate documented (cold_dict OR
hist > 16MB AND long_share >= MIN_LONG_OFFSET_SHARE).
* perf(fse): drop dead spread-buffer copy from reinit_from
Predefined-cache reinit_from + dict-init reinit_from feed a source
table whose decode[] is already enriched. The post-reinit decoder
never re-reads symbol_spread_buffer (build-time scratch + enrich
source only). Revert to reserve-capacity-only so dict-init and
Predefined paths skip the unconditional bytes copy that was added
when SeqSymbol enrich gained spread-buffer dependency.
* refactor(fse,decode): tighten FSE internals visibility + dedup long-pipeline gate
- FSEDecoderImpl, FseEntry trait -> pub(crate). Not reachable through any
public type chain; were leaking through `pub use fse_decoder::*` without
being part of the intended public surface.
- FSETableImpl, SeqSymbol stay pub but #[doc(hidden)] — they're reachable
through the public Deref impl on AlignedFSETable (AlignedFSETable ->
SeqFSETable = FSETableImpl<SeqSymbol>) which the public Dictionary type
exposes via its FSEScratch field.
- Extract compute_use_long_pipeline() helper in sequence_section_decoder
as the single source of truth for the MIN_LONG_OFFSET_SHARE / 16-MB
history / num_sequences gate. Both the K-generic dispatcher and the
per-tier x86 macro now call into it, eliminating the duplicate
definitions and preventing future drift between scalar and SIMD
dispatch.
* perf(fse): reuse symbol_probabilities buffer in build_from_probabilities
Replace `self.symbol_probabilities = probs.to_vec()` with
clear() + extend_from_slice(). Avoids fresh heap allocation on
every call and preserves Vec capacity (with_capacity(256) in
the constructor was being thrown away).
* fix(fse): widen FSEDecoderImpl visibility to match pub type aliases
The `pub(crate) struct FSEDecoderImpl` was re-exported via two
`pub type` aliases (`FSEDecoder` for HUF weight-stream callers,
`SeqFSEDecoder` for the sequence decoder). Under
`feature = "fuzz_exports"` the `crate::fse` module becomes
`pub mod fse`, so the public aliases would name a more-private
struct — the compiler emits `type FSEDecoderImpl... is more
private than the item FSEDecoder` (`private_interfaces` lint).
Make the struct itself `pub`. External reachability is still
gated by `crate::fse` module visibility:
- default build: `pub(crate) mod fse` keeps everything
crate-internal regardless of the inner `pub`
- `fuzz_exports`: `pub mod fse` exposes the struct, which is
exactly what the fuzz harness needs
No behavioural change; warning-only fix on the fuzz_exports build.
* perf(decode): R12 monolithic per-kernel decoder foundation (#291)
* refactor(decode): AVX2-tier monolithic sequence decoder
Replaces the macro-generated three-function tier (decode_fn / loop_fn /
decode_one_fn) for the AVX2 path with one self-contained
`#[target_feature(enable = "bmi2,avx2")]` function. The body inlines the
entire decode + execute pipeline: outer init, RLE dispatch, FSE state
init, both pipeline arms (8-deep lookahead ring for cold-dict / >16 MB
total-history, straight short-block loop otherwise), per-sequence
decode (peek_bits_triple_bmi2 with `_pext_u64` inline at call site),
per-sequence execute (32-byte AVX2 ymm wildcopy via
exec_sequence_inline_avx2), and post-loop bitstream-tail validation.
Two internal helpers (decode_one_avx2, execute_one_avx2) carry the same
target_feature scope and are `#[inline]` — LLVM collapses them into the
caller in release mode since both are single-callsite hot-path functions
inside the matching target_feature.
Companion monoliths for BMI2 / VBMI2 / Scalar in follow-up commits. The
K-generic dispatcher in sequence_section_decoder still selects the tier
once per call via cached detect_cpu_kernel.
* refactor(decode): BMI2-tier monolithic sequence decoder
Same shape as the AVX2 monolith but pinned to `Bmi2Kernel` and using
the SSE2 16-byte `exec_sequence_inline` for match copy (no AVX2 ymm
widening at this tier). Outer init + RLE dispatch + FSE state init +
both pipeline arms + decode_one + execute_one all inlined into one
`#[target_feature(enable = "bmi2")]` function.
Triple-bit extract goes through `peek_bits_triple_bmi2` (_pext_u64
inline) gated by the cached vendor policy — same as AVX2.
* refactor(decode): VBMI2-tier monolithic sequence decoder
Same shape as the AVX2 monolith but pinned to `Vbmi2Kernel` and
carrying the full VBMI2 + AVX-512 + AVX2 + BMI2 target_feature scope.
Match copy uses `exec_sequence_inline_avx2` since VBMI2 always implies
AVX2+BMI2; zmm 64-byte wildcopy + VPSHUFB short-offset shortcut wait
until WILDCOPY_OVERLENGTH grows to 64.
* refactor(decode): Scalar monolith + drop x86 macro + one-time dispatch
Adds `seq_decoder_scalar.rs` as the Scalar-tier monolith (same shape
as the AVX2 / BMI2 / VBMI2 monoliths, pinned to `ScalarKernel`, no
target_feature). Dispatcher in
`sequence_section_decoder::decode_and_execute_sequences` now routes
the Scalar arm into the new monolith too.
Deletes `seq_decoder_x86_kernel.rs` entirely — the
`define_x86_seq_decoder_tier!` macro is no longer referenced by any
tier file. Each tier (Scalar / BMI2 / AVX2 / VBMI2) is now a fully
self-contained function with the whole decode + execute pipeline
inlined in one place, selected ONCE per call via cached
`detect_cpu_kernel`.
aarch64 NEON / SVE still go through the K-generic
`decode_and_execute_sequences_impl` shared body until their own
monoliths land in a follow-up.
* perf(decode): macro-expand decode_one + execute_one inside each kernel monolith
Replaces the per-tier `#[inline]` helper functions (`decode_one_avx2`,
`execute_one_avx2`, BMI2/VBMI2/Scalar equivalents) with `macro_rules!`
blocks that expand textually at every callsite inside the monolithic
outer function. This bypasses the LLVM inline cost-model entirely:
macros expand BEFORE LLVM sees the code, so the resulting compiled
function has NO inner CALL boundary regardless of cost-model decisions.
Why this matters: `#[inline]` is a hint, not a command. With
`target_feature` scopes, `Result<>` panic landings, multiple callsites,
and generic `B: BufferBackend` parameterisation, LLVM's heuristic
declines to inline — leaving CALL boundaries that fragment register
allocation and break the µop-cache window on Skylake-X
(DSB ≤ 1536 µops). Combined with the Rust limit that forbids
`#[inline(always)]` + `#[target_feature]` together (rust-lang/rust#145574),
macros are the only mechanism that guarantees the textual inline donor C
achieves via single-source monoliths.
Macro shape: two `macro_rules!` per tier — `decode_one_body!`
(state read + triple-bit extract + advance) and `execute_one_body!`
(literal slice + offset gates + match copy + cold fallback). The execute
body uses labeled-block early-exit (`break 'exec_inner Err(...)`),
stable since 1.65, instead of closure or `?` — closures and `?` recreate
function boundaries.
Affects all four monoliths: Scalar / BMI2 / AVX2 / VBMI2.
* perf(huff0): drop Vec<Entry> table, unify on packed u16 (donor HUF_DEltX1) (#292)
The HUF decode table was held in two parallel representations:
- `decode: Vec<Entry { symbol: u8, num_bits: u8 }>` for scalar /
weight-stream / single-symbol fallback readers
- `packed_decode: Vec<u16>` (donor `HUF_DEltX1` format: low byte =
symbol, high byte = num_bits) for the hot 4-stream burst path
`build_table_from_weights` filled both with `Vec::fill(value)` over
the same index range — two `__memset_avx2`-driven passes per rank
group. On the z000033 decode flamegraph the HUF table build is
12.53% of self-time, with `build_table_from_weights` 5.02% (largest
single setup cost on a workload with many small blocks).
Drop the `Entry` Vec entirely. Every reader switches to inline
unpack of the packed u16: `packed as u8` for symbol, `(packed >> 8)
as u8` for num_bits. Single AND/SHR on registers — same cost as the
struct-field reads being replaced. Donor `huf_decompress.c` reads
the `HUF_DEltX1` byte-wise from a single u16-wide table, so this is
structural parity rather than a representation invention.
`Entry` struct dropped as unused; no external re-export.
* fix(decode): saturating_add for total_history to prevent 32-bit usize wrap
`buffer.window_size + buffer.dict_content.len()` is plain `usize +
usize`. On 32-bit targets (i686, wasm32) a pathological frame
combining a large window with a non-trivial dict could wrap the
sum past `usize::MAX`, silently flipping the
`compute_use_long_pipeline` gate around `HISTORY_THRESHOLD_FOR_PREFETCH`
(16 MiB) in the wrong direction.
Replace with `.saturating_add()` at all 5 sites where
`total_history` feeds into the long-pipeline gate:
- `sequence_section_decoder.rs:366` (K-generic dispatcher)
- `seq_decoder_avx2.rs:258`
- `seq_decoder_bmi2.rs:230`
- `seq_decoder_vbmi2.rs:226`
- `seq_decoder_scalar.rs:176`
Saturation pins overflow to `usize::MAX`, which is comfortably
above the 16 MiB threshold — the gate engages, matching the
non-wrapped intent (large total history → long pipeline pays off).
Verified: `cargo check --target i686-unknown-linux-gnu` clean,
647/647 tests pass on x86_64.
* test(decode): add boundary unit tests for compute_use_long_pipeline
Cover the four-dimensional gate logic:
- num_sequences threshold (ADVANCE*2 = 16)
- ddict_is_cold override (forces engage independent of history/share)
- history threshold (1 << 24 = 16 MiB, `>` not `>=`)
- per-target-pointer-width MIN_LONG_OFFSET_SHARE (7 on 64-bit, 20 on 32-bit)
Six tests covering the boundary cases that would silently regress
if the heuristic is re-tuned without re-reading the spec:
- below 2x ADVANCE never engages
- cold dict at min seq count overrides everything
- history exactly at threshold does NOT engage (strict inequality)
- history one above threshold + share at min DOES engage
- share below min blocks even with large history
- saturating-history fallback (usize::MAX) engages cleanly
* test(decode): move MIN_SHARE > 0 invariant into const assertion
`assert!(MIN_SHARE > 0, ...)` at runtime was flagged by
`clippy::assertions_on_constants`. Both `MIN_SHARE` values (7 on
64-bit, 20 on 32-bit) are compile-time constants, so the check
belongs in a `const _: () = assert!(...)` block — caught at
compile time, no runtime cost.
* docs(decode): clarify why shared helpers are not dead on x86_64
Expand the comment on the per-kernel sequence-decoder dispatcher to
explicitly document that the shared K-generic helpers
(`decode_and_execute_sequences_impl`, `run_pipelined_sequence_loop`,
`decode_one_sequence_inline`, `execute_one_sequence_pipelined*`)
remain reachable on x86_64 builds even though the per-kernel
monoliths bypass them — aarch64 production + on-target tests are
the live callers. Prevents future review rounds from flagging the
helpers as orphaned.
Strict-warnings build status documented inline so the next reader
doesn't need to re-run cargo to verify.
* fix(decode): allow(dead_code) on shared seq helpers orphaned on x86_64
11 helpers in `sequence_section_decoder.rs` trigger dead_code under
`-D warnings` on x86_64 because the per-kernel monoliths
(`seq_decoder_{bmi2,avx2,vbmi2}`) bypass them entirely:
- `decode_and_execute_sequences_impl`, `run_pipelined_sequence_loop`,
`decode_one_sequence_inline`, `execute_one_sequence_pipelined`,
`execute_one_sequence_pipelined_resolved` — still live on aarch64
(Neon/Sve dispatch arms) and in tests, orphan on x86_64 production.
- `execute_one_sequence_pipelined_{bmi2,avx2,vbmi2}` and
`execute_one_sequence_pipelined_resolved_{bmi2,avx2,vbmi2}` —
vestigial pre-R12 macro-dispatch helpers with no remaining
callers; deletion deferred to a follow-up.
Apply `#[allow(dead_code)]` per-function. Cross-arch helpers keep
their definitions intact for aarch64 + tests; x86_64 build no longer
warns. Updated the dispatch comment in `mod.rs` to describe the
actual status.1 parent d3a9f7b commit 8065eb2
10 files changed
Lines changed: 2099 additions & 818 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
70 | 80 | | |
71 | 81 | | |
72 | 82 | | |
73 | 83 | | |
| 84 | + | |
74 | 85 | | |
75 | 86 | | |
76 | 87 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
| 8 | + | |
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| |||
348 | 348 | | |
349 | 349 | | |
350 | 350 | | |
351 | | - | |
| 351 | + | |
352 | 352 | | |
353 | 353 | | |
354 | | - | |
| 354 | + | |
355 | 355 | | |
356 | 356 | | |
357 | 357 | | |
358 | | - | |
| 358 | + | |
359 | 359 | | |
360 | 360 | | |
361 | 361 | | |
362 | 362 | | |
363 | | - | |
| 363 | + | |
364 | 364 | | |
365 | 365 | | |
366 | 366 | | |
| |||
0 commit comments