Skip to content

Commit 5880331

Browse files
authored
perf(fse): kill iterator overhead in build_decoding_table, write decode via set_len (#293)
* fix(fse): use spare_capacity_mut + MaybeUninit instead of premature set_len Copilot flagged the prior P2 change as UB: `set_len(table_size)` before the write loop made `self.decode` logically contain `table_size` initialised entries, but they were uninitialised memory. The error path (`nb > accuracy_log`) returned with that broken state, and the outer `reset()`/Drop would have run on uninit entries — unsound per the Vec contract. Fix: write via `spare_capacity_mut()` (typed `&mut [MaybeUninit<E>]`) and call `set_len(table_size)` only AFTER the loop completes. Error path returns with `decode.len() == 0` (set by the preceding `clear()`), so no uninitialised entry is observable. The hot-loop codegen is unchanged — `MaybeUninit::write` lowers to the same strided store sequence the raw pointer version was emitting; the soundness is now machine-checkable. `E: FseEntry` has no `Drop` so the dropped `MaybeUninit` slice on the error path is a no-op. * perf(decode): drop #[cold] on do_offset_history_repcode for hot workloads (#294) Issue #279 round-1 mispredict diagnosis attributed 15.42% of decoder mispredicts to `do_offset_history_repcode`, with 27.80% landing on the `pushq %rax` function entry — call/ret BTB pressure from the never-inlined boundary. `#[inline(never)]` was dropped in earlier rounds; `#[cold]` was kept to preserve out-of-line layout for low-entropy blocks where the prior «drop both» variant regressed +15.9% on L14. The z000033 L-5 decode flamegraph surfaces this helper at 1.93% self-time despite the `#[cold]` label — the cold-bias attribute itself blocks LLVM from inlining even at hot call sites where the call/ret + BTB cost dominates the body work. The body is small (RULES lookup + 6 branchless cmov) and inlining duplicates a tight scalar sequence into the seq decoder's per-sequence loop. Drop `#[cold]` and let the inline cost-model see the full picture. Cold callers don't pay anything they weren't paying before — their total cost was already dominated by surrounding rare-path work, not this helper. * fix(fse): index slice instead of take() before unsafe set_len `spread.iter().take(table_size)` silently runs fewer iterations if `spread.len() < table_size` — which can only happen if a future refactor breaks the upstream `spread.resize(table_size, 0)` invariant, but the failure mode is severe: the loop body would leave the `[loop_count..table_size)` slots in `decode`'s spare capacity uninitialised, and the post-loop `set_len(table_size)` would then claim those uninitialised entries as initialised — UB. Switch to `spread[..table_size].iter()`. A length mismatch now panics with a clear bounds-check error BEFORE the unsafe set_len runs, surfacing the broken invariant immediately instead of turning it into silent memory unsafety.
1 parent 8065eb2 commit 5880331

2 files changed

Lines changed: 64 additions & 26 deletions

File tree

zstd/src/decoding/sequence_execution.rs

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -84,17 +84,23 @@ pub(crate) fn do_offset_history(offset_value: u32, lit_len: u32, scratch: &mut [
8484
do_offset_history_repcode(offset_value, lit_len, scratch)
8585
}
8686

87-
// `#[cold]` keeps the body out of caller hot layout and biases icache
88-
// against pulling it unless hit. Earlier the helper was also
89-
// `#[inline(never)]`; round-1 findings on issue #279 (branch-mispredict
90-
// diagnostic) attributed 15.42% of decoder mispredicts to this function,
91-
// with 27.80% landing on the `pushq %rax` fn entry — call/ret BTB
92-
// pressure from the never-inlined boundary. Dropping `#[inline(never)]`
93-
// while keeping `#[cold]` lets LLVM inline at hot call sites where the
94-
// boundary cost outweighs body duplication; cold paths (low-entropy
95-
// blocks where the previous "drop both" variant regressed +15.9% on
96-
// L14) keep the out-of-line shape via the cold attribute.
97-
#[cold]
87+
// Previously `#[cold]+#[inline(never)]`; round-1 findings on issue
88+
// #279 attributed 15.42% of decoder mispredicts to this function with
89+
// 27.80% on the `pushq %rax` fn entry — call/ret BTB pressure from
90+
// the never-inlined boundary. `#[inline(never)]` was dropped first,
91+
// keeping `#[cold]` to preserve out-of-line layout for low-entropy
92+
// blocks (the prior «drop both» variant regressed +15.9% on L14).
93+
//
94+
// For high-repcode workloads (z000033 L-5, decode_all flamegraph
95+
// surfaces this helper at 1.93% self-time despite the `#[cold]`
96+
// label), the cold-bias attribute itself blocks LLVM from inlining
97+
// even at the hot call sites where the call/ret + BTB cost still
98+
// dominates the body work. Drop `#[cold]` and let the inline
99+
// cost-model see the full picture — body is small enough (RULES
100+
// lookup + 6 branchless cmov), so duplication into hot callers is
101+
// affordable, and cold callers don't pay anything they weren't
102+
// paying before (their cost was already dominated by the surrounding
103+
// rare-path work, not this helper).
98104
fn do_offset_history_repcode(offset_value: u32, lit_len: u32, scratch: &mut [u32; 3]) -> u32 {
99105
#[derive(Copy, Clone)]
100106
struct Rule {

zstd/src/fse/fse_decoder.rs

Lines changed: 47 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -473,13 +473,15 @@ impl<E: FseEntry> FSETableImpl<E> {
473473
// table AND initialise `symbol_next` for them. Donor:
474474
// `tableDecode[highThreshold--].baseValue = s; symbolNext[s]
475475
// = 1;`.
476+
//
477+
// Index loop (not `iter().enumerate().take()`) — LLVM emits
478+
// a tighter scalar loop without the Iterator::next state
479+
// machine. The enumerate+take iterator chain was visible as
480+
// ~1.8% combined self-time on the decode flamegraph.
481+
let probs = self.symbol_probabilities.as_slice();
476482
let mut negative_idx = table_size;
477-
for (symbol, &prob) in self
478-
.symbol_probabilities
479-
.iter()
480-
.enumerate()
481-
.take(nb_symbols)
482-
{
483+
for symbol in 0..nb_symbols {
484+
let prob = probs[symbol];
483485
if prob == -1 {
484486
negative_idx -= 1;
485487
spread[negative_idx] = symbol as u8;
@@ -493,12 +495,8 @@ impl<E: FseEntry> FSETableImpl<E> {
493495
// build loop's counter reaches `2*prob - 1` over `prob`
494496
// iterations (matching donor `symbolNext[s]++` semantics).
495497
let mut position = 0usize;
496-
for (symbol, &prob) in self
497-
.symbol_probabilities
498-
.iter()
499-
.enumerate()
500-
.take(nb_symbols)
501-
{
498+
for symbol in 0..nb_symbols {
499+
let prob = probs[symbol];
502500
if prob <= 0 {
503501
continue;
504502
}
@@ -532,9 +530,29 @@ impl<E: FseEntry> FSETableImpl<E> {
532530
// shape.
533531
let accuracy_log = self.accuracy_log;
534532
let table_size_u32 = table_size as u32;
533+
// Write entries into `spare_capacity_mut()` (typed as
534+
// `&mut [MaybeUninit<E>]`) and only `set_len` AFTER all
535+
// writes complete. This keeps the per-push bookkeeping out
536+
// of the hot loop (the body becomes a flat strided
537+
// `MaybeUninit::write` sequence) while staying sound: the
538+
// `set_len` call only ever runs when every slot in
539+
// 0..table_size is initialised. The error path returns Err
540+
// with `decode.len() == 0` (the preceding `clear()`),
541+
// exposing zero uninitialised entries.
535542
self.decode.clear();
536543
self.decode.reserve(table_size);
537-
for &symbol in spread.iter().take(table_size) {
544+
let slots = &mut self.decode.spare_capacity_mut()[..table_size];
545+
546+
// Slice index instead of `spread.iter().take(table_size)`:
547+
// if `spread.len() < table_size` (a future refactor breaking
548+
// the upstream `spread.resize(table_size, 0)` invariant), the
549+
// slice indexing panics here BEFORE the unsafe `set_len`
550+
// below would claim uninitialised entries. `take()` would
551+
// silently shorten the loop and leave `slots` half-written,
552+
// which the post-loop `set_len(table_size)` would then expose
553+
// as UB. Indexing surfaces the invariant violation as a
554+
// bounds-check panic instead.
555+
for (state_idx, &symbol) in spread[..table_size].iter().enumerate() {
538556
let next_state = symbol_next[symbol as usize];
539557
// `next_state >= 1` by construction: upstream
540558
// `read_probabilities` / `build_from_probabilities`
@@ -567,6 +585,12 @@ impl<E: FseEntry> FSETableImpl<E> {
567585
// high_bit > accuracy_log + 1 and wrap `nb` to a large
568586
// u8. Reject so the unchecked indexing contract holds.
569587
if nb > accuracy_log {
588+
// `decode.len()` is still 0 (set by `clear()` above) —
589+
// no `set_len` ran, so no uninitialised entry is
590+
// observable to the outer `build_decoding_table`'s
591+
// `reset()` path. The partially-filled `slots` buffer
592+
// is dropped here harmlessly (`MaybeUninit<E>` has no
593+
// Drop).
570594
return Err(FSETableError::TableInvariantViolation {
571595
prob: self.symbol_probabilities[symbol as usize],
572596
symbol,
@@ -585,8 +609,16 @@ impl<E: FseEntry> FSETableImpl<E> {
585609
// formula bug instead of silently producing a
586610
// malformed entry.
587611
let new_state_u32 = (next_state << nb) - table_size_u32;
588-
self.decode
589-
.push(E::from_raw(new_state_u32 as u16, symbol, nb));
612+
slots[state_idx].write(E::from_raw(new_state_u32 as u16, symbol, nb));
613+
}
614+
615+
// SAFETY: the loop above ran `table_size` iterations and
616+
// wrote `slots[state_idx]` for every `state_idx ∈
617+
// [0, table_size)`. `reserve(table_size)` guaranteed capacity.
618+
// Every Err exit happens BEFORE this `set_len`, so we only
619+
// claim initialisation when every slot has been written.
620+
unsafe {
621+
self.decode.set_len(table_size);
590622
}
591623

592624
Ok(())

0 commit comments

Comments
 (0)