perf(fse): kill iterator overhead in build_decoding_table, write decode via set_len (#293)

polaz · web-flow · commit 5880331ea26d · 2026-05-28T20:57:40.000+03:00
* fix(fse): use spare_capacity_mut + MaybeUninit instead of premature set_len Copilot flagged the prior P2 change as UB: `set_len(table_size)` before the write loop made `self.decode` logically contain `table_size` initialised entries, but they were uninitialised memory. The error path (`nb > accuracy_log`) returned with that broken state, and the outer `reset()`/Drop would have run on uninit entries — unsound per the Vec contract. Fix: write via `spare_capacity_mut()` (typed `&mut [MaybeUninit<E>]`) and call `set_len(table_size)` only AFTER the loop completes. Error path returns with `decode.len() == 0` (set by the preceding `clear()`), so no uninitialised entry is observable. The hot-loop codegen is unchanged — `MaybeUninit::write` lowers to the same strided store sequence the raw pointer version was emitting; the soundness is now machine-checkable. `E: FseEntry` has no `Drop` so the dropped `MaybeUninit` slice on the error path is a no-op. * perf(decode): drop #[cold] on do_offset_history_repcode for hot workloads (#294) Issue #279 round-1 mispredict diagnosis attributed 15.42% of decoder mispredicts to `do_offset_history_repcode`, with 27.80% landing on the `pushq %rax` function entry — call/ret BTB pressure from the never-inlined boundary. `#[inline(never)]` was dropped in earlier rounds; `#[cold]` was kept to preserve out-of-line layout for low-entropy blocks where the prior «drop both» variant regressed +15.9% on L14. The z000033 L-5 decode flamegraph surfaces this helper at 1.93% self-time despite the `#[cold]` label — the cold-bias attribute itself blocks LLVM from inlining even at hot call sites where the call/ret + BTB cost dominates the body work. The body is small (RULES lookup + 6 branchless cmov) and inlining duplicates a tight scalar sequence into the seq decoder's per-sequence loop. Drop `#[cold]` and let the inline cost-model see the full picture. Cold callers don't pay anything they weren't paying before — their total cost was already dominated by surrounding rare-path work, not this helper. * fix(fse): index slice instead of take() before unsafe set_len `spread.iter().take(table_size)` silently runs fewer iterations if `spread.len() < table_size` — which can only happen if a future refactor breaks the upstream `spread.resize(table_size, 0)` invariant, but the failure mode is severe: the loop body would leave the `[loop_count..table_size)` slots in `decode`'s spare capacity uninitialised, and the post-loop `set_len(table_size)` would then claim those uninitialised entries as initialised — UB. Switch to `spread[..table_size].iter()`. A length mismatch now panics with a clear bounds-check error BEFORE the unsafe set_len runs, surfacing the broken invariant immediately instead of turning it into silent memory unsafety.
diff --git a/zstd/src/decoding/sequence_execution.rs b/zstd/src/decoding/sequence_execution.rs
@@ -84,17 +84,23 @@ pub(crate) fn do_offset_history(offset_value: u32, lit_len: u32, scratch: &mut [
     do_offset_history_repcode(offset_value, lit_len, scratch)
 }
 
-// `#[cold]` keeps the body out of caller hot layout and biases icache
-// against pulling it unless hit. Earlier the helper was also
-// `#[inline(never)]`; round-1 findings on issue #279 (branch-mispredict
-// diagnostic) attributed 15.42% of decoder mispredicts to this function,
-// with 27.80% landing on the `pushq %rax` fn entry — call/ret BTB
-// pressure from the never-inlined boundary. Dropping `#[inline(never)]`
-// while keeping `#[cold]` lets LLVM inline at hot call sites where the
-// boundary cost outweighs body duplication; cold paths (low-entropy
-// blocks where the previous "drop both" variant regressed +15.9% on
-// L14) keep the out-of-line shape via the cold attribute.
-#[cold]
+// Previously `#[cold]+#[inline(never)]`; round-1 findings on issue
+// #279 attributed 15.42% of decoder mispredicts to this function with
+// 27.80% on the `pushq %rax` fn entry — call/ret BTB pressure from
+// the never-inlined boundary. `#[inline(never)]` was dropped first,
+// keeping `#[cold]` to preserve out-of-line layout for low-entropy
+// blocks (the prior «drop both» variant regressed +15.9% on L14).
+//
+// For high-repcode workloads (z000033 L-5, decode_all flamegraph
+// surfaces this helper at 1.93% self-time despite the `#[cold]`
+// label), the cold-bias attribute itself blocks LLVM from inlining
+// even at the hot call sites where the call/ret + BTB cost still
+// dominates the body work. Drop `#[cold]` and let the inline
+// cost-model see the full picture — body is small enough (RULES
+// lookup + 6 branchless cmov), so duplication into hot callers is
+// affordable, and cold callers don't pay anything they weren't
+// paying before (their cost was already dominated by the surrounding
+// rare-path work, not this helper).
 fn do_offset_history_repcode(offset_value: u32, lit_len: u32, scratch: &mut [u32; 3]) -> u32 {
     #[derive(Copy, Clone)]
     struct Rule {
diff --git a/zstd/src/fse/fse_decoder.rs b/zstd/src/fse/fse_decoder.rs
@@ -473,13 +473,15 @@ impl<E: FseEntry> FSETableImpl<E> {
         // table AND initialise `symbol_next` for them. Donor:
         // `tableDecode[highThreshold--].baseValue = s; symbolNext[s]
         // = 1;`.
+        //
+        // Index loop (not `iter().enumerate().take()`) — LLVM emits
+        // a tighter scalar loop without the Iterator::next state
+        // machine. The enumerate+take iterator chain was visible as
+        // ~1.8% combined self-time on the decode flamegraph.
+        let probs = self.symbol_probabilities.as_slice();
         let mut negative_idx = table_size;
-        for (symbol, &prob) in self
-            .symbol_probabilities
-            .iter()
-            .enumerate()
-            .take(nb_symbols)
-        {
+        for symbol in 0..nb_symbols {
+            let prob = probs[symbol];
             if prob == -1 {
                 negative_idx -= 1;
                 spread[negative_idx] = symbol as u8;
@@ -493,12 +495,8 @@ impl<E: FseEntry> FSETableImpl<E> {
         // build loop's counter reaches `2*prob - 1` over `prob`
         // iterations (matching donor `symbolNext[s]++` semantics).
         let mut position = 0usize;
-        for (symbol, &prob) in self
-            .symbol_probabilities
-            .iter()
-            .enumerate()
-            .take(nb_symbols)
-        {
+        for symbol in 0..nb_symbols {
+            let prob = probs[symbol];
             if prob <= 0 {
                 continue;
             }
@@ -532,9 +530,29 @@ impl<E: FseEntry> FSETableImpl<E> {
         // shape.
         let accuracy_log = self.accuracy_log;
         let table_size_u32 = table_size as u32;
+        // Write entries into `spare_capacity_mut()` (typed as
+        // `&mut [MaybeUninit<E>]`) and only `set_len` AFTER all
+        // writes complete. This keeps the per-push bookkeeping out
+        // of the hot loop (the body becomes a flat strided
+        // `MaybeUninit::write` sequence) while staying sound: the
+        // `set_len` call only ever runs when every slot in
+        // 0..table_size is initialised. The error path returns Err
+        // with `decode.len() == 0` (the preceding `clear()`),
+        // exposing zero uninitialised entries.
         self.decode.clear();
         self.decode.reserve(table_size);
-        for &symbol in spread.iter().take(table_size) {
+        let slots = &mut self.decode.spare_capacity_mut()[..table_size];
+
+        // Slice index instead of `spread.iter().take(table_size)`:
+        // if `spread.len() < table_size` (a future refactor breaking
+        // the upstream `spread.resize(table_size, 0)` invariant), the
+        // slice indexing panics here BEFORE the unsafe `set_len`
+        // below would claim uninitialised entries. `take()` would
+        // silently shorten the loop and leave `slots` half-written,
+        // which the post-loop `set_len(table_size)` would then expose
+        // as UB. Indexing surfaces the invariant violation as a
+        // bounds-check panic instead.
+        for (state_idx, &symbol) in spread[..table_size].iter().enumerate() {
             let next_state = symbol_next[symbol as usize];
             // `next_state >= 1` by construction: upstream
             // `read_probabilities` / `build_from_probabilities`
@@ -567,6 +585,12 @@ impl<E: FseEntry> FSETableImpl<E> {
             // high_bit > accuracy_log + 1 and wrap `nb` to a large
             // u8. Reject so the unchecked indexing contract holds.
             if nb > accuracy_log {
+                // `decode.len()` is still 0 (set by `clear()` above) —
+                // no `set_len` ran, so no uninitialised entry is
+                // observable to the outer `build_decoding_table`'s
+                // `reset()` path. The partially-filled `slots` buffer
+                // is dropped here harmlessly (`MaybeUninit<E>` has no
+                // Drop).
                 return Err(FSETableError::TableInvariantViolation {
                     prob: self.symbol_probabilities[symbol as usize],
                     symbol,
@@ -585,8 +609,16 @@ impl<E: FseEntry> FSETableImpl<E> {
             // formula bug instead of silently producing a
             // malformed entry.
             let new_state_u32 = (next_state << nb) - table_size_u32;
-            self.decode
-                .push(E::from_raw(new_state_u32 as u16, symbol, nb));
+            slots[state_idx].write(E::from_raw(new_state_u32 as u16, symbol, nb));
+        }
+
+        // SAFETY: the loop above ran `table_size` iterations and
+        // wrote `slots[state_idx]` for every `state_idx ∈
+        // [0, table_size)`. `reserve(table_size)` guaranteed capacity.
+        // Every Err exit happens BEFORE this `set_len`, so we only
+        // claim initialisation when every slot has been written.
+        unsafe {
+            self.decode.set_len(table_size);
         }
 
         Ok(())