Skip to content

perf: stabilize SymbolSlab inlining + reduce encode allocations#213

Merged
cberner merged 2 commits intocberner:masterfrom
virtuallynathan:perf/exp-stack-encode-decode-foundation
Mar 15, 2026
Merged

perf: stabilize SymbolSlab inlining + reduce encode allocations#213
cberner merged 2 commits intocberner:masterfrom
virtuallynathan:perf/exp-stack-encode-decode-foundation

Conversation

@virtuallynathan
Copy link
Copy Markdown
Contributor

@virtuallynathan virtuallynathan commented Mar 9, 2026

Why

  1. Codegen fragility: The SymbolSlab accessor methods (get, get_mut, get_pair_mut) use #[inline] (soft hint). Under lto = "fat" + codegen-units = 1, LLVM outlines them when the impl block grows. Adding a single dead-code method to SymbolSlab causes a ~24% roundtrip regression from binary layout shifts alone.
  2. Redundant copies in encode: SourceBlockEncoder stores source symbols as individual Symbol objects in a Vec<Symbol>, each owning a separate Vec<u8>. Building a SourceBlockEncoder copies every source byte twice: once into the Vec<Symbol>, then again into the intermediate SymbolSlab.

How

Commit 1: stabilize SymbolSlab accessor inlining under LTO

  • Promote get(), get_mut(), and get_pair_mut() from #[inline] to #[inline(always)].
  • These are the PI solver's hottest accessors. physical_index() was already #[inline(always)].
  • 1 file, 3 lines changed.

Commit 2: reduce encode source symbol allocations

  • Add SymbolSlab::from_bytes() to construct a slab directly from a contiguous byte slice with padding.
  • Change create_symbols() to return a SymbolSlab instead of Vec<Symbol>, eliminating the per-symbol heap allocation.
  • In create_d(), bulk-copy source bytes into the intermediate slab via copy_block_from() instead of iterating per-symbol.
  • 2 files, +53/-37 lines.
  • Public API unchanged.

Benchmarks (Zen4, EPYC 9654P, symbol_size=1280)

Back-to-back runs, cooldowns between switches. Criterion uses 100 samples per metric.

codec_benchmark (criterion, reliable)

Comparison is this branch (both commits) vs the inline-only baseline, measured back-to-back:

Metric inline-only baseline this branch Delta
encode 10KB 15.86 us* 13.07 us -17.6%
roundtrip 10KB 14.42 us 11.49 us -20.3%
roundtrip repair 10KB 48.41 us 45.61 us -5.8%

* Criterion's cached value used for comparison (reported as -17.6% change). Inline-only and master encode times were comparable (~13 us).

End-to-end vs master (separate back-to-back run):

Metric master this branch Estimated total delta
encode 10KB 13.46 us 13.07 us -3%
roundtrip 10KB 14.64 us 11.49 us -21%
roundtrip repair 10KB 48.93 us 45.61 us -7%

decode_benchmark (5% overhead, single-run harness)

K master this branch Delta
10 3,122 3,084 -1.2%
100 3,906 3,906 0.0%
250 4,192 4,353 +3.8%
500 4,883 5,207 +6.6%
1,000 4,859 5,374 +10.6%
2,000 4,680 4,859 +3.8%
5,000 4,156 4,173 +0.4%
10,000 3,081 3,475 +12.8%
20,000 2,128 1,885 -11.4%
50,000 1,270 1,405 +10.6%

Commit 2 only touches encoder code. Decode variation is from the #[inline(always)] in commit 1 and single-run noise.

decode_benchmark (0% overhead, single-run harness)

K master this branch Delta
10 3,303 2,959 -10.4%
100 4,110 4,029 -2.0%
250 3,693 4,353 +17.9%
500 4,066 4,217 +3.7%
1,000 3,876 4,197 +8.3%
2,000 3,983 4,250 +6.7%
5,000 3,299 3,742 +13.4%
10,000 2,933 3,277 +11.7%
20,000 2,128 1,961 -7.8%
50,000 1,372 1,449 +5.6%

Single-run harness. The 0% overhead path does not exercise the PI solver, so variation is noise.

Tests

  • cargo clippy --all --all-targets -- -Dwarnings — clean
  • cargo test --all — 60 passed, 4 ignored
  • cargo build --features benchmarking,serde_support — clean
  • cargo test --features benchmarking — clean

Commit history

  1. 985e073 perf: stabilize SymbolSlab accessor inlining under LTO (src/symbol_slab.rs)
  2. 54a5920 perf: reduce encode source symbol allocations (src/encoder.rs, src/symbol_slab.rs)

Notes

  • Commit 1 is prerequisite for commit 2. Without #[inline(always)] on the accessors, adding from_bytes() to SymbolSlab triggers the codegen fragility described above, causing a ~24% roundtrip regression from layout shifts.
  • The decode-side copy reduction from the original version of this PR was dropped — it showed marginal improvement indistinguishable from noise and added complexity.

@virtuallynathan virtuallynathan changed the title perf: reduce encoder allocations and trim decoder output copies perf: cut encoder allocation overhead and decoder output copies Mar 9, 2026
@virtuallynathan virtuallynathan mentioned this pull request Mar 9, 2026
@virtuallynathan virtuallynathan marked this pull request as draft March 9, 2026 04:41
@virtuallynathan virtuallynathan force-pushed the perf/exp-stack-encode-decode-foundation branch from ecb3133 to cecd2d0 Compare March 9, 2026 05:30
@virtuallynathan virtuallynathan changed the title perf: cut encoder allocation overhead and decoder output copies perf: reduce encode allocations and stabilize SymbolSlab inlining Mar 9, 2026
Promote get(), get_mut(), and get_pair_mut() from #[inline] to
#[inline(always)]. These are the PI solver's hottest accessors (called
millions of times during decode). Under lto=fat + codegen-units=1,
the soft #[inline] hint leaves LLVM free to outline them when
nearby code changes, causing up to 24% performance swings in
unrelated hot paths.
@virtuallynathan
Copy link
Copy Markdown
Contributor Author

@codex review

@virtuallynathan virtuallynathan force-pushed the perf/exp-stack-encode-decode-foundation branch from cecd2d0 to 985e073 Compare March 9, 2026 05:40
@virtuallynathan virtuallynathan changed the title perf: reduce encode allocations and stabilize SymbolSlab inlining perf: stabilize SymbolSlab accessor inlining under LTO Mar 9, 2026
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@virtuallynathan virtuallynathan marked this pull request as ready for review March 9, 2026 05:44
@virtuallynathan virtuallynathan changed the title perf: stabilize SymbolSlab accessor inlining under LTO perf: stabilize SymbolSlab inlining + reduce encode allocations Mar 9, 2026
Copy link
Copy Markdown
Owner

@cberner cberner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but I had a couple minor comments

Comment thread src/encoder.rs Outdated
let S = num_ldpc_symbols(source_block.len() as u32);
let H = num_hdpc_symbols(source_block.len() as u32);

debug_assert_eq!(source_block.symbol_size(), symbol_size);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an assert!() rather than debug_assert!()?

Comment thread src/symbol_slab.rs Outdated
Comment on lines +107 to +110
debug_assert!(
self.mapping.is_none(),
"as_bytes called with active mapping"
);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this should be assert!(). It'd be a pretty serious bug if the mapping is Some, ya?

@virtuallynathan virtuallynathan force-pushed the perf/exp-stack-encode-decode-foundation branch from 54a5920 to d11dd86 Compare March 15, 2026 04:08
@virtuallynathan
Copy link
Copy Markdown
Contributor Author

Moved asserts to runtime, not just debug builds.

@cberner cberner merged commit df75b75 into cberner:master Mar 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants