Skip to content

perf(huff0): hoist bit-stream state into the encode loop#423

Merged
polaz merged 1 commit into
mainfrom
perf/huff0-encode-state-hoist
Jun 15, 2026
Merged

perf(huff0): hoist bit-stream state into the encode loop#423
polaz merged 1 commit into
mainfrom
perf/huff0-encode-state-hoist

Conversation

@polaz

@polaz polaz commented Jun 15, 2026

Copy link
Copy Markdown
Member

Summary

Hoist the Huffman encode loop's bit-stream state into locals so it stays register-resident, matching upstream zstd's HUF_CStream_t shape.

The unrolled encode loop drove add_bits / flush_bits through &mut self, reading and writing bit_container[idx] / bit_pos / cursor in the struct on every symbol. The optimizer could not prove the raw output-buffer writes in flush_bits don't alias those struct fields, so it reloaded the containers from memory per symbol. Upstream keeps them in HUF_CStream_t locals.

This moves the unrolled loop into a HufCStream::encode_unrolled method that hoists the two bit containers, their bit positions, and the write cursor into locals kept register-resident for the loop, writing them back once at the end. The per-symbol arithmetic mirrors the prior add_bits / flush_bits / zero_index1 / merge_index1 exactly, so the emitted bitstream is byte-identical; only the codegen changes. The now-inlined zero_index1 / merge_index1 are deleted.

Results (i9-9900K, perf stat -e cycles, paired)

  • decodecorpus-z000033 level_-7_fast + dict: 72.4G → 70.8G cycles = -2.2%
  • decodecorpus-z000033 level_3 dfast, no dict: 220.8G → 217.8G = -1.4%

A general win for all compressible compression (where the Huffman literal encode is hot: ~19% of the fast dict-compress profile), not dict-specific.

Correctness

  • Byte-identical output: 839 tests (--features dict_builder) + cross_validation green on x86_64 (avx2) and aarch64; i9 last-out-sum identical pre/post on decodecorpus.
  • The deleted merge_index1 unit test is replaced by encode_unrolled_dual_container_size_is_deterministic, which exercises the inlined dual-container merge path through the new method.

Base

Stacked on perf/dfast-speed-microopt (#422); the diff is the single Huffman commit.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@polaz, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 1 hour, 28 minutes, and 38 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 5930e191-18c6-4f92-b869-74d092de04c2

📥 Commits

Reviewing files that changed from the base of the PR and between 1ad404a and ec68492.

📒 Files selected for processing (2)
  • zstd/src/huff0/huf_cstream.rs
  • zstd/src/huff0/huff0_encoder.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/huff0-encode-state-hoist

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown

Greptile Summary

This PR moves the Huffman encode loop's mutable bit-stream state into locals inside a new HufCStream::encode_unrolled method, matching the upstream HUF_CStream_t register-resident shape and eliminating a compiler reload per symbol caused by the optimizer's aliasing conservatism between struct-field writes and the raw output-buffer writes in flush_bits.

  • The add0!/add1!/flush0! macros are faithful inline copies of add_bits/flush_bits; the inlined merge exactly replicates merge_index1 (including using the full bp1 value, not the masked nb_bits_1, in wrapping_add). Byte-identical output is maintained.
  • zero_index1 and merge_index1 are deleted; their callers in encode_one_stream_unrolled are replaced by a single bit_c.encode_unrolled(...) delegation.
  • The replacement test encode_unrolled_dual_container_size_is_deterministic exercises the dual-container merge path (phase 3 runs twice with K_UNROLL=4 over 16 symbols) with a deterministic size assertion.

Confidence Score: 5/5

Safe to merge — the refactor is a pure codegen change with byte-identical bitstream output verified against 839 tests and cross-architecture cross-validation.

The arithmetic in encode_unrolled is a direct mechanical translation of the four deleted methods: the macros mirror add_bits and flush_bits line-for-line, and the inline merge block matches merge_index1 exactly. All six mutable fields are written back after the loop so close() sees correct final state. The raw-pointer safety argument is sound.

No files require special attention.

Important Files Changed

Filename Overview
zstd/src/huff0/huf_cstream.rs Adds encode_unrolled with hoisted bit-state locals; deletes zero_index1/merge_index1; replaces their unit test. Arithmetic is byte-identical to the deleted methods — add0!/add1!/flush0! macros are faithful mirrors of add_bits/flush_bits, and the inline merge matches merge_index1 exactly including the full-value (not masked) bp1 in wrapping_add. All six state fields are written back correctly at the end. Raw-pointer safety argument is sound.
zstd/src/huff0/huff0_encoder.rs encode_one_stream_unrolled is reduced to a single-line delegation to HufCStream::encode_unrolled; all loop logic moved to huf_cstream.rs. Two doc comments still reference the deleted zero_index1/merge_index1 methods — flagged in prior review, not re-posted here.

Reviews (6): Last reviewed commit: "perf(huff0): hoist bit-stream state into..." | Re-trigger Greptile

Base automatically changed from perf/dfast-speed-microopt to main June 15, 2026 15:38
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The unrolled Huffman encode loop drove add_bits/flush_bits through
&mut self, reading and writing bit_container[idx] / bit_pos / cursor in
the struct every symbol. The optimizer could not prove the raw
output-buffer writes in flush_bits don't alias those fields, so it
reloaded the containers from memory per symbol (upstream zstd keeps them
in HUF_CStream_t locals). Move the unrolled loop into a HufCStream method
that hoists the two containers, bit positions, and cursor into locals
kept register-resident for the loop, writing back once at the end. The
per-symbol arithmetic mirrors the prior methods exactly, so the emitted
bitstream is byte-identical (839 tests + cross_validation green); only
the codegen changes.
@polaz polaz force-pushed the perf/huff0-encode-state-hoist branch from dc3dd53 to ec68492 Compare June 15, 2026 16:30
@polaz

polaz commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@polaz polaz merged commit de57b5b into main Jun 15, 2026
28 checks passed
@polaz polaz deleted the perf/huff0-encode-state-hoist branch June 15, 2026 16:38
@sw-release-bot sw-release-bot Bot mentioned this pull request Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant