perf(huff0): hoist bit-stream state into the encode loop by polaz · Pull Request #423 · structured-world/structured-zstd

polaz · 2026-06-15T13:34:47Z

Summary

Hoist the Huffman encode loop's bit-stream state into locals so it stays register-resident, matching upstream zstd's HUF_CStream_t shape.

The unrolled encode loop drove add_bits / flush_bits through &mut self, reading and writing bit_container[idx] / bit_pos / cursor in the struct on every symbol. The optimizer could not prove the raw output-buffer writes in flush_bits don't alias those struct fields, so it reloaded the containers from memory per symbol. Upstream keeps them in HUF_CStream_t locals.

This moves the unrolled loop into a HufCStream::encode_unrolled method that hoists the two bit containers, their bit positions, and the write cursor into locals kept register-resident for the loop, writing them back once at the end. The per-symbol arithmetic mirrors the prior add_bits / flush_bits / zero_index1 / merge_index1 exactly, so the emitted bitstream is byte-identical; only the codegen changes. The now-inlined zero_index1 / merge_index1 are deleted.

Results (i9-9900K, `perf stat -e cycles`, paired)

decodecorpus-z000033 level_-7_fast + dict: 72.4G → 70.8G cycles = -2.2%
decodecorpus-z000033 level_3 dfast, no dict: 220.8G → 217.8G = -1.4%

A general win for all compressible compression (where the Huffman literal encode is hot: ~19% of the fast dict-compress profile), not dict-specific.

Correctness

Byte-identical output: 839 tests (--features dict_builder) + cross_validation green on x86_64 (avx2) and aarch64; i9 last-out-sum identical pre/post on decodecorpus.
The deleted merge_index1 unit test is replaced by encode_unrolled_dual_container_size_is_deterministic, which exercises the inlined dual-container merge path through the new method.

Base

Stacked on perf/dfast-speed-microopt (#422); the diff is the single Huffman commit.

coderabbitai · 2026-06-15T13:34:54Z

Warning

Review limit reached

@polaz, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 1 hour, 28 minutes, and 38 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 5930e191-18c6-4f92-b869-74d092de04c2

📥 Commits

Reviewing files that changed from the base of the PR and between 1ad404a and ec68492.

📒 Files selected for processing (2)

zstd/src/huff0/huf_cstream.rs
zstd/src/huff0/huff0_encoder.rs

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/huff0-encode-state-hoist

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-06-15T13:40:41Z

Greptile Summary

This PR moves the Huffman encode loop's mutable bit-stream state into locals inside a new HufCStream::encode_unrolled method, matching the upstream HUF_CStream_t register-resident shape and eliminating a compiler reload per symbol caused by the optimizer's aliasing conservatism between struct-field writes and the raw output-buffer writes in flush_bits.

The add0!/add1!/flush0! macros are faithful inline copies of add_bits/flush_bits; the inlined merge exactly replicates merge_index1 (including using the full bp1 value, not the masked nb_bits_1, in wrapping_add). Byte-identical output is maintained.
zero_index1 and merge_index1 are deleted; their callers in encode_one_stream_unrolled are replaced by a single bit_c.encode_unrolled(...) delegation.
The replacement test encode_unrolled_dual_container_size_is_deterministic exercises the dual-container merge path (phase 3 runs twice with K_UNROLL=4 over 16 symbols) with a deterministic size assertion.

Confidence Score: 5/5

Safe to merge — the refactor is a pure codegen change with byte-identical bitstream output verified against 839 tests and cross-architecture cross-validation.

The arithmetic in encode_unrolled is a direct mechanical translation of the four deleted methods: the macros mirror add_bits and flush_bits line-for-line, and the inline merge block matches merge_index1 exactly. All six mutable fields are written back after the loop so close() sees correct final state. The raw-pointer safety argument is sound.

No files require special attention.

Important Files Changed

Filename	Overview
zstd/src/huff0/huf_cstream.rs	Adds `encode_unrolled` with hoisted bit-state locals; deletes `zero_index1`/`merge_index1`; replaces their unit test. Arithmetic is byte-identical to the deleted methods — `add0!`/`add1!`/`flush0!` macros are faithful mirrors of `add_bits`/`flush_bits`, and the inline merge matches `merge_index1` exactly including the full-value (not masked) `bp1` in `wrapping_add`. All six state fields are written back correctly at the end. Raw-pointer safety argument is sound.
zstd/src/huff0/huff0_encoder.rs	`encode_one_stream_unrolled` is reduced to a single-line delegation to `HufCStream::encode_unrolled`; all loop logic moved to `huf_cstream.rs`. Two doc comments still reference the deleted `zero_index1`/`merge_index1` methods — flagged in prior review, not re-posted here.

_{Reviews (6): Last reviewed commit: "perf(huff0): hoist bit-stream state into..." | Re-trigger Greptile}

codecov · 2026-06-15T15:42:05Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The unrolled Huffman encode loop drove add_bits/flush_bits through &mut self, reading and writing bit_container[idx] / bit_pos / cursor in the struct every symbol. The optimizer could not prove the raw output-buffer writes in flush_bits don't alias those fields, so it reloaded the containers from memory per symbol (upstream zstd keeps them in HUF_CStream_t locals). Move the unrolled loop into a HufCStream method that hoists the two containers, bit positions, and cursor into locals kept register-resident for the loop, writing back once at the end. The per-symbol arithmetic mirrors the prior methods exactly, so the emitted bitstream is byte-identical (839 tests + cross_validation green); only the codegen changes.

polaz · 2026-06-15T16:38:03Z

@coderabbitai review

coderabbitai · 2026-06-15T16:38:08Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Base automatically changed from perf/dfast-speed-microopt to main June 15, 2026 15:38

polaz force-pushed the perf/huff0-encode-state-hoist branch from dc3dd53 to ec68492 Compare June 15, 2026 16:30

polaz merged commit de57b5b into main Jun 15, 2026
28 checks passed

polaz deleted the perf/huff0-encode-state-hoist branch June 15, 2026 16:38

sw-release-bot Bot mentioned this pull request Jun 15, 2026

chore: release v0.0.40 #419

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(huff0): hoist bit-stream state into the encode loop#423

perf(huff0): hoist bit-stream state into the encode loop#423
polaz merged 1 commit into
mainfrom
perf/huff0-encode-state-hoist

polaz commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Review limit reached

Uh oh!

greptile-apps Bot commented Jun 15, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

codecov Bot commented Jun 15, 2026

Uh oh!

polaz commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

polaz commented Jun 15, 2026

Summary

Results (i9-9900K, perf stat -e cycles, paired)

Correctness

Base

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

greptile-apps Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

codecov Bot commented Jun 15, 2026

Codecov Report

Uh oh!

polaz commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Results (i9-9900K, `perf stat -e cycles`, paired)

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

greptile-apps Bot commented Jun 15, 2026 •

edited

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading