perf(huff0): hoist bit-stream state into the encode loop#423
Conversation
|
Warning Review limit reached
More reviews will be available in 1 hour, 28 minutes, and 38 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
| Filename | Overview |
|---|---|
| zstd/src/huff0/huf_cstream.rs | Adds encode_unrolled with hoisted bit-state locals; deletes zero_index1/merge_index1; replaces their unit test. Arithmetic is byte-identical to the deleted methods — add0!/add1!/flush0! macros are faithful mirrors of add_bits/flush_bits, and the inline merge matches merge_index1 exactly including the full-value (not masked) bp1 in wrapping_add. All six state fields are written back correctly at the end. Raw-pointer safety argument is sound. |
| zstd/src/huff0/huff0_encoder.rs | encode_one_stream_unrolled is reduced to a single-line delegation to HufCStream::encode_unrolled; all loop logic moved to huf_cstream.rs. Two doc comments still reference the deleted zero_index1/merge_index1 methods — flagged in prior review, not re-posted here. |
Reviews (6): Last reviewed commit: "perf(huff0): hoist bit-stream state into..." | Re-trigger Greptile
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The unrolled Huffman encode loop drove add_bits/flush_bits through &mut self, reading and writing bit_container[idx] / bit_pos / cursor in the struct every symbol. The optimizer could not prove the raw output-buffer writes in flush_bits don't alias those fields, so it reloaded the containers from memory per symbol (upstream zstd keeps them in HUF_CStream_t locals). Move the unrolled loop into a HufCStream method that hoists the two containers, bit positions, and cursor into locals kept register-resident for the loop, writing back once at the end. The per-symbol arithmetic mirrors the prior methods exactly, so the emitted bitstream is byte-identical (839 tests + cross_validation green); only the codegen changes.
dc3dd53 to
ec68492
Compare
|
@coderabbitai review |
✅ Action performedReview finished.
|
Summary
Hoist the Huffman encode loop's bit-stream state into locals so it stays register-resident, matching upstream zstd's
HUF_CStream_tshape.The unrolled encode loop drove
add_bits/flush_bitsthrough&mut self, reading and writingbit_container[idx]/bit_pos/cursorin the struct on every symbol. The optimizer could not prove the raw output-buffer writes inflush_bitsdon't alias those struct fields, so it reloaded the containers from memory per symbol. Upstream keeps them inHUF_CStream_tlocals.This moves the unrolled loop into a
HufCStream::encode_unrolledmethod that hoists the two bit containers, their bit positions, and the write cursor into locals kept register-resident for the loop, writing them back once at the end. The per-symbol arithmetic mirrors the prioradd_bits/flush_bits/zero_index1/merge_index1exactly, so the emitted bitstream is byte-identical; only the codegen changes. The now-inlinedzero_index1/merge_index1are deleted.Results (i9-9900K,
perf stat -e cycles, paired)A general win for all compressible compression (where the Huffman literal encode is hot: ~19% of the fast dict-compress profile), not dict-specific.
Correctness
--features dict_builder) + cross_validation green on x86_64 (avx2) and aarch64; i9last-out-sumidentical pre/post on decodecorpus.merge_index1unit test is replaced byencode_unrolled_dual_container_size_is_deterministic, which exercises the inlined dual-container merge path through the new method.Base
Stacked on
perf/dfast-speed-microopt(#422); the diff is the single Huffman commit.