This document describes the design and implementation of the Huffman encoder/decoder in this repository, with emphasis on the C reference implementation and Rust compatibility.
The project implements optimal static Huffman coding over 8-bit symbols.
Design goals:
- Bit-for-bit deterministic output for equivalent inputs.
- Simple, auditable data structures over clever machinery.
- Portable on little-endian and big-endian hosts.
- Robust I/O handling for files and pipes.
- C and Rust implementations that interoperate at the file-format level.
This is not an adaptive Huffman implementation and does not optimize for the smallest possible tree serialization.
Compressed files are self-describing:
- 16-byte header.
- 2-byte CRC16 over that header (V2 only).
- Serialized Huffman tree.
- Bit-packed payload.
| Offset | Size | Field | Meaning |
|---|---|---|---|
| 0 | 4 | magic |
0xBEEFD00E (V2) or 0xBEEFD00D (V1 legacy) |
| 4 | 2 | permissions |
Unix mode bits from source file |
| 6 | 2 | tree_size |
Serialized tree size in bytes |
| 8 | 8 | file_size |
Original plaintext size in bytes |
| 16 | 2 | header_crc |
CRC-16/CCITT-FALSE over bytes [0..16) (V2 only) |
All multi-byte fields are encoded little-endian.
Post-order encoding:
- Leaf: byte
'L', then one symbol byte. - Interior: serialize left subtree, then right subtree, then byte
'I'.
If a tree has k leaves, serialized size is 3k - 1.
In symbols:
Decoder validation enforces:
$S_{\text{tree}} \ge 5$ $S_{\text{tree}} \le 767$ $S_{\text{tree}} \equiv 2 \pmod{3}$
Represents both leaf and interior nodes:
symbol: meaningful for leaves.count: frequency on leaves, subtree sum on interiors.leaf: node-kind discriminator.left/right: child pointers.
delTree recursively frees nodes in post-order.
A circular buffer of treeNode *, maintained sorted by frequency.
dequeueremoves smallest count in O(1) fromtail.enqueueinsertion-sorts in O(n) by shifting larger entries.
Given at most 256 symbols, this simple approach is fast and easy to audit. Tie behavior is stable for equal counts, which matters for deterministic output.
Used during tree deserialization:
- Backing array starts at 256 entries.
pushdoubles capacity on demand with safereallochandling.popreturnsNULLon empty stack.
encode.c performs:
- Read input and build histogram (
uint64_t hist[256]). - Ensure at least two symbols by adding phantom stand-ins when needed.
- Build Huffman tree with greedy merge loop.
- Build per-symbol bit codes with DFS (
buildCode). - Emit header + CRC + serialized tree.
- Re-read plaintext and emit encoded payload bits.
Payload bits are written least-significant-bit first within each output byte.
appendCode and flushCode in code.h maintain a 1 KB bit buffer and write
through io_write_full.
decode.c performs:
- Exact-read header (
io_read_full). - Validate magic, parse metadata, and read CRC for V2.
- On CRC mismatch, continue decode with fallback permissions
0444. - Exact-read serialized tree and reconstruct with stack.
- Decode payload bitstream by tree walk until
file_sizesymbols are emitted.
Decode stops by symbol count, not by EOF, so zero-padding bits at the end of the payload are ignored safely.
src/io.h centralizes retry loops:
io_read_full: keep reading untillenbytes or failure/EOF.io_write_full: keep writing until all bytes are written or failure.
These helpers are used for metadata reads and all high-value write paths. This avoids silent short-read/short-write truncation on pipes or congested I/O.
Runtime helpers in endian.h provide:
- Host endianness detection (
isBig()). - Byte-swap helpers (
swap16/32/64).
Fields are always stored little-endian on disk.
CRC variant: CRC-16/CCITT-FALSE
- Polynomial:
0x1021 - Init:
0xFFFF - No reflection
- No XOR-out
CRC protects header integrity only, not the tree or payload.
Rust (rust/) is kept wire-compatible with C:
- Same header layout and CRC policy.
- Same tree serialization.
- Same LSB-first bit ordering.
- Same V1/V2 decode compatibility behavior.
- Same fallback permissions behavior on V2 CRC mismatch.
Rust decode path is streaming in the CLI, mirroring C behavior for large files.
Let
- Histogram:
$O(n)$ - Tree build:
$O(k^2)$ worst-case due to insertion-sort enqueue - Code generation:
$O(k)$ - Encode payload walk:
$O!\left(n \times \bar{\ell}\right)$ - Decode payload walk:
$O!\left(n \times \bar{\ell}\right)$
Here
The implementation is designed for robustness, not adversarial hardening:
- Reject malformed magic/tree encodings.
- Refuse to clobber existing output files (
O_CREAT|O_EXCL/create_new). - Preserve deterministic decode behavior under header CRC mismatch.
- Abort on allocation failures in critical paths.
Out of scope:
- Side-channel resistance.
- Cryptographic integrity/authentication of payload bytes.
- Constant-memory decode under all malformed inputs.
Coverage comes from:
- C build with strict warnings and sanitizer mode.
- Rust unit and compatibility tests.
- Cross-language bit-compatibility checks.
- Mutation + round-trip fuzzer with:
- random and structured header mutations,
- stdin/stdout and file-path execution modes,
- V1 and V2 corpus entries,
- sanitizer-diagnostic detection in encoder and decoder stderr.
| File | Responsibility |
|---|---|
src/encode.c |
C encoder entry point |
src/decode.c |
C decoder entry point |
src/huffman.c / src/huffman.h |
Tree node construction and tree utilities |
src/priority.c / src/queue.h |
Priority queue implementation/interface |
src/stack.c / src/stack.h |
Dynamic stack for deserialization |
src/code.h |
Code representation and payload bit-buffer logic |
src/header.h |
On-disk header definition and magic values |
src/endian.h |
Endianness detection and byte swaps |
src/crc16.h |
CRC-16/CCITT-FALSE implementation |
src/io.h |
Exact read/write helpers |
tests/fuzz_huffman.py |
Coverage-oriented black-box fuzzer |
rust/src/lib.rs |
Rust format-compatible encoder/decoder core |
rust/src/bin/encode.rs |
Rust CLI encoder |
rust/src/bin/decode.rs |
Rust CLI decoder |