program.py: route from_bytes_backrefs through deser_auto#776
program.py: route from_bytes_backrefs through deser_auto#776richardkiss wants to merge 15 commits into
Conversation
Implement an interning-based serialization format for CLVM trees that deduplicates atoms and pairs for better compression than backrefs, with comparable deserialization speed and much faster serialization. Format: 6-byte magic prefix (fd ff 32 30 32 36), atom table grouped by length, stack-based instruction stream with dual-cons opcodes. Uses a variable-length signed integer (varint) encoding throughout. Includes: - varint codec (encode/decode, signed, variable-length prefix scheme) - deserializer with DeserializeLimits (max_atom_len, max_input_bytes) - serializer (left-first traversal) and pair-optimized serializer (DP) - serialized_length_serde_2026: compute blob length without deserializing - node_from_bytes_auto: auto-detect classic/backrefs/serde_2026 - InternedTree::node_indices() for serialization support - format specification (docs/serde-2026.md) - benchmarks for serialize and deserialize - 654 lines of tests (41 test cases) Made-with: Cursor
Rust pyo3 bindings: - deser_legacy, deser_backrefs, deser_2026, deser_auto (bytes -> LazyNode) - ser_legacy, ser_backrefs, ser_2026 (LazyNode -> bytes) - clvm_tree_to_lazy_node (Python CLVM tree -> interned Rust LazyNode) Python layer: - clvm_rs.serde module: serialize()/deserialize() with format selection - Program.from_bytes() now auto-detects format via deser_auto - Program.to_bytes_serde_2026() / from_bytes_serde_2026() - Updated .pyi stubs (fix phantom NO_NEG_DIV, add new flags/functions) - 222 lines of Python tests Made-with: Cursor
Roundtrip fuzzer: deserialize arbitrary bytes as classic CLVM, serialize to serde_2026, deserialize back, and compare trees structurally. Uses serialize_2026 (not node_to_bytes) for comparison to handle DAG-vs-tree differences correctly. Made-with: Cursor
Simplify serde_2026 deserialization limits around byte and atom-size budgets, add strict varint decoding, and update docs/tests/fuzzing for the refined API. Made-with: Cursor
Keep the serde_2026 serializer on the left-first strategy for review while preserving the strategy hook for a future context-aware compact implementation. Made-with: Cursor
Update the fuzz target and Python tests to reflect the remaining fast strategy after compact was removed from the PR. Made-with: Cursor
…26 prefix - Add a `decode_varint` fuzz target (`serde-2026-varint`). Exercises the non-strict and strict decoders; checks that any value the non-strict path accepts re-encodes to a canonical (minimal) form which decodes identically in strict mode and is no longer than the original input. - Re-export `decode_varint` / `encode_varint` (with `#[doc(hidden)]`) so the fuzzer can call them through the public crate surface. - Add a unit test for `write_atom_table` that confirms atoms of equal length are coalesced into a single repeated-length group while atoms of different lengths are emitted as separate groups. - Fix `deser_2026`: it now strips `SERDE_2026_MAGIC_PREFIX` before calling the underlying decoder, so it round-trips with `ser_2026` (which always emits the prefix). A `ValueError` is raised if the prefix is missing. Adds a Python test for the symmetry property. Made-with: Cursor
Public serialization API now uses `level: u32` rather than the `Compression` enum. New entry points: - `serialize_2026(a, n)` / `serialize_2026_level(a, n, level)` - `serialize_2026_to_stream(a, n, w)` / `..._level(a, n, level, w)` - `node_to_bytes_serde_2026(a, n)` / `..._level(a, n, level)` Levels above the highest implemented level *saturate* down to it. This means callers can pass `u32::MAX` to mean "best available compression" without recompiling when new levels are added later. `Compression` becomes a `pub(crate)` internal vocabulary mapped from `level` by `compression_for_level`. Python `ser_2026(node, level=K)` no longer raises for `K > 0`; it produces the same bytes as level 0 today (and the best implemented level tomorrow). The "rejects unknown level" test is replaced with a saturation test that asserts levels 0, 1, 7, 1<<20 and u32::MAX all produce the same bytes today. All internal call sites (tests, benches, fuzz target, wheel API) updated. Made-with: Cursor
Co-authored-by: Cursor <cursoragent@cursor.com>
…_count `Vec::with_capacity(instruction_count / 3)` allocated based on the declared instruction-count varint. A tiny blob (under 16 bytes) declaring instruction_count = 2^54 caused malloc to be called for ~24 PB and the process to abort with SIGABRT. `LimitReader` already bounds total bytes consumed, so the loop below cannot drive `pairs` past max_input_bytes/3 entries — geometric growth from `Vec::new()` handles that with a handful of reallocs and removes the abort surface. Adds three regression tests covering the same vector via group_count, the per-group atom count, and instruction_count: each crafts a tiny blob with a huge declared count and asserts deserialize errors instead of aborting. Made-with: Cursor
…t API
Move bounds policy out of clvm_rs. The deserializer now takes (max_atom_len,
strict) directly; total input bytes is bounded by the slice length (or, for
the streamed variant, a caller-supplied `Read::take`). LimitReader and the
DEFAULT_MAX_* constants are gone — those defaults felt consensus-flavored
without actually being consensus, and they belong in the wrapper that
*does* care (chia_rs).
`node_from_bytes_auto` was the only Rust caller of those defaults, so it
goes too. Auto-detect (sniff the magic prefix and dispatch to backrefs or
serde_2026) survives as a Python-only convenience: deser_auto in the wheel
sniffs the prefix in Rust and dispatches. clvm_rs's Python tools keep their
"works on any blob" feel; consensus-aware callers sniff themselves and
supply their own caps.
Wire-up:
- de.rs: drop DeserializeOptions/LimitReader/DEFAULT_MAX_*; new signatures
- mod re-exports: drop DeserializeOptions, node_from_bytes_auto
- debug self-check: hard-codes 1 MiB max atom (matches old default)
- wheel: deser_2026 / deser_auto take max_atom_len + strict; default 1 MiB
- tests: each test/fuzz/bench picks a local MAX_ATOM_LEN; Rust auto tests
removed (function is gone, Python tests cover it)
Made-with: Cursor
The Rust deserializer family was lopsided:
serialize_2026 -> body, no prefix
node_to_bytes_serde_2026 -> body + prefix
deserialize_2026 -> body, no prefix (name didn't say so)
deserialize_2026_from_stream -> body, no prefix (same problem; worse for
streams since you can't
pre-check the prefix)
serialized_length_serde_2026 -> wants prefix
(no node_from_bytes_serde_2026)
This commit:
- Renames `deserialize_2026{,_from_stream}` ->
`deserialize_2026_body{,_from_stream}` so the body-only contract is
visible at the call site.
- Adds `node_from_bytes_serde_2026(a, blob, max_atom_len, strict)` —
strips the magic prefix and delegates to the body deserializer. Pairs
one-to-one with `node_to_bytes_serde_2026`.
- Parametrizes `serialized_length_serde_2026(buf, max_atom_len, strict)`
so it mirrors every header-time validation the body deserializer
performs (closes the Ok-then-Err mismatch on `instruction_count == 0`,
`atom_len > max_atom_len`, and overlong varints under `strict`).
- Wheel: `deser_2026` now calls `node_from_bytes_serde_2026` directly
instead of stripping the prefix in Python-binding glue.
- Self-check, tests, fuzz target, benches: switch to the body name.
Includes a regression test (`test_serialized_length_rejects_what_deserialize_rejects`)
asserting the length helper rejects every header-time condition the body
deserializer rejects.
Made-with: Cursor
Cursor Bugbot caught that test_serialized_length_rejects_what_deserialize_rejects was calling `deserialize_2026_body` with a magic-prefixed blob from the `mk` helper. The body deserializer treats the leading 0xfd as a giant negative varint and rejects every input regardless of the header-time condition under test, making the first assertion vacuously pass. Switch the deserializer call to `node_from_bytes_serde_2026` so the test actually exercises the conditions it claims to (instruction_count == 0, group length == 0, multi-atom group count == 0). Made-with: Cursor
The previous implementation went through CLVMTree.from_bytes ->
deserialize_as_tuples, which is a classic-format-only parser. As a result,
Program.from_bytes_backrefs("...") raised ValueError on any blob that
actually used the 0xfe backref opcode, despite the method's name implying
backref support.
After this change, from_bytes_backrefs:
- still rejects the serde_2026 magic prefix up front
- delegates to the Rust deser_auto binding, which handles both classic
and backrefs formats
Adds a regression test that round-trips a tree-with-shared-subtrees through
the backrefs encoder and Program.from_bytes_backrefs (asserting the blob
contains at least one 0xfe backref byte). The existing test_repr_clvm_tree
case is updated to use Program.from_bytes_with_cursor, which is the actual
CLVMTree-producing entry point.
Reported by Cursor bot on PR #708.
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes Program.from_bytes_backrefs in the Python wheel so it actually supports backrefs-encoded CLVM (i.e., blobs containing 0xfe backref markers), by routing deserialization through the Rust deser_auto binding after continuing to reject the serde_2026 magic prefix.
Changes:
- Update
Program.from_bytes_backrefsto use the Rust auto-deserializer (classic/backrefs) instead of the classic-onlyCLVMTree.from_bytespath. - Update
test_repr_clvm_treeto useProgram.from_bytes_with_cursor, which is the CLVMTree-producing entry point. - Add a regression test ensuring
from_bytes_backrefscan round-trip an actually-backrefs-encoded blob (containing at least one0xfe).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| wheel/python/clvm_rs/program.py | Fixes from_bytes_backrefs to correctly parse backrefs (while still rejecting serde_2026 by prefix). |
| wheel/python/tests/test_serialize.py | Adjusts a CLVMTree-specific test and adds a regression test for true backrefs parsing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def from_bytes_backrefs(cls, blob: bytes) -> Program: | ||
| """Deserialize classic or backrefs format only (rejects serde_2026).""" | ||
| if blob.startswith(SERDE_2026_MAGIC_PREFIX): | ||
| raise ValueError("unexpected serde_2026 format; use from_bytes() for auto-detection") | ||
| obj, cursor = cls.from_bytes_with_cursor(blob, 0) | ||
| return obj | ||
| return cls.wrap(deser_auto(blob)) |
|
|
||
| # Serialize using backrefs format | ||
| blob = serialize(deserialize(bytes(p), "legacy"), "backrefs") | ||
|
|
||
| # Assert the blob contains at least one 0xfe backref opcode | ||
| self.assertIn(0xfe, blob, "backrefs-encoded blob should contain at least one 0xfe byte") | ||
|
|
||
| # Round-trip through from_bytes_backrefs | ||
| p2 = Program.from_bytes_backrefs(blob) | ||
|
|
Coverage Report for CI Build 25349903018Warning No base build found for commit Coverage: 87.971%Details
Uncovered ChangesNo uncovered changes found. Coverage RegressionsRequires a base build to compare against. How to fix this → Coverage Stats
💛 - Coveralls |
Summary
Program.from_bytes_backrefswas added in #708 but its implementationdelegated to
CLVMTree.from_bytes→deserialize_as_tuples, aclassic-format-only parser. The method therefore rejected any blob that
actually contained a
0xfebackref byte, despite its name.This PR routes
from_bytes_backrefsthrough the Rustdeser_autobinding (after the existing serde_2026 prefix rejection), which correctly
handles both classic and backrefs wire formats.
Test plan
the blob contains at least one
0xfebyte) throughProgram.from_bytes_backrefs.test_repr_clvm_treeupdated to useProgram.from_bytes_with_cursor(the actualCLVMTree-producingentry point).
pytest wheel/python/tests/test_serialize.pypasses locally.cargo fmt --checkandcargo clippy -- -D warningsclean.Notes
serde_2026because the broken method only exists on thatbranch (introduced in 2c51d14). Should land before [CHIA-3823] 2026 serde format #708 merges to
main.Made with Cursor
Note
Low Risk
Low risk: small deserialization routing change plus targeted regression tests; behavior is narrowed only by continuing to reject
serde_2026inputs.Overview
Program.from_bytes_backrefsnow routes through the Rustdeser_autobinding (while still rejecting theserde_2026magic prefix), so it can successfully parse both classic and actual backrefs wire format blobs.Tests are updated/extended to reflect this:
test_repr_clvm_treeusesfrom_bytes_with_cursorforCLVMTree-level assertions, and a new regression test round-trips a backrefs-encoded blob and asserts it contains a0xfebackref opcode.Reviewed by Cursor Bugbot for commit df894a9. Bugbot is set up for automated code reviews on this repo. Configure here.