Skip to content

program.py: route from_bytes_backrefs through deser_auto#776

Open
richardkiss wants to merge 15 commits into
mainfrom
fix-program-py-from-bytes-backrefs
Open

program.py: route from_bytes_backrefs through deser_auto#776
richardkiss wants to merge 15 commits into
mainfrom
fix-program-py-from-bytes-backrefs

Conversation

@richardkiss

@richardkiss richardkiss commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Program.from_bytes_backrefs was added in #708 but its implementation
delegated to CLVMTree.from_bytesdeserialize_as_tuples, a
classic-format-only parser. The method therefore rejected any blob that
actually contained a 0xfe backref byte, despite its name.

This PR routes from_bytes_backrefs through the Rust deser_auto
binding (after the existing serde_2026 prefix rejection), which correctly
handles both classic and backrefs wire formats.

Test plan

  • New regression test that round-trips a backrefs-encoded blob (asserts
    the blob contains at least one 0xfe byte) through
    Program.from_bytes_backrefs.
  • Existing test_repr_clvm_tree updated to use
    Program.from_bytes_with_cursor (the actual CLVMTree-producing
    entry point).
  • pytest wheel/python/tests/test_serialize.py passes locally.
  • cargo fmt --check and cargo clippy -- -D warnings clean.

Notes

Made with Cursor


Note

Low Risk
Low risk: small deserialization routing change plus targeted regression tests; behavior is narrowed only by continuing to reject serde_2026 inputs.

Overview
Program.from_bytes_backrefs now routes through the Rust deser_auto binding (while still rejecting the serde_2026 magic prefix), so it can successfully parse both classic and actual backrefs wire format blobs.

Tests are updated/extended to reflect this: test_repr_clvm_tree uses from_bytes_with_cursor for CLVMTree-level assertions, and a new regression test round-trips a backrefs-encoded blob and asserts it contains a 0xfe backref opcode.

Reviewed by Cursor Bugbot for commit df894a9. Bugbot is set up for automated code reviews on this repo. Configure here.

richardkiss and others added 15 commits April 30, 2026 11:45
Implement an interning-based serialization format for CLVM trees that
deduplicates atoms and pairs for better compression than backrefs, with
comparable deserialization speed and much faster serialization.

Format: 6-byte magic prefix (fd ff 32 30 32 36), atom table grouped by
length, stack-based instruction stream with dual-cons opcodes. Uses a
variable-length signed integer (varint) encoding throughout.

Includes:
- varint codec (encode/decode, signed, variable-length prefix scheme)
- deserializer with DeserializeLimits (max_atom_len, max_input_bytes)
- serializer (left-first traversal) and pair-optimized serializer (DP)
- serialized_length_serde_2026: compute blob length without deserializing
- node_from_bytes_auto: auto-detect classic/backrefs/serde_2026
- InternedTree::node_indices() for serialization support
- format specification (docs/serde-2026.md)
- benchmarks for serialize and deserialize
- 654 lines of tests (41 test cases)

Made-with: Cursor
Rust pyo3 bindings:
- deser_legacy, deser_backrefs, deser_2026, deser_auto (bytes -> LazyNode)
- ser_legacy, ser_backrefs, ser_2026 (LazyNode -> bytes)
- clvm_tree_to_lazy_node (Python CLVM tree -> interned Rust LazyNode)

Python layer:
- clvm_rs.serde module: serialize()/deserialize() with format selection
- Program.from_bytes() now auto-detects format via deser_auto
- Program.to_bytes_serde_2026() / from_bytes_serde_2026()
- Updated .pyi stubs (fix phantom NO_NEG_DIV, add new flags/functions)
- 222 lines of Python tests

Made-with: Cursor
Roundtrip fuzzer: deserialize arbitrary bytes as classic CLVM, serialize
to serde_2026, deserialize back, and compare trees structurally. Uses
serialize_2026 (not node_to_bytes) for comparison to handle DAG-vs-tree
differences correctly.

Made-with: Cursor
Simplify serde_2026 deserialization limits around byte and atom-size budgets, add strict varint decoding, and update docs/tests/fuzzing for the refined API.

Made-with: Cursor
Keep the serde_2026 serializer on the left-first strategy for review while preserving the strategy hook for a future context-aware compact implementation.

Made-with: Cursor
Update the fuzz target and Python tests to reflect the remaining fast strategy after compact was removed from the PR.

Made-with: Cursor
…26 prefix

- Add a `decode_varint` fuzz target (`serde-2026-varint`). Exercises the
  non-strict and strict decoders; checks that any value the non-strict path
  accepts re-encodes to a canonical (minimal) form which decodes identically
  in strict mode and is no longer than the original input.
- Re-export `decode_varint` / `encode_varint` (with `#[doc(hidden)]`) so the
  fuzzer can call them through the public crate surface.
- Add a unit test for `write_atom_table` that confirms atoms of equal length
  are coalesced into a single repeated-length group while atoms of different
  lengths are emitted as separate groups.
- Fix `deser_2026`: it now strips `SERDE_2026_MAGIC_PREFIX` before calling
  the underlying decoder, so it round-trips with `ser_2026` (which always
  emits the prefix). A `ValueError` is raised if the prefix is missing.
  Adds a Python test for the symmetry property.

Made-with: Cursor
Public serialization API now uses `level: u32` rather than the `Compression`
enum. New entry points:

- `serialize_2026(a, n)` / `serialize_2026_level(a, n, level)`
- `serialize_2026_to_stream(a, n, w)` / `..._level(a, n, level, w)`
- `node_to_bytes_serde_2026(a, n)` / `..._level(a, n, level)`

Levels above the highest implemented level *saturate* down to it. This means
callers can pass `u32::MAX` to mean "best available compression" without
recompiling when new levels are added later. `Compression` becomes a
`pub(crate)` internal vocabulary mapped from `level` by `compression_for_level`.

Python `ser_2026(node, level=K)` no longer raises for `K > 0`; it produces the
same bytes as level 0 today (and the best implemented level tomorrow). The
"rejects unknown level" test is replaced with a saturation test that asserts
levels 0, 1, 7, 1<<20 and u32::MAX all produce the same bytes today.

All internal call sites (tests, benches, fuzz target, wheel API) updated.

Made-with: Cursor
Co-authored-by: Cursor <cursoragent@cursor.com>
…_count

`Vec::with_capacity(instruction_count / 3)` allocated based on the declared
instruction-count varint. A tiny blob (under 16 bytes) declaring
instruction_count = 2^54 caused malloc to be called for ~24 PB and the
process to abort with SIGABRT. `LimitReader` already bounds total bytes
consumed, so the loop below cannot drive `pairs` past max_input_bytes/3
entries — geometric growth from `Vec::new()` handles that with a handful of
reallocs and removes the abort surface.

Adds three regression tests covering the same vector via group_count, the
per-group atom count, and instruction_count: each crafts a tiny blob with a
huge declared count and asserts deserialize errors instead of aborting.

Made-with: Cursor
…t API

Move bounds policy out of clvm_rs. The deserializer now takes (max_atom_len,
strict) directly; total input bytes is bounded by the slice length (or, for
the streamed variant, a caller-supplied `Read::take`). LimitReader and the
DEFAULT_MAX_* constants are gone — those defaults felt consensus-flavored
without actually being consensus, and they belong in the wrapper that
*does* care (chia_rs).

`node_from_bytes_auto` was the only Rust caller of those defaults, so it
goes too. Auto-detect (sniff the magic prefix and dispatch to backrefs or
serde_2026) survives as a Python-only convenience: deser_auto in the wheel
sniffs the prefix in Rust and dispatches. clvm_rs's Python tools keep their
"works on any blob" feel; consensus-aware callers sniff themselves and
supply their own caps.

Wire-up:
  - de.rs: drop DeserializeOptions/LimitReader/DEFAULT_MAX_*; new signatures
  - mod re-exports: drop DeserializeOptions, node_from_bytes_auto
  - debug self-check: hard-codes 1 MiB max atom (matches old default)
  - wheel: deser_2026 / deser_auto take max_atom_len + strict; default 1 MiB
  - tests: each test/fuzz/bench picks a local MAX_ATOM_LEN; Rust auto tests
    removed (function is gone, Python tests cover it)

Made-with: Cursor
The Rust deserializer family was lopsided:

  serialize_2026          -> body, no prefix
  node_to_bytes_serde_2026 -> body + prefix
  deserialize_2026         -> body, no prefix    (name didn't say so)
  deserialize_2026_from_stream -> body, no prefix (same problem; worse for
                                                   streams since you can't
                                                   pre-check the prefix)
  serialized_length_serde_2026 -> wants prefix
  (no node_from_bytes_serde_2026)

This commit:

  - Renames `deserialize_2026{,_from_stream}` ->
    `deserialize_2026_body{,_from_stream}` so the body-only contract is
    visible at the call site.
  - Adds `node_from_bytes_serde_2026(a, blob, max_atom_len, strict)` —
    strips the magic prefix and delegates to the body deserializer. Pairs
    one-to-one with `node_to_bytes_serde_2026`.
  - Parametrizes `serialized_length_serde_2026(buf, max_atom_len, strict)`
    so it mirrors every header-time validation the body deserializer
    performs (closes the Ok-then-Err mismatch on `instruction_count == 0`,
    `atom_len > max_atom_len`, and overlong varints under `strict`).
  - Wheel: `deser_2026` now calls `node_from_bytes_serde_2026` directly
    instead of stripping the prefix in Python-binding glue.
  - Self-check, tests, fuzz target, benches: switch to the body name.

Includes a regression test (`test_serialized_length_rejects_what_deserialize_rejects`)
asserting the length helper rejects every header-time condition the body
deserializer rejects.

Made-with: Cursor
Cursor Bugbot caught that test_serialized_length_rejects_what_deserialize_rejects
was calling `deserialize_2026_body` with a magic-prefixed blob from the `mk`
helper. The body deserializer treats the leading 0xfd as a giant negative
varint and rejects every input regardless of the header-time condition under
test, making the first assertion vacuously pass.

Switch the deserializer call to `node_from_bytes_serde_2026` so the test
actually exercises the conditions it claims to (instruction_count == 0,
group length == 0, multi-atom group count == 0).

Made-with: Cursor
The previous implementation went through CLVMTree.from_bytes ->
deserialize_as_tuples, which is a classic-format-only parser. As a result,
Program.from_bytes_backrefs("...") raised ValueError on any blob that
actually used the 0xfe backref opcode, despite the method's name implying
backref support.

After this change, from_bytes_backrefs:
  - still rejects the serde_2026 magic prefix up front
  - delegates to the Rust deser_auto binding, which handles both classic
    and backrefs formats

Adds a regression test that round-trips a tree-with-shared-subtrees through
the backrefs encoder and Program.from_bytes_backrefs (asserting the blob
contains at least one 0xfe backref byte). The existing test_repr_clvm_tree
case is updated to use Program.from_bytes_with_cursor, which is the actual
CLVMTree-producing entry point.

Reported by Cursor bot on PR #708.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 4, 2026 23:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Program.from_bytes_backrefs in the Python wheel so it actually supports backrefs-encoded CLVM (i.e., blobs containing 0xfe backref markers), by routing deserialization through the Rust deser_auto binding after continuing to reject the serde_2026 magic prefix.

Changes:

  • Update Program.from_bytes_backrefs to use the Rust auto-deserializer (classic/backrefs) instead of the classic-only CLVMTree.from_bytes path.
  • Update test_repr_clvm_tree to use Program.from_bytes_with_cursor, which is the CLVMTree-producing entry point.
  • Add a regression test ensuring from_bytes_backrefs can round-trip an actually-backrefs-encoded blob (containing at least one 0xfe).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
wheel/python/clvm_rs/program.py Fixes from_bytes_backrefs to correctly parse backrefs (while still rejecting serde_2026 by prefix).
wheel/python/tests/test_serialize.py Adjusts a CLVMTree-specific test and adds a regression test for true backrefs parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 50 to +54
def from_bytes_backrefs(cls, blob: bytes) -> Program:
"""Deserialize classic or backrefs format only (rejects serde_2026)."""
if blob.startswith(SERDE_2026_MAGIC_PREFIX):
raise ValueError("unexpected serde_2026 format; use from_bytes() for auto-detection")
obj, cursor = cls.from_bytes_with_cursor(blob, 0)
return obj
return cls.wrap(deser_auto(blob))
Comment on lines +187 to +196

# Serialize using backrefs format
blob = serialize(deserialize(bytes(p), "legacy"), "backrefs")

# Assert the blob contains at least one 0xfe backref opcode
self.assertIn(0xfe, blob, "backrefs-encoded blob should contain at least one 0xfe byte")

# Round-trip through from_bytes_backrefs
p2 = Program.from_bytes_backrefs(blob)

@coveralls-official

Copy link
Copy Markdown

Coverage Report for CI Build 25349903018

Warning

No base build found for commit 6ac4265 on serde_2026.
Coverage changes can't be calculated without a base build.
If a base build is processing, this comment will update automatically when it completes.

Coverage: 87.971%

Details

  • Patch coverage: 1 of 1 lines across 1 file are fully covered (100%).

Uncovered Changes

No uncovered changes found.

Coverage Regressions

Requires a base build to compare against. How to fix this →


Coverage Stats

Coverage Status
Relevant Lines: 9394
Covered Lines: 8264
Line Coverage: 87.97%
Coverage Strength: 28088109.57 hits per line

💛 - Coveralls

Base automatically changed from serde_2026 to main May 19, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants