Add textproto (text format) encoding and decoding by iainmcgin · Pull Request #6 · anthropics/buffa

iainmcgin · 2026-03-19T23:30:58Z

The protobuf text format is the human-readable debug representation — useful
for config files, golden-file tests, and logging. This implements the full
format with 883/883 conformance suite coverage.

Runtime (feature = "text", zero deps, no_std + alloc)

buffa::text module: TextFormat trait, TextEncoder, TextDecoder,
encode_to_string / decode_from_str conveniences
Tokenizer handles all grammar quirks: both { } and < > delimiters,
adjacent string concatenation, #-comments, [pkg.ext] bracket names,
hex/octal/float number forms, nan/inf, C++-style bool literals
Decoder returns Cow<'a, str> for strings (zero-copy when no escapes)
\u escapes are BMP-only (surrogates rejected per spec; use \U for non-BMP)
Heuristic nested-message detection for unknown LD fields (speculative
wire-format parse, C++ text_format.cc:2926 / Java TextFormat.java:87)

TypeRegistry — unified type/extension registry

JsonRegistry → TypeRegistry with feature-split entry types: JsonAnyEntry,
TextAnyEntry, JsonExtEntry, TextExtEntry each under their own cfg gate
json and text features are now independently enableable (previously json
pulled in text because unified entry structs referenced TextEncoder)
Codegen emits one register_types() per file covering whichever formats
were enabled — lines inside are emitted per generate_json/generate_text,
no #[cfg] in generated code
Any expansion ([type.googleapis.com/...] { ... }) and extension brackets
([pkg.ext] { ... }) both consult TypeRegistry
Deprecated aliases: JsonRegistry, set_json_registry, AnyTypeEntry,
ExtensionRegistryEntry

Codegen

generate_text(bool) config option and buffa-build builder method
impl TextFormat emission in impl_text.rs, skip-gated for Any (hand-written
in buffa-types) and MessageSet
Group-field naming: GroupLikeType { ... } (message type name) and
groupliketype { ... } (lowercase/field name) both accepted on parse;
type name emitted on encode
Separate __FOO_JSON_ANY / __FOO_TEXT_ANY consts per codegen flag
emit_register_fn config (default true) — gen_wkt_types sets false since
seven include!'d WKT files would collide on register_types

Conformance

Text suite: 0 → 883 passing (was entirely skipped), zero expected failures
Binary+JSON unchanged at 5549/5529/2797
known_failures_text.txt added (empty — passes the full suite)
print_unknown_fields request field parsed and threaded to TextEncoder

Docs & examples

guide.md "Text format (textproto)" section
README/CHANGELOG/CONTRIBUTING/migration-from-protobuf.md updated
addressbook example gains a dump command (encode_to_string_pretty)
Taskfile: build-examples, example-addressbook, example-envelope,
example-logging tasks (USER_WORKING_DIR for file-path preservation)

The protobuf text format is the human-readable debug representation — useful for config files, golden-file tests, and logging. This implements the full format with 883/883 conformance suite coverage. Runtime (feature = "text", zero deps, no_std + alloc) - buffa::text module: TextFormat trait, TextEncoder, TextDecoder, encode_to_string / decode_from_str conveniences - Tokenizer handles all grammar quirks: both { } and < > delimiters, adjacent string concatenation, #-comments, [pkg.ext] bracket names, hex/octal/float number forms, nan/inf, C++-style bool literals - Decoder returns Cow<'a, str> for strings (zero-copy when no escapes) - \u escapes are BMP-only (surrogates rejected per spec; use \U for non-BMP) - Heuristic nested-message detection for unknown LD fields (speculative wire-format parse, C++ text_format.cc:2926 / Java TextFormat.java:87) TypeRegistry — unified type/extension registry - JsonRegistry → TypeRegistry with feature-split entry types: JsonAnyEntry, TextAnyEntry, JsonExtEntry, TextExtEntry each under their own cfg gate - json and text features are now independently enableable (previously json pulled in text because unified entry structs referenced TextEncoder) - Codegen emits one register_types() per file covering whichever formats were enabled — lines inside are emitted per generate_json/generate_text, no #[cfg] in generated code - Any expansion ([type.googleapis.com/...] { ... }) and extension brackets ([pkg.ext] { ... }) both consult TypeRegistry - Deprecated aliases: JsonRegistry, set_json_registry, AnyTypeEntry, ExtensionRegistryEntry Codegen - generate_text(bool) config option and buffa-build builder method - impl TextFormat emission in impl_text.rs, skip-gated for Any (hand-written in buffa-types) and MessageSet - Group-field naming: GroupLikeType { ... } (message type name) and groupliketype { ... } (lowercase/field name) both accepted on parse; type name emitted on encode - Separate __FOO_JSON_ANY / __FOO_TEXT_ANY consts per codegen flag - emit_register_fn config (default true) — gen_wkt_types sets false since seven include!'d WKT files would collide on register_types Conformance - Text suite: 0 → 883 passing (was entirely skipped), zero expected failures - Binary+JSON unchanged at 5549/5529/2797 - known_failures_text.txt added (empty — passes the full suite) - print_unknown_fields request field parsed and threaded to TextEncoder Docs & examples - guide.md "Text format (textproto)" section - README/CHANGELOG/CONTRIBUTING/migration-from-protobuf.md updated - addressbook example gains a dump command (encode_to_string_pretty) - Taskfile: build-examples, example-addressbook, example-envelope, example-logging tasks (USER_WORKING_DIR for file-path preservation)

asacamano

Happy for this to be submitted and then any you want to be addressed in later PRs - or happy to re-review.

asacamano · 2026-03-24T16:31:53Z

buffa/src/type_registry.rs

+    let bytes = m.encode_to_vec();
+    // Freshly-encoded → re-decode cannot fail short of an encoder bug.
+    let inner = UnknownFields::decode_from_slice(&bytes)
+        .expect("BUG: re-decoding freshly-encoded message bytes failed");


This feels reachable - and it might be nice to return a ParseError rather than an panic on the expect()

[claude code] Fixed in ed6363d — added ParseErrorKind::Internal(&'static str) (non-breaking, enum is #[non_exhaustive]) and converted the .expect() to return it with line:col of 0:0. Error message is "internal error (this is a buffa bug): re-decoding freshly-encoded group message bytes failed" so the it's-not-your-input signal is preserved without the panic.

asacamano · 2026-03-24T16:34:34Z

buffa/src/text/token.rs

+/// separators.
+#[derive(Clone, Copy, PartialEq, Eq)]
+enum LastKind {
+    Bof,


nit - the others are clear, "Bof" is an ancronym (which I didn't immediately guess because I read it in another context). A translation to "beginning of file" would be helpfu.

[claude code] Fixed in ed6363d — added /// Beginning of file — no token emitted yet. on the variant.

asacamano · 2026-03-24T16:41:58Z

buffa/src/text/token.rs

+
+            match self.last_kind {
+                LastKind::Bof => {
+                    // Top-level: expect EOF or a field name.


There's an interesting edge case here were a byte order bark [0xEF, 0xBB, 0xBF]. It seems like Rust's conversion methods want to be data preserving, so they translate the BOM to valid UTF8, which then the proto parser rejects as tha invalid first character of the file.

It might be nice here to add a specific error message for "this UTF8 is is not valid protobuf because it contains a Byte Order Mark".

[claude code] Fixed in 8d39b7a — went with silent skip rather than a specific error, since a BOM is semantically meaningless at file-start (same as most JSON parsers). Tokenizer::new checks input.starts_with('\u{FEFF}') and initializes cursor = 3 if so. A BOM-only file correctly tokenizes as empty.

asacamano · 2026-03-24T16:44:11Z

buffa/src/text/decoder.rs

+        unescape_str(tok.raw).map_err(|why| {
+            self.err_at(
+                &tok,
+                if why == "invalid UTF-8" {


nit - could this be an error instead of a string?

[claude code] Fixed in ed6363d — added enum UnescapeError { InvalidUtf8, BadEscape(&'static str) }. unescape/unescape_str/push_utf8 all return it now; this call site matches instead of string-comparing. read_bytes destructures with let UnescapeError::BadEscape(why) = e else { unreachable!() } since byte-level unescape never returns InvalidUtf8.

asacamano · 2026-03-24T16:45:16Z

buffa/src/text/decoder.rs

+        name: &str,
+        extendee: &str,
+    ) -> Result<Vec<crate::UnknownField>, ParseError> {
+        let full = name


Does this also need to strip whitespace, e.g. [ foo ] vs [foo]?

[claude code] Fixed in 8d39b7a — added .map(str::trim) after the bracket-strip here and in read_any_expansion (same pattern at line 569). Token.raw is a slice of the input so it preserves the whitespace; the tokenizer's internal skip at line 959 only affects the cursor, not raw.

asacamano · 2026-03-24T16:47:31Z

buffa/src/text/string.rs

+        }
+        // After a closing quote: skip whitespace, then check for another quote.
+        while let Some(&c) = s.first() {
+            if matches!(c, b' ' | b'\t' | b'\r' | b'\n') {


https://protobuf.dev/reference/protobuf/textformat-spec/#whitespace also wants vertical tab and form feed to be treated as whitespace.

[claude code] Fixed in 8d39b7a — extracted is_textproto_ws(c: u8) -> bool in token.rs and replaced all 5 sites that had the incomplete set (this one, string.rs:184, token.rs:271/915/959). Test added: "a:\x0B1\x0Cb: 2" tokenizes correctly.

asacamano · 2026-03-24T17:02:22Z

buffa/src/text/token.rs

+/// `-`. Returns the byte length, or 0 if no identifier is present.
+///
+/// Reference: protobuf-go `parseIdent`.
+fn parse_ident(s: &[u8], allow_neg: bool) -> usize {


Here's a rabbit hole.

Should foo: -42 and foo: -inf and foo: - 42 and foo: - inf all work the same way.

https://protobuf.dev/reference/protobuf/textformat-spec/#value seems to be unclear. But my instinct is that if "- 42" is a number, "- inf" should also be a number.

[claude code] Fixed in 8d39b7a — your instinct was right. lex_number (line 763) already calls consume_ws after -, so - 42 works; but parse_ident (this function, the -inf/-nan path) didn't. Added the same consume_ws call after consuming the sign. Decoder side (read_f64 line 252) adds a .trim_start() after strip_prefix('-') since the raw span now contains the whitespace. Added "f: - inf" and "f: -\tinf" to read_f64_table.

asacamano · 2026-03-24T17:04:55Z

buffa/src/text/mod.rs

+
+/// Encode a message as a multi-line textproto string with 2-space indent.
+///
+/// One field per line; nested messages are indented. No trailing newline.


Is no-trailing newline part part of the conformance test?

https://protobuf.dev/reference/protobuf/textformat-spec/#working-with-the-format-programmatically suggests following the behavior of https://github.com/protocolbuffers/txtpbfmt which does emit a trailing newline.

[claude code] Fixed in ed6363d — encode_to_string_pretty appends \n when the output is non-empty. Single-line encode_to_string unchanged (it's a value, not a file). Not conformance-relevant — the conformance runner doesn't use the pretty form.

asacamano · 2026-03-24T17:09:04Z

buffa/src/text/token.rs

+    /// For `Name` tokens: was this name followed by a `:` separator?
+    /// The textproto grammar makes `:` optional before messages but required
+    /// before scalars.
+    pub has_separator: bool,


Nit: is this only used in tests?

[claude code] Not addressed yet — it's written at 7 production sites in the tokenizer but the only read in buffa/src/text/ is the test at line 1111. Before touching it I'd want to grep the impl_text.rs-generated code (the decoder-side merge_text impls) to confirm nothing there reads it. If it's genuinely dead: drop the field entirely, one fewer word on every Token. Deferred.

asacamano · 2026-03-24T17:16:10Z

buffa/src/text/encoder.rs

+        // serialize_extensions dedup loop.
+        let mut seen: alloc::vec::Vec<u32> = alloc::vec::Vec::new();
+        for uf in fields.iter() {
+            if seen.contains(&uf.number) {


With a bit of extra code, this could be less than O(n^2). Probably not worth it, but maybe leaving a comment saying "most proto messages are small enough and the extensions are small enough" that we decided to leave it. Or we could just get Claude to do the work and pre-optimize it for large complex messages. Your call :)

[claude code] Fixed in ed6363d — went with the cheap fix: Vec<u32> + contains + push → BTreeSet<u32> + insert (returns false on duplicate, so the if !seen.insert(...) form collapses two operations to one). BTreeSet because this crate is no_std+alloc.

asacamano · 2026-03-24T17:23:04Z

buffa/src/text/token.rs

+    pub fn line_col(&self, pos: usize) -> (u32, u32) {
+        let pos = pos.min(self.input.len());
+        let before = &self.input[..pos];
+        let line = before.bytes().filter(|&b| b == b'\n').count() as u32 + 1;


Any concern about reakly large files and u32 limits?

[claude code] Documented in ed6363d — added a doc-comment line noting the ~4 GiB bound and that this is error-reporting only. Didn't add saturation logic since a textproto that large isn't a realistic input.

Conflicts were all in buffa-types/src/generated/ — main added Self emission (PR #15) and proto source comments (PR #7), our branch added TextFormat impls. Resolved by regenerating with the merged codegen source, which applies all three. No conflicts in hand-written source — the codegen changes (message.rs FieldInfo.struct_field_type, impl_text.rs, etc.) merged automatically.

Addresses four review comments on the tokenizer/decoder: - WHITESPACE per spec includes vertical tab and form feed, which five call sites were missing. Extracted is_textproto_ws() helper; all whitespace-skipping sites now call it. - `- inf` (whitespace between sign and literal) is valid per the grammar (`FLOAT = [ "-" ] , FLOAT_LIT` where `,` means optional whitespace). lex_number already handled `- 42`; parse_ident now matches for the literal path. Decoder side trims after stripping the sign. - `[ pkg.ext ]` (whitespace inside extension/Any brackets) is valid per the grammar. Token.raw preserves input bytes, so the two bracket-strip sites in the decoder now .trim() before registry lookup. - A leading UTF-8 BOM (U+FEFF) is silently skipped. Some editors prepend one; without this it fails with a cryptic 'expected field name' at 1:1.

Addresses the remaining non-spec review comments: - UnescapeError enum replaces string-compare error dispatch in the decoder. unescape/unescape_str/push_utf8 now return a two-variant enum (InvalidUtf8 | BadEscape(&str)); callers match instead of string-comparing "invalid UTF-8". - ParseErrorKind::Internal added. group_merge_text's re-decode of freshly-encoded bytes now returns this instead of panicking. Still unreachable absent an encoder bug, but a library shouldn't panic from that either. - Extension-field dedup uses BTreeSet instead of Vec.contains, removing the quadratic lookup for extension-heavy messages. - encode_to_string_pretty appends a trailing newline when non-empty, matching txtpbfmt and POSIX text-file conventions. Single-line encode_to_string is unchanged. - Bof enum variant documented; line_col u32 bounds noted in doc.

iainmcgin · 2026-03-30T19:07:27Z

[claude code] Merge conflicts resolved + review follow-ups applied in 3 commits:

7ee6310 — merge resolution. All 7 conflicts were in buffa-types/src/generated/ where main added Self-emission (#15) + proto source comments (#7) while this branch added TextFormat impls. No conflicts in hand-written source — the codegen changes merged automatically. Resolved by regenerating with the merged codegen, which applies all three.

8d39b7a — spec-compliance parsing gaps (+79/−11, 3 files):

\v/\f whitespace: extracted is_textproto_ws(), replaced 5 sites (spec)
- inf with whitespace-after-sign: parse_ident now matches lex_number's existing - 42 handling; decoder trims after sign-strip
[ pkg.ext ] whitespace: .trim() after bracket-strip in read_extension and read_any_expansion
Leading UTF-8 BOM silently skipped in Tokenizer::new

ed6363d — error-handling & polish (+79/−35, 8 files):

UnescapeError enum replaces &'static str error-return; decoder matches instead of string-comparing "invalid UTF-8"
ParseErrorKind::Internal added; group_merge_text's re-decode .expect() converted (unreachable absent encoder bug, but a library shouldn't panic)
Extension dedup: Vec::contains → BTreeSet::insert
encode_to_string_pretty appends trailing \n (matches txtpbfmt); single-line encode_to_string unchanged; conformance unaffected (doesn't use pretty)
Bof variant documented; line_col u32 bounds noted

Not addressed: the name_kind question (token.rs:87) — it's written at 7 production sites but I couldn't find a production reader in the text module. Needs a grep across impl_text.rs-generated code to confirm whether it's dead before touching it.

task lint clean, 1390 workspace tests pass.

All four known_failures*.txt files are empty on this branch: binary, JSON, and text all pass in std, no_std, and via-view modes. Updated the Why-buffa blurb to reflect all three formats and dropped the per-suite 883/883 count from the text bullet since it's now covered by the top-line claim.

iainmcgin marked this pull request as ready for review March 23, 2026 19:25

iainmcgin requested a review from asacamano March 23, 2026 19:25

asacamano previously approved these changes Mar 24, 2026

View reviewed changes

asacamano reviewed Mar 24, 2026

View reviewed changes

iainmcgin added 3 commits March 30, 2026 18:59

iainmcgin dismissed asacamano’s stale review via ed6363d March 30, 2026 19:06

iainmcgin requested a review from asacamano March 30, 2026 19:13

iainmcgin enabled auto-merge (squash) March 30, 2026 19:14

asacamano approved these changes Mar 30, 2026

View reviewed changes

Merge branch 'main' into textproto-support

eaeb6e9

iainmcgin merged commit cd002f6 into main Mar 30, 2026
7 checks passed

iainmcgin deleted the textproto-support branch March 30, 2026 20:17

github-actions bot locked and limited conversation to collaborators Mar 30, 2026

Conversation

iainmcgin commented Mar 19, 2026

Uh oh!

asacamano left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iainmcgin commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants