Add textproto (text format) encoding and decoding#6
Conversation
The protobuf text format is the human-readable debug representation — useful
for config files, golden-file tests, and logging. This implements the full
format with 883/883 conformance suite coverage.
Runtime (feature = "text", zero deps, no_std + alloc)
- buffa::text module: TextFormat trait, TextEncoder, TextDecoder,
encode_to_string / decode_from_str conveniences
- Tokenizer handles all grammar quirks: both { } and < > delimiters,
adjacent string concatenation, #-comments, [pkg.ext] bracket names,
hex/octal/float number forms, nan/inf, C++-style bool literals
- Decoder returns Cow<'a, str> for strings (zero-copy when no escapes)
- \u escapes are BMP-only (surrogates rejected per spec; use \U for non-BMP)
- Heuristic nested-message detection for unknown LD fields (speculative
wire-format parse, C++ text_format.cc:2926 / Java TextFormat.java:87)
TypeRegistry — unified type/extension registry
- JsonRegistry → TypeRegistry with feature-split entry types: JsonAnyEntry,
TextAnyEntry, JsonExtEntry, TextExtEntry each under their own cfg gate
- json and text features are now independently enableable (previously json
pulled in text because unified entry structs referenced TextEncoder)
- Codegen emits one register_types() per file covering whichever formats
were enabled — lines inside are emitted per generate_json/generate_text,
no #[cfg] in generated code
- Any expansion ([type.googleapis.com/...] { ... }) and extension brackets
([pkg.ext] { ... }) both consult TypeRegistry
- Deprecated aliases: JsonRegistry, set_json_registry, AnyTypeEntry,
ExtensionRegistryEntry
Codegen
- generate_text(bool) config option and buffa-build builder method
- impl TextFormat emission in impl_text.rs, skip-gated for Any (hand-written
in buffa-types) and MessageSet
- Group-field naming: GroupLikeType { ... } (message type name) and
groupliketype { ... } (lowercase/field name) both accepted on parse;
type name emitted on encode
- Separate __FOO_JSON_ANY / __FOO_TEXT_ANY consts per codegen flag
- emit_register_fn config (default true) — gen_wkt_types sets false since
seven include!'d WKT files would collide on register_types
Conformance
- Text suite: 0 → 883 passing (was entirely skipped), zero expected failures
- Binary+JSON unchanged at 5549/5529/2797
- known_failures_text.txt added (empty — passes the full suite)
- print_unknown_fields request field parsed and threaded to TextEncoder
Docs & examples
- guide.md "Text format (textproto)" section
- README/CHANGELOG/CONTRIBUTING/migration-from-protobuf.md updated
- addressbook example gains a dump command (encode_to_string_pretty)
- Taskfile: build-examples, example-addressbook, example-envelope,
example-logging tasks (USER_WORKING_DIR for file-path preservation)
asacamano
left a comment
There was a problem hiding this comment.
Happy for this to be submitted and then any you want to be addressed in later PRs - or happy to re-review.
buffa/src/type_registry.rs
Outdated
| let bytes = m.encode_to_vec(); | ||
| // Freshly-encoded → re-decode cannot fail short of an encoder bug. | ||
| let inner = UnknownFields::decode_from_slice(&bytes) | ||
| .expect("BUG: re-decoding freshly-encoded message bytes failed"); |
There was a problem hiding this comment.
This feels reachable - and it might be nice to return a ParseError rather than an panic on the expect()
There was a problem hiding this comment.
[claude code] Fixed in ed6363d — added ParseErrorKind::Internal(&'static str) (non-breaking, enum is #[non_exhaustive]) and converted the .expect() to return it with line:col of 0:0. Error message is "internal error (this is a buffa bug): re-decoding freshly-encoded group message bytes failed" so the it's-not-your-input signal is preserved without the panic.
| /// separators. | ||
| #[derive(Clone, Copy, PartialEq, Eq)] | ||
| enum LastKind { | ||
| Bof, |
There was a problem hiding this comment.
nit - the others are clear, "Bof" is an ancronym (which I didn't immediately guess because I read it in another context). A translation to "beginning of file" would be helpfu.
There was a problem hiding this comment.
[claude code] Fixed in ed6363d — added /// Beginning of file — no token emitted yet. on the variant.
|
|
||
| match self.last_kind { | ||
| LastKind::Bof => { | ||
| // Top-level: expect EOF or a field name. |
There was a problem hiding this comment.
There's an interesting edge case here were a byte order bark [0xEF, 0xBB, 0xBF]. It seems like Rust's conversion methods want to be data preserving, so they translate the BOM to valid UTF8, which then the proto parser rejects as tha invalid first character of the file.
It might be nice here to add a specific error message for "this UTF8 is is not valid protobuf because it contains a Byte Order Mark".
There was a problem hiding this comment.
[claude code] Fixed in 8d39b7a — went with silent skip rather than a specific error, since a BOM is semantically meaningless at file-start (same as most JSON parsers). Tokenizer::new checks input.starts_with('\u{FEFF}') and initializes cursor = 3 if so. A BOM-only file correctly tokenizes as empty.
buffa/src/text/decoder.rs
Outdated
| unescape_str(tok.raw).map_err(|why| { | ||
| self.err_at( | ||
| &tok, | ||
| if why == "invalid UTF-8" { |
There was a problem hiding this comment.
nit - could this be an error instead of a string?
There was a problem hiding this comment.
[claude code] Fixed in ed6363d — added enum UnescapeError { InvalidUtf8, BadEscape(&'static str) }. unescape/unescape_str/push_utf8 all return it now; this call site matches instead of string-comparing. read_bytes destructures with let UnescapeError::BadEscape(why) = e else { unreachable!() } since byte-level unescape never returns InvalidUtf8.
| name: &str, | ||
| extendee: &str, | ||
| ) -> Result<Vec<crate::UnknownField>, ParseError> { | ||
| let full = name |
There was a problem hiding this comment.
Does this also need to strip whitespace, e.g. [ foo ] vs [foo]?
There was a problem hiding this comment.
[claude code] Fixed in 8d39b7a — added .map(str::trim) after the bracket-strip here and in read_any_expansion (same pattern at line 569). Token.raw is a slice of the input so it preserves the whitespace; the tokenizer's internal skip at line 959 only affects the cursor, not raw.
buffa/src/text/string.rs
Outdated
| } | ||
| // After a closing quote: skip whitespace, then check for another quote. | ||
| while let Some(&c) = s.first() { | ||
| if matches!(c, b' ' | b'\t' | b'\r' | b'\n') { |
There was a problem hiding this comment.
https://protobuf.dev/reference/protobuf/textformat-spec/#whitespace also wants vertical tab and form feed to be treated as whitespace.
There was a problem hiding this comment.
[claude code] Fixed in 8d39b7a — extracted is_textproto_ws(c: u8) -> bool in token.rs and replaced all 5 sites that had the incomplete set (this one, string.rs:184, token.rs:271/915/959). Test added: "a:\x0B1\x0Cb: 2" tokenizes correctly.
| /// `-`. Returns the byte length, or 0 if no identifier is present. | ||
| /// | ||
| /// Reference: protobuf-go `parseIdent`. | ||
| fn parse_ident(s: &[u8], allow_neg: bool) -> usize { |
There was a problem hiding this comment.
Here's a rabbit hole.
Should foo: -42 and foo: -inf and foo: - 42 and foo: - inf all work the same way.
https://protobuf.dev/reference/protobuf/textformat-spec/#value seems to be unclear. But my instinct is that if "- 42" is a number, "- inf" should also be a number.
There was a problem hiding this comment.
[claude code] Fixed in 8d39b7a — your instinct was right. lex_number (line 763) already calls consume_ws after -, so - 42 works; but parse_ident (this function, the -inf/-nan path) didn't. Added the same consume_ws call after consuming the sign. Decoder side (read_f64 line 252) adds a .trim_start() after strip_prefix('-') since the raw span now contains the whitespace. Added "f: - inf" and "f: -\tinf" to read_f64_table.
buffa/src/text/mod.rs
Outdated
|
|
||
| /// Encode a message as a multi-line textproto string with 2-space indent. | ||
| /// | ||
| /// One field per line; nested messages are indented. No trailing newline. |
There was a problem hiding this comment.
Is no-trailing newline part part of the conformance test?
https://protobuf.dev/reference/protobuf/textformat-spec/#working-with-the-format-programmatically suggests following the behavior of https://github.com/protocolbuffers/txtpbfmt which does emit a trailing newline.
There was a problem hiding this comment.
[claude code] Fixed in ed6363d — encode_to_string_pretty appends \n when the output is non-empty. Single-line encode_to_string unchanged (it's a value, not a file). Not conformance-relevant — the conformance runner doesn't use the pretty form.
| /// For `Name` tokens: was this name followed by a `:` separator? | ||
| /// The textproto grammar makes `:` optional before messages but required | ||
| /// before scalars. | ||
| pub has_separator: bool, |
There was a problem hiding this comment.
[claude code] Not addressed yet — it's written at 7 production sites in the tokenizer but the only read in buffa/src/text/ is the test at line 1111. Before touching it I'd want to grep the impl_text.rs-generated code (the decoder-side merge_text impls) to confirm nothing there reads it. If it's genuinely dead: drop the field entirely, one fewer word on every Token. Deferred.
buffa/src/text/encoder.rs
Outdated
| // serialize_extensions dedup loop. | ||
| let mut seen: alloc::vec::Vec<u32> = alloc::vec::Vec::new(); | ||
| for uf in fields.iter() { | ||
| if seen.contains(&uf.number) { |
There was a problem hiding this comment.
With a bit of extra code, this could be less than O(n^2). Probably not worth it, but maybe leaving a comment saying "most proto messages are small enough and the extensions are small enough" that we decided to leave it. Or we could just get Claude to do the work and pre-optimize it for large complex messages. Your call :)
There was a problem hiding this comment.
[claude code] Fixed in ed6363d — went with the cheap fix: Vec<u32> + contains + push → BTreeSet<u32> + insert (returns false on duplicate, so the if !seen.insert(...) form collapses two operations to one). BTreeSet because this crate is no_std+alloc.
| pub fn line_col(&self, pos: usize) -> (u32, u32) { | ||
| let pos = pos.min(self.input.len()); | ||
| let before = &self.input[..pos]; | ||
| let line = before.bytes().filter(|&b| b == b'\n').count() as u32 + 1; |
There was a problem hiding this comment.
Any concern about reakly large files and u32 limits?
There was a problem hiding this comment.
[claude code] Documented in ed6363d — added a doc-comment line noting the ~4 GiB bound and that this is error-reporting only. Didn't add saturation logic since a textproto that large isn't a realistic input.
Conflicts were all in buffa-types/src/generated/ — main added Self emission (PR #15) and proto source comments (PR #7), our branch added TextFormat impls. Resolved by regenerating with the merged codegen source, which applies all three. No conflicts in hand-written source — the codegen changes (message.rs FieldInfo.struct_field_type, impl_text.rs, etc.) merged automatically.
Addresses four review comments on the tokenizer/decoder: - WHITESPACE per spec includes vertical tab and form feed, which five call sites were missing. Extracted is_textproto_ws() helper; all whitespace-skipping sites now call it. - `- inf` (whitespace between sign and literal) is valid per the grammar (`FLOAT = [ "-" ] , FLOAT_LIT` where `,` means optional whitespace). lex_number already handled `- 42`; parse_ident now matches for the literal path. Decoder side trims after stripping the sign. - `[ pkg.ext ]` (whitespace inside extension/Any brackets) is valid per the grammar. Token.raw preserves input bytes, so the two bracket-strip sites in the decoder now .trim() before registry lookup. - A leading UTF-8 BOM (U+FEFF) is silently skipped. Some editors prepend one; without this it fails with a cryptic 'expected field name' at 1:1.
Addresses the remaining non-spec review comments: - UnescapeError enum replaces string-compare error dispatch in the decoder. unescape/unescape_str/push_utf8 now return a two-variant enum (InvalidUtf8 | BadEscape(&str)); callers match instead of string-comparing "invalid UTF-8". - ParseErrorKind::Internal added. group_merge_text's re-decode of freshly-encoded bytes now returns this instead of panicking. Still unreachable absent an encoder bug, but a library shouldn't panic from that either. - Extension-field dedup uses BTreeSet instead of Vec.contains, removing the quadratic lookup for extension-heavy messages. - encode_to_string_pretty appends a trailing newline when non-empty, matching txtpbfmt and POSIX text-file conventions. Single-line encode_to_string is unchanged. - Bof enum variant documented; line_col u32 bounds noted in doc.
|
[claude code] Merge conflicts resolved + review follow-ups applied in 3 commits:
Not addressed: the
|
All four known_failures*.txt files are empty on this branch: binary, JSON, and text all pass in std, no_std, and via-view modes. Updated the Why-buffa blurb to reflect all three formats and dropped the per-suite 883/883 count from the text bullet since it's now covered by the top-line claim.
The protobuf text format is the human-readable debug representation — useful
for config files, golden-file tests, and logging. This implements the full
format with 883/883 conformance suite coverage.
Runtime (feature = "text", zero deps, no_std + alloc)
encode_to_string / decode_from_str conveniences
adjacent string concatenation, #-comments, [pkg.ext] bracket names,
hex/octal/float number forms, nan/inf, C++-style bool literals
wire-format parse, C++ text_format.cc:2926 / Java TextFormat.java:87)
TypeRegistry — unified type/extension registry
TextAnyEntry, JsonExtEntry, TextExtEntry each under their own cfg gate
pulled in text because unified entry structs referenced TextEncoder)
were enabled — lines inside are emitted per generate_json/generate_text,
no #[cfg] in generated code
([pkg.ext] { ... }) both consult TypeRegistry
ExtensionRegistryEntry
Codegen
in buffa-types) and MessageSet
groupliketype { ... } (lowercase/field name) both accepted on parse;
type name emitted on encode
seven include!'d WKT files would collide on register_types
Conformance
Docs & examples
example-logging tasks (USER_WORKING_DIR for file-path preservation)