Skip to content

Add textproto (text format) encoding and decoding#6

Merged
iainmcgin merged 6 commits intomainfrom
textproto-support
Mar 30, 2026
Merged

Add textproto (text format) encoding and decoding#6
iainmcgin merged 6 commits intomainfrom
textproto-support

Conversation

@iainmcgin
Copy link
Copy Markdown
Collaborator

The protobuf text format is the human-readable debug representation — useful
for config files, golden-file tests, and logging. This implements the full
format with 883/883 conformance suite coverage.

Runtime (feature = "text", zero deps, no_std + alloc)

  • buffa::text module: TextFormat trait, TextEncoder, TextDecoder,
    encode_to_string / decode_from_str conveniences
  • Tokenizer handles all grammar quirks: both { } and < > delimiters,
    adjacent string concatenation, #-comments, [pkg.ext] bracket names,
    hex/octal/float number forms, nan/inf, C++-style bool literals
  • Decoder returns Cow<'a, str> for strings (zero-copy when no escapes)
  • \u escapes are BMP-only (surrogates rejected per spec; use \U for non-BMP)
  • Heuristic nested-message detection for unknown LD fields (speculative
    wire-format parse, C++ text_format.cc:2926 / Java TextFormat.java:87)

TypeRegistry — unified type/extension registry

  • JsonRegistry → TypeRegistry with feature-split entry types: JsonAnyEntry,
    TextAnyEntry, JsonExtEntry, TextExtEntry each under their own cfg gate
  • json and text features are now independently enableable (previously json
    pulled in text because unified entry structs referenced TextEncoder)
  • Codegen emits one register_types() per file covering whichever formats
    were enabled — lines inside are emitted per generate_json/generate_text,
    no #[cfg] in generated code
  • Any expansion ([type.googleapis.com/...] { ... }) and extension brackets
    ([pkg.ext] { ... }) both consult TypeRegistry
  • Deprecated aliases: JsonRegistry, set_json_registry, AnyTypeEntry,
    ExtensionRegistryEntry

Codegen

  • generate_text(bool) config option and buffa-build builder method
  • impl TextFormat emission in impl_text.rs, skip-gated for Any (hand-written
    in buffa-types) and MessageSet
  • Group-field naming: GroupLikeType { ... } (message type name) and
    groupliketype { ... } (lowercase/field name) both accepted on parse;
    type name emitted on encode
  • Separate __FOO_JSON_ANY / __FOO_TEXT_ANY consts per codegen flag
  • emit_register_fn config (default true) — gen_wkt_types sets false since
    seven include!'d WKT files would collide on register_types

Conformance

  • Text suite: 0 → 883 passing (was entirely skipped), zero expected failures
  • Binary+JSON unchanged at 5549/5529/2797
  • known_failures_text.txt added (empty — passes the full suite)
  • print_unknown_fields request field parsed and threaded to TextEncoder

Docs & examples

  • guide.md "Text format (textproto)" section
  • README/CHANGELOG/CONTRIBUTING/migration-from-protobuf.md updated
  • addressbook example gains a dump command (encode_to_string_pretty)
  • Taskfile: build-examples, example-addressbook, example-envelope,
    example-logging tasks (USER_WORKING_DIR for file-path preservation)

The protobuf text format is the human-readable debug representation — useful
for config files, golden-file tests, and logging. This implements the full
format with 883/883 conformance suite coverage.

Runtime (feature = "text", zero deps, no_std + alloc)

- buffa::text module: TextFormat trait, TextEncoder, TextDecoder,
  encode_to_string / decode_from_str conveniences
- Tokenizer handles all grammar quirks: both { } and < > delimiters,
  adjacent string concatenation, #-comments, [pkg.ext] bracket names,
  hex/octal/float number forms, nan/inf, C++-style bool literals
- Decoder returns Cow<'a, str> for strings (zero-copy when no escapes)
- \u escapes are BMP-only (surrogates rejected per spec; use \U for non-BMP)
- Heuristic nested-message detection for unknown LD fields (speculative
  wire-format parse, C++ text_format.cc:2926 / Java TextFormat.java:87)

TypeRegistry — unified type/extension registry

- JsonRegistry → TypeRegistry with feature-split entry types: JsonAnyEntry,
  TextAnyEntry, JsonExtEntry, TextExtEntry each under their own cfg gate
- json and text features are now independently enableable (previously json
  pulled in text because unified entry structs referenced TextEncoder)
- Codegen emits one register_types() per file covering whichever formats
  were enabled — lines inside are emitted per generate_json/generate_text,
  no #[cfg] in generated code
- Any expansion ([type.googleapis.com/...] { ... }) and extension brackets
  ([pkg.ext] { ... }) both consult TypeRegistry
- Deprecated aliases: JsonRegistry, set_json_registry, AnyTypeEntry,
  ExtensionRegistryEntry

Codegen

- generate_text(bool) config option and buffa-build builder method
- impl TextFormat emission in impl_text.rs, skip-gated for Any (hand-written
  in buffa-types) and MessageSet
- Group-field naming: GroupLikeType { ... } (message type name) and
  groupliketype { ... } (lowercase/field name) both accepted on parse;
  type name emitted on encode
- Separate __FOO_JSON_ANY / __FOO_TEXT_ANY consts per codegen flag
- emit_register_fn config (default true) — gen_wkt_types sets false since
  seven include!'d WKT files would collide on register_types

Conformance

- Text suite: 0 → 883 passing (was entirely skipped), zero expected failures
- Binary+JSON unchanged at 5549/5529/2797
- known_failures_text.txt added (empty — passes the full suite)
- print_unknown_fields request field parsed and threaded to TextEncoder

Docs & examples

- guide.md "Text format (textproto)" section
- README/CHANGELOG/CONTRIBUTING/migration-from-protobuf.md updated
- addressbook example gains a dump command (encode_to_string_pretty)
- Taskfile: build-examples, example-addressbook, example-envelope,
  example-logging tasks (USER_WORKING_DIR for file-path preservation)
@iainmcgin iainmcgin marked this pull request as ready for review March 23, 2026 19:25
@iainmcgin iainmcgin requested a review from asacamano March 23, 2026 19:25
asacamano
asacamano previously approved these changes Mar 24, 2026
Copy link
Copy Markdown

@asacamano asacamano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy for this to be submitted and then any you want to be addressed in later PRs - or happy to re-review.

let bytes = m.encode_to_vec();
// Freshly-encoded → re-decode cannot fail short of an encoder bug.
let inner = UnknownFields::decode_from_slice(&bytes)
.expect("BUG: re-decoding freshly-encoded message bytes failed");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels reachable - and it might be nice to return a ParseError rather than an panic on the expect()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Fixed in ed6363d — added ParseErrorKind::Internal(&'static str) (non-breaking, enum is #[non_exhaustive]) and converted the .expect() to return it with line:col of 0:0. Error message is "internal error (this is a buffa bug): re-decoding freshly-encoded group message bytes failed" so the it's-not-your-input signal is preserved without the panic.

/// separators.
#[derive(Clone, Copy, PartialEq, Eq)]
enum LastKind {
Bof,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - the others are clear, "Bof" is an ancronym (which I didn't immediately guess because I read it in another context). A translation to "beginning of file" would be helpfu.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Fixed in ed6363d — added /// Beginning of file — no token emitted yet. on the variant.


match self.last_kind {
LastKind::Bof => {
// Top-level: expect EOF or a field name.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an interesting edge case here were a byte order bark [0xEF, 0xBB, 0xBF]. It seems like Rust's conversion methods want to be data preserving, so they translate the BOM to valid UTF8, which then the proto parser rejects as tha invalid first character of the file.

It might be nice here to add a specific error message for "this UTF8 is is not valid protobuf because it contains a Byte Order Mark".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Fixed in 8d39b7a — went with silent skip rather than a specific error, since a BOM is semantically meaningless at file-start (same as most JSON parsers). Tokenizer::new checks input.starts_with('\u{FEFF}') and initializes cursor = 3 if so. A BOM-only file correctly tokenizes as empty.

unescape_str(tok.raw).map_err(|why| {
self.err_at(
&tok,
if why == "invalid UTF-8" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - could this be an error instead of a string?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Fixed in ed6363d — added enum UnescapeError { InvalidUtf8, BadEscape(&'static str) }. unescape/unescape_str/push_utf8 all return it now; this call site matches instead of string-comparing. read_bytes destructures with let UnescapeError::BadEscape(why) = e else { unreachable!() } since byte-level unescape never returns InvalidUtf8.

name: &str,
extendee: &str,
) -> Result<Vec<crate::UnknownField>, ParseError> {
let full = name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also need to strip whitespace, e.g. [ foo ] vs [foo]?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Fixed in 8d39b7a — added .map(str::trim) after the bracket-strip here and in read_any_expansion (same pattern at line 569). Token.raw is a slice of the input so it preserves the whitespace; the tokenizer's internal skip at line 959 only affects the cursor, not raw.

}
// After a closing quote: skip whitespace, then check for another quote.
while let Some(&c) = s.first() {
if matches!(c, b' ' | b'\t' | b'\r' | b'\n') {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://protobuf.dev/reference/protobuf/textformat-spec/#whitespace also wants vertical tab and form feed to be treated as whitespace.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Fixed in 8d39b7a — extracted is_textproto_ws(c: u8) -> bool in token.rs and replaced all 5 sites that had the incomplete set (this one, string.rs:184, token.rs:271/915/959). Test added: "a:\x0B1\x0Cb: 2" tokenizes correctly.

/// `-`. Returns the byte length, or 0 if no identifier is present.
///
/// Reference: protobuf-go `parseIdent`.
fn parse_ident(s: &[u8], allow_neg: bool) -> usize {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a rabbit hole.

Should foo: -42 and foo: -inf and foo: - 42 and foo: - inf all work the same way.

https://protobuf.dev/reference/protobuf/textformat-spec/#value seems to be unclear. But my instinct is that if "- 42" is a number, "- inf" should also be a number.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Fixed in 8d39b7a — your instinct was right. lex_number (line 763) already calls consume_ws after -, so - 42 works; but parse_ident (this function, the -inf/-nan path) didn't. Added the same consume_ws call after consuming the sign. Decoder side (read_f64 line 252) adds a .trim_start() after strip_prefix('-') since the raw span now contains the whitespace. Added "f: - inf" and "f: -\tinf" to read_f64_table.


/// Encode a message as a multi-line textproto string with 2-space indent.
///
/// One field per line; nested messages are indented. No trailing newline.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is no-trailing newline part part of the conformance test?

https://protobuf.dev/reference/protobuf/textformat-spec/#working-with-the-format-programmatically suggests following the behavior of https://github.com/protocolbuffers/txtpbfmt which does emit a trailing newline.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Fixed in ed6363dencode_to_string_pretty appends \n when the output is non-empty. Single-line encode_to_string unchanged (it's a value, not a file). Not conformance-relevant — the conformance runner doesn't use the pretty form.

/// For `Name` tokens: was this name followed by a `:` separator?
/// The textproto grammar makes `:` optional before messages but required
/// before scalars.
pub has_separator: bool,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is this only used in tests?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Not addressed yet — it's written at 7 production sites in the tokenizer but the only read in buffa/src/text/ is the test at line 1111. Before touching it I'd want to grep the impl_text.rs-generated code (the decoder-side merge_text impls) to confirm nothing there reads it. If it's genuinely dead: drop the field entirely, one fewer word on every Token. Deferred.

// serialize_extensions dedup loop.
let mut seen: alloc::vec::Vec<u32> = alloc::vec::Vec::new();
for uf in fields.iter() {
if seen.contains(&uf.number) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a bit of extra code, this could be less than O(n^2). Probably not worth it, but maybe leaving a comment saying "most proto messages are small enough and the extensions are small enough" that we decided to leave it. Or we could just get Claude to do the work and pre-optimize it for large complex messages. Your call :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Fixed in ed6363d — went with the cheap fix: Vec<u32> + contains + pushBTreeSet<u32> + insert (returns false on duplicate, so the if !seen.insert(...) form collapses two operations to one). BTreeSet because this crate is no_std+alloc.

pub fn line_col(&self, pos: usize) -> (u32, u32) {
let pos = pos.min(self.input.len());
let before = &self.input[..pos];
let line = before.bytes().filter(|&b| b == b'\n').count() as u32 + 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concern about reakly large files and u32 limits?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude code] Documented in ed6363d — added a doc-comment line noting the ~4 GiB bound and that this is error-reporting only. Didn't add saturation logic since a textproto that large isn't a realistic input.

Conflicts were all in buffa-types/src/generated/ — main added Self
emission (PR #15) and proto source comments (PR #7), our branch added
TextFormat impls. Resolved by regenerating with the merged codegen
source, which applies all three.

No conflicts in hand-written source — the codegen changes (message.rs
FieldInfo.struct_field_type, impl_text.rs, etc.) merged automatically.
Addresses four review comments on the tokenizer/decoder:

- WHITESPACE per spec includes vertical tab and form feed, which five
  call sites were missing. Extracted is_textproto_ws() helper; all
  whitespace-skipping sites now call it.

- `- inf` (whitespace between sign and literal) is valid per the
  grammar (`FLOAT = [ "-" ] , FLOAT_LIT` where `,` means optional
  whitespace). lex_number already handled `- 42`; parse_ident now
  matches for the literal path. Decoder side trims after stripping
  the sign.

- `[ pkg.ext ]` (whitespace inside extension/Any brackets) is valid
  per the grammar. Token.raw preserves input bytes, so the two
  bracket-strip sites in the decoder now .trim() before registry
  lookup.

- A leading UTF-8 BOM (U+FEFF) is silently skipped. Some editors
  prepend one; without this it fails with a cryptic 'expected field
  name' at 1:1.
Addresses the remaining non-spec review comments:

- UnescapeError enum replaces string-compare error dispatch in the
  decoder. unescape/unescape_str/push_utf8 now return a two-variant
  enum (InvalidUtf8 | BadEscape(&str)); callers match instead of
  string-comparing "invalid UTF-8".

- ParseErrorKind::Internal added. group_merge_text's re-decode of
  freshly-encoded bytes now returns this instead of panicking. Still
  unreachable absent an encoder bug, but a library shouldn't panic
  from that either.

- Extension-field dedup uses BTreeSet instead of Vec.contains,
  removing the quadratic lookup for extension-heavy messages.

- encode_to_string_pretty appends a trailing newline when non-empty,
  matching txtpbfmt and POSIX text-file conventions. Single-line
  encode_to_string is unchanged.

- Bof enum variant documented; line_col u32 bounds noted in doc.
@iainmcgin
Copy link
Copy Markdown
Collaborator Author

[claude code] Merge conflicts resolved + review follow-ups applied in 3 commits:

7ee6310 — merge resolution. All 7 conflicts were in buffa-types/src/generated/ where main added Self-emission (#15) + proto source comments (#7) while this branch added TextFormat impls. No conflicts in hand-written source — the codegen changes merged automatically. Resolved by regenerating with the merged codegen, which applies all three.

8d39b7a — spec-compliance parsing gaps (+79/−11, 3 files):

  • \v/\f whitespace: extracted is_textproto_ws(), replaced 5 sites (spec)
  • - inf with whitespace-after-sign: parse_ident now matches lex_number's existing - 42 handling; decoder trims after sign-strip
  • [ pkg.ext ] whitespace: .trim() after bracket-strip in read_extension and read_any_expansion
  • Leading UTF-8 BOM silently skipped in Tokenizer::new

ed6363d — error-handling & polish (+79/−35, 8 files):

  • UnescapeError enum replaces &'static str error-return; decoder matches instead of string-comparing "invalid UTF-8"
  • ParseErrorKind::Internal added; group_merge_text's re-decode .expect() converted (unreachable absent encoder bug, but a library shouldn't panic)
  • Extension dedup: Vec::containsBTreeSet::insert
  • encode_to_string_pretty appends trailing \n (matches txtpbfmt); single-line encode_to_string unchanged; conformance unaffected (doesn't use pretty)
  • Bof variant documented; line_col u32 bounds noted

Not addressed: the name_kind question (token.rs:87) — it's written at 7 production sites but I couldn't find a production reader in the text module. Needs a grep across impl_text.rs-generated code to confirm whether it's dead before touching it.

task lint clean, 1390 workspace tests pass.

All four known_failures*.txt files are empty on this branch: binary,
JSON, and text all pass in std, no_std, and via-view modes. Updated
the Why-buffa blurb to reflect all three formats and dropped the
per-suite 883/883 count from the text bullet since it's now covered
by the top-line claim.
@iainmcgin iainmcgin requested a review from asacamano March 30, 2026 19:13
@iainmcgin iainmcgin enabled auto-merge (squash) March 30, 2026 19:14
@iainmcgin iainmcgin merged commit cd002f6 into main Mar 30, 2026
7 checks passed
@iainmcgin iainmcgin deleted the textproto-support branch March 30, 2026 20:17
@github-actions github-actions bot locked and limited conversation to collaborators Mar 30, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants