Skip to content

fix: AwkwardForth VM and libawkward correctness issues#4105

Open
henryiii wants to merge 3 commits into
mainfrom
henryiii/fix-forth-vm
Open

fix: AwkwardForth VM and libawkward correctness issues#4105
henryiii wants to merge 3 commits into
mainfrom
henryiii/fix-forth-vm

Conversation

@henryiii

Copy link
Copy Markdown
Member

🤖 AI text below 🤖

This PR fixes a batch of correctness and memory-safety issues in the C++ libawkward sources (AwkwardForth VM, the JSON-schema reader, the layout builders, and a couple of small utilities), found by an automated multi-agent code review (Claude Code).

All fixes are in awkward-cpp/src/libawkward/ (plus the ForthMachine.h member declarations):

  1. io/json.cpp nulls_for_optiontype — the switch fell through from case FillIndexedOptionArray: into case KeyTableHeader:, double-executing push_stack and leaving the instruction stack unbalanced for nullable records. This produced a spurious JSON schema mismatch when a record typed ["object","null"] had a missing option-type key, while the equivalent non-nullable record null-filled it correctly. Added break; so each case pushes once, matching the single pop_stack. Nullable records now null-fill missing keys.

  2. forth/ForthMachine.cpp (Nbit read, ~3068)uint64_t mask = (1 << bit_width) - 1; shifted an int literal; is_nbit allows widths up to 64, so this was UB / wrong for N >= 31. Now (bit_width >= 64) ? ~0ULL : ((uint64_t)1 << bit_width) - 1.

  3. forth/ForthMachine.cpp constructor / ForthMachine.h — the seven scratch arrays were raw new[], and a user-reachable std::invalid_argument from tokenize/compile on bad AwkwardForth source leaked them (~50 KB per failed compile). Converted to std::unique_ptr<T[]>; destructor defaulted.

  4. forth/ForthInputBuffer.cpp read/peek_byte — read sizes derive from stack-supplied item counts and could overflow negative. Now reject num_bytes < 0 || next < pos_ || next > length_ (and the analogous check in peek_byte).

  5. forth/ForthInputBuffer.cpp read_textfloat — the integral-mantissa loop had no digit cap, causing signed-int64 overflow UB on long literals. Now accumulates in int64_t for the first 18 digits, then continues in double.

  6. forth/ForthMachine.cpp is_integer — used std::stoul (caught only invalid_argument, accepted trailing junk like 123abc, handled negatives only by wraparound). Now uses std::stoll, also catches out_of_range, and requires full-string consumption via the pos out-parameter.

  7. forth/ForthMachine.cpp (enum/enumonly error path) — could read tokenized[pos + 1] with pos + 1 == stop (out-of-bounds vector read) when no s" strings followed. Now reads the already-validated keyword token at pos - 1 with a bounds guard.

  8. forth/ForthMachine.cpp stack_at — off-by-one: stack_at(0) read one past the top. Fixed to stack_depth_ - 1 - from_top. (Part of the public C++ API; no in-repo callers.)

  9. forth/ForthOutputBuffer.cpp maybe_resizereservation = ceil(reservation * resize_) never terminated when constructed with output_initial_size=0 or resize_ <= 1.0 (both reachable from Python). Now guarantees strict growth each iteration.

  10. builder/RecordBuilder.cpp clear — cleared keys_/pointers_ and set length_ = -1 while keeping contents_, so subsequent form()/to_buffers() read keys_[i] out of bounds and __len__ raised ValueError. Now keeps the record structure consistent and resets length_ to 0, mirroring TupleBuilder::clear(). Reachable via _ext.ArrayBuilder.clear().

  11. util.cpp dtype_to_format#if defined _MSC_VER || defined INTPTR_MAX == INT32_MAX parsed as (defined INTPTR_MAX) == INT32_MAX (always false) for the uint32/uint64 cases, unlike the correct int32/int64 lines. Dropped the stray defined.

Tests

New regression test file covering the behaviorally-verifiable fixes: from_json with a nullable-record schema + missing option-type keys (#1), ArrayBuilder clear-then-reuse (#10), ForthMachine64(output_initial_size=0) (#9), and Nbit reads with N >= 32 (#2). The full existing suite passes.

Notes

  • No package versions bumped here. Since this changes C++, the awkward-cpp version bump (in awkward-cpp/pyproject.toml and the pin in pyproject.toml) and an awkward-cpp release happen at release time, before the next awkward release.
  • AI assistance: these fixes were found and implemented with Claude Code as part of an automated multi-agent review, then verified locally.

🤖 Generated with Claude Code

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.32%. Comparing base (712dac0) to head (3fe4a4b).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

henryiii added 3 commits June 11, 2026 15:41
Several correctness and safety fixes in awkward-cpp/src/libawkward:

- io/json.cpp (nulls_for_optiontype): stop the switch falling through from
  FillIndexedOptionArray into KeyTableHeader, which double-pushed the
  instruction stack and left it unbalanced for nullable records, producing a
  spurious "JSON schema mismatch" when a nullable record had a missing
  option-type key. Nullable records now null-fill missing keys like
  non-nullable records.
- forth/ForthMachine.cpp (Nbit read): build the bit mask with a 64-bit
  literal and special-case width 64, fixing UB/wrong masks for N >= 31
  (is_nbit allows up to 64).
- forth/ForthMachine.cpp (constructor): hold the seven scratch buffers in
  std::unique_ptr<T[]> so a std::invalid_argument thrown by tokenize/compile
  on bad AwkwardForth source no longer leaks them; destructor defaulted.
- forth/ForthInputBuffer.cpp + ForthMachine.cpp read macros: reject negative
  and overflowed read sizes in read()/peek_byte() (sizes derive from
  stack-supplied item counts).
- forth/ForthInputBuffer.cpp (read_textfloat): cap integral-mantissa
  accumulation at 18 digits and continue in double, avoiding signed int64
  overflow UB on long literals.
- forth/ForthMachine.cpp (is_integer): use std::stoll, catch out_of_range,
  and require full-string consumption so trailing junk and overlong literals
  are rejected.
- forth/ForthMachine.cpp (enum/enumonly): bounds-check the keyword lookup on
  the error path so it no longer reads one past the token vector.
- forth/ForthMachine.cpp (stack_at): fix off-by-one so stack_at(0) returns
  the top element instead of one past it.
- forth/ForthOutputBuffer.cpp (maybe_resize): guarantee strict growth so an
  initial size of 0 or a resize factor <= 1.0 no longer loops forever.
- builder/RecordBuilder.cpp (clear): keep the record structure (contents_,
  keys_, pointers_, keys_size_) consistent and reset length_ to 0, mirroring
  TupleBuilder::clear(); previously clear() left keys_ out of sync with
  contents_ (out-of-bounds reads) and a length of -1.
- util.cpp (dtype_to_format): drop the stray "defined" in the uint32/uint64
  preprocessor checks so they map to the correct buffer-format chars on
  32-bit non-MSVC platforms.

Assisted-by: ClaudeCode:claude-fable-5
Covers the behaviorally-verifiable fixes: from_json with a nullable-record
schema and missing option-type keys, ArrayBuilder clear-then-reuse,
ForthMachine64 with output_initial_size=0 / resize factor 1.0, and Nbit reads
with N >= 32.

Assisted-by: ClaudeCode:claude-fable-5
Drop the non-nullable from_json guard (already passed pre-fix), the
duplicate output-resize test (same maybe_resize bug as the
initial-size-zero case), and two interior nbit-mask parametrize cases.
Condense the verbose test docstring-comments to match the repo's test
style, and tighten the RecordBuilder::clear() comment.

Assisted-by: ClaudeCode:claude-opus-4-8
@henryiii henryiii force-pushed the henryiii/fix-forth-vm branch from c84efa9 to 3fe4a4b Compare June 11, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant