fix: AwkwardForth VM and libawkward correctness issues#4105
Open
henryiii wants to merge 3 commits into
Open
Conversation
Several correctness and safety fixes in awkward-cpp/src/libawkward: - io/json.cpp (nulls_for_optiontype): stop the switch falling through from FillIndexedOptionArray into KeyTableHeader, which double-pushed the instruction stack and left it unbalanced for nullable records, producing a spurious "JSON schema mismatch" when a nullable record had a missing option-type key. Nullable records now null-fill missing keys like non-nullable records. - forth/ForthMachine.cpp (Nbit read): build the bit mask with a 64-bit literal and special-case width 64, fixing UB/wrong masks for N >= 31 (is_nbit allows up to 64). - forth/ForthMachine.cpp (constructor): hold the seven scratch buffers in std::unique_ptr<T[]> so a std::invalid_argument thrown by tokenize/compile on bad AwkwardForth source no longer leaks them; destructor defaulted. - forth/ForthInputBuffer.cpp + ForthMachine.cpp read macros: reject negative and overflowed read sizes in read()/peek_byte() (sizes derive from stack-supplied item counts). - forth/ForthInputBuffer.cpp (read_textfloat): cap integral-mantissa accumulation at 18 digits and continue in double, avoiding signed int64 overflow UB on long literals. - forth/ForthMachine.cpp (is_integer): use std::stoll, catch out_of_range, and require full-string consumption so trailing junk and overlong literals are rejected. - forth/ForthMachine.cpp (enum/enumonly): bounds-check the keyword lookup on the error path so it no longer reads one past the token vector. - forth/ForthMachine.cpp (stack_at): fix off-by-one so stack_at(0) returns the top element instead of one past it. - forth/ForthOutputBuffer.cpp (maybe_resize): guarantee strict growth so an initial size of 0 or a resize factor <= 1.0 no longer loops forever. - builder/RecordBuilder.cpp (clear): keep the record structure (contents_, keys_, pointers_, keys_size_) consistent and reset length_ to 0, mirroring TupleBuilder::clear(); previously clear() left keys_ out of sync with contents_ (out-of-bounds reads) and a length of -1. - util.cpp (dtype_to_format): drop the stray "defined" in the uint32/uint64 preprocessor checks so they map to the correct buffer-format chars on 32-bit non-MSVC platforms. Assisted-by: ClaudeCode:claude-fable-5
Covers the behaviorally-verifiable fixes: from_json with a nullable-record schema and missing option-type keys, ArrayBuilder clear-then-reuse, ForthMachine64 with output_initial_size=0 / resize factor 1.0, and Nbit reads with N >= 32. Assisted-by: ClaudeCode:claude-fable-5
Drop the non-nullable from_json guard (already passed pre-fix), the duplicate output-resize test (same maybe_resize bug as the initial-size-zero case), and two interior nbit-mask parametrize cases. Condense the verbose test docstring-comments to match the repo's test style, and tighten the RecordBuilder::clear() comment. Assisted-by: ClaudeCode:claude-opus-4-8
c84efa9 to
3fe4a4b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 AI text below 🤖
This PR fixes a batch of correctness and memory-safety issues in the C++
libawkwardsources (AwkwardForth VM, the JSON-schema reader, the layout builders, and a couple of small utilities), found by an automated multi-agent code review (Claude Code).All fixes are in
awkward-cpp/src/libawkward/(plus theForthMachine.hmember declarations):io/json.cppnulls_for_optiontype— theswitchfell through fromcase FillIndexedOptionArray:intocase KeyTableHeader:, double-executingpush_stackand leaving the instruction stack unbalanced for nullable records. This produced a spuriousJSON schema mismatchwhen a record typed["object","null"]had a missing option-type key, while the equivalent non-nullable record null-filled it correctly. Addedbreak;so each case pushes once, matching the singlepop_stack. Nullable records now null-fill missing keys.forth/ForthMachine.cpp(Nbit read, ~3068) —uint64_t mask = (1 << bit_width) - 1;shifted anintliteral;is_nbitallows widths up to 64, so this was UB / wrong forN >= 31. Now(bit_width >= 64) ? ~0ULL : ((uint64_t)1 << bit_width) - 1.forth/ForthMachine.cppconstructor /ForthMachine.h— the seven scratch arrays were rawnew[], and a user-reachablestd::invalid_argumentfromtokenize/compileon bad AwkwardForth source leaked them (~50 KB per failed compile). Converted tostd::unique_ptr<T[]>; destructor defaulted.forth/ForthInputBuffer.cppread/peek_byte— read sizes derive from stack-supplied item counts and could overflow negative. Now rejectnum_bytes < 0 || next < pos_ || next > length_(and the analogous check inpeek_byte).forth/ForthInputBuffer.cppread_textfloat— the integral-mantissa loop had no digit cap, causing signed-int64 overflow UB on long literals. Now accumulates inint64_tfor the first 18 digits, then continues indouble.forth/ForthMachine.cppis_integer— usedstd::stoul(caught onlyinvalid_argument, accepted trailing junk like123abc, handled negatives only by wraparound). Now usesstd::stoll, also catchesout_of_range, and requires full-string consumption via theposout-parameter.forth/ForthMachine.cpp(enum/enumonly error path) — could readtokenized[pos + 1]withpos + 1 == stop(out-of-bounds vector read) when nos"strings followed. Now reads the already-validated keyword token atpos - 1with a bounds guard.forth/ForthMachine.cppstack_at— off-by-one:stack_at(0)read one past the top. Fixed tostack_depth_ - 1 - from_top. (Part of the public C++ API; no in-repo callers.)forth/ForthOutputBuffer.cppmaybe_resize—reservation = ceil(reservation * resize_)never terminated when constructed withoutput_initial_size=0orresize_ <= 1.0(both reachable from Python). Now guarantees strict growth each iteration.builder/RecordBuilder.cppclear— clearedkeys_/pointers_and setlength_ = -1while keepingcontents_, so subsequentform()/to_buffers()readkeys_[i]out of bounds and__len__raisedValueError. Now keeps the record structure consistent and resetslength_to 0, mirroringTupleBuilder::clear(). Reachable via_ext.ArrayBuilder.clear().util.cppdtype_to_format—#if defined _MSC_VER || defined INTPTR_MAX == INT32_MAXparsed as(defined INTPTR_MAX) == INT32_MAX(always false) for the uint32/uint64 cases, unlike the correct int32/int64 lines. Dropped the straydefined.Tests
New regression test file covering the behaviorally-verifiable fixes:
from_jsonwith a nullable-record schema + missing option-type keys (#1),ArrayBuilderclear-then-reuse (#10),ForthMachine64(output_initial_size=0)(#9), and Nbit reads withN >= 32(#2). The full existing suite passes.Notes
awkward-cppversion bump (inawkward-cpp/pyproject.tomland the pin inpyproject.toml) and anawkward-cpprelease happen at release time, before the nextawkwardrelease.🤖 Generated with Claude Code