fix: wrong results from string sort/unique kernels#4091
Open
henryiii wants to merge 3 commits into
Open
Conversation
The sort_asstrings kernel iterated the per-string character copy with a uint8_t counter initialised from an int64_t byte offset, so once the cumulative offset exceeded 255 the counter wrapped and strings were built from the wrong bytes. The loop now uses an int64_t index bounded directly by the stop offset. The unique_strings kernel compacts strings toward the front of the buffer in place, but compared each candidate against the previously kept string at its original input offset, a region the compaction may have already overwritten. It now compares against the kept string's location in the compacted output, so adjacent duplicates are correctly removed. Assisted-by: ClaudeCode:claude-opus-4.8
Cover sorting and uniqueing of string arrays whose cumulative byte length exceeds 255 (exercising the former uint8 wraparound) and adjacent duplicate strings of differing lengths (exercising the in-place compaction comparison). Assisted-by: ClaudeCode:claude-opus-4.8
Removed three redundant test cases: - test_unique_strings_over_255_cumulative_chars (unique bug is about in-place compaction, not uint8 overflow; covered by adjacent-duplicates test) - test_unique_strings_short_then_long_duplicates (near-duplicate of adjacent-duplicates test) - test_sort_and_unique_mixed_lengths (covered by the two remaining tests) Assisted-by: ClaudeCode:claude-sonnet-4-6
78df4ef to
fbf07b1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 AI text below 🤖
Two CPU kernels backing string `sort`/`unique` (axis=-1) produced wrong results. Both bugs were reproduced before the fix and verified resolved after rebuilding `awkward-cpp`.
Fixes
`awkward-cpp/src/cpu-kernels/awkward_NumpyArray_sort_asstrings_uint8.cpp` — the per-string character-copy loop used `for (uint8_t i = (uint8_t)start; ...)` where `start`/`stop` are `int64_t` byte offsets. Once a cumulative offset exceeded 255 the `uint8_t` counter wrapped, so strings were assembled from the wrong bytes. The loop now uses an `int64_t` index bounded directly by `stop` (`for (int64_t i = start; i < stop; i++)`), dropping the redundant `slen` tracker.
`awkward-cpp/src/cpu-kernels/awkward_NumpyArray_unique_strings_uint8.cpp` — the kernel compacts kept strings toward the front of `toptr` in place, but compared each candidate against the previously kept string via its original input offset (`start = offsets[i]`), a region the leftward compaction may already have overwritten. It now tracks the kept string's position/length in the compacted output and compares against that copy, so adjacent duplicates are removed correctly.
Tests
Two regression tests in `tests/test_4091_string_sort_unique_kernels.py`, one per bug:
Notes
This fix came from an automated multi-agent code review and was implemented with AI assistance (Claude Code).