Skip to content

fix: wrong results from string sort/unique kernels#4091

Open
henryiii wants to merge 3 commits into
mainfrom
henryiii/fix-string-kernels
Open

fix: wrong results from string sort/unique kernels#4091
henryiii wants to merge 3 commits into
mainfrom
henryiii/fix-string-kernels

Conversation

@henryiii

@henryiii henryiii commented Jun 10, 2026

Copy link
Copy Markdown
Member

🤖 AI text below 🤖

Two CPU kernels backing string `sort`/`unique` (axis=-1) produced wrong results. Both bugs were reproduced before the fix and verified resolved after rebuilding `awkward-cpp`.

Fixes

  • `awkward-cpp/src/cpu-kernels/awkward_NumpyArray_sort_asstrings_uint8.cpp` — the per-string character-copy loop used `for (uint8_t i = (uint8_t)start; ...)` where `start`/`stop` are `int64_t` byte offsets. Once a cumulative offset exceeded 255 the `uint8_t` counter wrapped, so strings were assembled from the wrong bytes. The loop now uses an `int64_t` index bounded directly by `stop` (`for (int64_t i = start; i < stop; i++)`), dropping the redundant `slen` tracker.

  • `awkward-cpp/src/cpu-kernels/awkward_NumpyArray_unique_strings_uint8.cpp` — the kernel compacts kept strings toward the front of `toptr` in place, but compared each candidate against the previously kept string via its original input offset (`start = offsets[i]`), a region the leftward compaction may already have overwritten. It now tracks the kept string's position/length in the compacted output and compares against that copy, so adjacent duplicates are removed correctly.

Tests

Two regression tests in `tests/test_4091_string_sort_unique_kernels.py`, one per bug:

  • `test_sort_strings_over_255_cumulative_chars`: 200 four-char strings (800 cumulative bytes) shuffled and sorted; fails without the `uint8_t`→`int64_t` fix.
  • `test_unique_strings_adjacent_duplicates_of_differing_lengths`: `["aa","aa","bcde","bcde"]` deduplicated; previously yielded `["aa","bcde","bcde"]` (duplicate survived) without the compaction fix.

Notes

  • This changes `awkward-cpp` C++ only. No package versions were bumped here; the `awkward-cpp` version bump and release will be handled by maintainers at release time.
  • The `kernel-specification.yml` entries for both kernels contain only placeholder definitions (no Python reference implementation), and there is no associated `kernel-test-data.json`, so there was nothing to update there.

This fix came from an automated multi-agent code review and was implemented with AI assistance (Claude Code).

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.32%. Comparing base (712dac0) to head (fbf07b1).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

henryiii added 3 commits June 11, 2026 15:40
The sort_asstrings kernel iterated the per-string character copy with a
uint8_t counter initialised from an int64_t byte offset, so once the
cumulative offset exceeded 255 the counter wrapped and strings were built
from the wrong bytes. The loop now uses an int64_t index bounded directly
by the stop offset.

The unique_strings kernel compacts strings toward the front of the buffer
in place, but compared each candidate against the previously kept string
at its original input offset, a region the compaction may have already
overwritten. It now compares against the kept string's location in the
compacted output, so adjacent duplicates are correctly removed.

Assisted-by: ClaudeCode:claude-opus-4.8
Cover sorting and uniqueing of string arrays whose cumulative byte length
exceeds 255 (exercising the former uint8 wraparound) and adjacent
duplicate strings of differing lengths (exercising the in-place
compaction comparison).

Assisted-by: ClaudeCode:claude-opus-4.8
Removed three redundant test cases:
- test_unique_strings_over_255_cumulative_chars (unique bug is about
  in-place compaction, not uint8 overflow; covered by adjacent-duplicates test)
- test_unique_strings_short_then_long_duplicates (near-duplicate of
  adjacent-duplicates test)
- test_sort_and_unique_mixed_lengths (covered by the two remaining tests)

Assisted-by: ClaudeCode:claude-sonnet-4-6
@henryiii henryiii force-pushed the henryiii/fix-string-kernels branch from 78df4ef to fbf07b1 Compare June 11, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant