fix: wrong results from string sort/unique kernels by henryiii · Pull Request #4091 · scikit-hep/awkward

henryiii · 2026-06-10T20:26:19Z

🤖 AI text below 🤖

Two CPU kernels backing string `sort`/`unique` (axis=-1) produced wrong results. Both bugs were reproduced before the fix and verified resolved after rebuilding `awkward-cpp`.

Fixes

`awkward-cpp/src/cpu-kernels/awkward_NumpyArray_sort_asstrings_uint8.cpp` — the per-string character-copy loop used `for (uint8_t i = (uint8_t)start; ...)` where `start`/`stop` are `int64_t` byte offsets. Once a cumulative offset exceeded 255 the `uint8_t` counter wrapped, so strings were assembled from the wrong bytes. The loop now uses an `int64_t` index bounded directly by `stop` (`for (int64_t i = start; i < stop; i++)`), dropping the redundant `slen` tracker.
`awkward-cpp/src/cpu-kernels/awkward_NumpyArray_unique_strings_uint8.cpp` — the kernel compacts kept strings toward the front of `toptr` in place, but compared each candidate against the previously kept string via its original input offset (`start = offsets[i]`), a region the leftward compaction may already have overwritten. It now tracks the kept string's position/length in the compacted output and compares against that copy, so adjacent duplicates are removed correctly.

Tests

Two regression tests in `tests/test_4091_string_sort_unique_kernels.py`, one per bug:

`test_sort_strings_over_255_cumulative_chars`: 200 four-char strings (800 cumulative bytes) shuffled and sorted; fails without the `uint8_t`→`int64_t` fix.
`test_unique_strings_adjacent_duplicates_of_differing_lengths`: `["aa","aa","bcde","bcde"]` deduplicated; previously yielded `["aa","bcde","bcde"]` (duplicate survived) without the compaction fix.

Notes

This changes `awkward-cpp` C++ only. No package versions were bumped here; the `awkward-cpp` version bump and release will be handled by maintainers at release time.
The `kernel-specification.yml` entries for both kernels contain only placeholder definitions (no Python reference implementation), and there is no associated `kernel-test-data.json`, so there was nothing to update there.

This fix came from an automated multi-agent code review and was implemented with AI assistance (Claude Code).

codecov · 2026-06-11T00:37:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.32%. Comparing base (712dac0) to head (fbf07b1).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

The sort_asstrings kernel iterated the per-string character copy with a uint8_t counter initialised from an int64_t byte offset, so once the cumulative offset exceeded 255 the counter wrapped and strings were built from the wrong bytes. The loop now uses an int64_t index bounded directly by the stop offset. The unique_strings kernel compacts strings toward the front of the buffer in place, but compared each candidate against the previously kept string at its original input offset, a region the compaction may have already overwritten. It now compares against the kept string's location in the compacted output, so adjacent duplicates are correctly removed. Assisted-by: ClaudeCode:claude-opus-4.8

Cover sorting and uniqueing of string arrays whose cumulative byte length exceeds 255 (exercising the former uint8 wraparound) and adjacent duplicate strings of differing lengths (exercising the in-place compaction comparison). Assisted-by: ClaudeCode:claude-opus-4.8

Removed three redundant test cases: - test_unique_strings_over_255_cumulative_chars (unique bug is about in-place compaction, not uint8 overflow; covered by adjacent-duplicates test) - test_unique_strings_short_then_long_duplicates (near-duplicate of adjacent-duplicates test) - test_sort_and_unique_mixed_lengths (covered by the two remaining tests) Assisted-by: ClaudeCode:claude-sonnet-4-6

henryiii mentioned this pull request Jun 10, 2026

Claude Fable AI review for Awkward #4085

Open

henryiii added 3 commits June 11, 2026 15:40

henryiii force-pushed the henryiii/fix-string-kernels branch from 78df4ef to fbf07b1 Compare June 11, 2026 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: wrong results from string sort/unique kernels#4091

fix: wrong results from string sort/unique kernels#4091
henryiii wants to merge 3 commits into
mainfrom
henryiii/fix-string-kernels

henryiii commented Jun 10, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

henryiii commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Tests

Notes

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

henryiii commented Jun 10, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading