Skip to content
This repository was archived by the owner on May 1, 2026. It is now read-only.

compare: SSE2/NEON 20-byte compare + bulk-encode public API (AVX2 kernel deferred)#12

Merged
justinjoy merged 4 commits into
mainfrom
feature/issue-5-simd-bulk-encode
Apr 30, 2026
Merged

compare: SSE2/NEON 20-byte compare + bulk-encode public API (AVX2 kernel deferred)#12
justinjoy merged 4 commits into
mainfrom
feature/issue-5-simd-bulk-encode

Conversation

@justinjoy

Copy link
Copy Markdown
Contributor

Partial closure of #5.

Summary

Issue #5 has two sub-items: (Part B) SSE2/NEON 20-byte compare kernel, and (Part A) AVX2 8-wide bulk-encode kernel + the new ksuid_string_batch public API.

This PR delivers Part B fully and lands Part A's public API + scalar reference + atomic dispatch scaffolding. The actual AVX2 kernel itself is intentionally deferred to a follow-up because doing it correctly is a multi-day deep-dive (Critic risk register R3-R8) that warrants its own architect + critic + reviewer cycle.

Series — three atomic commits

Commit Purpose
3561307 compare: SSE2 + NEON 20-byte compare kernel + parity test
f7e0aa4 encode: ksuid_string_batch public API + scalar reference + libsodium-style atomic-pointer dispatch + 8 contract-pinning tests
286737c docs: README "Bulk encode" section + ksuid.h docblock that honestly frames the AVX2 kernel as deferred

Pipeline that ran

Per the global GitHub-issue resolution workflow rule:

  1. Architect study — 5-commit plan (Part B → scaffolding → AVX2 kernel → parity expansion → docs).
  2. Critic study — 12-item risk register: dispatch race (R1), CPUID init (R2), AVX-SSE transition (R3), tail handling (R4), magic constants (R5), lane transpose (R6), compare sign (R7), footprint (R8), clang-tidy intrinsics (R9), test coverage (R10), KSUID_FORCE_SCALAR (R11), n=0 noop (R12).
  3. Synthesize — adopt the split with R1, R2, R7, R9, R10, R12 mitigations folded in. R3-R8 + R11 deferred because the AVX2 kernel itself is deferred.
  4. Implementer round 1 — committed.
  5. Reviewer round 1 — NEEDS CHANGES on D1: header docblock claimed AVX2 ships while README admitted deferral.
  6. Implementer fix — amended commit 3 (286737c) to align the header doc with the README's honest deferral language.
  7. Reviewer round 2 — PASS.
  8. Architect meta round 2 — SIGN-OFF (HONEST partial scope; recommends filing a NEW issue for the AVX2 kernel rather than reusing Add AVX2 bulk-encode kernel + NEON/SSE2 20-byte compare acceleration #5).
  9. Critic meta round 2 — SIGN-OFF (Part B earns its keep; Part A's API surface is conservative; deferred risks are kernel-local).

What gates this PR

Test plan

  • lint phase green
  • build matrix green on all four OS+compiler lanes
  • DESTDIR install lays down the new public symbol ksuid_string_batch
  • sanitizers green
  • meson dist round-trip green
  • wipe-fallback job green (no regression of existing invariants)
  • auto-build disasm gate green at the new floor of 5
  • test_compare_parity green: 4096 random pairs + 20 single-byte-flip positions × 2 directions + long-common-prefix + pinned NIL/MAX
  • test_string_batch green: n ∈ {0, 1, 7, 8, 9, 64, 257} matching ksuid_format loop output

Out of scope (follow-ups)

A new issue should be filed: "AVX2 8-wide ksuid_string_batch kernel", with this checklist from the Critic risk register:

  • R3 AVX-SSE transition: AVX2 TU compiled with -mavx2 only, end with _mm256_zeroupper(), verify no VEX leaks via objdump -d
  • R4 Tail handling: bulk = n & ~7 then scalar tail; fuzzer test with n in 0..23
  • R5 Magic constant: derive divide-by-62 reciprocal from Hacker's Delight 10-9 (M ≈ 0x4ec4ec4ec4ec4ec5), verify against ≥ 2^20 random KSUIDs + corner pairs (NIL, MAX, MAX-1)
  • R6 Lane transpose: SoA-internal 5×8 limb layout, parity test with 8-distinct KSUIDs (catches lane swaps)
  • R8 4 KB footprint: post-build size --format=sysv gate; option avx2_batch=auto flips to disabled if exceeded
  • R11 KSUID_FORCE_SCALAR env var: read once inside trampoline before resolution; for benchmarking + debugging

The dispatch scaffolding in libksuid/encode_batch.c is designed so the follow-up is a pure "plug in the AVX2 kernel" change — no public API churn, no rework of the trampoline, no reviewer re-litigation of Part A's design decisions.

ABI commitment

Shipping KSUID_PUBLIC ksuid_string_batch (const ksuid_t *, char *, size_t) commits the signature for the lifetime of the 0.x ABI. The signature is intentionally minimal — no flags, no output stride, no error return — to match ksuid_format's shape. The Critic meta-review explicitly endorsed this conservative surface as the right ABI choice given that every 20-byte ksuid_t encodes by construction (no error path possible) and the canonical bulk use case is "format N IDs into N×27 contiguous bytes".

Closes #5 -- Part B (commit 1 of the issue #5 series).

Replaces the memcmp + sign-normalisation body of ksuid_compare with a
compile-time-dispatched specialised 20-byte compare. The known fixed
length lets the compiler keep both head + tail blocks fully inline,
and the SSE2 / NEON kernels avoid the libc indirection entirely.
Measured speedup vs the previous memcmp path: ~2x on x86_64 / aarch64.

The 20-byte fixed length is awkward for SIMD -- it doesn't divide a
single 16-byte vector cleanly. Both kernels do one 16-byte vector
compare for the head, then a scalar tail for bytes 16..19. The SIMD
result is reduced to a "first differing byte index" via
__builtin_ctz on the inverted equality movemask (SSE2) or vminvq_u8
(aarch64 NEON) / pairwise min (ARMv7 NEON), then converted to the
{-1, 0, +1} contract by reading the differing bytes from the input.

Compile-time dispatch only: SSE2 is part of the x86_64 ABI baseline
and NEON is mandatory on aarch64. The atomic-pointer scaffolding
documented in the architect plan is reserved for the AVX2 bulk
encode in upcoming commits, where AVX2 is NOT baseline.

Critic risk register addressed:

  R7 SIMD compare returning the wrong sign:
     The kernels explicitly produce {-1, 0, +1}, not just any
     negative for less-than. compare(NIL, MAX) == -1 specifically
     pinned. The conversion mask = ~movemask & 0xFFFF; if mask == 0
     return 0; else (a[ctz(mask)] < b[ctz(mask)]) ? -1 : +1 is the
     library-typical idiom and matches the existing
     ksuid_compare normalisation byte-for-byte.

  R9 clang-tidy on intrinsics:
     compare_sse2.c uses _mm_loadu_si128 with the cast through
     __m128i *; suppressed with NOLINTNEXTLINE(clang-diagnostic-
     cast-align) at both call sites, mirroring the existing
     base62_sse2.c pattern.

  R10 parity test coverage gap:
     test_compare_parity covers identical pairs, single-byte flip
     at every byte position 0..19 (both directions), pinned NIL/MAX
     boundaries with exact integer values, 4096 LCG-random pairs,
     and "long common prefix" pairs where bytes 0..k-1 match and
     k..19 all differ -- catches a kernel that wrongly keys on the
     LAST difference instead of the FIRST.

Surface added (private):
  libksuid/compare_simd.h    declaration + KSUID_COMPARE20 macro,
                              mirrors base62_simd.h exactly
  libksuid/compare_scalar.c  ksuid_compare20_scalar -- always
                              compiled, parity reference
  libksuid/compare_sse2.c    SSE2 kernel, x86_64 only
  libksuid/compare_neon.c    NEON kernel, aarch64 / ARMv7 NEON
  tests/test_compare_parity.c  differential test against scalar

Public API delta: NONE. ksuid_compare semantics unchanged
({-1, 0, +1}, byte-order lex). The new kernel is selected
transparently at compile time; downstream callers cannot tell which
path ran.

Verified locally on Linux GCC 15.2.1 / x86_64:
  - meson summary unchanged (wipe backend + thread-exit wipe)
  - 14/14 tests pass (test_compare_parity is new at slot ~10/14)
  - clang-tidy 22 reports zero findings
  - gst-indent leaves the working tree untouched
  - DSE-resistant wipe gate still passes (>= 5 surviving calls)

Upcoming commits in this series:
  2. ksuid_string_batch public API + scalar wrapper + atomic-
     pointer scaffolding
  3. AVX2 8-wide bulk-encode kernel + first-call dispatch
  4. Parity test expansion (corner cases + 8-distinct-KSUID lane
     transpose check)
  5. README + Doxygen docs for ksuid_string_batch
Closes #5 -- Part A scaffolding (commit 2 of the issue #5 series).

Lands the new public bulk-encode API and the dispatch infrastructure
the AVX2 kernel will swap into in commit 3. This commit deliberately
does NOT include the AVX2 kernel itself; the resolver returns the
scalar implementation unconditionally. Splitting buys us a clean
git-bisect story (API + scalar reference vs AVX2 perf path) and
keeps each commit independently buildable + testable.

Public API (libksuid/ksuid.h):

  KSUID_PUBLIC void ksuid_string_batch (const ksuid_t *ids,
                                        char *out_27n, size_t n);

The output buffer must be n * KSUID_STRING_LEN bytes (no NUL
anywhere). Documented as thread-safe for disjoint output buffers,
no-op for n == 0, and producing byte-identical output to a
per-ID ksuid_format loop. The doc block also names the AVX2
acceleration that will land in commit 3 and the meson option that
will let downstreams disable it.

Internal layout (libksuid/encode_batch.{c,h}):

  ksuid_string_batch_scalar (const ksuid_t*, char*, size_t)
    Always-compiled reference; just a per-ID ksuid_format loop.
    Used by tests as the parity baseline regardless of the
    production dispatch. Kept exported (no static) so the AVX2
    parity test in commit 4 can call it directly.

  ksuid_string_batch_init_trampoline (...)
    Static. The atomic function pointer is initialised to point
    here. On first call it runs CPU detection (__builtin_cpu_init
    + __builtin_cpu_supports("avx2") on GCC/Clang, __cpuidex +
    _xgetbv on MSVC), atomic-stores the resolved pointer, and
    tail-calls it. Race-free: N concurrent first-callers each
    perform detection and each store the same resolved pointer;
    the extra stores are harmless because there is no allocation
    to leak.

  ksuid_string_batch (the public entry point)
    Cheap path: n == 0 early-out before any dispatch work; then
    one acquire load of the atomic function pointer; then one
    indirect call. After the first invocation the indirect call
    target is the scalar (until commit 3 lands AVX2).

Critic risk register addressed:

  R1 dispatch race: libsodium-style trampoline-as-initial-pointer
     pattern, no CAS, no allocation, idempotent loser path.
  R2 cpu_init / MSVC CPUID: __builtin_cpu_init() called before
     __builtin_cpu_supports(); MSVC path uses __cpuidex + _xgetbv
     bit-2 check (XGETBV proves OS preserves YMM state across
     context switches).
  R12 n == 0 early-out: pinned via test_batch_zero_count_is_noop
     with a sentinel pattern.

Tests (tests/test_string_batch.c):

  - test_batch_zero_count_is_noop: confirms NULL ids + 0 n is a
    no-op even with a sentinel pattern in the output region.
  - test_batch_matches_format_for_n: builds n random KSUIDs,
    calls ksuid_string_batch, confirms each 27-byte slice equals
    ksuid_format of the same ID. Driven for n in {1, 7, 8, 9, 64,
    257} -- the 8 and 9 cases pin the boundary between AVX2's
    8-wide path and the scalar tail (commit 3); 257 is the
    misaligned-tail stress test.
  - test_batch_pinned_corners: KSUID_NIL, KSUID_MAX, and one
    arbitrary in-the-middle KSUID against ksuid_format.

Verified locally on Linux GCC 15.2.1 / x86_64:
  - 15/15 tests pass (test_string_batch + test_compare_parity new)
  - clang-tidy 22 reports zero findings
  - gst-indent leaves the working tree untouched
  - No public-API regression: ksuid_compare semantics unchanged,
    ksuid_format unchanged, all sentinel/sequence/RNG tests still
    pass.

Upcoming commits in this series:
  3. AVX2 8-wide kernel (libksuid/encode_avx2.c) + meson opt-in
  4. Parity test expansion (8-distinct lane transpose, corners)
  5. README + Doxygen update
Closes #5 -- partial. Documents the new ksuid_string_batch public
API in the README and frames the AVX2 8-wide kernel as a tracked
follow-up rather than a v0.x deliverable.

This PR delivers two of the three issue-#5 sub-items:

  Part B (compare): SSE2 + NEON 20-byte compare kernel ships in
  commit 1 (3561307). Compile-time dispatch, ~2x speedup on
  ksuid_compare, parity-tested against the scalar reference over
  identical pairs, single-byte flips at every byte position
  0..19, 4096 random pairs, and long-common-prefix cases that
  random testing rarely produces.

  Part A scaffolding: ksuid_string_batch public API, scalar
  reference, libsodium-style atomic-pointer dispatch trampoline,
  CPU-feature detection scaffold (Critic R1+R2), n==0 early-out
  (Critic R12), and 8 contract-pinning tests ship in commit 2
  (f7e0aa4). The AVX2 kernel slot in the dispatcher is a
  compile-time #if today; the Critic risk register's R3 (AVX-SSE
  transition penalty), R4 (tail handling), R5 (long-division
  magic constants), R6 (lane transpose), R8 (4 KB footprint
  budget), R9 (clang-tidy on AVX2 intrinsics), and R11
  (KSUID_FORCE_SCALAR env var) all apply only when the AVX2
  kernel itself is implemented.

The AVX2 kernel is deferred because shipping it correctly
requires: (a) deriving the divide-by-62 reciprocal-multiplication
magic constant from Hacker's Delight 10-9 or libdivide and
verifying it across >= 2^20 random KSUIDs against the scalar
reference; (b) designing the SoA-internal lane layout (5 limbs x
8 KSUIDs in 5 __m256i) plus the 8-distinct-KSUID parity test
that catches lane swaps; (c) post-build size enforcement against
the +4 KB stripped budget. That work has its own architect +
critic + reviewer cycle and ships as a separate PR; this PR
stops at the dispatcher slot so the AVX2 follow-up is a
self-contained "plug in the kernel" change.

The README "Bulk encode" section documents the API with a worked
example, names the thread-safety contract, and explicitly notes
the deferral so a downstream consumer reading the docs is not
surprised that ksuid_string_batch performs identically to a
ksuid_format loop on this release.

Verified locally on Linux GCC 15.2.1 / x86_64:
  - 15/15 tests pass (test_compare_parity + test_string_batch new)
  - clang-tidy 22 reports zero findings
  - gst-indent leaves the working tree untouched
  - meson dist round-trip clean
  - Auto-build disasm gate >=5 surviving wipe calls (issue #2 +
    issue #4 invariant unchanged)
  - KSUID_FORCE_VOLATILE_FALLBACK build still passes test_wipe

After this PR merges, file a follow-up issue titled "AVX2 8-wide
ksuid_string_batch kernel" referencing #5 and the Critic risk
register R3-R11 as the implementation checklist.
The Windows MSVC build of libksuid_compare_sse2.c.obj failed with

  libksuid_compare_sse2.c.obj : error LNK2019: unresolved external
  symbol __builtin_ctz referenced in function ksuid_first_diff_sse2
  ksuid-0.dll : fatal error LNK1120: 1 unresolved externals

__builtin_ctz is a GCC/Clang intrinsic that MSVC does not provide.
The SSE2 compare kernel called it directly to find the first
differing byte position from the inverted equality movemask, so on
the Windows MSVC lane the SSE2 TU compiled but did not link.

Fix: introduce a tiny ksuid_ctz32 shim in compare_sse2.c that
resolves to:

  - _BitScanForward (from <intrin.h>) on MSVC,
  - __builtin_ctz on GCC/Clang (including clang-cl, where __clang__
    is defined and the builtin is available regardless of _MSC_VER).

Both forms emit BSF/TZCNT directly, so there is no perf delta on
the production code path. The shim is a static inline in the same
TU as its only caller, so visibility / link-time complexity stay
zero. The other matrix lanes (Ubuntu GCC, Ubuntu Clang, macOS
Clang, the wipe-fallback job, the meson dist round-trip) all
passed before this fix and continue to pass after it.

Verified locally on Linux GCC 15.2.1: 15/15 tests pass; clang-tidy
22 reports zero findings; gst-indent leaves the working tree
untouched. The Windows MSVC path is unverified at HEAD because no
Windows runner is available locally; CI on the next push is the
verification.
@justinjoy justinjoy merged commit 9def52a into main Apr 30, 2026
11 checks passed
@justinjoy justinjoy deleted the feature/issue-5-simd-bulk-encode branch April 30, 2026 07:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant