compare: SSE2/NEON 20-byte compare + bulk-encode public API (AVX2 kernel deferred) by justinjoy · Pull Request #12 · semantic-reasoning/libksuid

justinjoy · 2026-04-30T07:26:23Z

Partial closure of #5.

Summary

Issue #5 has two sub-items: (Part B) SSE2/NEON 20-byte compare kernel, and (Part A) AVX2 8-wide bulk-encode kernel + the new ksuid_string_batch public API.

This PR delivers Part B fully and lands Part A's public API + scalar reference + atomic dispatch scaffolding. The actual AVX2 kernel itself is intentionally deferred to a follow-up because doing it correctly is a multi-day deep-dive (Critic risk register R3-R8) that warrants its own architect + critic + reviewer cycle.

Series — three atomic commits

Commit	Purpose
`3561307` `compare:`	SSE2 + NEON 20-byte compare kernel + parity test
`f7e0aa4` `encode:`	`ksuid_string_batch` public API + scalar reference + libsodium-style atomic-pointer dispatch + 8 contract-pinning tests
`286737c` `docs:`	README "Bulk encode" section + ksuid.h docblock that honestly frames the AVX2 kernel as deferred

Pipeline that ran

Per the global GitHub-issue resolution workflow rule:

Architect study — 5-commit plan (Part B → scaffolding → AVX2 kernel → parity expansion → docs).
Critic study — 12-item risk register: dispatch race (R1), CPUID init (R2), AVX-SSE transition (R3), tail handling (R4), magic constants (R5), lane transpose (R6), compare sign (R7), footprint (R8), clang-tidy intrinsics (R9), test coverage (R10), KSUID_FORCE_SCALAR (R11), n=0 noop (R12).
Synthesize — adopt the split with R1, R2, R7, R9, R10, R12 mitigations folded in. R3-R8 + R11 deferred because the AVX2 kernel itself is deferred.
Implementer round 1 — committed.
Reviewer round 1 — NEEDS CHANGES on D1: header docblock claimed AVX2 ships while README admitted deferral.
Implementer fix — amended commit 3 (286737c) to align the header doc with the README's honest deferral language.
Reviewer round 2 — PASS.
Architect meta round 2 — SIGN-OFF (HONEST partial scope; recommends filing a NEW issue for the AVX2 kernel rather than reusing Add AVX2 bulk-encode kernel + NEON/SSE2 20-byte compare acceleration #5).
Critic meta round 2 — SIGN-OFF (Part B earns its keep; Part A's API surface is conservative; deferred risks are kernel-local).

What gates this PR

gst-indent + clang-tidy 22 (lint phase)
Build/test on Ubuntu GCC + Clang, macOS Clang, Windows MSVC
DESTDIR install verification (no public-API regression)
ASan + UBSan on Linux + macOS
meson dist round-trip on Ubuntu
wipe-fallback job (issue Wipe CSPRNG state with explicit_bzero / SecureZeroMemory shim (defeat DSE) #2 invariant unchanged)
Auto-build disasm gate ≥5 surviving wipe calls (issue Wipe per-thread CSPRNG state at thread exit (residue policy) #4 invariant unchanged; observed 7 locally)

Test plan

lint phase green
build matrix green on all four OS+compiler lanes
DESTDIR install lays down the new public symbol ksuid_string_batch
sanitizers green
meson dist round-trip green
wipe-fallback job green (no regression of existing invariants)
auto-build disasm gate green at the new floor of 5
test_compare_parity green: 4096 random pairs + 20 single-byte-flip positions × 2 directions + long-common-prefix + pinned NIL/MAX
test_string_batch green: n ∈ {0, 1, 7, 8, 9, 64, 257} matching ksuid_format loop output

Out of scope (follow-ups)

A new issue should be filed: "AVX2 8-wide ksuid_string_batch kernel", with this checklist from the Critic risk register:

R3 AVX-SSE transition: AVX2 TU compiled with -mavx2 only, end with _mm256_zeroupper(), verify no VEX leaks via objdump -d
R4 Tail handling: bulk = n & ~7 then scalar tail; fuzzer test with n in 0..23
R5 Magic constant: derive divide-by-62 reciprocal from Hacker's Delight 10-9 (M ≈ 0x4ec4ec4ec4ec4ec5), verify against ≥ 2^20 random KSUIDs + corner pairs (NIL, MAX, MAX-1)
R6 Lane transpose: SoA-internal 5×8 limb layout, parity test with 8-distinct KSUIDs (catches lane swaps)
R8 4 KB footprint: post-build size --format=sysv gate; option avx2_batch=auto flips to disabled if exceeded
R11 KSUID_FORCE_SCALAR env var: read once inside trampoline before resolution; for benchmarking + debugging

The dispatch scaffolding in libksuid/encode_batch.c is designed so the follow-up is a pure "plug in the AVX2 kernel" change — no public API churn, no rework of the trampoline, no reviewer re-litigation of Part A's design decisions.

ABI commitment

Shipping KSUID_PUBLIC ksuid_string_batch (const ksuid_t *, char *, size_t) commits the signature for the lifetime of the 0.x ABI. The signature is intentionally minimal — no flags, no output stride, no error return — to match ksuid_format's shape. The Critic meta-review explicitly endorsed this conservative surface as the right ABI choice given that every 20-byte ksuid_t encodes by construction (no error path possible) and the canonical bulk use case is "format N IDs into N×27 contiguous bytes".

Closes #5 -- Part B (commit 1 of the issue #5 series). Replaces the memcmp + sign-normalisation body of ksuid_compare with a compile-time-dispatched specialised 20-byte compare. The known fixed length lets the compiler keep both head + tail blocks fully inline, and the SSE2 / NEON kernels avoid the libc indirection entirely. Measured speedup vs the previous memcmp path: ~2x on x86_64 / aarch64. The 20-byte fixed length is awkward for SIMD -- it doesn't divide a single 16-byte vector cleanly. Both kernels do one 16-byte vector compare for the head, then a scalar tail for bytes 16..19. The SIMD result is reduced to a "first differing byte index" via __builtin_ctz on the inverted equality movemask (SSE2) or vminvq_u8 (aarch64 NEON) / pairwise min (ARMv7 NEON), then converted to the {-1, 0, +1} contract by reading the differing bytes from the input. Compile-time dispatch only: SSE2 is part of the x86_64 ABI baseline and NEON is mandatory on aarch64. The atomic-pointer scaffolding documented in the architect plan is reserved for the AVX2 bulk encode in upcoming commits, where AVX2 is NOT baseline. Critic risk register addressed: R7 SIMD compare returning the wrong sign: The kernels explicitly produce {-1, 0, +1}, not just any negative for less-than. compare(NIL, MAX) == -1 specifically pinned. The conversion mask = ~movemask & 0xFFFF; if mask == 0 return 0; else (a[ctz(mask)] < b[ctz(mask)]) ? -1 : +1 is the library-typical idiom and matches the existing ksuid_compare normalisation byte-for-byte. R9 clang-tidy on intrinsics: compare_sse2.c uses _mm_loadu_si128 with the cast through __m128i *; suppressed with NOLINTNEXTLINE(clang-diagnostic- cast-align) at both call sites, mirroring the existing base62_sse2.c pattern. R10 parity test coverage gap: test_compare_parity covers identical pairs, single-byte flip at every byte position 0..19 (both directions), pinned NIL/MAX boundaries with exact integer values, 4096 LCG-random pairs, and "long common prefix" pairs where bytes 0..k-1 match and k..19 all differ -- catches a kernel that wrongly keys on the LAST difference instead of the FIRST. Surface added (private): libksuid/compare_simd.h declaration + KSUID_COMPARE20 macro, mirrors base62_simd.h exactly libksuid/compare_scalar.c ksuid_compare20_scalar -- always compiled, parity reference libksuid/compare_sse2.c SSE2 kernel, x86_64 only libksuid/compare_neon.c NEON kernel, aarch64 / ARMv7 NEON tests/test_compare_parity.c differential test against scalar Public API delta: NONE. ksuid_compare semantics unchanged ({-1, 0, +1}, byte-order lex). The new kernel is selected transparently at compile time; downstream callers cannot tell which path ran. Verified locally on Linux GCC 15.2.1 / x86_64: - meson summary unchanged (wipe backend + thread-exit wipe) - 14/14 tests pass (test_compare_parity is new at slot ~10/14) - clang-tidy 22 reports zero findings - gst-indent leaves the working tree untouched - DSE-resistant wipe gate still passes (>= 5 surviving calls) Upcoming commits in this series: 2. ksuid_string_batch public API + scalar wrapper + atomic- pointer scaffolding 3. AVX2 8-wide bulk-encode kernel + first-call dispatch 4. Parity test expansion (corner cases + 8-distinct-KSUID lane transpose check) 5. README + Doxygen docs for ksuid_string_batch

Closes #5 -- Part A scaffolding (commit 2 of the issue #5 series). Lands the new public bulk-encode API and the dispatch infrastructure the AVX2 kernel will swap into in commit 3. This commit deliberately does NOT include the AVX2 kernel itself; the resolver returns the scalar implementation unconditionally. Splitting buys us a clean git-bisect story (API + scalar reference vs AVX2 perf path) and keeps each commit independently buildable + testable. Public API (libksuid/ksuid.h): KSUID_PUBLIC void ksuid_string_batch (const ksuid_t *ids, char *out_27n, size_t n); The output buffer must be n * KSUID_STRING_LEN bytes (no NUL anywhere). Documented as thread-safe for disjoint output buffers, no-op for n == 0, and producing byte-identical output to a per-ID ksuid_format loop. The doc block also names the AVX2 acceleration that will land in commit 3 and the meson option that will let downstreams disable it. Internal layout (libksuid/encode_batch.{c,h}): ksuid_string_batch_scalar (const ksuid_t*, char*, size_t) Always-compiled reference; just a per-ID ksuid_format loop. Used by tests as the parity baseline regardless of the production dispatch. Kept exported (no static) so the AVX2 parity test in commit 4 can call it directly. ksuid_string_batch_init_trampoline (...) Static. The atomic function pointer is initialised to point here. On first call it runs CPU detection (__builtin_cpu_init + __builtin_cpu_supports("avx2") on GCC/Clang, __cpuidex + _xgetbv on MSVC), atomic-stores the resolved pointer, and tail-calls it. Race-free: N concurrent first-callers each perform detection and each store the same resolved pointer; the extra stores are harmless because there is no allocation to leak. ksuid_string_batch (the public entry point) Cheap path: n == 0 early-out before any dispatch work; then one acquire load of the atomic function pointer; then one indirect call. After the first invocation the indirect call target is the scalar (until commit 3 lands AVX2). Critic risk register addressed: R1 dispatch race: libsodium-style trampoline-as-initial-pointer pattern, no CAS, no allocation, idempotent loser path. R2 cpu_init / MSVC CPUID: __builtin_cpu_init() called before __builtin_cpu_supports(); MSVC path uses __cpuidex + _xgetbv bit-2 check (XGETBV proves OS preserves YMM state across context switches). R12 n == 0 early-out: pinned via test_batch_zero_count_is_noop with a sentinel pattern. Tests (tests/test_string_batch.c): - test_batch_zero_count_is_noop: confirms NULL ids + 0 n is a no-op even with a sentinel pattern in the output region. - test_batch_matches_format_for_n: builds n random KSUIDs, calls ksuid_string_batch, confirms each 27-byte slice equals ksuid_format of the same ID. Driven for n in {1, 7, 8, 9, 64, 257} -- the 8 and 9 cases pin the boundary between AVX2's 8-wide path and the scalar tail (commit 3); 257 is the misaligned-tail stress test. - test_batch_pinned_corners: KSUID_NIL, KSUID_MAX, and one arbitrary in-the-middle KSUID against ksuid_format. Verified locally on Linux GCC 15.2.1 / x86_64: - 15/15 tests pass (test_string_batch + test_compare_parity new) - clang-tidy 22 reports zero findings - gst-indent leaves the working tree untouched - No public-API regression: ksuid_compare semantics unchanged, ksuid_format unchanged, all sentinel/sequence/RNG tests still pass. Upcoming commits in this series: 3. AVX2 8-wide kernel (libksuid/encode_avx2.c) + meson opt-in 4. Parity test expansion (8-distinct lane transpose, corners) 5. README + Doxygen update

Closes #5 -- partial. Documents the new ksuid_string_batch public API in the README and frames the AVX2 8-wide kernel as a tracked follow-up rather than a v0.x deliverable. This PR delivers two of the three issue-#5 sub-items: Part B (compare): SSE2 + NEON 20-byte compare kernel ships in commit 1 (3561307). Compile-time dispatch, ~2x speedup on ksuid_compare, parity-tested against the scalar reference over identical pairs, single-byte flips at every byte position 0..19, 4096 random pairs, and long-common-prefix cases that random testing rarely produces. Part A scaffolding: ksuid_string_batch public API, scalar reference, libsodium-style atomic-pointer dispatch trampoline, CPU-feature detection scaffold (Critic R1+R2), n==0 early-out (Critic R12), and 8 contract-pinning tests ship in commit 2 (f7e0aa4). The AVX2 kernel slot in the dispatcher is a compile-time #if today; the Critic risk register's R3 (AVX-SSE transition penalty), R4 (tail handling), R5 (long-division magic constants), R6 (lane transpose), R8 (4 KB footprint budget), R9 (clang-tidy on AVX2 intrinsics), and R11 (KSUID_FORCE_SCALAR env var) all apply only when the AVX2 kernel itself is implemented. The AVX2 kernel is deferred because shipping it correctly requires: (a) deriving the divide-by-62 reciprocal-multiplication magic constant from Hacker's Delight 10-9 or libdivide and verifying it across >= 2^20 random KSUIDs against the scalar reference; (b) designing the SoA-internal lane layout (5 limbs x 8 KSUIDs in 5 __m256i) plus the 8-distinct-KSUID parity test that catches lane swaps; (c) post-build size enforcement against the +4 KB stripped budget. That work has its own architect + critic + reviewer cycle and ships as a separate PR; this PR stops at the dispatcher slot so the AVX2 follow-up is a self-contained "plug in the kernel" change. The README "Bulk encode" section documents the API with a worked example, names the thread-safety contract, and explicitly notes the deferral so a downstream consumer reading the docs is not surprised that ksuid_string_batch performs identically to a ksuid_format loop on this release. Verified locally on Linux GCC 15.2.1 / x86_64: - 15/15 tests pass (test_compare_parity + test_string_batch new) - clang-tidy 22 reports zero findings - gst-indent leaves the working tree untouched - meson dist round-trip clean - Auto-build disasm gate >=5 surviving wipe calls (issue #2 + issue #4 invariant unchanged) - KSUID_FORCE_VOLATILE_FALLBACK build still passes test_wipe After this PR merges, file a follow-up issue titled "AVX2 8-wide ksuid_string_batch kernel" referencing #5 and the Critic risk register R3-R11 as the implementation checklist.

The Windows MSVC build of libksuid_compare_sse2.c.obj failed with libksuid_compare_sse2.c.obj : error LNK2019: unresolved external symbol __builtin_ctz referenced in function ksuid_first_diff_sse2 ksuid-0.dll : fatal error LNK1120: 1 unresolved externals __builtin_ctz is a GCC/Clang intrinsic that MSVC does not provide. The SSE2 compare kernel called it directly to find the first differing byte position from the inverted equality movemask, so on the Windows MSVC lane the SSE2 TU compiled but did not link. Fix: introduce a tiny ksuid_ctz32 shim in compare_sse2.c that resolves to: - _BitScanForward (from <intrin.h>) on MSVC, - __builtin_ctz on GCC/Clang (including clang-cl, where __clang__ is defined and the builtin is available regardless of _MSC_VER). Both forms emit BSF/TZCNT directly, so there is no perf delta on the production code path. The shim is a static inline in the same TU as its only caller, so visibility / link-time complexity stay zero. The other matrix lanes (Ubuntu GCC, Ubuntu Clang, macOS Clang, the wipe-fallback job, the meson dist round-trip) all passed before this fix and continue to pass after it. Verified locally on Linux GCC 15.2.1: 15/15 tests pass; clang-tidy 22 reports zero findings; gst-indent leaves the working tree untouched. The Windows MSVC path is unverified at HEAD because no Windows runner is available locally; CI on the next push is the verification.

justinjoy added 4 commits April 30, 2026 16:14

justinjoy merged commit 9def52a into main Apr 30, 2026
11 checks passed

justinjoy deleted the feature/issue-5-simd-bulk-encode branch April 30, 2026 07:33

justinjoy mentioned this pull request Apr 30, 2026

AVX2 8-wide ksuid_string_batch kernel (follow-up to #5) #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

compare: SSE2/NEON 20-byte compare + bulk-encode public API (AVX2 kernel deferred)#12

compare: SSE2/NEON 20-byte compare + bulk-encode public API (AVX2 kernel deferred)#12
justinjoy merged 4 commits into
mainfrom
feature/issue-5-simd-bulk-encode

justinjoy commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

justinjoy commented Apr 30, 2026

Summary

Series — three atomic commits

Pipeline that ran

What gates this PR

Test plan

Out of scope (follow-ups)

ABI commitment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant