Background
Currently, some SIMD optimized paths depend on compile-time ISA selection, compiler-specific target attributes, or native build flags. This can make the generated binary tightly coupled to the build machine or to a specific compiler/CPU feature set.
For package distribution, especially Python wheels and prebuilt binaries, it would be better to support runtime SIMD dispatch: build multiple ISA-specific implementations into one binary, detect CPU capabilities at runtime, and select the best available implementation automatically.
Goal
Add runtime SIMD dispatch for performance-critical SIMD code paths.
Expected behavior:
- One binary / wheel can run on different x86 CPU environments.
- At runtime, select the fastest supported implementation, for example:
- AVX512 when available
- AVX2 otherwise
- portable fallback when needed
- Avoid requiring users to build with
-march=native.
- Avoid exposing ISA-specific intrinsics in public headers where possible.
- Keep ISA-specific implementations isolated in dedicated translation units.
- Preserve performance close to the current native / compile-time optimized build.
- Keep recall and indexing/search behavior unchanged.
- Make the design extensible for future compiler/platform support, such as MSVC or ARM NEON.
Advantages
Better binary compatibility
A single prebuilt package can support more machines without requiring users to rebuild from source.
Easier Python wheel distribution
Python wheels can be built once per platform instead of per CPU generation. This is especially useful for users who install from pip and expect the package to work out of the box.
Safer runtime behavior
Runtime CPU feature checks can prevent accidentally executing unsupported instructions, avoiding illegal-instruction crashes on older CPUs.
Better maintainability
Keeping SIMD implementations behind dispatch interfaces makes it clearer which code is ISA-specific and which code is shared logic.
Easier future extension
The same dispatch structure can later support additional backends, such as:
- AVX512 variants
- AVX2
- ARM NEON
Performance remains competitive
The intent is not to replace SIMD optimization with generic scalar code. The dispatch layer should preserve optimized kernels and only add a small runtime selection cost, ideally resolved once and reused.
The main success criteria are:
- Same results / recall as before
- No significant performance regression on AVX2 and AVX512 machines
- One binary can run correctly across supported CPU environments
Background
Currently, some SIMD optimized paths depend on compile-time ISA selection, compiler-specific target attributes, or native build flags. This can make the generated binary tightly coupled to the build machine or to a specific compiler/CPU feature set.
For package distribution, especially Python wheels and prebuilt binaries, it would be better to support runtime SIMD dispatch: build multiple ISA-specific implementations into one binary, detect CPU capabilities at runtime, and select the best available implementation automatically.
Goal
Add runtime SIMD dispatch for performance-critical SIMD code paths.
Expected behavior:
-march=native.Advantages
Better binary compatibility
A single prebuilt package can support more machines without requiring users to rebuild from source.
Easier Python wheel distribution
Python wheels can be built once per platform instead of per CPU generation. This is especially useful for users who install from pip and expect the package to work out of the box.
Safer runtime behavior
Runtime CPU feature checks can prevent accidentally executing unsupported instructions, avoiding illegal-instruction crashes on older CPUs.
Better maintainability
Keeping SIMD implementations behind dispatch interfaces makes it clearer which code is ISA-specific and which code is shared logic.
Easier future extension
The same dispatch structure can later support additional backends, such as:
Performance remains competitive
The intent is not to replace SIMD optimization with generic scalar code. The dispatch layer should preserve optimized kernels and only add a small runtime selection cost, ideally resolved once and reused.
The main success criteria are: