Proposal: Add runtime SIMD dispatch to support one binary across different CPU environments

## Background

Currently, some SIMD optimized paths depend on compile-time ISA selection, compiler-specific target attributes, or native build flags. This can make the generated binary tightly coupled to the build machine or to a specific compiler/CPU feature set.

For package distribution, especially Python wheels and prebuilt binaries, it would be better to support runtime SIMD dispatch: build multiple ISA-specific implementations into one binary, detect CPU capabilities at runtime, and select the best available implementation automatically.

## Goal

Add runtime SIMD dispatch for performance-critical SIMD code paths.

Expected behavior:

- One binary / wheel can run on different x86 CPU environments.
- At runtime, select the fastest supported implementation, for example:
  - AVX512 when available
  - AVX2 otherwise
  - portable fallback when needed
- Avoid requiring users to build with `-march=native`.
- Avoid exposing ISA-specific intrinsics in public headers where possible.
- Keep ISA-specific implementations isolated in dedicated translation units.
- Preserve performance close to the current native / compile-time optimized build.
- Keep recall and indexing/search behavior unchanged.
- Make the design extensible for future compiler/platform support, such as MSVC or ARM NEON.

## Advantages

### Better binary compatibility

A single prebuilt package can support more machines without requiring users to rebuild from source.

### Easier Python wheel distribution

Python wheels can be built once per platform instead of per CPU generation. This is especially useful for users who install from pip and expect the package to work out of the box.

### Safer runtime behavior

Runtime CPU feature checks can prevent accidentally executing unsupported instructions, avoiding illegal-instruction crashes on older CPUs.

### Better maintainability

Keeping SIMD implementations behind dispatch interfaces makes it clearer which code is ISA-specific and which code is shared logic.

### Easier future extension

The same dispatch structure can later support additional backends, such as:

- AVX512 variants
- AVX2
- ARM NEON

### Performance remains competitive

The intent is not to replace SIMD optimization with generic scalar code. The dispatch layer should preserve optimized kernels and only add a small runtime selection cost, ideally resolved once and reused.

The main success criteria are:

- Same results / recall as before
- No significant performance regression on AVX2 and AVX512 machines
- One binary can run correctly across supported CPU environments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Add runtime SIMD dispatch to support one binary across different CPU environments #57

Background

Goal

Advantages

Better binary compatibility

Easier Python wheel distribution

Safer runtime behavior

Better maintainability

Easier future extension

Performance remains competitive

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Proposal: Add runtime SIMD dispatch to support one binary across different CPU environments #57

Description

Background

Goal

Advantages

Better binary compatibility

Easier Python wheel distribution

Safer runtime behavior

Better maintainability

Easier future extension

Performance remains competitive

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions