[SLP] -slp-inst-count-check default-on regresses AMDGPU codegen; add TTI hook for per-target opt-out

### Summary

The `-slp-inst-count-check` option (introduced by 8ac9461e48e0, PR https://github.com/llvm/llvm-project/pull/190414), defaulted to `true` on upstream `main` as of filing, causes a codegen regression on AMDGPU `gfx94x` and `gfx950` targets. Per @alexey-bataev's review on https://github.com/llvm/llvm-project/pull/199536, the requested direction is to expose the heuristic through a TTI hook so each target can opt out. This issue tracks adding that hook in the SLP-vectorizer. Tuning the AMDGPU side of that hook is out of scope for this issue and will be handled separately by the AMDGPU maintainers.

### Background

`getTreeCost()` in `llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp` rejects size-2 vector trees whose lowered vector instruction count exceeds the scalar count, gated by the hidden option `-slp-inst-count-check` (default `true`). The use site carries a `FIXME` left by the original author asking for the check to be removed once a correct fractional model is landed for all targets. The heuristic counts instructions as integers via `getNumScalarInsts()` / `getNumVectorInsts()`.

### Observed effect on AMDGPU

With `-slp-inst-count-check=true` SLP admits a 2-element vector tree that, after lowering, produces measurable codegen regressions on AMDGPU `gfx94x` and `gfx950` in real workloads. The mechanism visible on the minimal reproducer below is an inner-loop 32-bit carrier move being widened, post-codegen, to a 64-bit vector move (e.g. `v_mov_b32_e32` becomes `v_mov_b64` after `llc -mcpu=gfx950`), with no reduction in the total move count.

A self-contained IR reproducer (rotating chain of five `i32` phis, gfx950 target, with two RUN lines: default behavior vs explicit `-slp-inst-count-check=true`) is included in https://github.com/llvm/llvm-project/pull/199536 as `llvm/test/Transforms/SLPVectorizer/AMDGPU/inst-count-heuristic.ll`. The reproducer pins the gfx950 case at the IR level only; the downstream asm widening is observable via `llc -mcpu=gfx950` but is not asserted by the test itself.

Beyond the public reproducer, the same option toggle drives throughput regressions across several other AMDGPU workload classes (numbers below are relative to baseline on the same build, heuristic-off vs heuristic-on):

- A generic device memory bandwidth kernel: ~7% throughput regression.
- A device-to-device memory copy throughput kernel: ~9% regression.
- Pseudo-random number generation kernels of the xorshift/xorwow family, across several output distributions: ~10%, ~14%, ~16% regressions on different distributions.

The minimal IR reduction in #199536 corresponds to the PRNG case. The other workload classes reproduce the same option-driven sensitivity on the same targets; the precise downstream mechanism on each may differ from the move-widening described above. Reductions of those other cases to upstream-shareable IR can be provided on request.

### What this issue asks for

Add a TTI hook (shape at SLP-maintainer discretion) that lets a target influence the size-2 inst-count check in `getTreeCost()`. Default behavior on targets that do not implement the hook must remain unchanged.

### Scope

In scope:

- A TTI virtual function (or equivalent) gating the inst-count check on a per-target basis.
- The call site in `getTreeCost()` updated to consult it.
- An SLP-level test of the new mechanism (target-agnostic, e.g. via a test target that opts out).

Out of scope:

- AMDGPU TTI implementation that overrides the hook. That work will be carried by the AMDGPU maintainers in a separate PR once the hook lands.
- Replacing the integer inst-count check with a fractional-cost formulation. The `InstructionCost` class already supports fractional costing (noted in the discussion on https://github.com/llvm/llvm-project/pull/199536), though adoption in SLP and backends may still be incomplete. It is the natural long-term destination, but that work is independent of and larger than the per-target hook asked here.

### Related

- PR https://github.com/llvm/llvm-project/pull/199536 is an interim default flip until the hook exists; this issue tracks only the TTI-hook work in SLP. Once the hook lands, the AMDGPU TTI override will be filed separately and #199536 will be revisited (either closed in favour of the hook or reframed accordingly).
- Original PR https://github.com/llvm/llvm-project/pull/190414 (introduction of the heuristic).
- `FIXME` comment in `getTreeCost()` adjacent to the `SLPInstCountCheck` use site.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLP] -slp-inst-count-check default-on regresses AMDGPU codegen; add TTI hook for per-target opt-out #199662

Summary

Background

Observed effect on AMDGPU

What this issue asks for

Scope

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[SLP] -slp-inst-count-check default-on regresses AMDGPU codegen; add TTI hook for per-target opt-out #199662

Description

Summary

Background

Observed effect on AMDGPU

What this issue asks for

Scope

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions