Skip to content

[SLP] -slp-inst-count-check default-on regresses AMDGPU codegen; add TTI hook for per-target opt-out #199662

@michaelselehov

Description

@michaelselehov

Summary

The -slp-inst-count-check option (introduced by 8ac9461, PR #190414), defaulted to true on upstream main as of filing, causes a codegen regression on AMDGPU gfx94x and gfx950 targets. Per @alexey-bataev's review on #199536, the requested direction is to expose the heuristic through a TTI hook so each target can opt out. This issue tracks adding that hook in the SLP-vectorizer. Tuning the AMDGPU side of that hook is out of scope for this issue and will be handled separately by the AMDGPU maintainers.

Background

getTreeCost() in llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp rejects size-2 vector trees whose lowered vector instruction count exceeds the scalar count, gated by the hidden option -slp-inst-count-check (default true). The use site carries a FIXME left by the original author asking for the check to be removed once a correct fractional model is landed for all targets. The heuristic counts instructions as integers via getNumScalarInsts() / getNumVectorInsts().

Observed effect on AMDGPU

With -slp-inst-count-check=true SLP admits a 2-element vector tree that, after lowering, produces measurable codegen regressions on AMDGPU gfx94x and gfx950 in real workloads. The mechanism visible on the minimal reproducer below is an inner-loop 32-bit carrier move being widened, post-codegen, to a 64-bit vector move (e.g. v_mov_b32_e32 becomes v_mov_b64 after llc -mcpu=gfx950), with no reduction in the total move count.

A self-contained IR reproducer (rotating chain of five i32 phis, gfx950 target, with two RUN lines: default behavior vs explicit -slp-inst-count-check=true) is included in #199536 as llvm/test/Transforms/SLPVectorizer/AMDGPU/inst-count-heuristic.ll. The reproducer pins the gfx950 case at the IR level only; the downstream asm widening is observable via llc -mcpu=gfx950 but is not asserted by the test itself.

Beyond the public reproducer, the same option toggle drives throughput regressions across several other AMDGPU workload classes (numbers below are relative to baseline on the same build, heuristic-off vs heuristic-on):

  • A generic device memory bandwidth kernel: ~7% throughput regression.
  • A device-to-device memory copy throughput kernel: ~9% regression.
  • Pseudo-random number generation kernels of the xorshift/xorwow family, across several output distributions: ~10%, ~14%, ~16% regressions on different distributions.

The minimal IR reduction in #199536 corresponds to the PRNG case. The other workload classes reproduce the same option-driven sensitivity on the same targets; the precise downstream mechanism on each may differ from the move-widening described above. Reductions of those other cases to upstream-shareable IR can be provided on request.

What this issue asks for

Add a TTI hook (shape at SLP-maintainer discretion) that lets a target influence the size-2 inst-count check in getTreeCost(). Default behavior on targets that do not implement the hook must remain unchanged.

Scope

In scope:

  • A TTI virtual function (or equivalent) gating the inst-count check on a per-target basis.
  • The call site in getTreeCost() updated to consult it.
  • An SLP-level test of the new mechanism (target-agnostic, e.g. via a test target that opts out).

Out of scope:

  • AMDGPU TTI implementation that overrides the hook. That work will be carried by the AMDGPU maintainers in a separate PR once the hook lands.
  • Replacing the integer inst-count check with a fractional-cost formulation. The InstructionCost class already supports fractional costing (noted in the discussion on [SLP] Disable -slp-inst-count-check by default #199536), though adoption in SLP and backends may still be incomplete. It is the natural long-term destination, but that work is independent of and larger than the per-target hook asked here.

Related

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions