You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The -slp-inst-count-check option (introduced by 8ac9461, PR #190414), defaulted to true on upstream main as of filing, causes a codegen regression on AMDGPU gfx94x and gfx950 targets. Per @alexey-bataev's review on #199536, the requested direction is to expose the heuristic through a TTI hook so each target can opt out. This issue tracks adding that hook in the SLP-vectorizer. Tuning the AMDGPU side of that hook is out of scope for this issue and will be handled separately by the AMDGPU maintainers.
Background
getTreeCost() in llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp rejects size-2 vector trees whose lowered vector instruction count exceeds the scalar count, gated by the hidden option -slp-inst-count-check (default true). The use site carries a FIXME left by the original author asking for the check to be removed once a correct fractional model is landed for all targets. The heuristic counts instructions as integers via getNumScalarInsts() / getNumVectorInsts().
Observed effect on AMDGPU
With -slp-inst-count-check=true SLP admits a 2-element vector tree that, after lowering, produces measurable codegen regressions on AMDGPU gfx94x and gfx950 in real workloads. The mechanism visible on the minimal reproducer below is an inner-loop 32-bit carrier move being widened, post-codegen, to a 64-bit vector move (e.g. v_mov_b32_e32 becomes v_mov_b64 after llc -mcpu=gfx950), with no reduction in the total move count.
A self-contained IR reproducer (rotating chain of five i32 phis, gfx950 target, with two RUN lines: default behavior vs explicit -slp-inst-count-check=true) is included in #199536 as llvm/test/Transforms/SLPVectorizer/AMDGPU/inst-count-heuristic.ll. The reproducer pins the gfx950 case at the IR level only; the downstream asm widening is observable via llc -mcpu=gfx950 but is not asserted by the test itself.
Beyond the public reproducer, the same option toggle drives throughput regressions across several other AMDGPU workload classes (numbers below are relative to baseline on the same build, heuristic-off vs heuristic-on):
A generic device memory bandwidth kernel: ~7% throughput regression.
A device-to-device memory copy throughput kernel: ~9% regression.
Pseudo-random number generation kernels of the xorshift/xorwow family, across several output distributions: ~10%, ~14%, ~16% regressions on different distributions.
The minimal IR reduction in #199536 corresponds to the PRNG case. The other workload classes reproduce the same option-driven sensitivity on the same targets; the precise downstream mechanism on each may differ from the move-widening described above. Reductions of those other cases to upstream-shareable IR can be provided on request.
What this issue asks for
Add a TTI hook (shape at SLP-maintainer discretion) that lets a target influence the size-2 inst-count check in getTreeCost(). Default behavior on targets that do not implement the hook must remain unchanged.
Scope
In scope:
A TTI virtual function (or equivalent) gating the inst-count check on a per-target basis.
The call site in getTreeCost() updated to consult it.
An SLP-level test of the new mechanism (target-agnostic, e.g. via a test target that opts out).
Out of scope:
AMDGPU TTI implementation that overrides the hook. That work will be carried by the AMDGPU maintainers in a separate PR once the hook lands.
Replacing the integer inst-count check with a fractional-cost formulation. The InstructionCost class already supports fractional costing (noted in the discussion on [SLP] Disable -slp-inst-count-check by default #199536), though adoption in SLP and backends may still be incomplete. It is the natural long-term destination, but that work is independent of and larger than the per-target hook asked here.
Summary
The
-slp-inst-count-checkoption (introduced by 8ac9461, PR #190414), defaulted totrueon upstreammainas of filing, causes a codegen regression on AMDGPUgfx94xandgfx950targets. Per @alexey-bataev's review on #199536, the requested direction is to expose the heuristic through a TTI hook so each target can opt out. This issue tracks adding that hook in the SLP-vectorizer. Tuning the AMDGPU side of that hook is out of scope for this issue and will be handled separately by the AMDGPU maintainers.Background
getTreeCost()inllvm/lib/Transforms/Vectorize/SLPVectorizer.cpprejects size-2 vector trees whose lowered vector instruction count exceeds the scalar count, gated by the hidden option-slp-inst-count-check(defaulttrue). The use site carries aFIXMEleft by the original author asking for the check to be removed once a correct fractional model is landed for all targets. The heuristic counts instructions as integers viagetNumScalarInsts()/getNumVectorInsts().Observed effect on AMDGPU
With
-slp-inst-count-check=trueSLP admits a 2-element vector tree that, after lowering, produces measurable codegen regressions on AMDGPUgfx94xandgfx950in real workloads. The mechanism visible on the minimal reproducer below is an inner-loop 32-bit carrier move being widened, post-codegen, to a 64-bit vector move (e.g.v_mov_b32_e32becomesv_mov_b64afterllc -mcpu=gfx950), with no reduction in the total move count.A self-contained IR reproducer (rotating chain of five
i32phis, gfx950 target, with two RUN lines: default behavior vs explicit-slp-inst-count-check=true) is included in #199536 asllvm/test/Transforms/SLPVectorizer/AMDGPU/inst-count-heuristic.ll. The reproducer pins the gfx950 case at the IR level only; the downstream asm widening is observable viallc -mcpu=gfx950but is not asserted by the test itself.Beyond the public reproducer, the same option toggle drives throughput regressions across several other AMDGPU workload classes (numbers below are relative to baseline on the same build, heuristic-off vs heuristic-on):
The minimal IR reduction in #199536 corresponds to the PRNG case. The other workload classes reproduce the same option-driven sensitivity on the same targets; the precise downstream mechanism on each may differ from the move-widening described above. Reductions of those other cases to upstream-shareable IR can be provided on request.
What this issue asks for
Add a TTI hook (shape at SLP-maintainer discretion) that lets a target influence the size-2 inst-count check in
getTreeCost(). Default behavior on targets that do not implement the hook must remain unchanged.Scope
In scope:
getTreeCost()updated to consult it.Out of scope:
InstructionCostclass already supports fractional costing (noted in the discussion on [SLP] Disable -slp-inst-count-check by default #199536), though adoption in SLP and backends may still be incomplete. It is the natural long-term destination, but that work is independent of and larger than the per-target hook asked here.Related
FIXMEcomment ingetTreeCost()adjacent to theSLPInstCountCheckuse site.