[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2754
Draft
yushangdi wants to merge 1 commit into
Draft
[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2754yushangdi wants to merge 1 commit into
yushangdi wants to merge 1 commit into
Conversation
6f5a27a to
9c7a9c0
Compare
This was referenced Jun 11, 2026
ea8db74 to
d3fee78
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
…pills
The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile
fragment (all subtiles' fp32 values per thread) into a register tensor in
the per-output-tile setup, keeping it live across the whole subtile loop on
top of the accumulator fragments and TMA-store state. At bk=128 (the
CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the
epilogue warps to the 255-register cap and into local memory.
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local:
before after
dynamic LDL+STL (spills) 409,104 0
static spill sites 111 0
kernel duration 57.8 us 52.6 us
tensor-pipe active 71.6 % 81.2 %
do_bench TFLOP/s 1915-1942 2100-2120
The top stall lines were all spill traffic in the epilogue: the acc TMEM
read (1148 long-scoreboard samples), the whole-tile autovec_copy itself
(238), and the acc*sa*sb chain (263). The reference CUTLASS kernel
(epilogue_tma_store_scaled, 176 regs, zero spills) gets away with the same
whole-tile copy only because its epilogue is a lean helper; in Helion's
fatter persistent-loop context the copy is what overflows the register
budget.
The fix removes the hoist machinery entirely: the rowvec scale falls back
to the generic per-subtile ttr_aux_subtile.load() placed after the
c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69
placement), which spills zero at every measured bk. At bk=64 - the only
configuration where the hoist had been measured as a win - removal is
perf-neutral (1711 vs 1712 TFLOP/s). relerr=0.0000 vs the fp32 reference
at both bk=64 and bk=128.
With this fix the fused-scale epilogue is free again relative to the plain
fp8 matmul (2100 vs 2105 TFLOP/s), matching the standalone kernel's
behavior (2334 vs 2345).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2754, branch: yushangdi/stack/34
9c7a9c0 to
68f7f50
Compare
d3fee78 to
1510d86
Compare
68f7f50 to
c18f8b4
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
…pills
The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile
fragment (all subtiles' fp32 values per thread) into a register tensor in
the per-output-tile setup, keeping it live across the whole subtile loop on
top of the accumulator fragments and TMA-store state. At bk=128 (the
CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the
epilogue warps to the 255-register cap and into local memory.
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local:
before after
dynamic LDL+STL (spills) 409,104 0
static spill sites 111 0
kernel duration 57.8 us 52.6 us
tensor-pipe active 71.6 % 81.2 %
do_bench TFLOP/s 1915-1942 2100-2120
The top stall lines were all spill traffic in the epilogue: the acc TMEM
read (1148 long-scoreboard samples), the whole-tile autovec_copy itself
(238), and the acc*sa*sb chain (263). The reference CUTLASS kernel
(epilogue_tma_store_scaled, 176 regs, zero spills) gets away with the same
whole-tile copy only because its epilogue is a lean helper; in Helion's
fatter persistent-loop context the copy is what overflows the register
budget.
The fix removes the hoist machinery entirely: the rowvec scale falls back
to the generic per-subtile ttr_aux_subtile.load() placed after the
c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69
placement), which spills zero at every measured bk. At bk=64 - the only
configuration where the hoist had been measured as a win - removal is
perf-neutral (1711 vs 1712 TFLOP/s). relerr=0.0000 vs the fp32 reference
at both bk=64 and bk=128.
With this fix the fused-scale epilogue is free again relative to the plain
fp8 matmul (2100 vs 2105 TFLOP/s), matching the standalone kernel's
behavior (2334 vs 2345).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2754, branch: yushangdi/stack/34
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
…pills
The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile
fragment (all subtiles' fp32 values per thread) into a register tensor in
the per-output-tile setup, keeping it live across the whole subtile loop on
top of the accumulator fragments and TMA-store state. At bk=128 (the
CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the
epilogue warps to the 255-register cap and into local memory.
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local:
before after
dynamic LDL+STL (spills) 409,104 0
static spill sites 111 0
kernel duration 57.8 us 52.6 us
tensor-pipe active 71.6 % 81.2 %
do_bench TFLOP/s 1915-1942 2100-2120
The top stall lines were all spill traffic in the epilogue: the acc TMEM
read (1148 long-scoreboard samples), the whole-tile autovec_copy itself
(238), and the acc*sa*sb chain (263). The reference CUTLASS kernel
(epilogue_tma_store_scaled, 176 regs, zero spills) gets away with the same
whole-tile copy only because its epilogue is a lean helper; in Helion's
fatter persistent-loop context the copy is what overflows the register
budget.
The fix removes the hoist machinery entirely: the rowvec scale falls back
to the generic per-subtile ttr_aux_subtile.load() placed after the
c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69
placement), which spills zero at every measured bk. At bk=64 - the only
configuration where the hoist had been measured as a win - removal is
perf-neutral (1711 vs 1712 TFLOP/s). relerr=0.0000 vs the fp32 reference
at both bk=64 and bk=128.
With this fix the fused-scale epilogue is free again relative to the plain
fp8 matmul (2100 vs 2105 TFLOP/s), matching the standalone kernel's
behavior (2334 vs 2345).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2754, branch: yushangdi/stack/34
c18f8b4 to
d92ad46
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
… the TMA producer
Two related fixes that together take the fused fp8 scaled_mm kernel from
74.2 us / 1509-1711 TFLOP/s to 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) on B200, closing in on the CUTLASS
standalone (48.7 us / 2317-2334). relerr=0.0000 vs the fp32 reference at
bk=64 and bk=128 for both plain and scaled matmul.
1. Remove the fp8 whole-tile rowvec register hoist (epilogue spills)
--------------------------------------------------------------------
The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile
fragment (all subtiles' fp32 values per thread) into a register tensor in
the per-output-tile setup, keeping it live across the whole subtile loop on
top of the accumulator fragments and TMA-store state. At bk=128 (the
CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the
epilogue warps to the 255-register cap and into local memory.
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local:
before after
dynamic LDL+STL (spills) 409,104 0
static spill sites 111 0
tensor-pipe active 71.6 % 81.2 %
The top stall lines were all spill traffic in the epilogue: the acc TMEM
read (1148 long-scoreboard samples), the whole-tile autovec_copy itself
(238), and the acc*sa*sb chain (263). The fix removes the hoist machinery
entirely: the rowvec scale falls back to the generic per-subtile
ttr_aux_subtile.load() placed after the c_pipeline acquire / acc
consumer_wait / T2R prefix (the cycle-69 placement), which spills zero at
every measured bk. At bk=64 - the only configuration where the hoist had
been measured as a win - removal is perf-neutral (1711 vs 1712 TFLOP/s).
2. Unified rolled TMA producer + hoisted K-loop predicates
----------------------------------------------------------
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, same config:
before after standalone
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound.
Scope and fallbacks:
* The unified producer requires static-full tiles + the role-local
separate producer loop. Edge/K-tail families and the bridge
diagnostics that skip individual acquire/advance edges keep the
unrolled-prologue form.
* _build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims keep
the old call shape where the gated form is still used.
* The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order.
Test updates: tests that pinned the old per-subtile hoist, per-stage
prefetch markers, inline cluster-leader predicates, and the producer-if's
full-tile gate now pin the unified/hoisted forms. Full HELION_BACKEND=cute
sweep: test_cute_lowerings.py + test_cute_backend.py 479 passed, other cute
test files 85 passed.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2754, branch: yushangdi/stack/34
d92ad46 to
e01571a
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
…gen05
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:
before after standalone
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.
Scope and fallbacks:
* The unified producer requires static-full tiles + the role-local
separate producer loop. Edge/K-tail families and the bridge
diagnostics that skip individual acquire/advance edges keep the
unrolled-prologue form.
* _build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims keep
the old call shape where the gated form is still used.
* The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order.
Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2754, branch: yushangdi/stack/34
e01571a to
9c18723
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
…gen05
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:
before after standalone
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.
Scope and fallbacks:
* The unified producer requires static-full tiles + the role-local
separate producer loop. Edge/K-tail families and the bridge
diagnostics that skip individual acquire/advance edges keep the
unrolled-prologue form.
* _build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims keep
the old call shape where the gated form is still used.
* The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order.
Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2754, branch: yushangdi/stack/34
9c18723 to
3f57c00
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
…gen05
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:
before after standalone
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.
Scope and fallbacks:
* The unified producer requires static-full tiles + the role-local
separate producer loop. Edge/K-tail families and the bridge
diagnostics that skip individual acquire/advance edges keep the
unrolled-prologue form.
* _build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims keep
the old call shape where the gated form is still used.
* The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order.
Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2754, branch: yushangdi/stack/34
3f57c00 to
05d14ef
Compare
60d5b4c to
8d9123c
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
…gen05
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:
before after standalone
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.
Scope and fallbacks:
* The unified producer requires static-full tiles + the role-local
separate producer loop. Edge/K-tail families and the bridge
diagnostics that skip individual acquire/advance edges keep the
unrolled-prologue form.
* _build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims keep
the old call shape where the gated form is still used.
* The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order.
Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2754, branch: yushangdi/stack/34
05d14ef to
d853568
Compare
…gen05
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:
before after standalone
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.
Scope and fallbacks:
* The unified producer requires static-full tiles + the role-local
separate producer loop. Edge/K-tail families and the bridge
diagnostics that skip individual acquire/advance edges keep the
unrolled-prologue form.
* _build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims keep
the old call shape where the gated form is still used.
* The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order.
Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2754, branch: yushangdi/stack/34
d853568 to
3541d8d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.
Scope and fallbacks:
separate producer loop. Edge/K-tail families and the bridge
diagnostics that skip individual acquire/advance edges keep the
unrolled-prologue form.
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims keep
the old call shape where the gated form is still used.
structurally impossible in the unified form: the blocking acquire
fills every stage in order.
Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.
Co-Authored-By: Claude Fable 5 noreply@anthropic.com