[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05 by yushangdi · Pull Request #2754 · pytorch/helion

yushangdi · 2026-06-11T02:29:22Z

Stacked PRs:

[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                          before        after       standalone

warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.

Scope and fallbacks:

The unified producer requires static-full tiles + the role-local
separate producer loop. Edge/K-tail families and the bridge
diagnostics that skip individual acquire/advance edges keep the
unrolled-prologue form.
_build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims keep
the old call shape where the gated form is still used.
The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order.

Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.

Co-Authored-By: Claude Fable 5 noreply@anthropic.com

…pills The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile fragment (all subtiles' fp32 values per thread) into a register tensor in the per-output-tile setup, keeping it live across the whole subtile loop on top of the accumulator fragments and TMA-store state. At bk=128 (the CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the epilogue warps to the 255-register cap and into local memory. NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200, block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local: before after dynamic LDL+STL (spills) 409,104 0 static spill sites 111 0 kernel duration 57.8 us 52.6 us tensor-pipe active 71.6 % 81.2 % do_bench TFLOP/s 1915-1942 2100-2120 The top stall lines were all spill traffic in the epilogue: the acc TMEM read (1148 long-scoreboard samples), the whole-tile autovec_copy itself (238), and the acc*sa*sb chain (263). The reference CUTLASS kernel (epilogue_tma_store_scaled, 176 regs, zero spills) gets away with the same whole-tile copy only because its epilogue is a lean helper; in Helion's fatter persistent-loop context the copy is what overflows the register budget. The fix removes the hoist machinery entirely: the rowvec scale falls back to the generic per-subtile ttr_aux_subtile.load() placed after the c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69 placement), which spills zero at every measured bk. At bk=64 - the only configuration where the hoist had been measured as a win - removal is perf-neutral (1711 vs 1712 TFLOP/s). relerr=0.0000 vs the fp32 reference at both bk=64 and bk=128. With this fix the fused-scale epilogue is free again relative to the plain fp8 matmul (2100 vs 2105 TFLOP/s), matching the standalone kernel's behavior (2334 vs 2345). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2754, branch: yushangdi/stack/34

… the TMA producer Two related fixes that together take the fused fp8 scaled_mm kernel from 74.2 us / 1509-1711 TFLOP/s to 52.6 us / 2100-2132 TFLOP/s device-side (batched back-to-back: 2359-2381) on B200, closing in on the CUTLASS standalone (48.7 us / 2317-2334). relerr=0.0000 vs the fp32 reference at bk=64 and bk=128 for both plain and scaled matmul. 1. Remove the fp8 whole-tile rowvec register hoist (epilogue spills) -------------------------------------------------------------------- The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile fragment (all subtiles' fp32 values per thread) into a register tensor in the per-output-tile setup, keeping it live across the whole subtile loop on top of the accumulator fragments and TMA-store state. At bk=128 (the CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the epilogue warps to the 255-register cap and into local memory. NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200, block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local: before after dynamic LDL+STL (spills) 409,104 0 static spill sites 111 0 tensor-pipe active 71.6 % 81.2 % The top stall lines were all spill traffic in the epilogue: the acc TMEM read (1148 long-scoreboard samples), the whole-tile autovec_copy itself (238), and the acc*sa*sb chain (263). The fix removes the hoist machinery entirely: the rowvec scale falls back to the generic per-subtile ttr_aux_subtile.load() placed after the c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69 placement), which spills zero at every measured bk. At bk=64 - the only configuration where the hoist had been measured as a win - removal is perf-neutral (1711 vs 1712 TFLOP/s). 2. Unified rolled TMA producer + hoisted K-loop predicates ---------------------------------------------------------- Static-full role-local tcgen05 configs now emit the CUTLASS reference producer shape: ONE rolled K loop in the TMA-load warp whose blocking producer_acquire fills the AB ring - pipelining comes from the mbarrier handshake, not from codegen structure. This replaces, per work tile, the fully unrolled ab_stages-deep warm-up prologue (8 guarded acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true per-stage predicate) and the separate steady-state producer loop with per-iteration full-tile predicate recomputation. The MMA-exec K loop drops its per-iteration full-tile predicate the same way (static-full means every K tile is full), and all per-K-iter owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the inline form cost an S2UR+IMAD pair per use and was the single largest instruction-count delta vs the reference (+508k dynamic IMAD). NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, same config: before after standalone warp instructions / SMSP 5820 4711 4866 static UTMALDG sites 24 2 2 dynamic SYNCS (barriers) 533k 507k 470k Helion now executes FEWER warp-instructions per SMSP than the CUTLASS standalone; the remaining ~8% duration gap is no longer instruction-count-bound. Scope and fallbacks: * The unified producer requires static-full tiles + the role-local separate producer loop. Edge/K-tail families and the bridge diagnostics that skip individual acquire/advance edges keep the unrolled-prologue form. * _build_kloop_pipeline_consumer_if/_release_if become list-returning _stmts builders so the ungated static-full forms can emit sibling statements without an artificial wrapper; single-statement shims keep the old call shape where the gated form is still used. * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is structurally impossible in the unified form: the blocking acquire fills every stage in order. Test updates: tests that pinned the old per-subtile hoist, per-stage prefetch markers, inline cluster-leader predicates, and the producer-if's full-tile gate now pin the unified/hoisted forms. Full HELION_BACKEND=cute sweep: test_cute_lowerings.py + test_cute_backend.py 479 passed, other cute test files 85 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2754, branch: yushangdi/stack/34

…gen05 Static-full role-local tcgen05 configs now emit the CUTLASS reference producer shape: ONE rolled K loop in the TMA-load warp whose blocking producer_acquire fills the AB ring - pipelining comes from the mbarrier handshake, not from codegen structure. This replaces, per work tile, the fully unrolled ab_stages-deep warm-up prologue (8 guarded acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true per-stage predicate) and the separate steady-state producer loop with per-iteration full-tile predicate recomputation. The MMA-exec K loop drops its per-iteration full-tile predicate the same way (static-full means every K tile is full), and all per-K-iter owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the inline form cost an S2UR+IMAD pair per use and was the single largest instruction-count delta vs the reference (+508k dynamic IMAD). NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200, block_sizes=[256,128,128], ab_stages=8, cluster_m=2: before after standalone warp instructions / SMSP 5820 4711 4866 static UTMALDG sites 24 2 2 dynamic SYNCS (barriers) 533k 507k 470k Helion now executes FEWER warp-instructions per SMSP than the CUTLASS standalone; the remaining duration gap is no longer instruction-count-bound. Scope and fallbacks: * The unified producer requires static-full tiles + the role-local separate producer loop. Edge/K-tail families and the bridge diagnostics that skip individual acquire/advance edges keep the unrolled-prologue form. * _build_kloop_pipeline_consumer_if/_release_if become list-returning _stmts builders so the ungated static-full forms can emit sibling statements without an artificial wrapper; single-statement shims keep the old call shape where the gated form is still used. * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is structurally impossible in the unified form: the blocking acquire fills every stage in order. Test updates: tests that pinned the old per-stage prefetch markers, inline cluster-leader predicates, and the producer-if's full-tile gate now pin the unified/hoisted forms. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2754, branch: yushangdi/stack/34

yushangdi force-pushed the yushangdi/stack/34 branch from 6f5a27a to 9c7a9c0 Compare June 11, 2026 02:29

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 11, 2026

yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 02:35

yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 02:36

yushangdi mentioned this pull request Jun 11, 2026

[cute] Record measured-good B200 config for the fp8 scaled_mm example #2756

Draft

yushangdi force-pushed the yushangdi/stack/27 branch from ea8db74 to d3fee78 Compare June 11, 2026 17:08

yushangdi force-pushed the yushangdi/stack/34 branch from 9c7a9c0 to 68f7f50 Compare June 11, 2026 17:10

yushangdi force-pushed the yushangdi/stack/27 branch from d3fee78 to 1510d86 Compare June 11, 2026 20:49

yushangdi force-pushed the yushangdi/stack/34 branch from 68f7f50 to c18f8b4 Compare June 11, 2026 20:57

yushangdi force-pushed the yushangdi/stack/34 branch from c18f8b4 to d92ad46 Compare June 11, 2026 21:00

yushangdi changed the base branch from yushangdi/stack/27 to yushangdi/stack/26 June 11, 2026 21:02

yushangdi changed the base branch from yushangdi/stack/26 to main June 11, 2026 21:17

yushangdi force-pushed the yushangdi/stack/34 branch from d92ad46 to e01571a Compare June 11, 2026 21:17

yushangdi changed the title ~~[cute] Remove fp8 whole-tile rowvec register hoist to kill epilogue spills~~ [cute] Close the fp8 scaled_mm perf gap: kill epilogue spills + unify the TMA producer Jun 11, 2026

yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 21:18

yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 22:00

yushangdi force-pushed the yushangdi/stack/34 branch from e01571a to 9c18723 Compare June 11, 2026 22:00

yushangdi changed the title ~~[cute] Close the fp8 scaled_mm perf gap: kill epilogue spills + unify the TMA producer~~ [cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05 Jun 11, 2026

yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 22:00

yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 22:16

yushangdi force-pushed the yushangdi/stack/34 branch from 9c18723 to 3f57c00 Compare June 11, 2026 22:16

yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 22:16

yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 22:58

yushangdi force-pushed the yushangdi/stack/34 branch from 3f57c00 to 05d14ef Compare June 11, 2026 22:58

yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 22:58

yushangdi force-pushed the yushangdi/stack/27 branch from 60d5b4c to 8d9123c Compare June 11, 2026 23:12

yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 23:17

yushangdi force-pushed the yushangdi/stack/34 branch from 05d14ef to d853568 Compare June 11, 2026 23:17

yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 23:17

yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 23:41

yushangdi force-pushed the yushangdi/stack/34 branch from d853568 to 3541d8d Compare June 11, 2026 23:41

yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2754

[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2754
yushangdi wants to merge 1 commit into
yushangdi/stack/27from
yushangdi/stack/34

yushangdi commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yushangdi commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!