Skip to content

[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2754

Draft
yushangdi wants to merge 1 commit into
yushangdi/stack/27from
yushangdi/stack/34
Draft

[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2754
yushangdi wants to merge 1 commit into
yushangdi/stack/27from
yushangdi/stack/34

Conversation

@yushangdi

@yushangdi yushangdi commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Stacked PRs:


[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                          before        after       standalone

warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.

Scope and fallbacks:

  • The unified producer requires static-full tiles + the role-local
    separate producer loop. Edge/K-tail families and the bridge
    diagnostics that skip individual acquire/advance edges keep the
    unrolled-prologue form.
  • _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims keep
    the old call shape where the gated form is still used.
  • The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order.

Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.

Co-Authored-By: Claude Fable 5 noreply@anthropic.com

@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from 6f5a27a to 9c7a9c0 Compare June 11, 2026 02:29
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 11, 2026
@yushangdi yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 02:35
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 02:36
@yushangdi yushangdi force-pushed the yushangdi/stack/27 branch from ea8db74 to d3fee78 Compare June 11, 2026 17:08
yushangdi added a commit that referenced this pull request Jun 11, 2026
…pills

The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile
fragment (all subtiles' fp32 values per thread) into a register tensor in
the per-output-tile setup, keeping it live across the whole subtile loop on
top of the accumulator fragments and TMA-store state. At bk=128 (the
CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the
epilogue warps to the 255-register cap and into local memory.

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local:

                              before        after
  dynamic LDL+STL (spills)    409,104       0
  static spill sites          111           0
  kernel duration             57.8 us       52.6 us
  tensor-pipe active          71.6 %        81.2 %
  do_bench TFLOP/s            1915-1942     2100-2120

The top stall lines were all spill traffic in the epilogue: the acc TMEM
read (1148 long-scoreboard samples), the whole-tile autovec_copy itself
(238), and the acc*sa*sb chain (263). The reference CUTLASS kernel
(epilogue_tma_store_scaled, 176 regs, zero spills) gets away with the same
whole-tile copy only because its epilogue is a lean helper; in Helion's
fatter persistent-loop context the copy is what overflows the register
budget.

The fix removes the hoist machinery entirely: the rowvec scale falls back
to the generic per-subtile ttr_aux_subtile.load() placed after the
c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69
placement), which spills zero at every measured bk. At bk=64 - the only
configuration where the hoist had been measured as a win - removal is
perf-neutral (1711 vs 1712 TFLOP/s). relerr=0.0000 vs the fp32 reference
at both bk=64 and bk=128.

With this fix the fused-scale epilogue is free again relative to the plain
fp8 matmul (2100 vs 2105 TFLOP/s), matching the standalone kernel's
behavior (2334 vs 2345).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2754, branch: yushangdi/stack/34
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from 9c7a9c0 to 68f7f50 Compare June 11, 2026 17:10
@yushangdi yushangdi force-pushed the yushangdi/stack/27 branch from d3fee78 to 1510d86 Compare June 11, 2026 20:49
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from 68f7f50 to c18f8b4 Compare June 11, 2026 20:57
yushangdi added a commit that referenced this pull request Jun 11, 2026
…pills

The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile
fragment (all subtiles' fp32 values per thread) into a register tensor in
the per-output-tile setup, keeping it live across the whole subtile loop on
top of the accumulator fragments and TMA-store state. At bk=128 (the
CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the
epilogue warps to the 255-register cap and into local memory.

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local:

                              before        after
  dynamic LDL+STL (spills)    409,104       0
  static spill sites          111           0
  kernel duration             57.8 us       52.6 us
  tensor-pipe active          71.6 %        81.2 %
  do_bench TFLOP/s            1915-1942     2100-2120

The top stall lines were all spill traffic in the epilogue: the acc TMEM
read (1148 long-scoreboard samples), the whole-tile autovec_copy itself
(238), and the acc*sa*sb chain (263). The reference CUTLASS kernel
(epilogue_tma_store_scaled, 176 regs, zero spills) gets away with the same
whole-tile copy only because its epilogue is a lean helper; in Helion's
fatter persistent-loop context the copy is what overflows the register
budget.

The fix removes the hoist machinery entirely: the rowvec scale falls back
to the generic per-subtile ttr_aux_subtile.load() placed after the
c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69
placement), which spills zero at every measured bk. At bk=64 - the only
configuration where the hoist had been measured as a win - removal is
perf-neutral (1711 vs 1712 TFLOP/s). relerr=0.0000 vs the fp32 reference
at both bk=64 and bk=128.

With this fix the fused-scale epilogue is free again relative to the plain
fp8 matmul (2100 vs 2105 TFLOP/s), matching the standalone kernel's
behavior (2334 vs 2345).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2754, branch: yushangdi/stack/34
yushangdi added a commit that referenced this pull request Jun 11, 2026
…pills

The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile
fragment (all subtiles' fp32 values per thread) into a register tensor in
the per-output-tile setup, keeping it live across the whole subtile loop on
top of the accumulator fragments and TMA-store state. At bk=128 (the
CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the
epilogue warps to the 255-register cap and into local memory.

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local:

                              before        after
  dynamic LDL+STL (spills)    409,104       0
  static spill sites          111           0
  kernel duration             57.8 us       52.6 us
  tensor-pipe active          71.6 %        81.2 %
  do_bench TFLOP/s            1915-1942     2100-2120

The top stall lines were all spill traffic in the epilogue: the acc TMEM
read (1148 long-scoreboard samples), the whole-tile autovec_copy itself
(238), and the acc*sa*sb chain (263). The reference CUTLASS kernel
(epilogue_tma_store_scaled, 176 regs, zero spills) gets away with the same
whole-tile copy only because its epilogue is a lean helper; in Helion's
fatter persistent-loop context the copy is what overflows the register
budget.

The fix removes the hoist machinery entirely: the rowvec scale falls back
to the generic per-subtile ttr_aux_subtile.load() placed after the
c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69
placement), which spills zero at every measured bk. At bk=64 - the only
configuration where the hoist had been measured as a win - removal is
perf-neutral (1711 vs 1712 TFLOP/s). relerr=0.0000 vs the fp32 reference
at both bk=64 and bk=128.

With this fix the fused-scale epilogue is free again relative to the plain
fp8 matmul (2100 vs 2105 TFLOP/s), matching the standalone kernel's
behavior (2334 vs 2345).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2754, branch: yushangdi/stack/34
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from c18f8b4 to d92ad46 Compare June 11, 2026 21:00
@yushangdi yushangdi changed the base branch from yushangdi/stack/27 to yushangdi/stack/26 June 11, 2026 21:02
@yushangdi yushangdi changed the base branch from yushangdi/stack/26 to main June 11, 2026 21:17
yushangdi added a commit that referenced this pull request Jun 11, 2026
… the TMA producer

Two related fixes that together take the fused fp8 scaled_mm kernel from
74.2 us / 1509-1711 TFLOP/s to 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) on B200, closing in on the CUTLASS
standalone (48.7 us / 2317-2334). relerr=0.0000 vs the fp32 reference at
bk=64 and bk=128 for both plain and scaled matmul.

1. Remove the fp8 whole-tile rowvec register hoist (epilogue spills)
--------------------------------------------------------------------
The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile
fragment (all subtiles' fp32 values per thread) into a register tensor in
the per-output-tile setup, keeping it live across the whole subtile loop on
top of the accumulator fragments and TMA-store state. At bk=128 (the
CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the
epilogue warps to the 255-register cap and into local memory.

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local:

                              before        after
  dynamic LDL+STL (spills)    409,104       0
  static spill sites          111           0
  tensor-pipe active          71.6 %        81.2 %

The top stall lines were all spill traffic in the epilogue: the acc TMEM
read (1148 long-scoreboard samples), the whole-tile autovec_copy itself
(238), and the acc*sa*sb chain (263). The fix removes the hoist machinery
entirely: the rowvec scale falls back to the generic per-subtile
ttr_aux_subtile.load() placed after the c_pipeline acquire / acc
consumer_wait / T2R prefix (the cycle-69 placement), which spills zero at
every measured bk. At bk=64 - the only configuration where the hoist had
been measured as a win - removal is perf-neutral (1711 vs 1712 TFLOP/s).

2. Unified rolled TMA producer + hoisted K-loop predicates
----------------------------------------------------------
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, same config:

                              before        after       standalone
  warp instructions / SMSP    5820          4711        4866
  static UTMALDG sites        24            2           2
  dynamic SYNCS (barriers)    533k          507k        470k

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound.

Scope and fallbacks:
  * The unified producer requires static-full tiles + the role-local
    separate producer loop. Edge/K-tail families and the bridge
    diagnostics that skip individual acquire/advance edges keep the
    unrolled-prologue form.
  * _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims keep
    the old call shape where the gated form is still used.
  * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order.

Test updates: tests that pinned the old per-subtile hoist, per-stage
prefetch markers, inline cluster-leader predicates, and the producer-if's
full-tile gate now pin the unified/hoisted forms. Full HELION_BACKEND=cute
sweep: test_cute_lowerings.py + test_cute_backend.py 479 passed, other cute
test files 85 passed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2754, branch: yushangdi/stack/34
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from d92ad46 to e01571a Compare June 11, 2026 21:17
@yushangdi yushangdi changed the title [cute] Remove fp8 whole-tile rowvec register hoist to kill epilogue spills [cute] Close the fp8 scaled_mm perf gap: kill epilogue spills + unify the TMA producer Jun 11, 2026
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 21:18
@yushangdi yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 22:00
yushangdi added a commit that referenced this pull request Jun 11, 2026
…gen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                              before        after       standalone
  warp instructions / SMSP    5820          4711        4866
  static UTMALDG sites        24            2           2
  dynamic SYNCS (barriers)    533k          507k        470k

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.

Scope and fallbacks:
  * The unified producer requires static-full tiles + the role-local
    separate producer loop. Edge/K-tail families and the bridge
    diagnostics that skip individual acquire/advance edges keep the
    unrolled-prologue form.
  * _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims keep
    the old call shape where the gated form is still used.
  * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order.

Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2754, branch: yushangdi/stack/34
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from e01571a to 9c18723 Compare June 11, 2026 22:00
@yushangdi yushangdi changed the title [cute] Close the fp8 scaled_mm perf gap: kill epilogue spills + unify the TMA producer [cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05 Jun 11, 2026
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 22:00
@yushangdi yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 22:16
yushangdi added a commit that referenced this pull request Jun 11, 2026
…gen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                              before        after       standalone
  warp instructions / SMSP    5820          4711        4866
  static UTMALDG sites        24            2           2
  dynamic SYNCS (barriers)    533k          507k        470k

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.

Scope and fallbacks:
  * The unified producer requires static-full tiles + the role-local
    separate producer loop. Edge/K-tail families and the bridge
    diagnostics that skip individual acquire/advance edges keep the
    unrolled-prologue form.
  * _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims keep
    the old call shape where the gated form is still used.
  * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order.

Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2754, branch: yushangdi/stack/34
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from 9c18723 to 3f57c00 Compare June 11, 2026 22:16
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 22:16
@yushangdi yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 22:58
yushangdi added a commit that referenced this pull request Jun 11, 2026
…gen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                              before        after       standalone
  warp instructions / SMSP    5820          4711        4866
  static UTMALDG sites        24            2           2
  dynamic SYNCS (barriers)    533k          507k        470k

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.

Scope and fallbacks:
  * The unified producer requires static-full tiles + the role-local
    separate producer loop. Edge/K-tail families and the bridge
    diagnostics that skip individual acquire/advance edges keep the
    unrolled-prologue form.
  * _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims keep
    the old call shape where the gated form is still used.
  * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order.

Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2754, branch: yushangdi/stack/34
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from 3f57c00 to 05d14ef Compare June 11, 2026 22:58
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 22:58
@yushangdi yushangdi force-pushed the yushangdi/stack/27 branch from 60d5b4c to 8d9123c Compare June 11, 2026 23:12
@yushangdi yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 23:17
yushangdi added a commit that referenced this pull request Jun 11, 2026
…gen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                              before        after       standalone
  warp instructions / SMSP    5820          4711        4866
  static UTMALDG sites        24            2           2
  dynamic SYNCS (barriers)    533k          507k        470k

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.

Scope and fallbacks:
  * The unified producer requires static-full tiles + the role-local
    separate producer loop. Edge/K-tail families and the bridge
    diagnostics that skip individual acquire/advance edges keep the
    unrolled-prologue form.
  * _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims keep
    the old call shape where the gated form is still used.
  * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order.

Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2754, branch: yushangdi/stack/34
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from 05d14ef to d853568 Compare June 11, 2026 23:17
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 23:17
…gen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile, the
fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each re-computing a compile-time-true
per-stage predicate) and the separate steady-state producer loop with
per-iteration full-tile predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full), and all per-K-iter owner
gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of
re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the
inline form cost an S2UR+IMAD pair per use and was the single largest
instruction-count delta vs the reference (+508k dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                              before        after       standalone
  warp instructions / SMSP    5820          4711        4866
  static UTMALDG sites        24            2           2
  dynamic SYNCS (barriers)    533k          507k        470k

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining duration gap is no longer
instruction-count-bound.

Scope and fallbacks:
  * The unified producer requires static-full tiles + the role-local
    separate producer loop. Edge/K-tail families and the bridge
    diagnostics that skip individual acquire/advance edges keep the
    unrolled-prologue form.
  * _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims keep
    the old call shape where the gated form is still used.
  * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order.

Test updates: tests that pinned the old per-stage prefetch markers,
inline cluster-leader predicates, and the producer-if's full-tile gate
now pin the unified/hoisted forms.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2754, branch: yushangdi/stack/34
@yushangdi yushangdi changed the base branch from yushangdi/stack/27 to main June 11, 2026 23:41
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from d853568 to 3541d8d Compare June 11, 2026 23:41
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/27 June 11, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant