Skip to content

[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2755

Draft
yushangdi wants to merge 16 commits into
yushangdi/stack/34from
yushangdi/stack/35
Draft

[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2755
yushangdi wants to merge 16 commits into
yushangdi/stack/34from
yushangdi/stack/35

Conversation

@yushangdi

@yushangdi yushangdi commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Stacked PRs:


[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile:

  • the fully unrolled ab_stages-deep warm-up prologue (8 guarded
    acquire/copy/commit blocks at ab=8, each with a re-computed
    per-stage "k_tile fits in K" predicate that is a compile-time truth
    under static shapes), and
  • the separate steady-state producer loop issuing TMA at
    k_tile + ab_stages with per-iteration full-tile/next-full-tile
    predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full, so consumer wait / release
run unconditionally under their two-CTA owner gates), and all per-K-iter
owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead
of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate -
the inline form cost an S2UR+IMAD pair per use inside the K loop and was
the single largest instruction-count delta vs the reference (+508k
dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                          before        after       standalone

warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
kernel duration 52.2 us 51.9 us 48.0 us
tensor-pipe active 81.6 % 81.5 % 84.2 %

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound. Combined with the epilogue spill fix, the fused
scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) vs the standalone's 48.7 us /
2317-2334, up from 74.2 us / 1509-1711 at the start of this
investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale.

Scope and fallbacks:

  • The unified producer requires static-full tiles + the role-local
    separate producer loop (the producer warp owns its K loop, so a
    blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail
    families and the bridge diagnostics that skip individual
    acquire/advance edges keep the unrolled-prologue form.
  • _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims
    keep the old call shape where the gated form is still used.
  • The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order. The regression test now pins the
    unified codegen markers and still verifies end-to-end correctness.

Test updates: three tests pinned old codegen shapes (per-stage prefetch
markers, inline cluster-leader predicates in the K loop, the
producer-if's full-tile gate) and now pin the unified/hoisted forms.
Full HELION_BACKEND=cute sweep: test_cute_lowerings.py +
test_cute_backend.py 479 passed, other cute test files 85 passed.

Co-Authored-By: Claude Fable 5 noreply@anthropic.com

yushangdi added a commit that referenced this pull request Jun 11, 2026
…gen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile:

  * the fully unrolled ab_stages-deep warm-up prologue (8 guarded
    acquire/copy/commit blocks at ab=8, each with a re-computed
    per-stage "k_tile fits in K" predicate that is a compile-time truth
    under static shapes), and
  * the separate steady-state producer loop issuing TMA at
    k_tile + ab_stages with per-iteration full-tile/next-full-tile
    predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full, so consumer wait / release
run unconditionally under their two-CTA owner gates), and all per-K-iter
owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead
of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate -
the inline form cost an S2UR+IMAD pair per use inside the K loop and was
the single largest instruction-count delta vs the reference (+508k
dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                              before        after       standalone
  warp instructions / SMSP    5820          4711        4866
  static UTMALDG sites        24            2           2
  dynamic SYNCS (barriers)    533k          507k        470k
  kernel duration             52.2 us       51.9 us     48.0 us
  tensor-pipe active          81.6 %        81.5 %      84.2 %

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound. Combined with the epilogue spill fix, the fused
scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) vs the standalone's 48.7 us /
2317-2334, up from 74.2 us / 1509-1711 at the start of this
investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale.

Scope and fallbacks:
  * The unified producer requires static-full tiles + the role-local
    separate producer loop (the producer warp owns its K loop, so a
    blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail
    families and the bridge diagnostics that skip individual
    acquire/advance edges keep the unrolled-prologue form.
  * _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims
    keep the old call shape where the gated form is still used.
  * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order. The regression test now pins the
    unified codegen markers and still verifies end-to-end correctness.

Test updates: three tests pinned old codegen shapes (per-stage prefetch
markers, inline cluster-leader predicates in the K loop, the
producer-if's full-tile gate) and now pin the unified/hoisted forms.
Full HELION_BACKEND=cute sweep: test_cute_lowerings.py +
test_cute_backend.py 479 passed, other cute test files 85 passed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2755, branch: yushangdi/stack/35
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from 6f5a27a to 9c7a9c0 Compare June 11, 2026 02:29
@yushangdi yushangdi force-pushed the yushangdi/stack/35 branch from 2bb4805 to 138744f Compare June 11, 2026 02:29
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 11, 2026
@yushangdi yushangdi changed the base branch from yushangdi/stack/34 to main June 11, 2026 02:35
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/34 June 11, 2026 02:36
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from 9c7a9c0 to 68f7f50 Compare June 11, 2026 17:10
yushangdi added a commit that referenced this pull request Jun 11, 2026
…gen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile:

  * the fully unrolled ab_stages-deep warm-up prologue (8 guarded
    acquire/copy/commit blocks at ab=8, each with a re-computed
    per-stage "k_tile fits in K" predicate that is a compile-time truth
    under static shapes), and
  * the separate steady-state producer loop issuing TMA at
    k_tile + ab_stages with per-iteration full-tile/next-full-tile
    predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full, so consumer wait / release
run unconditionally under their two-CTA owner gates), and all per-K-iter
owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead
of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate -
the inline form cost an S2UR+IMAD pair per use inside the K loop and was
the single largest instruction-count delta vs the reference (+508k
dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                              before        after       standalone
  warp instructions / SMSP    5820          4711        4866
  static UTMALDG sites        24            2           2
  dynamic SYNCS (barriers)    533k          507k        470k
  kernel duration             52.2 us       51.9 us     48.0 us
  tensor-pipe active          81.6 %        81.5 %      84.2 %

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound. Combined with the epilogue spill fix, the fused
scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) vs the standalone's 48.7 us /
2317-2334, up from 74.2 us / 1509-1711 at the start of this
investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale.

Scope and fallbacks:
  * The unified producer requires static-full tiles + the role-local
    separate producer loop (the producer warp owns its K loop, so a
    blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail
    families and the bridge diagnostics that skip individual
    acquire/advance edges keep the unrolled-prologue form.
  * _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims
    keep the old call shape where the gated form is still used.
  * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order. The regression test now pins the
    unified codegen markers and still verifies end-to-end correctness.

Test updates: three tests pinned old codegen shapes (per-stage prefetch
markers, inline cluster-leader predicates in the K loop, the
producer-if's full-tile gate) and now pin the unified/hoisted forms.
Full HELION_BACKEND=cute sweep: test_cute_lowerings.py +
test_cute_backend.py 479 passed, other cute test files 85 passed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2755, branch: yushangdi/stack/35
@yushangdi yushangdi force-pushed the yushangdi/stack/35 branch from 138744f to 2237a51 Compare June 11, 2026 17:10
Backports ONLY the program_id.py change from #2696 (the
fused-scale memory_ops.py/cute_fx_walk.py and the deep-AB-staging
tcgen05_config.py pieces are intentionally omitted -- see below).

Adds cutlass.Float8E4M3FN / Float8E5M2 to the persistent role-local
pure-call whitelist in Tcgen05PersistentProgramIDs. Without this, the
role-local CtaGroup.TWO (cluster_m=2) persistent scheduler refuses to emit
for fp8 operands: full role-local body extraction fails and codegen raises
"tcgen05 fully role-local codegen would discard observable shared
statement(s)", so the autotuner falls back to a slow cluster_m=1
non-persistent kernel.

This is the single change (on top of the fp8 enablement commit) that
unblocks the fast persistent role-local path for fp8 -- verified by
bisection: the enablement commit alone fails to compile the CtaGroup.TWO
fp8 config; adding just these two whitelist lines makes it emit and run.

Benchmark (B200, CUDA 13.2, fp8 e4m3 rowwise scaled_mm, row-major B,
m=512 k=2048 n=2048, CtaGroup.TWO cluster_m=2 ab_stages=3 role-local
config, triton do_bench, correctness rel_err<=0.002 vs torch._scaled_mm):

  without this change : config fails to compile (InternalError);
                        autotuner falls back to ~5 TFLOP/s
  with this change    : 38.5 us   111.7 TFLOP/s   (role-local engaged)

i.e. ~20x over the non-persistent fallback the enablement commit is
otherwise stuck on. (For reference on the same shape: torch._scaled_mm
~376 TFLOP/s, standalone CUTLASS CuTe scaled_mm ~331 TFLOP/s -- closing
that remaining gap needs K-major B support, which is separate.)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2739, branch: yushangdi/stack/24
fp8 scaled_mm feeds B column-major (K-contiguous) -- the layout
torch._scaled_mm requires and the standalone CUTLASS CuTe kernel uses
(b_major="k"). Helion previously gated the tcgen05 TMA path on
rhs.is_contiguous() (row-major only) and hardcoded B as MN-major, so a
column-major B operand was forced onto the slow non-TMA fallback
(~5 TFLOP/s), and even feeding row-major B used the slower MN-major MMA
layout.

This makes B's major mode layout-aware:
- cute_mma.py: classify a 2D operand as TMA-eligible when contiguous in
  EITHER axis (row -> MN-major, col -> K-major) instead of requiring
  row-major is_contiguous(); compute tcgen05_b_k_major from rhs stride and
  thread it through _make_tiled_mma_setup / _make_tcgen05_layout_plan_setup
  and onto the ab_tma wrapper plan.
- _tcgen05_tiled_mma_expr / strategies.tcgen05_smem_layout_expr: emit
  OperandMajorMode.K and make_smem_layout_b(is_k_major=True) for B when
  K-major (delegating to CuTe's helper, mirroring the K-major A path).
- runtime/__init__.py: host wrapper emits the K-major B tiled_mma +
  SMEM layout and marks the B TMA view leading_dim=1 (K axis) for K-major
  vs leading_dim=0 (N axis) for MN-major.
- test_cute_backend.py: test_matmul_mma_tcgen05_fp8_col_major_b.

All paths default to the prior MN-major behavior when B is row-major, so
existing golden/codegen tests are byte-identical (full cute suite: 93
passed). K-major B is correct (rel_err 0.0000 vs torch._scaled_mm).

Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096,
CtaGroup.TWO cluster_m=2 ab_stages=3 role-local, do_bench, 10s warmup):

  row-major B (MN-major, old forced layout) :  577 TFLOP/s   31% of aten
  col-major B (K-major, this change)        : 1143 TFLOP/s   62% of aten
  torch._scaled_mm                          : 1857 TFLOP/s

i.e. ~2x from using the tensor-core-native K-major B layout. The
remaining gap to torch._scaled_mm is deep AB staging (ab_stages>3), a
separate lever.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2740, branch: yushangdi/stack/25
With K-major B in place, the fp8 tcgen05 kernel was still capped at
ab_stages=3 (the bf16-tuned limit), so its software pipeline was too
shallow to hide K-loop TMA latency on compute-bound shapes (4096^3:
1143 TFLOP/s, 62% of torch._scaled_mm). 1-byte fp8 operands fit a much
deeper AB pipeline than 2-byte bf16 in the same SMEM budget.

Backports the tcgen05_config.py deep-staging logic from
#2696:
- max_ab_stages_that_fit(): largest ab_stages whose AB SMEM fits the
  per-CTA budget (mirrors CUTLASS _compute_stages).
- _validate_target1_ab_stage_envelope(): admit ab_stages>3 for fp8 as
  long as the AB SMEM fits, instead of hard-failing at 3.
- optional_fragments(): widen the ab_stages search cap to 12 for fp8
  (1-byte operands) on BOTH the search and validation surfaces, so the
  autotuner can sample deep pipelines and a frozen deep-staged fp8 config
  also passes normalize().

Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096, col-major B,
CtaGroup.TWO cluster_m=2 role-local, do_bench, 10s warmup):

  ab_stages= 3 (prev cap) : 1143 TFLOP/s   62% of aten
  ab_stages= 6            : 1494 TFLOP/s   80%
  ab_stages= 8            : 1613 TFLOP/s   86%   <- sweet spot
  ab_stages=10/12         : ~1600 TFLOP/s  (SMEM-bound plateau)
  torch._scaled_mm        : 1867 TFLOP/s

So K-major B (prior commit) + deep AB staging together take fp8 scaled_mm
from ~31% to ~86% of torch._scaled_mm on the 4096^3 compute-bound shape.
Correctness rel_err 0.0000 throughout; full cute suite: 93 passed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2741, branch: yushangdi/stack/26
The fp8 scaled_mm epilogue loaded both rowwise scales per subtile AFTER
the accumulator consumer_wait, exposing their GMEM latency. The standalone
CUTLASS kernel (cute_scaled_mm) instead reads the whole rowvec scale into
registers ONCE before the acc wait so its latency overlaps the MMA, and
reads the per-row colvec scale as a single scalar.

Backports the remaining fused-scale pieces from #2696:
- memory_ops.py: register-hoist a rowvec epilogue aux (rowwise scale_b[n])
  into a register tensor in per-tile setup (before the subtile loop / acc
  consumer_wait), fp8-gated; read per-subtile from registers instead of a
  fresh GMEM load.
- cute_fx_walk.py: classify a stride-(1,0) (M,N) aux as a per-row column
  vector ("broadcast", 2) so scale_a is read as a scalar per subtile
  (tTR_gAux[(0,0,0,s)]) instead of a redundant N-wide vector load.

Generated epilogue now matches the standalone: autovec_copy hoist before
the acc consumer_wait, scalar colvec read.

Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096, col-major B,
ab_stages=8, do_bench, 10s warmup):

  before (deep staging only) : 1616 TFLOP/s   86% of aten
  after (this change)        : 1650 TFLOP/s   88% of aten
  torch._scaled_mm           : 1878 TFLOP/s

Full cute suite: 93 passed; rel_err 0.0000.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2742, branch: yushangdi/stack/27
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from 68f7f50 to c18f8b4 Compare June 11, 2026 20:57
…pills

The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile
fragment (all subtiles' fp32 values per thread) into a register tensor in
the per-output-tile setup, keeping it live across the whole subtile loop on
top of the accumulator fragments and TMA-store state. At bk=128 (the
CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the
epilogue warps to the 255-register cap and into local memory.

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local:

                              before        after
  dynamic LDL+STL (spills)    409,104       0
  static spill sites          111           0
  kernel duration             57.8 us       52.6 us
  tensor-pipe active          71.6 %        81.2 %
  do_bench TFLOP/s            1915-1942     2100-2120

The top stall lines were all spill traffic in the epilogue: the acc TMEM
read (1148 long-scoreboard samples), the whole-tile autovec_copy itself
(238), and the acc*sa*sb chain (263). The reference CUTLASS kernel
(epilogue_tma_store_scaled, 176 regs, zero spills) gets away with the same
whole-tile copy only because its epilogue is a lean helper; in Helion's
fatter persistent-loop context the copy is what overflows the register
budget.

The fix removes the hoist machinery entirely: the rowvec scale falls back
to the generic per-subtile ttr_aux_subtile.load() placed after the
c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69
placement), which spills zero at every measured bk. At bk=64 - the only
configuration where the hoist had been measured as a win - removal is
perf-neutral (1711 vs 1712 TFLOP/s). relerr=0.0000 vs the fp32 reference
at both bk=64 and bk=128.

With this fix the fused-scale epilogue is free again relative to the plain
fp8 matmul (2100 vs 2105 TFLOP/s), matching the standalone kernel's
behavior (2334 vs 2345).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2754, branch: yushangdi/stack/34
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch from c18f8b4 to d92ad46 Compare June 11, 2026 21:00
Every Helion kernel call recomputes the bound-kernel cache key, which
includes the device's compute capability via target_device_capability().
That function called torch.cuda.is_available() plus
torch.cuda.get_device_capability() on every single kernel launch,
costing ~2.8us/call of pure Python/CUDA-runtime overhead -- the single
largest line item in Helion-side dispatch.

A physical device's compute capability cannot change within a process,
so the lookup is memoized per device index, following the existing
is_hip / _is_hip split in this file: the public
target_device_capability is a thin wrapper, and the @functools.cache
lives on a private _target_device_capability(index) helper. A concrete
device index goes straight to the cache, so the steady-state hot path
makes no torch.cuda calls at all. device=None means the *current*
device (which moves with torch.cuda.set_device); it is resolved to a
concrete index per call before the cached lookup, so it is never frozen
under a single key. The availability check lives inside the cached
helper (like _is_hip), so an unavailable runtime caches None and a warm
device skips the ~2us is_available() probe entirely.

Tests that simulate other architectures now patch the helion seam
(target_device_capability / its get_target_device_capability alias in
config_spec) instead of torch.cuda.get_device_capability, exactly as
tests mock is_hip rather than the cached _is_hip. Two such tests are
updated: test_config_api's sm90-vs-sm100 cache-key test and the direct
CuTe arch-gate test.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args |   before |    after |  delta
  -------+----------+----------+-------
       2 | 24.14 us | 18.78 us |   -22%
       8 | 33.05 us | 27.77 us |   -16%
      16 | 43.56 us | 36.90 us |   -15%
```

Benchmark script (used for this and the follow-up launch-overhead
commits; wall-clock around the Python call deliberately captures
CPU-side dispatch cost, which CUDA-event timing excludes -- the kernel
is tiny, so the loop is CPU-bound and measures per-call issue cost):

```python
import os, sys
os.environ.setdefault("HELION_AUTOTUNE_EFFORT", "none")
import time, torch, helion, helion.language as hl

# Write kernels to a real module file: helion needs
# inspect.getsource, so exec()'d kernels don't work.
mod_src = ["import torch, helion, helion.language as hl\n"]
for n_args in (2, 8, 16):
    src_args = ", ".join(f"t{i}" for i in range(n_args))
    mod_src.append(f'''
@helion.kernel(config=helion.Config(block_sizes=[128], num_warps=4, num_stages=2))
def k{n_args}({src_args}):
    out = torch.empty_like(t0)
    for tile in hl.tile(t0.size(0)):
        out[tile] = {' + '.join(f't{i}[tile]' for i in range(n_args))}
    return out
''')
open("/tmp/bench_kernels.py", "w").write("\n".join(mod_src))
sys.path.insert(0, "/tmp")
import bench_kernels

def bench(n_args, n=20000):
    k = getattr(bench_kernels, f"k{n_args}")
    ts = [torch.randn(4096, device="cuda") for _ in range(n_args)]
    for _ in range(50):  # warmup: compile + prime launcher
        k(*ts)
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(n):
        k(*ts)
    dt = (time.perf_counter() - t0) / n * 1e6
    torch.cuda.synchronize()
    return dt

for n_args in (2, 8, 16):
    print(f"n_args={n_args:3d}: {bench(n_args):6.2f} us/call")
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2746, branch: yushangdi/stack/28
Every Helion kernel call recomputes the bound-kernel cache key, which
includes the device's compute capability via target_device_capability().
That function called torch.cuda.is_available() plus
torch.cuda.get_device_capability() on every single kernel launch,
costing ~2.8us/call of pure Python/CUDA-runtime overhead -- the single
largest line item in Helion-side dispatch.

A physical device's compute capability cannot change within a process,
so the lookup is memoized per device index, following the existing
is_hip / _is_hip split in this file: the public
target_device_capability is a thin wrapper, and the @functools.cache
lives on a private _target_device_capability(index) helper. A concrete
device index goes straight to the cache, so the steady-state hot path
makes no torch.cuda calls at all. device=None means the *current*
device (which moves with torch.cuda.set_device); it is resolved to a
concrete index per call before the cached lookup, so it is never frozen
under a single key. The availability check lives inside the cached
helper (like _is_hip), so an unavailable runtime caches None and a warm
device skips the ~2us is_available() probe entirely.

Tests that simulate other architectures now patch the helion seam
(target_device_capability / its get_target_device_capability alias in
config_spec) instead of torch.cuda.get_device_capability, exactly as
tests mock is_hip rather than the cached _is_hip. Two such tests are
updated: test_config_api's sm90-vs-sm100 cache-key test and the direct
CuTe arch-gate test.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args |   before |    after |  delta
  -------+----------+----------+-------
       2 | 24.14 us | 18.78 us |   -22%
       8 | 33.05 us | 27.77 us |   -16%
      16 | 43.56 us | 36.90 us |   -15%
```

Benchmark script (used for this and the follow-up launch-overhead
commits; wall-clock around the Python call deliberately captures
CPU-side dispatch cost, which CUDA-event timing excludes -- the kernel
is tiny, so the loop is CPU-bound and measures per-call issue cost):

```python
import os, sys
os.environ.setdefault("HELION_AUTOTUNE_EFFORT", "none")
import time, torch, helion, helion.language as hl

mod_src = ["import torch, helion, helion.language as hl\n"]
for n_args in (2, 8, 16):
    src_args = ", ".join(f"t{i}" for i in range(n_args))
    mod_src.append(f'''
@helion.kernel(config=helion.Config(block_sizes=[128], num_warps=4, num_stages=2))
def k{n_args}({src_args}):
    out = torch.empty_like(t0)
    for tile in hl.tile(t0.size(0)):
        out[tile] = {' + '.join(f't{i}[tile]' for i in range(n_args))}
    return out
''')
open("/tmp/bench_kernels.py", "w").write("\n".join(mod_src))
sys.path.insert(0, "/tmp")
import bench_kernels

def bench(n_args, n=20000):
    k = getattr(bench_kernels, f"k{n_args}")
    ts = [torch.randn(4096, device="cuda") for _ in range(n_args)]
    for _ in range(50):  # warmup: compile + prime launcher
        k(*ts)
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(n):
        k(*ts)
    dt = (time.perf_counter() - t0) / n * 1e6
    torch.cuda.synchronize()
    return dt

for n_args in (2, 8, 16):
    print(f"n_args={n_args:3d}: {bench(n_args):6.2f} us/call")
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2746, branch: yushangdi/stack/28
Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.

_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.

The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.

Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).

Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    18.78 us |    16.41 us
       8 | 33.05 us |    27.77 us |    24.78 us
      16 | 43.56 us |    36.90 us |    35.24 us
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2747, branch: yushangdi/stack/29
measure("Kernel.bind") wraps the whole body of Kernel.bind, which runs
on every kernel call. Even when HELION_MEASURE_COMPILE_TIME is unset
(the default) and measure() returns a shared no-op nullcontext,
entering and exiting the `with` block still costs ~115ns/call -- the
context-manager protocol itself, not the measurement -- on the
steady-state cache-hit dispatch path.

Gate it: bind() checks _compile_time.is_enabled() (a plain module-flag
read, ~5ns) and only enters the `with measure(...)` block when
measurement is actually on; otherwise it calls the extracted
_bind_impl directly. The measured region is unchanged -- when enabled,
the entire bind body (key computation, cache lookup, and any compile)
runs under the same "Kernel.bind" scope as before, and every bind()
call is still counted, including cache hits. Only the
nothing-to-measure fast path skips the context manager.

No behavior change: with HELION_MEASURE_COMPILE_TIME=1 the timing
report is identical to before (same scope, same call counts); with it
unset there was never any data recorded, only the protocol overhead
that is now avoided. Verified by test_compile_time (metric still
records when enabled), plus test_misc / test_config_api / test_cache /
test_specialize.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    18.04 us |    18.01 us
       8 | 33.05 us |    26.11 us |    25.57 us
      16 | 43.56 us |    36.53 us |    36.41 us
```

(The per-call win is ~0.4us isolated to bind(); at the end-to-end
scale here it is within run-to-run noise.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2752, branch: yushangdi/stack/33
_tensor_key wraps every size and stride element in _hashable_dim to
normalize torch.SymInt into a hashable (shape_env_id, expr) pair. That
wrap exists only for symbolic shapes, which appear on FakeTensors
during tracing -- yet concrete tensors paid for it on every kernel
call: two Python-level loops (sizes + strides) with an isinstance
check per dimension, plus rebuilding each as a fresh tuple. For the
default static_shapes=True path this was the bulk of per-tensor key
extraction (~0.6us each, x number of tensor args, every call).

Specialization-extractor dispatch is by exact type
(_specialization_extractors.get(type(obj))), so the torch.Tensor and
torch.nn.Parameter entries are only ever hit by objects whose
sizes/strides are guaranteed concrete ints. Route those to a new
_concrete_tensor_key that uses obj.size() (a torch.Size) and
obj.stride() directly as key components. Both are tuple subclasses:
they hash and compare identically to plain int tuples, so keys
produced by the fast path and the SymInt path for the same concrete
shape are interchangeable -- no cache invalidation, and the
specialization axes (dtype, shape, stride, _dynamo_static_indices,
int32/int64 index width, dynamic-shape buckets) are bit-identical.

Anything that can carry SymInts keeps the safe path:
 * FakeTensor has its own dispatch entry -> _tensor_key (unchanged)
 * torch.Tensor *subclasses* (e.g. the JAX-export adapter) miss the
   exact-type dict and hit the isinstance fallback in
   _specialization_key, which now routes to _tensor_key explicitly

(This is the same optimization as the Python-only part of upstream
PR #2611, independently arrived at from profiling.)

New test/test_tensor_key_fast_path.py pins down the key equivalence
(fast vs wrapped key hash/compare equal under static and dynamic
shapes, incl. strided tensors), the dispatch routing (concrete ->
_concrete_tensor_key; FakeTensor and the subclass fallback ->
_tensor_key), that a torch.Tensor subclass takes the SymInt-safe path
and still shares a BoundKernel with a plain tensor of the same shape,
and that bind() caching/distinguishing is unchanged. Also verified
against test_misc, test_specialize, test_torch_compile (the
torch.compile suite exercises the FakeTensor/SymInt path).

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    18.01 us |    17.25 us
       8 | 33.05 us |    25.57 us |    24.50 us
      16 | 43.56 us |    36.41 us |    32.16 us
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2748, branch: yushangdi/stack/30
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
Every Helion kernel call recomputes the bound-kernel cache key, which
includes the device's compute capability via target_device_capability().
That function called torch.cuda.is_available() plus
torch.cuda.get_device_capability() on every single kernel launch,
costing ~2.8us/call of pure Python/CUDA-runtime overhead -- the single
largest line item in Helion-side dispatch.

A physical device's compute capability cannot change within a process,
so cache it per device index in a module-level dict, and cache the
is_available() result once it returns True (an unavailable runtime has
no GPU launches to speed up, so False keeps re-querying).

Safety: tests simulate other architectures by patching
torch.cuda.get_device_capability / torch.cuda.is_available (e.g.
test_config_api.py's sm90-vs-sm100 cache-key test). The fast path
engages only while both functions are identity-equal to the originals
captured at import time; patched functions route to the original
uncached path, so the cache can never be poisoned by, nor serve stale
values to, arch-simulation tests.

An index-less torch.device("cuda") refers to the *current* device,
which can change between calls -- it is resolved per call via
torch.cuda.current_device() (cheap) before hitting the per-index cache.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args |   before |    after |  delta
  -------+----------+----------+-------
       2 | 24.14 us | 18.78 us |   -22%
       8 | 33.05 us | 27.77 us |   -16%
      16 | 43.56 us | 36.90 us |   -15%
```

Benchmark script (used for this and the follow-up launch-overhead
commits; wall-clock around the Python call deliberately captures
CPU-side dispatch cost, which CUDA-event timing excludes -- the kernel
is tiny, so the loop is CPU-bound and measures per-call issue cost):

```python
import os, sys
os.environ.setdefault("HELION_AUTOTUNE_EFFORT", "none")
import time, torch, helion, helion.language as hl

mod_src = ["import torch, helion, helion.language as hl\n"]
for n_args in (2, 8, 16):
    src_args = ", ".join(f"t{i}" for i in range(n_args))
    mod_src.append(f'''
@helion.kernel(config=helion.Config(block_sizes=[128], num_warps=4, num_stages=2))
def k{n_args}({src_args}):
    out = torch.empty_like(t0)
    for tile in hl.tile(t0.size(0)):
        out[tile] = {' + '.join(f't{i}[tile]' for i in range(n_args))}
    return out
''')
open("/tmp/bench_kernels.py", "w").write("\n".join(mod_src))
sys.path.insert(0, "/tmp")
import bench_kernels

def bench(n_args, n=20000):
    k = getattr(bench_kernels, f"k{n_args}")
    ts = [torch.randn(4096, device="cuda") for _ in range(n_args)]
    for _ in range(50):  # warmup: compile + prime launcher
        k(*ts)
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(n):
        k(*ts)
    dt = (time.perf_counter() - t0) / n * 1e6
    torch.cuda.synchronize()
    return dt

for n_args in (2, 8, 16):
    print(f"n_args={n_args:3d}: {bench(n_args):6.2f} us/call")
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2746, branch: yushangdi/stack/28
yushangdi and others added 4 commits June 11, 2026 21:10
measure() wrapped the entire body of Kernel.bind, which runs on every
kernel call. When HELION_MEASURE_COMPILE_TIME is unset (the default)
measure() returns a shared no-op context manager, but entering and
exiting it still costs ~0.4-0.5us/call: two extra frames plus a module
global read, on the steady-state cache-hit path.

Compile-time tracking only has meaningful data on the cache-*miss*
branch, where bind actually compiles and specializes a new
BoundKernel. The hit path is a specialization-key computation plus a
dict lookup -- nanoseconds of work that the tracker was never meant to
account for. Move the `with measure("Kernel.bind"):` block to wrap
only the miss branch; the hit path returns directly with no context
manager.

No behavior change: the same code runs under the same measurement
scope on the miss path, and HELION_MEASURE_COMPILE_TIME=1 still
attributes all compile/specialize time to "Kernel.bind". Verified by
test_misc, test_config_api, test_cache, test_specialize.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    16.41 us |    16.30 us
       8 | 33.05 us |    24.78 us |    24.39 us
      16 | 43.56 us |    35.24 us |    34.84 us
```

(The per-call win is ~0.4us isolated to bind(); at the end-to-end
scale here it is partly absorbed into run-to-run noise.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2751, branch: yushangdi/stack/32
measure("Kernel.bind") wraps the whole body of Kernel.bind, which runs
on every kernel call. Even when HELION_MEASURE_COMPILE_TIME is unset
(the default) and measure() returns a shared no-op nullcontext,
entering and exiting the `with` block still costs ~115ns/call -- the
context-manager protocol itself, not the measurement -- on the
steady-state cache-hit dispatch path.

Gate it: bind() checks _compile_time.is_enabled() (a plain module-flag
read, ~5ns) and only enters the `with measure(...)` block when
measurement is actually on; otherwise it calls the extracted
_bind_impl directly. The measured region is unchanged -- when enabled,
the entire bind body (key computation, cache lookup, and any compile)
runs under the same "Kernel.bind" scope as before, and every bind()
call is still counted, including cache hits. Only the
nothing-to-measure fast path skips the context manager.

No behavior change: with HELION_MEASURE_COMPILE_TIME=1 the timing
report is identical to before (same scope, same call counts); with it
unset there was never any data recorded, only the protocol overhead
that is now avoided. Verified by test_compile_time (metric still
records when enabled), plus test_misc / test_config_api / test_cache /
test_specialize.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    18.04 us |    18.01 us
       8 | 33.05 us |    26.11 us |    25.57 us
      16 | 43.56 us |    36.53 us |    36.41 us
```

(The per-call win is ~0.4us isolated to bind(); at the end-to-end
scale here it is within run-to-run noise.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2752, branch: yushangdi/stack/33
fp8 scaled_mm feeds B column-major (K-contiguous) -- the layout
torch._scaled_mm requires and the standalone CUTLASS CuTe kernel uses
(b_major="k"). Helion previously gated the tcgen05 TMA path on
rhs.is_contiguous() (row-major only) and hardcoded B as MN-major, so a
column-major B operand was forced onto the slow non-TMA fallback
(~5 TFLOP/s), and even feeding row-major B used the slower MN-major MMA
layout.

This makes B's major mode layout-aware:
- cute_mma.py: classify a 2D operand as TMA-eligible when contiguous in
  EITHER axis (row -> MN-major, col -> K-major) instead of requiring
  row-major is_contiguous(); compute tcgen05_b_k_major from rhs stride and
  thread it through _make_tiled_mma_setup / _make_tcgen05_layout_plan_setup
  and onto the ab_tma wrapper plan.
- _tcgen05_tiled_mma_expr / strategies.tcgen05_smem_layout_expr: emit
  OperandMajorMode.K and make_smem_layout_b(is_k_major=True) for B when
  K-major (delegating to CuTe's helper, mirroring the K-major A path).
- runtime/__init__.py: host wrapper emits the K-major B tiled_mma +
  SMEM layout and marks the B TMA view leading_dim=1 (K axis) for K-major
  vs leading_dim=0 (N axis) for MN-major.
- test_cute_backend.py: test_matmul_mma_tcgen05_fp8_col_major_b.

All paths default to the prior MN-major behavior when B is row-major, so
existing golden/codegen tests are byte-identical (full cute suite: 93
passed). K-major B is correct (rel_err 0.0000 vs torch._scaled_mm).

Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096,
CtaGroup.TWO cluster_m=2 ab_stages=3 role-local, do_bench, 10s warmup):

  row-major B (MN-major, old forced layout) :  577 TFLOP/s   31% of aten
  col-major B (K-major, this change)        : 1143 TFLOP/s   62% of aten
  torch._scaled_mm                          : 1857 TFLOP/s

i.e. ~2x from using the tensor-core-native K-major B layout. The
remaining gap to torch._scaled_mm is deep AB staging (ab_stages>3), a
separate lever.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2740, branch: yushangdi/stack/25
…gen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile:

  * the fully unrolled ab_stages-deep warm-up prologue (8 guarded
    acquire/copy/commit blocks at ab=8, each with a re-computed
    per-stage "k_tile fits in K" predicate that is a compile-time truth
    under static shapes), and
  * the separate steady-state producer loop issuing TMA at
    k_tile + ab_stages with per-iteration full-tile/next-full-tile
    predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full, so consumer wait / release
run unconditionally under their two-CTA owner gates), and all per-K-iter
owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead
of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate -
the inline form cost an S2UR+IMAD pair per use inside the K loop and was
the single largest instruction-count delta vs the reference (+508k
dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                              before        after       standalone
  warp instructions / SMSP    5820          4711        4866
  static UTMALDG sites        24            2           2
  dynamic SYNCS (barriers)    533k          507k        470k
  kernel duration             52.2 us       51.9 us     48.0 us
  tensor-pipe active          81.6 %        81.5 %      84.2 %

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound. Combined with the epilogue spill fix, the fused
scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) vs the standalone's 48.7 us /
2317-2334, up from 74.2 us / 1509-1711 at the start of this
investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale.

Scope and fallbacks:
  * The unified producer requires static-full tiles + the role-local
    separate producer loop (the producer warp owns its K loop, so a
    blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail
    families and the bridge diagnostics that skip individual
    acquire/advance edges keep the unrolled-prologue form.
  * _build_kloop_pipeline_consumer_if/_release_if become list-returning
    _stmts builders so the ungated static-full forms can emit sibling
    statements without an artificial wrapper; single-statement shims
    keep the old call shape where the gated form is still used.
  * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
    structurally impossible in the unified form: the blocking acquire
    fills every stage in order. The regression test now pins the
    unified codegen markers and still verifies end-to-end correctness.

Test updates: three tests pinned old codegen shapes (per-stage prefetch
markers, inline cluster-leader predicates in the K loop, the
producer-if's full-tile gate) and now pin the unified/hoisted forms.
Full HELION_BACKEND=cute sweep: test_cute_lowerings.py +
test_cute_backend.py 479 passed, other cute test files 85 passed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

stack-info: PR: #2755, branch: yushangdi/stack/35
@yushangdi yushangdi force-pushed the yushangdi/stack/35 branch from 2237a51 to f407dea Compare June 11, 2026 21:11
@yushangdi yushangdi force-pushed the yushangdi/stack/34 branch 6 times, most recently from d853568 to 3541d8d Compare June 11, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant