[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05 by yushangdi · Pull Request #2755 · pytorch/helion

yushangdi · 2026-06-11T02:29:24Z

Stacked PRs:

[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05

Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile:

the fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each with a re-computed
per-stage "k_tile fits in K" predicate that is a compile-time truth
under static shapes), and
the separate steady-state producer loop issuing TMA at
k_tile + ab_stages with per-iteration full-tile/next-full-tile
predicate recomputation.

The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full, so consumer wait / release
run unconditionally under their two-CTA owner gates), and all per-K-iter
owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead
of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate -
the inline form cost an S2UR+IMAD pair per use inside the K loop and was
the single largest instruction-count delta vs the reference (+508k
dynamic IMAD).

NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:

                          before        after       standalone

warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
kernel duration 52.2 us 51.9 us 48.0 us
tensor-pipe active 81.6 % 81.5 % 84.2 %

Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound. Combined with the epilogue spill fix, the fused
scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) vs the standalone's 48.7 us /
2317-2334, up from 74.2 us / 1509-1711 at the start of this
investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale.

Scope and fallbacks:

The unified producer requires static-full tiles + the role-local
separate producer loop (the producer warp owns its K loop, so a
blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail
families and the bridge diagnostics that skip individual
acquire/advance edges keep the unrolled-prologue form.
_build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims
keep the old call shape where the gated form is still used.
The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order. The regression test now pins the
unified codegen markers and still verifies end-to-end correctness.

Test updates: three tests pinned old codegen shapes (per-stage prefetch
markers, inline cluster-leader predicates in the K loop, the
producer-if's full-tile gate) and now pin the unified/hoisted forms.
Full HELION_BACKEND=cute sweep: test_cute_lowerings.py +
test_cute_backend.py 479 passed, other cute test files 85 passed.

Co-Authored-By: Claude Fable 5 noreply@anthropic.com

…gen05 Static-full role-local tcgen05 configs now emit the CUTLASS reference producer shape: ONE rolled K loop in the TMA-load warp whose blocking producer_acquire fills the AB ring - pipelining comes from the mbarrier handshake, not from codegen structure. This replaces, per work tile: * the fully unrolled ab_stages-deep warm-up prologue (8 guarded acquire/copy/commit blocks at ab=8, each with a re-computed per-stage "k_tile fits in K" predicate that is a compile-time truth under static shapes), and * the separate steady-state producer loop issuing TMA at k_tile + ab_stages with per-iteration full-tile/next-full-tile predicate recomputation. The MMA-exec K loop drops its per-iteration full-tile predicate the same way (static-full means every K tile is full, so consumer wait / release run unconditionally under their two-CTA owner gates), and all per-K-iter owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the inline form cost an S2UR+IMAD pair per use inside the K loop and was the single largest instruction-count delta vs the reference (+508k dynamic IMAD). NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200, block_sizes=[256,128,128], ab_stages=8, cluster_m=2: before after standalone warp instructions / SMSP 5820 4711 4866 static UTMALDG sites 24 2 2 dynamic SYNCS (barriers) 533k 507k 470k kernel duration 52.2 us 51.9 us 48.0 us tensor-pipe active 81.6 % 81.5 % 84.2 % Helion now executes FEWER warp-instructions per SMSP than the CUTLASS standalone; the remaining ~8% duration gap is no longer instruction-count-bound. Combined with the epilogue spill fix, the fused scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side (batched back-to-back: 2359-2381) vs the standalone's 48.7 us / 2317-2334, up from 74.2 us / 1509-1711 at the start of this investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale. Scope and fallbacks: * The unified producer requires static-full tiles + the role-local separate producer loop (the producer warp owns its K loop, so a blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail families and the bridge diagnostics that skip individual acquire/advance edges keep the unrolled-prologue form. * _build_kloop_pipeline_consumer_if/_release_if become list-returning _stmts builders so the ungated static-full forms can emit sibling statements without an artificial wrapper; single-statement shims keep the old call shape where the gated form is still used. * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is structurally impossible in the unified form: the blocking acquire fills every stage in order. The regression test now pins the unified codegen markers and still verifies end-to-end correctness. Test updates: three tests pinned old codegen shapes (per-stage prefetch markers, inline cluster-leader predicates in the K loop, the producer-if's full-tile gate) and now pin the unified/hoisted forms. Full HELION_BACKEND=cute sweep: test_cute_lowerings.py + test_cute_backend.py 479 passed, other cute test files 85 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2755, branch: yushangdi/stack/35

Backports ONLY the program_id.py change from #2696 (the fused-scale memory_ops.py/cute_fx_walk.py and the deep-AB-staging tcgen05_config.py pieces are intentionally omitted -- see below). Adds cutlass.Float8E4M3FN / Float8E5M2 to the persistent role-local pure-call whitelist in Tcgen05PersistentProgramIDs. Without this, the role-local CtaGroup.TWO (cluster_m=2) persistent scheduler refuses to emit for fp8 operands: full role-local body extraction fails and codegen raises "tcgen05 fully role-local codegen would discard observable shared statement(s)", so the autotuner falls back to a slow cluster_m=1 non-persistent kernel. This is the single change (on top of the fp8 enablement commit) that unblocks the fast persistent role-local path for fp8 -- verified by bisection: the enablement commit alone fails to compile the CtaGroup.TWO fp8 config; adding just these two whitelist lines makes it emit and run. Benchmark (B200, CUDA 13.2, fp8 e4m3 rowwise scaled_mm, row-major B, m=512 k=2048 n=2048, CtaGroup.TWO cluster_m=2 ab_stages=3 role-local config, triton do_bench, correctness rel_err<=0.002 vs torch._scaled_mm): without this change : config fails to compile (InternalError); autotuner falls back to ~5 TFLOP/s with this change : 38.5 us 111.7 TFLOP/s (role-local engaged) i.e. ~20x over the non-persistent fallback the enablement commit is otherwise stuck on. (For reference on the same shape: torch._scaled_mm ~376 TFLOP/s, standalone CUTLASS CuTe scaled_mm ~331 TFLOP/s -- closing that remaining gap needs K-major B support, which is separate.) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2739, branch: yushangdi/stack/24

fp8 scaled_mm feeds B column-major (K-contiguous) -- the layout torch._scaled_mm requires and the standalone CUTLASS CuTe kernel uses (b_major="k"). Helion previously gated the tcgen05 TMA path on rhs.is_contiguous() (row-major only) and hardcoded B as MN-major, so a column-major B operand was forced onto the slow non-TMA fallback (~5 TFLOP/s), and even feeding row-major B used the slower MN-major MMA layout. This makes B's major mode layout-aware: - cute_mma.py: classify a 2D operand as TMA-eligible when contiguous in EITHER axis (row -> MN-major, col -> K-major) instead of requiring row-major is_contiguous(); compute tcgen05_b_k_major from rhs stride and thread it through _make_tiled_mma_setup / _make_tcgen05_layout_plan_setup and onto the ab_tma wrapper plan. - _tcgen05_tiled_mma_expr / strategies.tcgen05_smem_layout_expr: emit OperandMajorMode.K and make_smem_layout_b(is_k_major=True) for B when K-major (delegating to CuTe's helper, mirroring the K-major A path). - runtime/__init__.py: host wrapper emits the K-major B tiled_mma + SMEM layout and marks the B TMA view leading_dim=1 (K axis) for K-major vs leading_dim=0 (N axis) for MN-major. - test_cute_backend.py: test_matmul_mma_tcgen05_fp8_col_major_b. All paths default to the prior MN-major behavior when B is row-major, so existing golden/codegen tests are byte-identical (full cute suite: 93 passed). K-major B is correct (rel_err 0.0000 vs torch._scaled_mm). Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096, CtaGroup.TWO cluster_m=2 ab_stages=3 role-local, do_bench, 10s warmup): row-major B (MN-major, old forced layout) : 577 TFLOP/s 31% of aten col-major B (K-major, this change) : 1143 TFLOP/s 62% of aten torch._scaled_mm : 1857 TFLOP/s i.e. ~2x from using the tensor-core-native K-major B layout. The remaining gap to torch._scaled_mm is deep AB staging (ab_stages>3), a separate lever. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2740, branch: yushangdi/stack/25

With K-major B in place, the fp8 tcgen05 kernel was still capped at ab_stages=3 (the bf16-tuned limit), so its software pipeline was too shallow to hide K-loop TMA latency on compute-bound shapes (4096^3: 1143 TFLOP/s, 62% of torch._scaled_mm). 1-byte fp8 operands fit a much deeper AB pipeline than 2-byte bf16 in the same SMEM budget. Backports the tcgen05_config.py deep-staging logic from #2696: - max_ab_stages_that_fit(): largest ab_stages whose AB SMEM fits the per-CTA budget (mirrors CUTLASS _compute_stages). - _validate_target1_ab_stage_envelope(): admit ab_stages>3 for fp8 as long as the AB SMEM fits, instead of hard-failing at 3. - optional_fragments(): widen the ab_stages search cap to 12 for fp8 (1-byte operands) on BOTH the search and validation surfaces, so the autotuner can sample deep pipelines and a frozen deep-staged fp8 config also passes normalize(). Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096, col-major B, CtaGroup.TWO cluster_m=2 role-local, do_bench, 10s warmup): ab_stages= 3 (prev cap) : 1143 TFLOP/s 62% of aten ab_stages= 6 : 1494 TFLOP/s 80% ab_stages= 8 : 1613 TFLOP/s 86% <- sweet spot ab_stages=10/12 : ~1600 TFLOP/s (SMEM-bound plateau) torch._scaled_mm : 1867 TFLOP/s So K-major B (prior commit) + deep AB staging together take fp8 scaled_mm from ~31% to ~86% of torch._scaled_mm on the 4096^3 compute-bound shape. Correctness rel_err 0.0000 throughout; full cute suite: 93 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2741, branch: yushangdi/stack/26

The fp8 scaled_mm epilogue loaded both rowwise scales per subtile AFTER the accumulator consumer_wait, exposing their GMEM latency. The standalone CUTLASS kernel (cute_scaled_mm) instead reads the whole rowvec scale into registers ONCE before the acc wait so its latency overlaps the MMA, and reads the per-row colvec scale as a single scalar. Backports the remaining fused-scale pieces from #2696: - memory_ops.py: register-hoist a rowvec epilogue aux (rowwise scale_b[n]) into a register tensor in per-tile setup (before the subtile loop / acc consumer_wait), fp8-gated; read per-subtile from registers instead of a fresh GMEM load. - cute_fx_walk.py: classify a stride-(1,0) (M,N) aux as a per-row column vector ("broadcast", 2) so scale_a is read as a scalar per subtile (tTR_gAux[(0,0,0,s)]) instead of a redundant N-wide vector load. Generated epilogue now matches the standalone: autovec_copy hoist before the acc consumer_wait, scalar colvec read. Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096, col-major B, ab_stages=8, do_bench, 10s warmup): before (deep staging only) : 1616 TFLOP/s 86% of aten after (this change) : 1650 TFLOP/s 88% of aten torch._scaled_mm : 1878 TFLOP/s Full cute suite: 93 passed; rel_err 0.0000. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2742, branch: yushangdi/stack/27

…pills The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile fragment (all subtiles' fp32 values per thread) into a register tensor in the per-output-tile setup, keeping it live across the whole subtile loop on top of the accumulator fragments and TMA-store state. At bk=128 (the CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the epilogue warps to the 255-register cap and into local memory. NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200, block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local: before after dynamic LDL+STL (spills) 409,104 0 static spill sites 111 0 kernel duration 57.8 us 52.6 us tensor-pipe active 71.6 % 81.2 % do_bench TFLOP/s 1915-1942 2100-2120 The top stall lines were all spill traffic in the epilogue: the acc TMEM read (1148 long-scoreboard samples), the whole-tile autovec_copy itself (238), and the acc*sa*sb chain (263). The reference CUTLASS kernel (epilogue_tma_store_scaled, 176 regs, zero spills) gets away with the same whole-tile copy only because its epilogue is a lean helper; in Helion's fatter persistent-loop context the copy is what overflows the register budget. The fix removes the hoist machinery entirely: the rowvec scale falls back to the generic per-subtile ttr_aux_subtile.load() placed after the c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69 placement), which spills zero at every measured bk. At bk=64 - the only configuration where the hoist had been measured as a win - removal is perf-neutral (1711 vs 1712 TFLOP/s). relerr=0.0000 vs the fp32 reference at both bk=64 and bk=128. With this fix the fused-scale epilogue is free again relative to the plain fp8 matmul (2100 vs 2105 TFLOP/s), matching the standalone kernel's behavior (2334 vs 2345). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2754, branch: yushangdi/stack/34

Every Helion kernel call recomputes the bound-kernel cache key, which includes the device's compute capability via target_device_capability(). That function called torch.cuda.is_available() plus torch.cuda.get_device_capability() on every single kernel launch, costing ~2.8us/call of pure Python/CUDA-runtime overhead -- the single largest line item in Helion-side dispatch. A physical device's compute capability cannot change within a process, so the lookup is memoized per device index, following the existing is_hip / _is_hip split in this file: the public target_device_capability is a thin wrapper, and the @functools.cache lives on a private _target_device_capability(index) helper. A concrete device index goes straight to the cache, so the steady-state hot path makes no torch.cuda calls at all. device=None means the *current* device (which moves with torch.cuda.set_device); it is resolved to a concrete index per call before the cached lookup, so it is never frozen under a single key. The availability check lives inside the cached helper (like _is_hip), so an unavailable runtime caches None and a warm device skips the ~2us is_available() probe entirely. Tests that simulate other architectures now patch the helion seam (target_device_capability / its get_target_device_capability alias in config_spec) instead of torch.cuda.get_device_capability, exactly as tests mock is_hip rather than the cached _is_hip. Two such tests are updated: test_config_api's sm90-vs-sm100 cache-key test and the direct CuTe arch-gate test. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | before | after | delta -------+----------+----------+------- 2 | 24.14 us | 18.78 us | -22% 8 | 33.05 us | 27.77 us | -16% 16 | 43.56 us | 36.90 us | -15% ``` Benchmark script (used for this and the follow-up launch-overhead commits; wall-clock around the Python call deliberately captures CPU-side dispatch cost, which CUDA-event timing excludes -- the kernel is tiny, so the loop is CPU-bound and measures per-call issue cost): ```python import os, sys os.environ.setdefault("HELION_AUTOTUNE_EFFORT", "none") import time, torch, helion, helion.language as hl # Write kernels to a real module file: helion needs # inspect.getsource, so exec()'d kernels don't work. mod_src = ["import torch, helion, helion.language as hl\n"] for n_args in (2, 8, 16): src_args = ", ".join(f"t{i}" for i in range(n_args)) mod_src.append(f''' @helion.kernel(config=helion.Config(block_sizes=[128], num_warps=4, num_stages=2)) def k{n_args}({src_args}): out = torch.empty_like(t0) for tile in hl.tile(t0.size(0)): out[tile] = {' + '.join(f't{i}[tile]' for i in range(n_args))} return out ''') open("/tmp/bench_kernels.py", "w").write("\n".join(mod_src)) sys.path.insert(0, "/tmp") import bench_kernels def bench(n_args, n=20000): k = getattr(bench_kernels, f"k{n_args}") ts = [torch.randn(4096, device="cuda") for _ in range(n_args)] for _ in range(50): # warmup: compile + prime launcher k(*ts) torch.cuda.synchronize() t0 = time.perf_counter() for _ in range(n): k(*ts) dt = (time.perf_counter() - t0) / n * 1e6 torch.cuda.synchronize() return dt for n_args in (2, 8, 16): print(f"n_args={n_args:3d}: {bench(n_args):6.2f} us/call") ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2746, branch: yushangdi/stack/28

Every Helion kernel call recomputes the bound-kernel cache key, which includes the device's compute capability via target_device_capability(). That function called torch.cuda.is_available() plus torch.cuda.get_device_capability() on every single kernel launch, costing ~2.8us/call of pure Python/CUDA-runtime overhead -- the single largest line item in Helion-side dispatch. A physical device's compute capability cannot change within a process, so the lookup is memoized per device index, following the existing is_hip / _is_hip split in this file: the public target_device_capability is a thin wrapper, and the @functools.cache lives on a private _target_device_capability(index) helper. A concrete device index goes straight to the cache, so the steady-state hot path makes no torch.cuda calls at all. device=None means the *current* device (which moves with torch.cuda.set_device); it is resolved to a concrete index per call before the cached lookup, so it is never frozen under a single key. The availability check lives inside the cached helper (like _is_hip), so an unavailable runtime caches None and a warm device skips the ~2us is_available() probe entirely. Tests that simulate other architectures now patch the helion seam (target_device_capability / its get_target_device_capability alias in config_spec) instead of torch.cuda.get_device_capability, exactly as tests mock is_hip rather than the cached _is_hip. Two such tests are updated: test_config_api's sm90-vs-sm100 cache-key test and the direct CuTe arch-gate test. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | before | after | delta -------+----------+----------+------- 2 | 24.14 us | 18.78 us | -22% 8 | 33.05 us | 27.77 us | -16% 16 | 43.56 us | 36.90 us | -15% ``` Benchmark script (used for this and the follow-up launch-overhead commits; wall-clock around the Python call deliberately captures CPU-side dispatch cost, which CUDA-event timing excludes -- the kernel is tiny, so the loop is CPU-bound and measures per-call issue cost): ```python import os, sys os.environ.setdefault("HELION_AUTOTUNE_EFFORT", "none") import time, torch, helion, helion.language as hl mod_src = ["import torch, helion, helion.language as hl\n"] for n_args in (2, 8, 16): src_args = ", ".join(f"t{i}" for i in range(n_args)) mod_src.append(f''' @helion.kernel(config=helion.Config(block_sizes=[128], num_warps=4, num_stages=2)) def k{n_args}({src_args}): out = torch.empty_like(t0) for tile in hl.tile(t0.size(0)): out[tile] = {' + '.join(f't{i}[tile]' for i in range(n_args))} return out ''') open("/tmp/bench_kernels.py", "w").write("\n".join(mod_src)) sys.path.insert(0, "/tmp") import bench_kernels def bench(n_args, n=20000): k = getattr(bench_kernels, f"k{n_args}") ts = [torch.randn(4096, device="cuda") for _ in range(n_args)] for _ in range(50): # warmup: compile + prime launcher k(*ts) torch.cuda.synchronize() t0 = time.perf_counter() for _ in range(n): k(*ts) dt = (time.perf_counter() - t0) / n * 1e6 torch.cuda.synchronize() return dt for n_args in (2, 8, 16): print(f"n_args={n_args:3d}: {bench(n_args):6.2f} us/call") ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2746, branch: yushangdi/stack/28

Kernel.bind runs on every kernel call and the cache *hit* is the steady state, so the per-call lookup key should be as cheap to build as possible. _get_bound_kernel_cache_key constructed a frozen-dataclass BoundKernelInMemoryCacheKey on every call: a lazy `from ..autotuner.base_cache import ...` import, a dataclass __init__, two frozen-field object.__setattr__ overrides, and a generated __hash__ that re-walks the fields. The in-memory _bound_kernels dict only needs *some* hashable key, so the per-call path now uses the equivalent plain (signature, extra_results) tuple. In isolation this drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call. The dataclass form is still produced by _create_bound_kernel_cache_key for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache subclass it into LooseAutotuneCacheKey); only the in-memory dict switches to tuple keys. On the compile (cache-miss) path the dataclass key is built once and unpacked into its (specialization_key, extra_results) tuple form so the in-memory dict and the autotuner caches stay keyed on the same value. Also drops one extra-results tuple allocation when a kernel has no hl.specialize() extras (the common case): the empty extra_fns list short-circuits to a shared () literal instead of tuple([]). Safety: cache-key *contents* are unchanged -- same signature tuple, same extra results, same specialization axes (dtype, shape bucket, device type+capability, ConstExpr values, key= fn, hl.specialize extras); the tuple is just the dataclass's two fields in order, with identical hash/equality. Verified by test_misc, test_config_api, test_cache, test_specialize (132 passed), plus dtype/shape/ConstExpr/specialize rebinding spot-checks. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit -------+----------+-------------+------------ 2 | 24.14 us | 18.78 us | 16.41 us 8 | 33.05 us | 27.77 us | 24.78 us 16 | 43.56 us | 36.90 us | 35.24 us ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2747, branch: yushangdi/stack/29

measure("Kernel.bind") wraps the whole body of Kernel.bind, which runs on every kernel call. Even when HELION_MEASURE_COMPILE_TIME is unset (the default) and measure() returns a shared no-op nullcontext, entering and exiting the `with` block still costs ~115ns/call -- the context-manager protocol itself, not the measurement -- on the steady-state cache-hit dispatch path. Gate it: bind() checks _compile_time.is_enabled() (a plain module-flag read, ~5ns) and only enters the `with measure(...)` block when measurement is actually on; otherwise it calls the extracted _bind_impl directly. The measured region is unchanged -- when enabled, the entire bind body (key computation, cache lookup, and any compile) runs under the same "Kernel.bind" scope as before, and every bind() call is still counted, including cache hits. Only the nothing-to-measure fast path skips the context manager. No behavior change: with HELION_MEASURE_COMPILE_TIME=1 the timing report is identical to before (same scope, same call counts); with it unset there was never any data recorded, only the protocol overhead that is now avoided. Verified by test_compile_time (metric still records when enabled), plus test_misc / test_config_api / test_cache / test_specialize. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit -------+----------+-------------+------------ 2 | 24.14 us | 18.04 us | 18.01 us 8 | 33.05 us | 26.11 us | 25.57 us 16 | 43.56 us | 36.53 us | 36.41 us ``` (The per-call win is ~0.4us isolated to bind(); at the end-to-end scale here it is within run-to-run noise.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2752, branch: yushangdi/stack/33

_tensor_key wraps every size and stride element in _hashable_dim to normalize torch.SymInt into a hashable (shape_env_id, expr) pair. That wrap exists only for symbolic shapes, which appear on FakeTensors during tracing -- yet concrete tensors paid for it on every kernel call: two Python-level loops (sizes + strides) with an isinstance check per dimension, plus rebuilding each as a fresh tuple. For the default static_shapes=True path this was the bulk of per-tensor key extraction (~0.6us each, x number of tensor args, every call). Specialization-extractor dispatch is by exact type (_specialization_extractors.get(type(obj))), so the torch.Tensor and torch.nn.Parameter entries are only ever hit by objects whose sizes/strides are guaranteed concrete ints. Route those to a new _concrete_tensor_key that uses obj.size() (a torch.Size) and obj.stride() directly as key components. Both are tuple subclasses: they hash and compare identically to plain int tuples, so keys produced by the fast path and the SymInt path for the same concrete shape are interchangeable -- no cache invalidation, and the specialization axes (dtype, shape, stride, _dynamo_static_indices, int32/int64 index width, dynamic-shape buckets) are bit-identical. Anything that can carry SymInts keeps the safe path: * FakeTensor has its own dispatch entry -> _tensor_key (unchanged) * torch.Tensor *subclasses* (e.g. the JAX-export adapter) miss the exact-type dict and hit the isinstance fallback in _specialization_key, which now routes to _tensor_key explicitly (This is the same optimization as the Python-only part of upstream PR #2611, independently arrived at from profiling.) New test/test_tensor_key_fast_path.py pins down the key equivalence (fast vs wrapped key hash/compare equal under static and dynamic shapes, incl. strided tensors), the dispatch routing (concrete -> _concrete_tensor_key; FakeTensor and the subclass fallback -> _tensor_key), that a torch.Tensor subclass takes the SymInt-safe path and still shares a BoundKernel with a plain tensor of the same shape, and that bind() caching/distinguishing is unchanged. Also verified against test_misc, test_specialize, test_torch_compile (the torch.compile suite exercises the FakeTensor/SymInt path). Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit -------+----------+-------------+------------ 2 | 24.14 us | 18.01 us | 17.25 us 8 | 33.05 us | 25.57 us | 24.50 us 16 | 43.56 us | 36.41 us | 32.16 us ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2748, branch: yushangdi/stack/30

Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31

Every Helion kernel call recomputes the bound-kernel cache key, which includes the device's compute capability via target_device_capability(). That function called torch.cuda.is_available() plus torch.cuda.get_device_capability() on every single kernel launch, costing ~2.8us/call of pure Python/CUDA-runtime overhead -- the single largest line item in Helion-side dispatch. A physical device's compute capability cannot change within a process, so cache it per device index in a module-level dict, and cache the is_available() result once it returns True (an unavailable runtime has no GPU launches to speed up, so False keeps re-querying). Safety: tests simulate other architectures by patching torch.cuda.get_device_capability / torch.cuda.is_available (e.g. test_config_api.py's sm90-vs-sm100 cache-key test). The fast path engages only while both functions are identity-equal to the originals captured at import time; patched functions route to the original uncached path, so the cache can never be poisoned by, nor serve stale values to, arch-simulation tests. An index-less torch.device("cuda") refers to the *current* device, which can change between calls -- it is resolved per call via torch.cuda.current_device() (cheap) before hitting the per-index cache. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | before | after | delta -------+----------+----------+------- 2 | 24.14 us | 18.78 us | -22% 8 | 33.05 us | 27.77 us | -16% 16 | 43.56 us | 36.90 us | -15% ``` Benchmark script (used for this and the follow-up launch-overhead commits; wall-clock around the Python call deliberately captures CPU-side dispatch cost, which CUDA-event timing excludes -- the kernel is tiny, so the loop is CPU-bound and measures per-call issue cost): ```python import os, sys os.environ.setdefault("HELION_AUTOTUNE_EFFORT", "none") import time, torch, helion, helion.language as hl mod_src = ["import torch, helion, helion.language as hl\n"] for n_args in (2, 8, 16): src_args = ", ".join(f"t{i}" for i in range(n_args)) mod_src.append(f''' @helion.kernel(config=helion.Config(block_sizes=[128], num_warps=4, num_stages=2)) def k{n_args}({src_args}): out = torch.empty_like(t0) for tile in hl.tile(t0.size(0)): out[tile] = {' + '.join(f't{i}[tile]' for i in range(n_args))} return out ''') open("/tmp/bench_kernels.py", "w").write("\n".join(mod_src)) sys.path.insert(0, "/tmp") import bench_kernels def bench(n_args, n=20000): k = getattr(bench_kernels, f"k{n_args}") ts = [torch.randn(4096, device="cuda") for _ in range(n_args)] for _ in range(50): # warmup: compile + prime launcher k(*ts) torch.cuda.synchronize() t0 = time.perf_counter() for _ in range(n): k(*ts) dt = (time.perf_counter() - t0) / n * 1e6 torch.cuda.synchronize() return dt for n_args in (2, 8, 16): print(f"n_args={n_args:3d}: {bench(n_args):6.2f} us/call") ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2746, branch: yushangdi/stack/28

measure() wrapped the entire body of Kernel.bind, which runs on every kernel call. When HELION_MEASURE_COMPILE_TIME is unset (the default) measure() returns a shared no-op context manager, but entering and exiting it still costs ~0.4-0.5us/call: two extra frames plus a module global read, on the steady-state cache-hit path. Compile-time tracking only has meaningful data on the cache-*miss* branch, where bind actually compiles and specializes a new BoundKernel. The hit path is a specialization-key computation plus a dict lookup -- nanoseconds of work that the tracker was never meant to account for. Move the `with measure("Kernel.bind"):` block to wrap only the miss branch; the hit path returns directly with no context manager. No behavior change: the same code runs under the same measurement scope on the miss path, and HELION_MEASURE_COMPILE_TIME=1 still attributes all compile/specialize time to "Kernel.bind". Verified by test_misc, test_config_api, test_cache, test_specialize. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit -------+----------+-------------+------------ 2 | 24.14 us | 16.41 us | 16.30 us 8 | 33.05 us | 24.78 us | 24.39 us 16 | 43.56 us | 35.24 us | 34.84 us ``` (The per-call win is ~0.4us isolated to bind(); at the end-to-end scale here it is partly absorbed into run-to-run noise.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2751, branch: yushangdi/stack/32

measure("Kernel.bind") wraps the whole body of Kernel.bind, which runs on every kernel call. Even when HELION_MEASURE_COMPILE_TIME is unset (the default) and measure() returns a shared no-op nullcontext, entering and exiting the `with` block still costs ~115ns/call -- the context-manager protocol itself, not the measurement -- on the steady-state cache-hit dispatch path. Gate it: bind() checks _compile_time.is_enabled() (a plain module-flag read, ~5ns) and only enters the `with measure(...)` block when measurement is actually on; otherwise it calls the extracted _bind_impl directly. The measured region is unchanged -- when enabled, the entire bind body (key computation, cache lookup, and any compile) runs under the same "Kernel.bind" scope as before, and every bind() call is still counted, including cache hits. Only the nothing-to-measure fast path skips the context manager. No behavior change: with HELION_MEASURE_COMPILE_TIME=1 the timing report is identical to before (same scope, same call counts); with it unset there was never any data recorded, only the protocol overhead that is now avoided. Verified by test_compile_time (metric still records when enabled), plus test_misc / test_config_api / test_cache / test_specialize. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit -------+----------+-------------+------------ 2 | 24.14 us | 18.04 us | 18.01 us 8 | 33.05 us | 26.11 us | 25.57 us 16 | 43.56 us | 36.53 us | 36.41 us ``` (The per-call win is ~0.4us isolated to bind(); at the end-to-end scale here it is within run-to-run noise.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2752, branch: yushangdi/stack/33

fp8 scaled_mm feeds B column-major (K-contiguous) -- the layout torch._scaled_mm requires and the standalone CUTLASS CuTe kernel uses (b_major="k"). Helion previously gated the tcgen05 TMA path on rhs.is_contiguous() (row-major only) and hardcoded B as MN-major, so a column-major B operand was forced onto the slow non-TMA fallback (~5 TFLOP/s), and even feeding row-major B used the slower MN-major MMA layout. This makes B's major mode layout-aware: - cute_mma.py: classify a 2D operand as TMA-eligible when contiguous in EITHER axis (row -> MN-major, col -> K-major) instead of requiring row-major is_contiguous(); compute tcgen05_b_k_major from rhs stride and thread it through _make_tiled_mma_setup / _make_tcgen05_layout_plan_setup and onto the ab_tma wrapper plan. - _tcgen05_tiled_mma_expr / strategies.tcgen05_smem_layout_expr: emit OperandMajorMode.K and make_smem_layout_b(is_k_major=True) for B when K-major (delegating to CuTe's helper, mirroring the K-major A path). - runtime/__init__.py: host wrapper emits the K-major B tiled_mma + SMEM layout and marks the B TMA view leading_dim=1 (K axis) for K-major vs leading_dim=0 (N axis) for MN-major. - test_cute_backend.py: test_matmul_mma_tcgen05_fp8_col_major_b. All paths default to the prior MN-major behavior when B is row-major, so existing golden/codegen tests are byte-identical (full cute suite: 93 passed). K-major B is correct (rel_err 0.0000 vs torch._scaled_mm). Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096, CtaGroup.TWO cluster_m=2 ab_stages=3 role-local, do_bench, 10s warmup): row-major B (MN-major, old forced layout) : 577 TFLOP/s 31% of aten col-major B (K-major, this change) : 1143 TFLOP/s 62% of aten torch._scaled_mm : 1857 TFLOP/s i.e. ~2x from using the tensor-core-native K-major B layout. The remaining gap to torch._scaled_mm is deep AB staging (ab_stages>3), a separate lever. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2740, branch: yushangdi/stack/25

…gen05 Static-full role-local tcgen05 configs now emit the CUTLASS reference producer shape: ONE rolled K loop in the TMA-load warp whose blocking producer_acquire fills the AB ring - pipelining comes from the mbarrier handshake, not from codegen structure. This replaces, per work tile: * the fully unrolled ab_stages-deep warm-up prologue (8 guarded acquire/copy/commit blocks at ab=8, each with a re-computed per-stage "k_tile fits in K" predicate that is a compile-time truth under static shapes), and * the separate steady-state producer loop issuing TMA at k_tile + ab_stages with per-iteration full-tile/next-full-tile predicate recomputation. The MMA-exec K loop drops its per-iteration full-tile predicate the same way (static-full means every K tile is full, so consumer wait / release run unconditionally under their two-CTA owner gates), and all per-K-iter owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate - the inline form cost an S2UR+IMAD pair per use inside the K loop and was the single largest instruction-count delta vs the reference (+508k dynamic IMAD). NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200, block_sizes=[256,128,128], ab_stages=8, cluster_m=2: before after standalone warp instructions / SMSP 5820 4711 4866 static UTMALDG sites 24 2 2 dynamic SYNCS (barriers) 533k 507k 470k kernel duration 52.2 us 51.9 us 48.0 us tensor-pipe active 81.6 % 81.5 % 84.2 % Helion now executes FEWER warp-instructions per SMSP than the CUTLASS standalone; the remaining ~8% duration gap is no longer instruction-count-bound. Combined with the epilogue spill fix, the fused scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side (batched back-to-back: 2359-2381) vs the standalone's 48.7 us / 2317-2334, up from 74.2 us / 1509-1711 at the start of this investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale. Scope and fallbacks: * The unified producer requires static-full tiles + the role-local separate producer loop (the producer warp owns its K loop, so a blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail families and the bridge diagnostics that skip individual acquire/advance edges keep the unrolled-prologue form. * _build_kloop_pipeline_consumer_if/_release_if become list-returning _stmts builders so the ungated static-full forms can emit sibling statements without an artificial wrapper; single-statement shims keep the old call shape where the gated form is still used. * The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is structurally impossible in the unified form: the blocking acquire fills every stage in order. The regression test now pins the unified codegen markers and still verifies end-to-end correctness. Test updates: three tests pinned old codegen shapes (per-stage prefetch markers, inline cluster-leader predicates in the K loop, the producer-if's full-tile gate) and now pin the unified/hoisted forms. Full HELION_BACKEND=cute sweep: test_cute_lowerings.py + test_cute_backend.py 479 passed, other cute test files 85 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2755, branch: yushangdi/stack/35

yushangdi force-pushed the yushangdi/stack/34 branch from 6f5a27a to 9c7a9c0 Compare June 11, 2026 02:29

yushangdi force-pushed the yushangdi/stack/35 branch from 2bb4805 to 138744f Compare June 11, 2026 02:29

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 11, 2026

yushangdi changed the base branch from yushangdi/stack/34 to main June 11, 2026 02:35

yushangdi changed the base branch from main to yushangdi/stack/34 June 11, 2026 02:36

yushangdi mentioned this pull request Jun 11, 2026

[cute] Record measured-good B200 config for the fp8 scaled_mm example #2756

Draft

yushangdi force-pushed the yushangdi/stack/34 branch from 9c7a9c0 to 68f7f50 Compare June 11, 2026 17:10

yushangdi force-pushed the yushangdi/stack/35 branch from 138744f to 2237a51 Compare June 11, 2026 17:10

yushangdi added 4 commits June 11, 2026 20:21

yushangdi force-pushed the yushangdi/stack/34 branch from 68f7f50 to c18f8b4 Compare June 11, 2026 20:57

yushangdi force-pushed the yushangdi/stack/34 branch from c18f8b4 to d92ad46 Compare June 11, 2026 21:00

yushangdi added 7 commits June 11, 2026 21:10

yushangdi and others added 4 commits June 11, 2026 21:10

yushangdi force-pushed the yushangdi/stack/35 branch from 2237a51 to f407dea Compare June 11, 2026 21:11

yushangdi force-pushed the yushangdi/stack/34 branch 6 times, most recently from d853568 to 3541d8d Compare June 11, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2755

[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2755
yushangdi wants to merge 16 commits into
yushangdi/stack/34from
yushangdi/stack/35

yushangdi commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yushangdi commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!