[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2755
Draft
yushangdi wants to merge 16 commits into
Draft
[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05#2755yushangdi wants to merge 16 commits into
yushangdi wants to merge 16 commits into
Conversation
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
…gen05
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile:
* the fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each with a re-computed
per-stage "k_tile fits in K" predicate that is a compile-time truth
under static shapes), and
* the separate steady-state producer loop issuing TMA at
k_tile + ab_stages with per-iteration full-tile/next-full-tile
predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full, so consumer wait / release
run unconditionally under their two-CTA owner gates), and all per-K-iter
owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead
of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate -
the inline form cost an S2UR+IMAD pair per use inside the K loop and was
the single largest instruction-count delta vs the reference (+508k
dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:
before after standalone
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
kernel duration 52.2 us 51.9 us 48.0 us
tensor-pipe active 81.6 % 81.5 % 84.2 %
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound. Combined with the epilogue spill fix, the fused
scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) vs the standalone's 48.7 us /
2317-2334, up from 74.2 us / 1509-1711 at the start of this
investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale.
Scope and fallbacks:
* The unified producer requires static-full tiles + the role-local
separate producer loop (the producer warp owns its K loop, so a
blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail
families and the bridge diagnostics that skip individual
acquire/advance edges keep the unrolled-prologue form.
* _build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims
keep the old call shape where the gated form is still used.
* The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order. The regression test now pins the
unified codegen markers and still verifies end-to-end correctness.
Test updates: three tests pinned old codegen shapes (per-stage prefetch
markers, inline cluster-leader predicates in the K loop, the
producer-if's full-tile gate) and now pin the unified/hoisted forms.
Full HELION_BACKEND=cute sweep: test_cute_lowerings.py +
test_cute_backend.py 479 passed, other cute test files 85 passed.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2755, branch: yushangdi/stack/35
6f5a27a to
9c7a9c0
Compare
2bb4805 to
138744f
Compare
This was referenced Jun 11, 2026
9c7a9c0 to
68f7f50
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
…gen05
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile:
* the fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each with a re-computed
per-stage "k_tile fits in K" predicate that is a compile-time truth
under static shapes), and
* the separate steady-state producer loop issuing TMA at
k_tile + ab_stages with per-iteration full-tile/next-full-tile
predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full, so consumer wait / release
run unconditionally under their two-CTA owner gates), and all per-K-iter
owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead
of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate -
the inline form cost an S2UR+IMAD pair per use inside the K loop and was
the single largest instruction-count delta vs the reference (+508k
dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:
before after standalone
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
kernel duration 52.2 us 51.9 us 48.0 us
tensor-pipe active 81.6 % 81.5 % 84.2 %
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound. Combined with the epilogue spill fix, the fused
scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) vs the standalone's 48.7 us /
2317-2334, up from 74.2 us / 1509-1711 at the start of this
investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale.
Scope and fallbacks:
* The unified producer requires static-full tiles + the role-local
separate producer loop (the producer warp owns its K loop, so a
blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail
families and the bridge diagnostics that skip individual
acquire/advance edges keep the unrolled-prologue form.
* _build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims
keep the old call shape where the gated form is still used.
* The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order. The regression test now pins the
unified codegen markers and still verifies end-to-end correctness.
Test updates: three tests pinned old codegen shapes (per-stage prefetch
markers, inline cluster-leader predicates in the K loop, the
producer-if's full-tile gate) and now pin the unified/hoisted forms.
Full HELION_BACKEND=cute sweep: test_cute_lowerings.py +
test_cute_backend.py 479 passed, other cute test files 85 passed.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2755, branch: yushangdi/stack/35
138744f to
2237a51
Compare
Backports ONLY the program_id.py change from #2696 (the fused-scale memory_ops.py/cute_fx_walk.py and the deep-AB-staging tcgen05_config.py pieces are intentionally omitted -- see below). Adds cutlass.Float8E4M3FN / Float8E5M2 to the persistent role-local pure-call whitelist in Tcgen05PersistentProgramIDs. Without this, the role-local CtaGroup.TWO (cluster_m=2) persistent scheduler refuses to emit for fp8 operands: full role-local body extraction fails and codegen raises "tcgen05 fully role-local codegen would discard observable shared statement(s)", so the autotuner falls back to a slow cluster_m=1 non-persistent kernel. This is the single change (on top of the fp8 enablement commit) that unblocks the fast persistent role-local path for fp8 -- verified by bisection: the enablement commit alone fails to compile the CtaGroup.TWO fp8 config; adding just these two whitelist lines makes it emit and run. Benchmark (B200, CUDA 13.2, fp8 e4m3 rowwise scaled_mm, row-major B, m=512 k=2048 n=2048, CtaGroup.TWO cluster_m=2 ab_stages=3 role-local config, triton do_bench, correctness rel_err<=0.002 vs torch._scaled_mm): without this change : config fails to compile (InternalError); autotuner falls back to ~5 TFLOP/s with this change : 38.5 us 111.7 TFLOP/s (role-local engaged) i.e. ~20x over the non-persistent fallback the enablement commit is otherwise stuck on. (For reference on the same shape: torch._scaled_mm ~376 TFLOP/s, standalone CUTLASS CuTe scaled_mm ~331 TFLOP/s -- closing that remaining gap needs K-major B support, which is separate.) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2739, branch: yushangdi/stack/24
fp8 scaled_mm feeds B column-major (K-contiguous) -- the layout torch._scaled_mm requires and the standalone CUTLASS CuTe kernel uses (b_major="k"). Helion previously gated the tcgen05 TMA path on rhs.is_contiguous() (row-major only) and hardcoded B as MN-major, so a column-major B operand was forced onto the slow non-TMA fallback (~5 TFLOP/s), and even feeding row-major B used the slower MN-major MMA layout. This makes B's major mode layout-aware: - cute_mma.py: classify a 2D operand as TMA-eligible when contiguous in EITHER axis (row -> MN-major, col -> K-major) instead of requiring row-major is_contiguous(); compute tcgen05_b_k_major from rhs stride and thread it through _make_tiled_mma_setup / _make_tcgen05_layout_plan_setup and onto the ab_tma wrapper plan. - _tcgen05_tiled_mma_expr / strategies.tcgen05_smem_layout_expr: emit OperandMajorMode.K and make_smem_layout_b(is_k_major=True) for B when K-major (delegating to CuTe's helper, mirroring the K-major A path). - runtime/__init__.py: host wrapper emits the K-major B tiled_mma + SMEM layout and marks the B TMA view leading_dim=1 (K axis) for K-major vs leading_dim=0 (N axis) for MN-major. - test_cute_backend.py: test_matmul_mma_tcgen05_fp8_col_major_b. All paths default to the prior MN-major behavior when B is row-major, so existing golden/codegen tests are byte-identical (full cute suite: 93 passed). K-major B is correct (rel_err 0.0000 vs torch._scaled_mm). Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096, CtaGroup.TWO cluster_m=2 ab_stages=3 role-local, do_bench, 10s warmup): row-major B (MN-major, old forced layout) : 577 TFLOP/s 31% of aten col-major B (K-major, this change) : 1143 TFLOP/s 62% of aten torch._scaled_mm : 1857 TFLOP/s i.e. ~2x from using the tensor-core-native K-major B layout. The remaining gap to torch._scaled_mm is deep AB staging (ab_stages>3), a separate lever. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2740, branch: yushangdi/stack/25
With K-major B in place, the fp8 tcgen05 kernel was still capped at ab_stages=3 (the bf16-tuned limit), so its software pipeline was too shallow to hide K-loop TMA latency on compute-bound shapes (4096^3: 1143 TFLOP/s, 62% of torch._scaled_mm). 1-byte fp8 operands fit a much deeper AB pipeline than 2-byte bf16 in the same SMEM budget. Backports the tcgen05_config.py deep-staging logic from #2696: - max_ab_stages_that_fit(): largest ab_stages whose AB SMEM fits the per-CTA budget (mirrors CUTLASS _compute_stages). - _validate_target1_ab_stage_envelope(): admit ab_stages>3 for fp8 as long as the AB SMEM fits, instead of hard-failing at 3. - optional_fragments(): widen the ab_stages search cap to 12 for fp8 (1-byte operands) on BOTH the search and validation surfaces, so the autotuner can sample deep pipelines and a frozen deep-staged fp8 config also passes normalize(). Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096, col-major B, CtaGroup.TWO cluster_m=2 role-local, do_bench, 10s warmup): ab_stages= 3 (prev cap) : 1143 TFLOP/s 62% of aten ab_stages= 6 : 1494 TFLOP/s 80% ab_stages= 8 : 1613 TFLOP/s 86% <- sweet spot ab_stages=10/12 : ~1600 TFLOP/s (SMEM-bound plateau) torch._scaled_mm : 1867 TFLOP/s So K-major B (prior commit) + deep AB staging together take fp8 scaled_mm from ~31% to ~86% of torch._scaled_mm on the 4096^3 compute-bound shape. Correctness rel_err 0.0000 throughout; full cute suite: 93 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2741, branch: yushangdi/stack/26
The fp8 scaled_mm epilogue loaded both rowwise scales per subtile AFTER the accumulator consumer_wait, exposing their GMEM latency. The standalone CUTLASS kernel (cute_scaled_mm) instead reads the whole rowvec scale into registers ONCE before the acc wait so its latency overlaps the MMA, and reads the per-row colvec scale as a single scalar. Backports the remaining fused-scale pieces from #2696: - memory_ops.py: register-hoist a rowvec epilogue aux (rowwise scale_b[n]) into a register tensor in per-tile setup (before the subtile loop / acc consumer_wait), fp8-gated; read per-subtile from registers instead of a fresh GMEM load. - cute_fx_walk.py: classify a stride-(1,0) (M,N) aux as a per-row column vector ("broadcast", 2) so scale_a is read as a scalar per subtile (tTR_gAux[(0,0,0,s)]) instead of a redundant N-wide vector load. Generated epilogue now matches the standalone: autovec_copy hoist before the acc consumer_wait, scalar colvec read. Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096, col-major B, ab_stages=8, do_bench, 10s warmup): before (deep staging only) : 1616 TFLOP/s 86% of aten after (this change) : 1650 TFLOP/s 88% of aten torch._scaled_mm : 1878 TFLOP/s Full cute suite: 93 passed; rel_err 0.0000. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2742, branch: yushangdi/stack/27
68f7f50 to
c18f8b4
Compare
…pills
The fp8-gated "rowvec register hoist" copied the ENTIRE column-scale tile
fragment (all subtiles' fp32 values per thread) into a register tensor in
the per-output-tile setup, keeping it live across the whole subtile loop on
top of the accumulator fragments and TMA-store state. At bk=128 (the
CUTLASS-matched fp8 K tile, 8 subtiles/tile) that working set pushed the
epilogue warps to the 255-register cap and into local memory.
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 rowwise scaled_mm, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2, persistent role-local:
before after
dynamic LDL+STL (spills) 409,104 0
static spill sites 111 0
kernel duration 57.8 us 52.6 us
tensor-pipe active 71.6 % 81.2 %
do_bench TFLOP/s 1915-1942 2100-2120
The top stall lines were all spill traffic in the epilogue: the acc TMEM
read (1148 long-scoreboard samples), the whole-tile autovec_copy itself
(238), and the acc*sa*sb chain (263). The reference CUTLASS kernel
(epilogue_tma_store_scaled, 176 regs, zero spills) gets away with the same
whole-tile copy only because its epilogue is a lean helper; in Helion's
fatter persistent-loop context the copy is what overflows the register
budget.
The fix removes the hoist machinery entirely: the rowvec scale falls back
to the generic per-subtile ttr_aux_subtile.load() placed after the
c_pipeline acquire / acc consumer_wait / T2R prefix (the cycle-69
placement), which spills zero at every measured bk. At bk=64 - the only
configuration where the hoist had been measured as a win - removal is
perf-neutral (1711 vs 1712 TFLOP/s). relerr=0.0000 vs the fp32 reference
at both bk=64 and bk=128.
With this fix the fused-scale epilogue is free again relative to the plain
fp8 matmul (2100 vs 2105 TFLOP/s), matching the standalone kernel's
behavior (2334 vs 2345).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2754, branch: yushangdi/stack/34
c18f8b4 to
d92ad46
Compare
Every Helion kernel call recomputes the bound-kernel cache key, which
includes the device's compute capability via target_device_capability().
That function called torch.cuda.is_available() plus
torch.cuda.get_device_capability() on every single kernel launch,
costing ~2.8us/call of pure Python/CUDA-runtime overhead -- the single
largest line item in Helion-side dispatch.
A physical device's compute capability cannot change within a process,
so the lookup is memoized per device index, following the existing
is_hip / _is_hip split in this file: the public
target_device_capability is a thin wrapper, and the @functools.cache
lives on a private _target_device_capability(index) helper. A concrete
device index goes straight to the cache, so the steady-state hot path
makes no torch.cuda calls at all. device=None means the *current*
device (which moves with torch.cuda.set_device); it is resolved to a
concrete index per call before the cached lookup, so it is never frozen
under a single key. The availability check lives inside the cached
helper (like _is_hip), so an unavailable runtime caches None and a warm
device skips the ~2us is_available() probe entirely.
Tests that simulate other architectures now patch the helion seam
(target_device_capability / its get_target_device_capability alias in
config_spec) instead of torch.cuda.get_device_capability, exactly as
tests mock is_hip rather than the cached _is_hip. Two such tests are
updated: test_config_api's sm90-vs-sm100 cache-key test and the direct
CuTe arch-gate test.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | before | after | delta
-------+----------+----------+-------
2 | 24.14 us | 18.78 us | -22%
8 | 33.05 us | 27.77 us | -16%
16 | 43.56 us | 36.90 us | -15%
```
Benchmark script (used for this and the follow-up launch-overhead
commits; wall-clock around the Python call deliberately captures
CPU-side dispatch cost, which CUDA-event timing excludes -- the kernel
is tiny, so the loop is CPU-bound and measures per-call issue cost):
```python
import os, sys
os.environ.setdefault("HELION_AUTOTUNE_EFFORT", "none")
import time, torch, helion, helion.language as hl
# Write kernels to a real module file: helion needs
# inspect.getsource, so exec()'d kernels don't work.
mod_src = ["import torch, helion, helion.language as hl\n"]
for n_args in (2, 8, 16):
src_args = ", ".join(f"t{i}" for i in range(n_args))
mod_src.append(f'''
@helion.kernel(config=helion.Config(block_sizes=[128], num_warps=4, num_stages=2))
def k{n_args}({src_args}):
out = torch.empty_like(t0)
for tile in hl.tile(t0.size(0)):
out[tile] = {' + '.join(f't{i}[tile]' for i in range(n_args))}
return out
''')
open("/tmp/bench_kernels.py", "w").write("\n".join(mod_src))
sys.path.insert(0, "/tmp")
import bench_kernels
def bench(n_args, n=20000):
k = getattr(bench_kernels, f"k{n_args}")
ts = [torch.randn(4096, device="cuda") for _ in range(n_args)]
for _ in range(50): # warmup: compile + prime launcher
k(*ts)
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n):
k(*ts)
dt = (time.perf_counter() - t0) / n * 1e6
torch.cuda.synchronize()
return dt
for n_args in (2, 8, 16):
print(f"n_args={n_args:3d}: {bench(n_args):6.2f} us/call")
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2746, branch: yushangdi/stack/28
Every Helion kernel call recomputes the bound-kernel cache key, which
includes the device's compute capability via target_device_capability().
That function called torch.cuda.is_available() plus
torch.cuda.get_device_capability() on every single kernel launch,
costing ~2.8us/call of pure Python/CUDA-runtime overhead -- the single
largest line item in Helion-side dispatch.
A physical device's compute capability cannot change within a process,
so the lookup is memoized per device index, following the existing
is_hip / _is_hip split in this file: the public
target_device_capability is a thin wrapper, and the @functools.cache
lives on a private _target_device_capability(index) helper. A concrete
device index goes straight to the cache, so the steady-state hot path
makes no torch.cuda calls at all. device=None means the *current*
device (which moves with torch.cuda.set_device); it is resolved to a
concrete index per call before the cached lookup, so it is never frozen
under a single key. The availability check lives inside the cached
helper (like _is_hip), so an unavailable runtime caches None and a warm
device skips the ~2us is_available() probe entirely.
Tests that simulate other architectures now patch the helion seam
(target_device_capability / its get_target_device_capability alias in
config_spec) instead of torch.cuda.get_device_capability, exactly as
tests mock is_hip rather than the cached _is_hip. Two such tests are
updated: test_config_api's sm90-vs-sm100 cache-key test and the direct
CuTe arch-gate test.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | before | after | delta
-------+----------+----------+-------
2 | 24.14 us | 18.78 us | -22%
8 | 33.05 us | 27.77 us | -16%
16 | 43.56 us | 36.90 us | -15%
```
Benchmark script (used for this and the follow-up launch-overhead
commits; wall-clock around the Python call deliberately captures
CPU-side dispatch cost, which CUDA-event timing excludes -- the kernel
is tiny, so the loop is CPU-bound and measures per-call issue cost):
```python
import os, sys
os.environ.setdefault("HELION_AUTOTUNE_EFFORT", "none")
import time, torch, helion, helion.language as hl
mod_src = ["import torch, helion, helion.language as hl\n"]
for n_args in (2, 8, 16):
src_args = ", ".join(f"t{i}" for i in range(n_args))
mod_src.append(f'''
@helion.kernel(config=helion.Config(block_sizes=[128], num_warps=4, num_stages=2))
def k{n_args}({src_args}):
out = torch.empty_like(t0)
for tile in hl.tile(t0.size(0)):
out[tile] = {' + '.join(f't{i}[tile]' for i in range(n_args))}
return out
''')
open("/tmp/bench_kernels.py", "w").write("\n".join(mod_src))
sys.path.insert(0, "/tmp")
import bench_kernels
def bench(n_args, n=20000):
k = getattr(bench_kernels, f"k{n_args}")
ts = [torch.randn(4096, device="cuda") for _ in range(n_args)]
for _ in range(50): # warmup: compile + prime launcher
k(*ts)
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n):
k(*ts)
dt = (time.perf_counter() - t0) / n * 1e6
torch.cuda.synchronize()
return dt
for n_args in (2, 8, 16):
print(f"n_args={n_args:3d}: {bench(n_args):6.2f} us/call")
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2746, branch: yushangdi/stack/28
Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.
_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.
The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.
Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).
Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | baseline | prev commit | this commit
-------+----------+-------------+------------
2 | 24.14 us | 18.78 us | 16.41 us
8 | 33.05 us | 27.77 us | 24.78 us
16 | 43.56 us | 36.90 us | 35.24 us
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2747, branch: yushangdi/stack/29
measure("Kernel.bind") wraps the whole body of Kernel.bind, which runs
on every kernel call. Even when HELION_MEASURE_COMPILE_TIME is unset
(the default) and measure() returns a shared no-op nullcontext,
entering and exiting the `with` block still costs ~115ns/call -- the
context-manager protocol itself, not the measurement -- on the
steady-state cache-hit dispatch path.
Gate it: bind() checks _compile_time.is_enabled() (a plain module-flag
read, ~5ns) and only enters the `with measure(...)` block when
measurement is actually on; otherwise it calls the extracted
_bind_impl directly. The measured region is unchanged -- when enabled,
the entire bind body (key computation, cache lookup, and any compile)
runs under the same "Kernel.bind" scope as before, and every bind()
call is still counted, including cache hits. Only the
nothing-to-measure fast path skips the context manager.
No behavior change: with HELION_MEASURE_COMPILE_TIME=1 the timing
report is identical to before (same scope, same call counts); with it
unset there was never any data recorded, only the protocol overhead
that is now avoided. Verified by test_compile_time (metric still
records when enabled), plus test_misc / test_config_api / test_cache /
test_specialize.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | baseline | prev commit | this commit
-------+----------+-------------+------------
2 | 24.14 us | 18.04 us | 18.01 us
8 | 33.05 us | 26.11 us | 25.57 us
16 | 43.56 us | 36.53 us | 36.41 us
```
(The per-call win is ~0.4us isolated to bind(); at the end-to-end
scale here it is within run-to-run noise.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2752, branch: yushangdi/stack/33
_tensor_key wraps every size and stride element in _hashable_dim to normalize torch.SymInt into a hashable (shape_env_id, expr) pair. That wrap exists only for symbolic shapes, which appear on FakeTensors during tracing -- yet concrete tensors paid for it on every kernel call: two Python-level loops (sizes + strides) with an isinstance check per dimension, plus rebuilding each as a fresh tuple. For the default static_shapes=True path this was the bulk of per-tensor key extraction (~0.6us each, x number of tensor args, every call). Specialization-extractor dispatch is by exact type (_specialization_extractors.get(type(obj))), so the torch.Tensor and torch.nn.Parameter entries are only ever hit by objects whose sizes/strides are guaranteed concrete ints. Route those to a new _concrete_tensor_key that uses obj.size() (a torch.Size) and obj.stride() directly as key components. Both are tuple subclasses: they hash and compare identically to plain int tuples, so keys produced by the fast path and the SymInt path for the same concrete shape are interchangeable -- no cache invalidation, and the specialization axes (dtype, shape, stride, _dynamo_static_indices, int32/int64 index width, dynamic-shape buckets) are bit-identical. Anything that can carry SymInts keeps the safe path: * FakeTensor has its own dispatch entry -> _tensor_key (unchanged) * torch.Tensor *subclasses* (e.g. the JAX-export adapter) miss the exact-type dict and hit the isinstance fallback in _specialization_key, which now routes to _tensor_key explicitly (This is the same optimization as the Python-only part of upstream PR #2611, independently arrived at from profiling.) New test/test_tensor_key_fast_path.py pins down the key equivalence (fast vs wrapped key hash/compare equal under static and dynamic shapes, incl. strided tensors), the dispatch routing (concrete -> _concrete_tensor_key; FakeTensor and the subclass fallback -> _tensor_key), that a torch.Tensor subclass takes the SymInt-safe path and still shares a BoundKernel with a plain tensor of the same shape, and that bind() caching/distinguishing is unchanged. Also verified against test_misc, test_specialize, test_torch_compile (the torch.compile suite exercises the FakeTensor/SymInt path). Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit -------+----------+-------------+------------ 2 | 24.14 us | 18.01 us | 17.25 us 8 | 33.05 us | 25.57 us | 24.50 us 16 | 43.56 us | 36.41 us | 32.16 us ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2748, branch: yushangdi/stack/30
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
Every Helion kernel call recomputes the bound-kernel cache key, which
includes the device's compute capability via target_device_capability().
That function called torch.cuda.is_available() plus
torch.cuda.get_device_capability() on every single kernel launch,
costing ~2.8us/call of pure Python/CUDA-runtime overhead -- the single
largest line item in Helion-side dispatch.
A physical device's compute capability cannot change within a process,
so cache it per device index in a module-level dict, and cache the
is_available() result once it returns True (an unavailable runtime has
no GPU launches to speed up, so False keeps re-querying).
Safety: tests simulate other architectures by patching
torch.cuda.get_device_capability / torch.cuda.is_available (e.g.
test_config_api.py's sm90-vs-sm100 cache-key test). The fast path
engages only while both functions are identity-equal to the originals
captured at import time; patched functions route to the original
uncached path, so the cache can never be poisoned by, nor serve stale
values to, arch-simulation tests.
An index-less torch.device("cuda") refers to the *current* device,
which can change between calls -- it is resolved per call via
torch.cuda.current_device() (cheap) before hitting the per-index cache.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | before | after | delta
-------+----------+----------+-------
2 | 24.14 us | 18.78 us | -22%
8 | 33.05 us | 27.77 us | -16%
16 | 43.56 us | 36.90 us | -15%
```
Benchmark script (used for this and the follow-up launch-overhead
commits; wall-clock around the Python call deliberately captures
CPU-side dispatch cost, which CUDA-event timing excludes -- the kernel
is tiny, so the loop is CPU-bound and measures per-call issue cost):
```python
import os, sys
os.environ.setdefault("HELION_AUTOTUNE_EFFORT", "none")
import time, torch, helion, helion.language as hl
mod_src = ["import torch, helion, helion.language as hl\n"]
for n_args in (2, 8, 16):
src_args = ", ".join(f"t{i}" for i in range(n_args))
mod_src.append(f'''
@helion.kernel(config=helion.Config(block_sizes=[128], num_warps=4, num_stages=2))
def k{n_args}({src_args}):
out = torch.empty_like(t0)
for tile in hl.tile(t0.size(0)):
out[tile] = {' + '.join(f't{i}[tile]' for i in range(n_args))}
return out
''')
open("/tmp/bench_kernels.py", "w").write("\n".join(mod_src))
sys.path.insert(0, "/tmp")
import bench_kernels
def bench(n_args, n=20000):
k = getattr(bench_kernels, f"k{n_args}")
ts = [torch.randn(4096, device="cuda") for _ in range(n_args)]
for _ in range(50): # warmup: compile + prime launcher
k(*ts)
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n):
k(*ts)
dt = (time.perf_counter() - t0) / n * 1e6
torch.cuda.synchronize()
return dt
for n_args in (2, 8, 16):
print(f"n_args={n_args:3d}: {bench(n_args):6.2f} us/call")
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2746, branch: yushangdi/stack/28
measure() wrapped the entire body of Kernel.bind, which runs on every
kernel call. When HELION_MEASURE_COMPILE_TIME is unset (the default)
measure() returns a shared no-op context manager, but entering and
exiting it still costs ~0.4-0.5us/call: two extra frames plus a module
global read, on the steady-state cache-hit path.
Compile-time tracking only has meaningful data on the cache-*miss*
branch, where bind actually compiles and specializes a new
BoundKernel. The hit path is a specialization-key computation plus a
dict lookup -- nanoseconds of work that the tracker was never meant to
account for. Move the `with measure("Kernel.bind"):` block to wrap
only the miss branch; the hit path returns directly with no context
manager.
No behavior change: the same code runs under the same measurement
scope on the miss path, and HELION_MEASURE_COMPILE_TIME=1 still
attributes all compile/specialize time to "Kernel.bind". Verified by
test_misc, test_config_api, test_cache, test_specialize.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | baseline | prev commit | this commit
-------+----------+-------------+------------
2 | 24.14 us | 16.41 us | 16.30 us
8 | 33.05 us | 24.78 us | 24.39 us
16 | 43.56 us | 35.24 us | 34.84 us
```
(The per-call win is ~0.4us isolated to bind(); at the end-to-end
scale here it is partly absorbed into run-to-run noise.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2751, branch: yushangdi/stack/32
measure("Kernel.bind") wraps the whole body of Kernel.bind, which runs
on every kernel call. Even when HELION_MEASURE_COMPILE_TIME is unset
(the default) and measure() returns a shared no-op nullcontext,
entering and exiting the `with` block still costs ~115ns/call -- the
context-manager protocol itself, not the measurement -- on the
steady-state cache-hit dispatch path.
Gate it: bind() checks _compile_time.is_enabled() (a plain module-flag
read, ~5ns) and only enters the `with measure(...)` block when
measurement is actually on; otherwise it calls the extracted
_bind_impl directly. The measured region is unchanged -- when enabled,
the entire bind body (key computation, cache lookup, and any compile)
runs under the same "Kernel.bind" scope as before, and every bind()
call is still counted, including cache hits. Only the
nothing-to-measure fast path skips the context manager.
No behavior change: with HELION_MEASURE_COMPILE_TIME=1 the timing
report is identical to before (same scope, same call counts); with it
unset there was never any data recorded, only the protocol overhead
that is now avoided. Verified by test_compile_time (metric still
records when enabled), plus test_misc / test_config_api / test_cache /
test_specialize.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | baseline | prev commit | this commit
-------+----------+-------------+------------
2 | 24.14 us | 18.04 us | 18.01 us
8 | 33.05 us | 26.11 us | 25.57 us
16 | 43.56 us | 36.53 us | 36.41 us
```
(The per-call win is ~0.4us isolated to bind(); at the end-to-end
scale here it is within run-to-run noise.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2752, branch: yushangdi/stack/33
fp8 scaled_mm feeds B column-major (K-contiguous) -- the layout torch._scaled_mm requires and the standalone CUTLASS CuTe kernel uses (b_major="k"). Helion previously gated the tcgen05 TMA path on rhs.is_contiguous() (row-major only) and hardcoded B as MN-major, so a column-major B operand was forced onto the slow non-TMA fallback (~5 TFLOP/s), and even feeding row-major B used the slower MN-major MMA layout. This makes B's major mode layout-aware: - cute_mma.py: classify a 2D operand as TMA-eligible when contiguous in EITHER axis (row -> MN-major, col -> K-major) instead of requiring row-major is_contiguous(); compute tcgen05_b_k_major from rhs stride and thread it through _make_tiled_mma_setup / _make_tcgen05_layout_plan_setup and onto the ab_tma wrapper plan. - _tcgen05_tiled_mma_expr / strategies.tcgen05_smem_layout_expr: emit OperandMajorMode.K and make_smem_layout_b(is_k_major=True) for B when K-major (delegating to CuTe's helper, mirroring the K-major A path). - runtime/__init__.py: host wrapper emits the K-major B tiled_mma + SMEM layout and marks the B TMA view leading_dim=1 (K axis) for K-major vs leading_dim=0 (N axis) for MN-major. - test_cute_backend.py: test_matmul_mma_tcgen05_fp8_col_major_b. All paths default to the prior MN-major behavior when B is row-major, so existing golden/codegen tests are byte-identical (full cute suite: 93 passed). K-major B is correct (rel_err 0.0000 vs torch._scaled_mm). Benchmark (B200, CUDA 13.2, fp8 e4m3 scaled_mm, m=k=n=4096, CtaGroup.TWO cluster_m=2 ab_stages=3 role-local, do_bench, 10s warmup): row-major B (MN-major, old forced layout) : 577 TFLOP/s 31% of aten col-major B (K-major, this change) : 1143 TFLOP/s 62% of aten torch._scaled_mm : 1857 TFLOP/s i.e. ~2x from using the tensor-core-native K-major B layout. The remaining gap to torch._scaled_mm is deep AB staging (ab_stages>3), a separate lever. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> stack-info: PR: #2740, branch: yushangdi/stack/25
…gen05
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile:
* the fully unrolled ab_stages-deep warm-up prologue (8 guarded
acquire/copy/commit blocks at ab=8, each with a re-computed
per-stage "k_tile fits in K" predicate that is a compile-time truth
under static shapes), and
* the separate steady-state producer loop issuing TMA at
k_tile + ab_stages with per-iteration full-tile/next-full-tile
predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full, so consumer wait / release
run unconditionally under their two-CTA owner gates), and all per-K-iter
owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead
of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate -
the inline form cost an S2UR+IMAD pair per use inside the K loop and was
the single largest instruction-count delta vs the reference (+508k
dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:
before after standalone
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
kernel duration 52.2 us 51.9 us 48.0 us
tensor-pipe active 81.6 % 81.5 % 84.2 %
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound. Combined with the epilogue spill fix, the fused
scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) vs the standalone's 48.7 us /
2317-2334, up from 74.2 us / 1509-1711 at the start of this
investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale.
Scope and fallbacks:
* The unified producer requires static-full tiles + the role-local
separate producer loop (the producer warp owns its K loop, so a
blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail
families and the bridge diagnostics that skip individual
acquire/advance edges keep the unrolled-prologue form.
* _build_kloop_pipeline_consumer_if/_release_if become list-returning
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims
keep the old call shape where the gated form is still used.
* The ab_stages=3 stage-gap deadlock (cute_plan.md 6.9.1) is
structurally impossible in the unified form: the blocking acquire
fills every stage in order. The regression test now pins the
unified codegen markers and still verifies end-to-end correctness.
Test updates: three tests pinned old codegen shapes (per-stage prefetch
markers, inline cluster-leader predicates in the K loop, the
producer-if's full-tile gate) and now pin the unified/hoisted forms.
Full HELION_BACKEND=cute sweep: test_cute_lowerings.py +
test_cute_backend.py 479 passed, other cute test files 85 passed.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
stack-info: PR: #2755, branch: yushangdi/stack/35
2237a51 to
f407dea
Compare
d853568 to
3541d8d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
[cute] Unified rolled TMA producer + hoisted K-loop predicates for tcgen05
Static-full role-local tcgen05 configs now emit the CUTLASS reference
producer shape: ONE rolled K loop in the TMA-load warp whose blocking
producer_acquire fills the AB ring - pipelining comes from the mbarrier
handshake, not from codegen structure. This replaces, per work tile:
acquire/copy/commit blocks at ab=8, each with a re-computed
per-stage "k_tile fits in K" predicate that is a compile-time truth
under static shapes), and
k_tile + ab_stages with per-iteration full-tile/next-full-tile
predicate recomputation.
The MMA-exec K loop drops its per-iteration full-tile predicate the same
way (static-full means every K tile is full, so consumer wait / release
run unconditionally under their two-CTA owner gates), and all per-K-iter
owner gates now reuse a hoisted tcgen05_cta_rank_in_cluster local instead
of re-deriving make_warp_uniform(block_idx_in_cluster()) at every gate -
the inline form cost an S2UR+IMAD pair per use inside the K loop and was
the single largest instruction-count delta vs the reference (+508k
dynamic IMAD).
NCU on 4096x4096x4096 fp8 e4m3 -> bf16 plain matmul, B200,
block_sizes=[256,128,128], ab_stages=8, cluster_m=2:
warp instructions / SMSP 5820 4711 4866
static UTMALDG sites 24 2 2
dynamic SYNCS (barriers) 533k 507k 470k
kernel duration 52.2 us 51.9 us 48.0 us
tensor-pipe active 81.6 % 81.5 % 84.2 %
Helion now executes FEWER warp-instructions per SMSP than the CUTLASS
standalone; the remaining ~8% duration gap is no longer
instruction-count-bound. Combined with the epilogue spill fix, the fused
scaled_mm kernel measures 52.6 us / 2100-2132 TFLOP/s device-side
(batched back-to-back: 2359-2381) vs the standalone's 48.7 us /
2317-2334, up from 74.2 us / 1509-1711 at the start of this
investigation. relerr=0.0000 at bk=64 and bk=128 for plain and scale.
Scope and fallbacks:
separate producer loop (the producer warp owns its K loop, so a
blocking acquire cannot stall MMA/epilogue warps). Edge/K-tail
families and the bridge diagnostics that skip individual
acquire/advance edges keep the unrolled-prologue form.
_stmts builders so the ungated static-full forms can emit sibling
statements without an artificial wrapper; single-statement shims
keep the old call shape where the gated form is still used.
structurally impossible in the unified form: the blocking acquire
fills every stage in order. The regression test now pins the
unified codegen markers and still verifies end-to-end correctness.
Test updates: three tests pinned old codegen shapes (per-stage prefetch
markers, inline cluster-leader predicates in the K loop, the
producer-if's full-tile gate) and now pin the unified/hoisted forms.
Full HELION_BACKEND=cute sweep: test_cute_lowerings.py +
test_cute_backend.py 479 passed, other cute test files 85 passed.
Co-Authored-By: Claude Fable 5 noreply@anthropic.com