Skip to content

Install a per-spec fast launcher that bypasses Triton's JITFunction.run#2749

Open
yushangdi wants to merge 1 commit into
yushangdi/stack/37from
yushangdi/stack/31
Open

Install a per-spec fast launcher that bypasses Triton's JITFunction.run#2749
yushangdi wants to merge 1 commit into
yushangdi/stack/37from
yushangdi/stack/31

Conversation

@yushangdi

@yushangdi yushangdi commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Stacked PRs:


Install a per-spec fast launcher that bypasses Triton's JITFunction.run

Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

  • helion/runtime/_fast_launcher.py -- default_launcher moves here
    unchanged (still re-exported from helion.runtime), joined by
    _FastLauncher: a multi-spec launcher primed on first call. The hot
    path computes a tiny spec key inline -- an alignment bitmask over
    the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
    stages-hook knob state -- dict-looks-up the compiled binary for
    that spec, and jumps straight into Triton's C launcher
    (CompiledKernel.run). Spec misses compile through Triton's full
    pipeline once and are cached, so call sites alternating aligned/
    unaligned tensors stay on the fast path for both.

  • BoundKernel.set_config clones the PyCodeCache'd host function
    (PyCodeCache keys on source hash, so two BoundKernels can share one
    function object) and re-points its _launcher kwdefault at a
    _FastLauncher with the config's num_warps/num_stages/etc. baked in.
    Explicit _launcher= callers (the autotune trial harness) override
    the kwdefault naturally.

  • TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
    values out of launcher_keyword_args so codegen strings and the
    launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:

  • Alignment is part of the spec key, so an unaligned tensor after an
    aligned prime gets its own correctly-compiled binary -- never the
    vectorized aligned binary (which would fault), and never a clone
    (which would silently drop writes to output args).
  • used_global_vals snapshot per spec entry; any mutation falls back
    to JITFunction.run so Triton's own RuntimeError surfaces instead of
    silently launching a stale binary.
  • torch.compile tracing routes through default_launcher so Dynamo's
    triton_kernel_wrapper_mutation HOP rules apply.
  • Multi-device guard: a current-device change after priming falls
    back to Triton's per-device dispatch.
  • launch_enter/exit hooks are re-read per call (a profiler attached
    after priming still fires; launch_metadata is built only when a
    hook will consume it); pre_run_hooks fire inline; flipping
    knobs.runtime.debug lands on a new spec entry and recompiles.
  • Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
    back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

@yushangdi yushangdi force-pushed the yushangdi/stack/30 branch from 79b4c3b to c9e5d81 Compare June 10, 2026 23:39
yushangdi added a commit that referenced this pull request Jun 10, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    15.53 us |    12.19 us |  -49%
       8 | 33.05 us |    22.12 us |    17.87 us |  -46%
      16 | 43.56 us |    30.62 us |    24.65 us |  -43%

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from 1e8127a to 1f95d7d Compare June 10, 2026 23:39
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2026
@yushangdi yushangdi changed the base branch from yushangdi/stack/30 to main June 10, 2026 23:43
yushangdi added a commit that referenced this pull request Jun 10, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    15.53 us |    12.19 us |  -49%
       8 | 33.05 us |    22.12 us |    17.87 us |  -46%
      16 | 43.56 us |    30.62 us |    24.65 us |  -43%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from 1f95d7d to d9b27b0 Compare June 10, 2026 23:43
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/30 June 10, 2026 23:43
@yushangdi yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 00:37
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from d9b27b0 to 7091d83 Compare June 11, 2026 00:37
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    15.53 us |    12.19 us |  -49%
       8 | 33.05 us |    22.12 us |    17.87 us |  -46%
      16 | 43.56 us |    30.62 us |    24.65 us |  -43%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    15.53 us |    12.19 us |  -49%
       8 | 33.05 us |    22.12 us |    17.87 us |  -46%
      16 | 43.56 us |    30.62 us |    24.65 us |  -43%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from 7091d83 to dc28a61 Compare June 11, 2026 00:37
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/30 June 11, 2026 00:37
@yushangdi yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 00:45
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    15.53 us |    12.19 us |  -49%
       8 | 33.05 us |    22.12 us |    17.87 us |  -46%
      16 | 43.56 us |    30.62 us |    24.65 us |  -43%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from dc28a61 to 1b8f941 Compare June 11, 2026 00:45
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    15.53 us |    12.19 us |  -49%
       8 | 33.05 us |    22.12 us |    17.87 us |  -46%
      16 | 43.56 us |    30.62 us |    24.65 us |  -43%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from 1b8f941 to dc37fce Compare June 11, 2026 00:45
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/30 June 11, 2026 00:45
@yushangdi yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 01:15
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from dc37fce to fdf25d2 Compare June 11, 2026 01:16
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    15.53 us |    12.19 us |  -49%
       8 | 33.05 us |    22.12 us |    17.87 us |  -46%
      16 | 43.56 us |    30.62 us |    24.65 us |  -43%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/30 June 11, 2026 01:16
@yushangdi yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 01:55
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from fdf25d2 to 7ea9e8b Compare June 11, 2026 01:55
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/30 June 11, 2026 01:55
@yushangdi yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 02:02
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from 7ea9e8b to 0ce7eba Compare June 11, 2026 02:02
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/30 June 11, 2026 02:02
@yushangdi yushangdi force-pushed the yushangdi/stack/30 branch from f9c2756 to 4a78314 Compare June 11, 2026 17:09
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from 0ce7eba to 719414a Compare June 11, 2026 17:09
@yushangdi yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 17:52
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from 719414a to 600e739 Compare June 11, 2026 17:52
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from 600e739 to 43439b5 Compare June 11, 2026 17:52
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/37 June 11, 2026 17:52
@yushangdi yushangdi changed the base branch from yushangdi/stack/37 to main June 11, 2026 18:47
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from 43439b5 to 35ccee7 Compare June 11, 2026 18:47
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
yushangdi added a commit that referenced this pull request Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

 * helion/runtime/_fast_launcher.py -- default_launcher moves here
   unchanged (still re-exported from helion.runtime), joined by
   _FastLauncher: a multi-spec launcher primed on first call. The hot
   path computes a tiny spec key inline -- an alignment bitmask over
   the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
   stages-hook knob state -- dict-looks-up the compiled binary for
   that spec, and jumps straight into Triton's C launcher
   (CompiledKernel.run). Spec misses compile through Triton's full
   pipeline once and are cached, so call sites alternating aligned/
   unaligned tensors stay on the fast path for both.

 * BoundKernel.set_config clones the PyCodeCache'd host function
   (PyCodeCache keys on source hash, so two BoundKernels can share one
   function object) and re-points its _launcher kwdefault at a
   _FastLauncher with the config's num_warps/num_stages/etc. baked in.
   Explicit _launcher= callers (the autotune trial harness) override
   the kwdefault naturally.

 * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
   values out of launcher_keyword_args so codegen strings and the
   launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
 * Alignment is part of the spec key, so an unaligned tensor after an
   aligned prime gets its own correctly-compiled binary -- never the
   vectorized aligned binary (which would fault), and never a clone
   (which would silently drop writes to output args).
 * used_global_vals snapshot per spec entry; any mutation falls back
   to JITFunction.run so Triton's own RuntimeError surfaces instead of
   silently launching a stale binary.
 * torch.compile tracing routes through default_launcher so Dynamo's
   triton_kernel_wrapper_mutation HOP rules apply.
 * Multi-device guard: a current-device change after priming falls
   back to Triton's per-device dispatch.
 * launch_enter/exit hooks are re-read per call (a profiler attached
   after priming still fires; launch_metadata is built only when a
   hook will consume it); pre_run_hooks fire inline; flipping
   knobs.runtime.debug lands on a new spec entry and recompiles.
 * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
   back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2749, branch: yushangdi/stack/31
@yushangdi yushangdi force-pushed the yushangdi/stack/31 branch from 35ccee7 to aff901a Compare June 11, 2026 22:03
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/37 June 11, 2026 22:03
@yushangdi yushangdi marked this pull request as ready for review June 11, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant