Install a per-spec fast launcher that bypasses Triton's JITFunction.run by yushangdi · Pull Request #2749 · pytorch/helion

yushangdi · 2026-06-10T23:39:14Z

Stacked PRs:

Install a per-spec fast launcher that bypasses Triton's JITFunction.run

Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.

This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:

helion/runtime/_fast_launcher.py -- default_launcher moves here
unchanged (still re-exported from helion.runtime), joined by
_FastLauncher: a multi-spec launcher primed on first call. The hot
path computes a tiny spec key inline -- an alignment bitmask over
the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
stages-hook knob state -- dict-looks-up the compiled binary for
that spec, and jumps straight into Triton's C launcher
(CompiledKernel.run). Spec misses compile through Triton's full
pipeline once and are cached, so call sites alternating aligned/
unaligned tensors stay on the fast path for both.
BoundKernel.set_config clones the PyCodeCache'd host function
(PyCodeCache keys on source hash, so two BoundKernels can share one
function object) and re-points its _launcher kwdefault at a
_FastLauncher with the config's num_warps/num_stages/etc. baked in.
Explicit _launcher= callers (the autotune trial harness) override
the kwdefault naturally.
TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
values out of launcher_keyword_args so codegen strings and the
launcher closure share one source of truth.

Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:

Alignment is part of the spec key, so an unaligned tensor after an
aligned prime gets its own correctly-compiled binary -- never the
vectorized aligned binary (which would fault), and never a clone
(which would silently drop writes to output args).
used_global_vals snapshot per spec entry; any mutation falls back
to JITFunction.run so Triton's own RuntimeError surfaces instead of
silently launching a stale binary.
torch.compile tracing routes through default_launcher so Dynamo's
triton_kernel_wrapper_mutation HOP rules apply.
Multi-device guard: a current-device change after priming falls
back to Triton's per-device dispatch.
launch_enter/exit hooks are re-read per call (a profiler attached
after priming still fires; launch_metadata is built only when a
hook will consume it); pre_run_hooks fire inline; flipping
knobs.runtime.debug lands on a new spec entry and recompiles.
Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall
back to default_launcher permanently.

Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

  n_args | baseline | prev commit | this commit | total
  -------+----------+-------------+-------------+------
       2 | 24.14 us |    17.25 us |    13.63 us |  -44%
       8 | 33.05 us |    24.50 us |    18.99 us |  -43%
      16 | 43.56 us |    32.16 us |    25.73 us |  -41%

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 15.53 us | 12.19 us | -49% 8 | 33.05 us | 22.12 us | 17.87 us | -46% 16 | 43.56 us | 30.62 us | 24.65 us | -43% Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31

Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 15.53 us | 12.19 us | -49% 8 | 33.05 us | 22.12 us | 17.87 us | -46% 16 | 43.56 us | 30.62 us | 24.65 us | -43% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31

Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31

yushangdi force-pushed the yushangdi/stack/30 branch from 79b4c3b to c9e5d81 Compare June 10, 2026 23:39

yushangdi force-pushed the yushangdi/stack/31 branch from 1e8127a to 1f95d7d Compare June 10, 2026 23:39

This was referenced Jun 10, 2026

Cache CUDA device capability lookups on the kernel dispatch hot path #2746

Merged

Use plain tuple keys for the in-memory bound-kernel cache #2747

Closed

Add a SymInt-free tensor specialization key for exact torch.Tensor args #2748

Merged

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2026

yushangdi changed the base branch from yushangdi/stack/30 to main June 10, 2026 23:43

yushangdi force-pushed the yushangdi/stack/31 branch from 1f95d7d to d9b27b0 Compare June 10, 2026 23:43

yushangdi changed the base branch from main to yushangdi/stack/30 June 10, 2026 23:43

yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 00:37

yushangdi force-pushed the yushangdi/stack/31 branch from d9b27b0 to 7091d83 Compare June 11, 2026 00:37

yushangdi force-pushed the yushangdi/stack/31 branch from 7091d83 to dc28a61 Compare June 11, 2026 00:37

yushangdi mentioned this pull request Jun 11, 2026

Move measure("Kernel.bind") off the cache-hit dispatch path #2751

Draft

yushangdi changed the base branch from main to yushangdi/stack/30 June 11, 2026 00:37

yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 00:45

yushangdi force-pushed the yushangdi/stack/31 branch from dc28a61 to 1b8f941 Compare June 11, 2026 00:45

yushangdi force-pushed the yushangdi/stack/31 branch from 1b8f941 to dc37fce Compare June 11, 2026 00:45

yushangdi mentioned this pull request Jun 11, 2026

Skip the measure("Kernel.bind") context manager when measurement is off #2752

Draft

yushangdi changed the base branch from main to yushangdi/stack/30 June 11, 2026 00:45

yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 01:15

yushangdi force-pushed the yushangdi/stack/31 branch from dc37fce to fdf25d2 Compare June 11, 2026 01:16

yushangdi changed the base branch from main to yushangdi/stack/30 June 11, 2026 01:16

yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 01:55

yushangdi force-pushed the yushangdi/stack/31 branch from fdf25d2 to 7ea9e8b Compare June 11, 2026 01:55

yushangdi changed the base branch from main to yushangdi/stack/30 June 11, 2026 01:55

yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 02:02

yushangdi force-pushed the yushangdi/stack/31 branch from 7ea9e8b to 0ce7eba Compare June 11, 2026 02:02

yushangdi changed the base branch from main to yushangdi/stack/30 June 11, 2026 02:02

yushangdi force-pushed the yushangdi/stack/30 branch from f9c2756 to 4a78314 Compare June 11, 2026 17:09

yushangdi force-pushed the yushangdi/stack/31 branch from 0ce7eba to 719414a Compare June 11, 2026 17:09

yushangdi changed the base branch from yushangdi/stack/30 to main June 11, 2026 17:52

yushangdi force-pushed the yushangdi/stack/31 branch from 719414a to 600e739 Compare June 11, 2026 17:52

yushangdi force-pushed the yushangdi/stack/31 branch from 600e739 to 43439b5 Compare June 11, 2026 17:52

yushangdi mentioned this pull request Jun 11, 2026

Add a SymInt-free tensor specialization key for exact torch.Tensor args #2759

Open

yushangdi changed the base branch from main to yushangdi/stack/37 June 11, 2026 17:52

yushangdi changed the base branch from yushangdi/stack/37 to main June 11, 2026 18:47

yushangdi force-pushed the yushangdi/stack/31 branch from 43439b5 to 35ccee7 Compare June 11, 2026 18:47

yushangdi force-pushed the yushangdi/stack/31 branch from 35ccee7 to aff901a Compare June 11, 2026 22:03

yushangdi changed the base branch from main to yushangdi/stack/37 June 11, 2026 22:03

yushangdi marked this pull request as ready for review June 11, 2026 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Install a per-spec fast launcher that bypasses Triton's JITFunction.run#2749

Install a per-spec fast launcher that bypasses Triton's JITFunction.run#2749
yushangdi wants to merge 1 commit into
yushangdi/stack/37from
yushangdi/stack/31

yushangdi commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yushangdi commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!