Install a per-spec fast launcher that bypasses Triton's JITFunction.run#2749
Open
yushangdi wants to merge 1 commit into
Open
Install a per-spec fast launcher that bypasses Triton's JITFunction.run#2749yushangdi wants to merge 1 commit into
yushangdi wants to merge 1 commit into
Conversation
79b4c3b to
c9e5d81
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 10, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 15.53 us | 12.19 us | -49% 8 | 33.05 us | 22.12 us | 17.87 us | -46% 16 | 43.56 us | 30.62 us | 24.65 us | -43% Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
1e8127a to
1f95d7d
Compare
This was referenced Jun 10, 2026
yushangdi
added a commit
that referenced
this pull request
Jun 10, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 15.53 us | 12.19 us | -49% 8 | 33.05 us | 22.12 us | 17.87 us | -46% 16 | 43.56 us | 30.62 us | 24.65 us | -43% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
1f95d7d to
d9b27b0
Compare
d9b27b0 to
7091d83
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 15.53 us | 12.19 us | -49% 8 | 33.05 us | 22.12 us | 17.87 us | -46% 16 | 43.56 us | 30.62 us | 24.65 us | -43% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 15.53 us | 12.19 us | -49% 8 | 33.05 us | 22.12 us | 17.87 us | -46% 16 | 43.56 us | 30.62 us | 24.65 us | -43% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
7091d83 to
dc28a61
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 15.53 us | 12.19 us | -49% 8 | 33.05 us | 22.12 us | 17.87 us | -46% 16 | 43.56 us | 30.62 us | 24.65 us | -43% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
dc28a61 to
1b8f941
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 15.53 us | 12.19 us | -49% 8 | 33.05 us | 22.12 us | 17.87 us | -46% 16 | 43.56 us | 30.62 us | 24.65 us | -43% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
1b8f941 to
dc37fce
Compare
dc37fce to
fdf25d2
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 15.53 us | 12.19 us | -49% 8 | 33.05 us | 22.12 us | 17.87 us | -46% 16 | 43.56 us | 30.62 us | 24.65 us | -43% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
fdf25d2 to
7ea9e8b
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
7ea9e8b to
0ce7eba
Compare
f9c2756 to
4a78314
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
0ce7eba to
719414a
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
719414a to
600e739
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
600e739 to
43439b5
Compare
43439b5 to
35ccee7
Compare
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
yushangdi
added a commit
that referenced
this pull request
Jun 11, 2026
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
Every Helion kernel launch went through Triton's full JITFunction.run pipeline (~9.3us): per-call device + stream proxy resolution, the argument binder, compute_cache_key, kernel-cache lookup, the used_global_vals walk, launch_metadata construction (even with no profiler attached), and the kwargs-dict munging around all of it. For a Helion kernel almost all of that is redundant: BoundKernel has already specialized on dtype/shape/stride/device, so the only Triton-level specialization left at launch time is pointer alignment and binary-affecting knob state. This ports the _FastLauncher design from upstream PR #2565 (plus the set_config function-clone fix from PR #2635) onto main: * helion/runtime/_fast_launcher.py -- default_launcher moves here unchanged (still re-exported from helion.runtime), joined by _FastLauncher: a multi-spec launcher primed on first call. The hot path computes a tiny spec key inline -- an alignment bitmask over the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/ stages-hook knob state -- dict-looks-up the compiled binary for that spec, and jumps straight into Triton's C launcher (CompiledKernel.run). Spec misses compile through Triton's full pipeline once and are cached, so call sites alternating aligned/ unaligned tensors stay on the fast path for both. * BoundKernel.set_config clones the PyCodeCache'd host function (PyCodeCache keys on source hash, so two BoundKernels can share one function object) and re-points its _launcher kwdefault at a _FastLauncher with the config's num_warps/num_stages/etc. baked in. Explicit _launcher= callers (the autotune trial harness) override the kwdefault naturally. * TritonBackend.launcher_runtime_kwargs factors the runtime kwarg values out of launcher_keyword_args so codegen strings and the launcher closure share one source of truth. Safety / correctness guards, each pinned by a test in test/test_fast_launcher.py: * Alignment is part of the spec key, so an unaligned tensor after an aligned prime gets its own correctly-compiled binary -- never the vectorized aligned binary (which would fault), and never a clone (which would silently drop writes to output args). * used_global_vals snapshot per spec entry; any mutation falls back to JITFunction.run so Triton's own RuntimeError surfaces instead of silently launching a stale binary. * torch.compile tracing routes through default_launcher so Dynamo's triton_kernel_wrapper_mutation HOP rules apply. * Multi-device guard: a current-device change after priming falls back to Triton's per-device dispatch. * launch_enter/exit hooks are re-read per call (a profiler attached after priming still fires; launch_metadata is built only when a hook will consume it); pre_run_hooks fire inline; flipping knobs.runtime.debug lands on a new spec entry and recompiles. * Any priming/compile failure, and HELION_SKIP_FAST_LAUNCHER=1, fall back to default_launcher permanently. Verified: test_fast_launcher + test_misc (51 passed), and full runs of test_torch_compile (244), test_examples (96), test_autotuner (122), test_indexing, test_loops, test_grid, test_ref_eager, test_specialize, test_config_api, test_cache earlier on this branch. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit | total -------+----------+-------------+-------------+------ 2 | 24.14 us | 17.25 us | 13.63 us | -44% 8 | 33.05 us | 24.50 us | 18.99 us | -43% 16 | 43.56 us | 32.16 us | 25.73 us | -41% ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2749, branch: yushangdi/stack/31
35ccee7 to
aff901a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
Install a per-spec fast launcher that bypasses Triton's JITFunction.run
Every Helion kernel launch went through Triton's full JITFunction.run
pipeline (~9.3us): per-call device + stream proxy resolution, the
argument binder, compute_cache_key, kernel-cache lookup, the
used_global_vals walk, launch_metadata construction (even with no
profiler attached), and the kwargs-dict munging around all of it. For
a Helion kernel almost all of that is redundant: BoundKernel has
already specialized on dtype/shape/stride/device, so the only
Triton-level specialization left at launch time is pointer alignment
and binary-affecting knob state.
This ports the _FastLauncher design from upstream PR #2565 (plus the
set_config function-clone fix from PR #2635) onto main:
helion/runtime/_fast_launcher.py -- default_launcher moves here
unchanged (still re-exported from helion.runtime), joined by
_FastLauncher: a multi-spec launcher primed on first call. The hot
path computes a tiny spec key inline -- an alignment bitmask over
the tensor args (data_ptr() & 15) plus debug/instrumentation_mode/
stages-hook knob state -- dict-looks-up the compiled binary for
that spec, and jumps straight into Triton's C launcher
(CompiledKernel.run). Spec misses compile through Triton's full
pipeline once and are cached, so call sites alternating aligned/
unaligned tensors stay on the fast path for both.
BoundKernel.set_config clones the PyCodeCache'd host function
(PyCodeCache keys on source hash, so two BoundKernels can share one
function object) and re-points its _launcher kwdefault at a
_FastLauncher with the config's num_warps/num_stages/etc. baked in.
Explicit _launcher= callers (the autotune trial harness) override
the kwdefault naturally.
TritonBackend.launcher_runtime_kwargs factors the runtime kwarg
values out of launcher_keyword_args so codegen strings and the
launcher closure share one source of truth.
Safety / correctness guards, each pinned by a test in
test/test_fast_launcher.py:
aligned prime gets its own correctly-compiled binary -- never the
vectorized aligned binary (which would fault), and never a clone
(which would silently drop writes to output args).
to JITFunction.run so Triton's own RuntimeError surfaces instead of
silently launching a stale binary.
triton_kernel_wrapper_mutation HOP rules apply.
back to Triton's per-device dispatch.
after priming still fires; launch_metadata is built only when a
hook will consume it); pre_run_hooks fire inline; flipping
knobs.runtime.debug lands on a new spec entry and recompiles.
back to default_launcher permanently.
Verified: test_fast_launcher + test_misc (51 passed), and full runs of
test_torch_compile (244), test_examples (96), test_autotuner (122),
test_indexing, test_loops, test_grid, test_ref_eager, test_specialize,
test_config_api, test_cache earlier on this branch.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com