Use plain tuple keys for the in-memory bound-kernel cache by yushangdi · Pull Request #2747 · pytorch/helion

yushangdi · 2026-06-10T23:39:12Z

Stacked PRs:

Use plain tuple keys for the in-memory bound-kernel cache

Kernel.bind runs on every kernel call and the cache hit is the
steady state, so the per-call lookup key should be as cheap to build
as possible.

_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
from ..autotuner.base_cache import ... import, a dataclass init,
two frozen-field object.setattr overrides, and a generated
hash that re-walks the fields. The in-memory _bound_kernels dict
only needs some hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.

The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.

Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).

Safety: cache-key contents are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    18.78 us |    16.41 us
       8 | 33.05 us |    27.77 us |    24.78 us
      16 | 43.56 us |    36.90 us |    35.24 us

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

oulgen

I assume the "win" here comes from getting rid of

helion/helion/autotuner/base_cache.py

Line 104 in 68aa3ee

return hashlib.sha256(repr(self).encode("utf-8")).hexdigest()

and instead doing a non-stable comparison. You dont need to remove all the abstraction to get better perf.

Kernel.bind runs on every kernel call and the cache *hit* is the steady state, so the per-call lookup key should be as cheap to build as possible. _get_bound_kernel_cache_key constructed a frozen-dataclass BoundKernelInMemoryCacheKey on every call: a lazy `from ..autotuner.base_cache import ...` import, a dataclass __init__, two frozen-field object.__setattr__ overrides, and a generated __hash__ that re-walks the fields. The in-memory _bound_kernels dict only needs *some* hashable key, so the per-call path now uses the equivalent plain (signature, extra_results) tuple. In isolation this drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call. The dataclass form is still produced by _create_bound_kernel_cache_key for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache subclass it into LooseAutotuneCacheKey); only the in-memory dict switches to tuple keys. On the compile (cache-miss) path the dataclass key is built once and unpacked into its (specialization_key, extra_results) tuple form so the in-memory dict and the autotuner caches stay keyed on the same value. Also drops one extra-results tuple allocation when a kernel has no hl.specialize() extras (the common case): the empty extra_fns list short-circuits to a shared () literal instead of tuple([]). Safety: cache-key *contents* are unchanged -- same signature tuple, same extra results, same specialization axes (dtype, shape bucket, device type+capability, ConstExpr values, key= fn, hl.specialize extras); the tuple is just the dataclass's two fields in order, with identical hash/equality. Verified by test_misc, test_config_api, test_cache, test_specialize (132 passed), plus dtype/shape/ConstExpr/specialize rebinding spot-checks. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit -------+----------+-------------+------------ 2 | 24.14 us | 18.78 us | 16.41 us 8 | 33.05 us | 27.77 us | 24.78 us 16 | 43.56 us | 36.90 us | 35.24 us ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2747, branch: yushangdi/stack/29

yushangdi · 2026-06-11T01:28:18Z

I assume the "win" here comes from getting rid of

helion/helion/autotuner/base_cache.py

Line 104 in 68aa3ee

return hashlib.sha256(repr(self).encode("utf-8")).hexdigest()

and instead doing a non-stable comparison. You dont need to remove all the abstraction to get better perf.

@oulgen actually, the win is from dataclass construction: 245 ns vs tuple 28 ns. add slots to the dataclass didn't help. See benchmarks below.

The in-memory _bound_kernels dict uses the frozen dataclass's auto-generated hash/eq, never stable_hash.

dev@gpu-dev-aba9482d:~/helion$ python /tmp/bench_cachekey.py
each timed over 2,000,000 iterations

full per-call hit path (construct key + dict lookup):
  OLD  dataclass key                            :   491.7 ns
  NEW  tuple key                                :    85.5 ns

key construction alone:
  BoundKernelInMemoryCacheKey(sig, extra)       :   243.7 ns
  (sig, extra)                                  :    28.5 ns

stable_hash() SHA-256 -- only on disk/remote caches, never in bind():
  k.stable_hash()                               :  1464.6 ns
dev@gpu-dev-aba9482d:~/helion$ cat /tmp/bench_cachekey.py
"""Demonstrate that the bind() cache-key win comes from avoiding per-call
frozen-dataclass construction, NOT from dropping stable_hash()'s SHA-256.

Run: python /tmp/bench_cachekey.py
"""
import dataclasses
import hashlib
import timeit
from typing import Hashable

N = 2_000_000


# The real in-memory cache key (helion/autotuner/base_cache.py).
@dataclasses.dataclass(frozen=True)
class BoundKernelInMemoryCacheKey:
    specialization_key: tuple[Hashable, ...]
    extra_results: tuple[Hashable, ...]

    def stable_hash(self) -> str:  # only used by disk/remote caches, NOT bind()
        return hashlib.sha256(repr(self).encode("utf-8")).hexdigest()


# A representative bind() signature: dtype/shape/stride bits + device + caps.
sig = ((1, 2, 3), "cuda", (10, 0), 0, ("float32",))
extra: tuple[Hashable, ...] = ()


def bench(label: str, fn) -> None:
    ns = timeit.timeit(fn, number=N) / N * 1e9
    print(f"{label:48s}: {ns:7.1f} ns")


print(f"each timed over {N:,} iterations\n")

# ---- what bind() does on every cache HIT: build key + dict.get ----
d_dataclass = {BoundKernelInMemoryCacheKey(sig, extra): "bound"}
d_tuple = {(sig, extra): "bound"}

print("full per-call hit path (construct key + dict lookup):")
bench("  OLD  dataclass key", lambda: d_dataclass.get(BoundKernelInMemoryCacheKey(sig, extra)))
bench("  NEW  tuple key", lambda: d_tuple.get((sig, extra)))

# ---- isolate just the key construction (the actual win) ----
print("\nkey construction alone:")
bench("  BoundKernelInMemoryCacheKey(sig, extra)", lambda: BoundKernelInMemoryCacheKey(sig, extra))
bench("  (sig, extra)", lambda: (sig, extra))

# ---- prove stable_hash() is off the path (and would dwarf everything) ----
print("\nstable_hash() SHA-256 -- only on disk/remote caches, never in bind():")
k = BoundKernelInMemoryCacheKey(sig, extra)
bench("  k.stable_hash()", lambda: k.stable_hash())

oulgen

hmm this feels weird, lets figure out an alternative way, i am not a fan of removing these key abstractions. It feels like an easy way to shoot ourselves in the foot later

Kernel.bind runs on every kernel call and the cache *hit* is the steady state, so the per-call lookup key should be as cheap to build as possible. _get_bound_kernel_cache_key constructed a frozen-dataclass BoundKernelInMemoryCacheKey on every call: a lazy `from ..autotuner.base_cache import ...` import, a dataclass __init__, two frozen-field object.__setattr__ overrides, and a generated __hash__ that re-walks the fields. The in-memory _bound_kernels dict only needs *some* hashable key, so the per-call path now uses the equivalent plain (signature, extra_results) tuple. In isolation this drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call. The dataclass form is still produced by _create_bound_kernel_cache_key for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache subclass it into LooseAutotuneCacheKey); only the in-memory dict switches to tuple keys. On the compile (cache-miss) path the dataclass key is built once and unpacked into its (specialization_key, extra_results) tuple form so the in-memory dict and the autotuner caches stay keyed on the same value. Also drops one extra-results tuple allocation when a kernel has no hl.specialize() extras (the common case): the empty extra_fns list short-circuits to a shared () literal instead of tuple([]). Safety: cache-key *contents* are unchanged -- same signature tuple, same extra results, same specialization axes (dtype, shape bucket, device type+capability, ConstExpr values, key= fn, hl.specialize extras); the tuple is just the dataclass's two fields in order, with identical hash/equality. Verified by test_misc, test_config_api, test_cache, test_specialize (132 passed), plus dtype/shape/ConstExpr/specialize rebinding spot-checks. Benchmark (B200, end-to-end wall time per call, add-style kernel, N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters): ``` n_args | baseline | prev commit | this commit -------+----------+-------------+------------ 2 | 24.14 us | 18.78 us | 16.41 us 8 | 33.05 us | 27.77 us | 24.78 us 16 | 43.56 us | 36.90 us | 35.24 us ``` Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> stack-info: PR: #2747, branch: yushangdi/stack/29

yushangdi force-pushed the yushangdi/stack/28 branch from e3eb97a to 43254bb Compare June 10, 2026 23:39

yushangdi force-pushed the yushangdi/stack/29 branch from f72bff3 to 7703235 Compare June 10, 2026 23:39

yushangdi mentioned this pull request Jun 10, 2026

Cache CUDA device capability lookups on the kernel dispatch hot path #2746

Merged

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2026

This was referenced Jun 10, 2026

Add a SymInt-free tensor specialization key for exact torch.Tensor args #2748

Merged

Install a per-spec fast launcher that bypasses Triton's JITFunction.run #2749

Open

yushangdi changed the base branch from yushangdi/stack/28 to main June 10, 2026 23:43

yushangdi force-pushed the yushangdi/stack/29 branch from 7703235 to d739597 Compare June 10, 2026 23:43

yushangdi changed the base branch from main to yushangdi/stack/28 June 10, 2026 23:43

yushangdi changed the base branch from yushangdi/stack/28 to main June 11, 2026 00:37

yushangdi force-pushed the yushangdi/stack/29 branch from d739597 to 09aea3f Compare June 11, 2026 00:37

yushangdi changed the title ~~Use plain tuple cache keys and skip measure() on the bind hot path~~ Use plain tuple keys for the in-memory bound-kernel cache Jun 11, 2026

yushangdi changed the base branch from main to yushangdi/stack/28 June 11, 2026 00:37

yushangdi mentioned this pull request Jun 11, 2026

Move measure("Kernel.bind") off the cache-hit dispatch path #2751

Draft

yushangdi changed the base branch from yushangdi/stack/28 to main June 11, 2026 00:45

yushangdi force-pushed the yushangdi/stack/29 branch from 09aea3f to 27da975 Compare June 11, 2026 00:45

yushangdi changed the base branch from main to yushangdi/stack/28 June 11, 2026 00:45

yushangdi mentioned this pull request Jun 11, 2026

Skip the measure("Kernel.bind") context manager when measurement is off #2752

Draft

yushangdi requested a review from oulgen June 11, 2026 00:49

yushangdi marked this pull request as ready for review June 11, 2026 00:49

oulgen requested changes Jun 11, 2026

View reviewed changes

yushangdi marked this pull request as draft June 11, 2026 01:14

yushangdi changed the base branch from yushangdi/stack/28 to main June 11, 2026 01:14

yushangdi force-pushed the yushangdi/stack/29 branch from 27da975 to 4803740 Compare June 11, 2026 01:16

yushangdi changed the base branch from main to yushangdi/stack/28 June 11, 2026 01:16

yushangdi marked this pull request as ready for review June 11, 2026 01:28

yushangdi requested a review from oulgen June 11, 2026 01:29

oulgen requested changes Jun 11, 2026

View reviewed changes

yushangdi closed this Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use plain tuple keys for the in-memory bound-kernel cache#2747

Use plain tuple keys for the in-memory bound-kernel cache#2747
yushangdi wants to merge 1 commit into
yushangdi/stack/28from
yushangdi/stack/29

yushangdi commented Jun 10, 2026 •

edited

Loading

Uh oh!

oulgen left a comment

Uh oh!

yushangdi commented Jun 11, 2026

Uh oh!

oulgen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yushangdi commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!