Skip to content

Use plain tuple keys for the in-memory bound-kernel cache#2747

Closed
yushangdi wants to merge 1 commit into
yushangdi/stack/28from
yushangdi/stack/29
Closed

Use plain tuple keys for the in-memory bound-kernel cache#2747
yushangdi wants to merge 1 commit into
yushangdi/stack/28from
yushangdi/stack/29

Conversation

@yushangdi

@yushangdi yushangdi commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Stacked PRs:


Use plain tuple keys for the in-memory bound-kernel cache

Kernel.bind runs on every kernel call and the cache hit is the
steady state, so the per-call lookup key should be as cheap to build
as possible.

_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
from ..autotuner.base_cache import ... import, a dataclass init,
two frozen-field object.setattr overrides, and a generated
hash that re-walks the fields. The in-memory _bound_kernels dict
only needs some hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.

The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.

Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).

Safety: cache-key contents are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    18.78 us |    16.41 us
       8 | 33.05 us |    27.77 us |    24.78 us
      16 | 43.56 us |    36.90 us |    35.24 us

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

@yushangdi yushangdi force-pushed the yushangdi/stack/28 branch from e3eb97a to 43254bb Compare June 10, 2026 23:39
@yushangdi yushangdi force-pushed the yushangdi/stack/29 branch from f72bff3 to 7703235 Compare June 10, 2026 23:39
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2026
@yushangdi yushangdi changed the base branch from yushangdi/stack/28 to main June 10, 2026 23:43
@yushangdi yushangdi force-pushed the yushangdi/stack/29 branch from 7703235 to d739597 Compare June 10, 2026 23:43
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/28 June 10, 2026 23:43
@yushangdi yushangdi changed the base branch from yushangdi/stack/28 to main June 11, 2026 00:37
@yushangdi yushangdi force-pushed the yushangdi/stack/29 branch from d739597 to 09aea3f Compare June 11, 2026 00:37
@yushangdi yushangdi changed the title Use plain tuple cache keys and skip measure() on the bind hot path Use plain tuple keys for the in-memory bound-kernel cache Jun 11, 2026
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/28 June 11, 2026 00:37
@yushangdi yushangdi changed the base branch from yushangdi/stack/28 to main June 11, 2026 00:45
@yushangdi yushangdi force-pushed the yushangdi/stack/29 branch from 09aea3f to 27da975 Compare June 11, 2026 00:45
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/28 June 11, 2026 00:45
@yushangdi yushangdi requested a review from oulgen June 11, 2026 00:49
@yushangdi yushangdi marked this pull request as ready for review June 11, 2026 00:49

@oulgen oulgen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the "win" here comes from getting rid of

return hashlib.sha256(repr(self).encode("utf-8")).hexdigest()
and instead doing a non-stable comparison. You dont need to remove all the abstraction to get better perf.

Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.

_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.

The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.

Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).

Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    18.78 us |    16.41 us
       8 | 33.05 us |    27.77 us |    24.78 us
      16 | 43.56 us |    36.90 us |    35.24 us
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2747, branch: yushangdi/stack/29
@yushangdi yushangdi marked this pull request as draft June 11, 2026 01:14
@yushangdi yushangdi changed the base branch from yushangdi/stack/28 to main June 11, 2026 01:14
@yushangdi yushangdi force-pushed the yushangdi/stack/29 branch from 27da975 to 4803740 Compare June 11, 2026 01:16
@yushangdi yushangdi changed the base branch from main to yushangdi/stack/28 June 11, 2026 01:16
@yushangdi

Copy link
Copy Markdown
Contributor Author

I assume the "win" here comes from getting rid of

return hashlib.sha256(repr(self).encode("utf-8")).hexdigest()

and instead doing a non-stable comparison. You dont need to remove all the abstraction to get better perf.

@oulgen actually, the win is from dataclass construction: 245 ns vs tuple 28 ns. add slots to the dataclass didn't help. See benchmarks below.

The in-memory _bound_kernels dict uses the frozen dataclass's auto-generated hash/eq, never stable_hash.

dev@gpu-dev-aba9482d:~/helion$ python /tmp/bench_cachekey.py
each timed over 2,000,000 iterations

full per-call hit path (construct key + dict lookup):
  OLD  dataclass key                            :   491.7 ns
  NEW  tuple key                                :    85.5 ns

key construction alone:
  BoundKernelInMemoryCacheKey(sig, extra)       :   243.7 ns
  (sig, extra)                                  :    28.5 ns

stable_hash() SHA-256 -- only on disk/remote caches, never in bind():
  k.stable_hash()                               :  1464.6 ns
dev@gpu-dev-aba9482d:~/helion$ cat /tmp/bench_cachekey.py
"""Demonstrate that the bind() cache-key win comes from avoiding per-call
frozen-dataclass construction, NOT from dropping stable_hash()'s SHA-256.

Run: python /tmp/bench_cachekey.py
"""
import dataclasses
import hashlib
import timeit
from typing import Hashable

N = 2_000_000


# The real in-memory cache key (helion/autotuner/base_cache.py).
@dataclasses.dataclass(frozen=True)
class BoundKernelInMemoryCacheKey:
    specialization_key: tuple[Hashable, ...]
    extra_results: tuple[Hashable, ...]

    def stable_hash(self) -> str:  # only used by disk/remote caches, NOT bind()
        return hashlib.sha256(repr(self).encode("utf-8")).hexdigest()


# A representative bind() signature: dtype/shape/stride bits + device + caps.
sig = ((1, 2, 3), "cuda", (10, 0), 0, ("float32",))
extra: tuple[Hashable, ...] = ()


def bench(label: str, fn) -> None:
    ns = timeit.timeit(fn, number=N) / N * 1e9
    print(f"{label:48s}: {ns:7.1f} ns")


print(f"each timed over {N:,} iterations\n")

# ---- what bind() does on every cache HIT: build key + dict.get ----
d_dataclass = {BoundKernelInMemoryCacheKey(sig, extra): "bound"}
d_tuple = {(sig, extra): "bound"}

print("full per-call hit path (construct key + dict lookup):")
bench("  OLD  dataclass key", lambda: d_dataclass.get(BoundKernelInMemoryCacheKey(sig, extra)))
bench("  NEW  tuple key", lambda: d_tuple.get((sig, extra)))

# ---- isolate just the key construction (the actual win) ----
print("\nkey construction alone:")
bench("  BoundKernelInMemoryCacheKey(sig, extra)", lambda: BoundKernelInMemoryCacheKey(sig, extra))
bench("  (sig, extra)", lambda: (sig, extra))

# ---- prove stable_hash() is off the path (and would dwarf everything) ----
print("\nstable_hash() SHA-256 -- only on disk/remote caches, never in bind():")
k = BoundKernelInMemoryCacheKey(sig, extra)
bench("  k.stable_hash()", lambda: k.stable_hash())

@yushangdi yushangdi marked this pull request as ready for review June 11, 2026 01:28
@yushangdi yushangdi requested a review from oulgen June 11, 2026 01:29

@oulgen oulgen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this feels weird, lets figure out an alternative way, i am not a fan of removing these key abstractions. It feels like an easy way to shoot ourselves in the foot later

@yushangdi yushangdi closed this Jun 11, 2026
yushangdi added a commit that referenced this pull request Jun 11, 2026
Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.

_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.

The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.

Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).

Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    18.78 us |    16.41 us
       8 | 33.05 us |    27.77 us |    24.78 us
      16 | 43.56 us |    36.90 us |    35.24 us
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2747, branch: yushangdi/stack/29
yushangdi added a commit that referenced this pull request Jun 11, 2026
Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.

_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.

The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.

Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).

Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    18.78 us |    16.41 us
       8 | 33.05 us |    27.77 us |    24.78 us
      16 | 43.56 us |    36.90 us |    35.24 us
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2747, branch: yushangdi/stack/29
yushangdi added a commit that referenced this pull request Jun 11, 2026
Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.

_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.

The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.

Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).

Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.

Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):

```
  n_args | baseline | prev commit | this commit
  -------+----------+-------------+------------
       2 | 24.14 us |    18.78 us |    16.41 us
       8 | 33.05 us |    27.77 us |    24.78 us
      16 | 43.56 us |    36.90 us |    35.24 us
```

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stack-info: PR: #2747, branch: yushangdi/stack/29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants