Use plain tuple keys for the in-memory bound-kernel cache#2747
Use plain tuple keys for the in-memory bound-kernel cache#2747yushangdi wants to merge 1 commit into
Conversation
e3eb97a to
43254bb
Compare
f72bff3 to
7703235
Compare
7703235 to
d739597
Compare
d739597 to
09aea3f
Compare
09aea3f to
27da975
Compare
oulgen
left a comment
There was a problem hiding this comment.
I assume the "win" here comes from getting rid of
helion/helion/autotuner/base_cache.py
Line 104 in 68aa3ee
Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.
_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.
The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.
Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).
Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | baseline | prev commit | this commit
-------+----------+-------------+------------
2 | 24.14 us | 18.78 us | 16.41 us
8 | 33.05 us | 27.77 us | 24.78 us
16 | 43.56 us | 36.90 us | 35.24 us
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2747, branch: yushangdi/stack/29
27da975 to
4803740
Compare
@oulgen actually, the win is from dataclass construction: 245 ns vs tuple 28 ns. add slots to the dataclass didn't help. See benchmarks below. The in-memory |
oulgen
left a comment
There was a problem hiding this comment.
hmm this feels weird, lets figure out an alternative way, i am not a fan of removing these key abstractions. It feels like an easy way to shoot ourselves in the foot later
Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.
_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.
The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.
Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).
Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | baseline | prev commit | this commit
-------+----------+-------------+------------
2 | 24.14 us | 18.78 us | 16.41 us
8 | 33.05 us | 27.77 us | 24.78 us
16 | 43.56 us | 36.90 us | 35.24 us
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2747, branch: yushangdi/stack/29
Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.
_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.
The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.
Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).
Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | baseline | prev commit | this commit
-------+----------+-------------+------------
2 | 24.14 us | 18.78 us | 16.41 us
8 | 33.05 us | 27.77 us | 24.78 us
16 | 43.56 us | 36.90 us | 35.24 us
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2747, branch: yushangdi/stack/29
Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.
_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.
The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.
Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).
Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | baseline | prev commit | this commit
-------+----------+-------------+------------
2 | 24.14 us | 18.78 us | 16.41 us
8 | 33.05 us | 27.77 us | 24.78 us
16 | 43.56 us | 36.90 us | 35.24 us
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2747, branch: yushangdi/stack/29
Stacked PRs:
Use plain tuple keys for the in-memory bound-kernel cache
Kernel.bind runs on every kernel call and the cache hit is the
steady state, so the per-call lookup key should be as cheap to build
as possible.
_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
from ..autotuner.base_cache import ...import, a dataclass init,two frozen-field object.setattr overrides, and a generated
hash that re-walks the fields. The in-memory _bound_kernels dict
only needs some hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.
The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.
Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).
Safety: cache-key contents are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com