Commit 09aea3f
committed
Use plain tuple keys for the in-memory bound-kernel cache
Kernel.bind runs on every kernel call and the cache *hit* is the
steady state, so the per-call lookup key should be as cheap to build
as possible.
_get_bound_kernel_cache_key constructed a frozen-dataclass
BoundKernelInMemoryCacheKey on every call: a lazy
`from ..autotuner.base_cache import ...` import, a dataclass __init__,
two frozen-field object.__setattr__ overrides, and a generated
__hash__ that re-walks the fields. The in-memory _bound_kernels dict
only needs *some* hashable key, so the per-call path now uses the
equivalent plain (signature, extra_results) tuple. In isolation this
drops _get_bound_kernel_cache_key from ~0.93us to ~0.22us per call.
The dataclass form is still produced by _create_bound_kernel_cache_key
for the autotuner caches (LocalAutotuneCache / AOTAutotuneCache
subclass it into LooseAutotuneCacheKey); only the in-memory dict
switches to tuple keys. On the compile (cache-miss) path the dataclass
key is built once and unpacked into its (specialization_key,
extra_results) tuple form so the in-memory dict and the autotuner
caches stay keyed on the same value.
Also drops one extra-results tuple allocation when a kernel has no
hl.specialize() extras (the common case): the empty extra_fns list
short-circuits to a shared () literal instead of tuple([]).
Safety: cache-key *contents* are unchanged -- same signature tuple,
same extra results, same specialization axes (dtype, shape bucket,
device type+capability, ConstExpr values, key= fn, hl.specialize
extras); the tuple is just the dataclass's two fields in order, with
identical hash/equality. Verified by test_misc, test_config_api,
test_cache, test_specialize (132 passed), plus
dtype/shape/ConstExpr/specialize rebinding spot-checks.
Benchmark (B200, end-to-end wall time per call, add-style kernel,
N=4096 fp32, HELION_AUTOTUNE_EFFORT=none, steady state, 20k iters):
```
n_args | baseline | prev commit | this commit
-------+----------+-------------+------------
2 | 24.14 us | 18.78 us | 16.41 us
8 | 33.05 us | 27.77 us | 24.78 us
16 | 43.56 us | 36.90 us | 35.24 us
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stack-info: PR: #2747, branch: yushangdi/stack/291 parent 26cd6c6 commit 09aea3f
1 file changed
Lines changed: 13 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
185 | 185 | | |
186 | 186 | | |
187 | 187 | | |
188 | | - | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
189 | 192 | | |
190 | 193 | | |
191 | 194 | | |
| |||
226 | 229 | | |
227 | 230 | | |
228 | 231 | | |
229 | | - | |
230 | | - | |
231 | | - | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
232 | 236 | | |
233 | 237 | | |
234 | | - | |
235 | | - | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
236 | 241 | | |
237 | 242 | | |
238 | 243 | | |
| |||
278 | 283 | | |
279 | 284 | | |
280 | 285 | | |
281 | | - | |
| 286 | + | |
282 | 287 | | |
283 | 288 | | |
| 289 | + | |
284 | 290 | | |
285 | 291 | | |
286 | 292 | | |
| |||
0 commit comments