Skip to content

[cute] Persist autotune winner from memory instead of recompiling#2768

Open
fulvius31 wants to merge 2 commits into
pytorch:mainfrom
fulvius31:cute-cache
Open

[cute] Persist autotune winner from memory instead of recompiling#2768
fulvius31 wants to merge 2 commits into
pytorch:mainfrom
fulvius31:cute-cache

Conversation

@fulvius31

Copy link
Copy Markdown
Collaborator

The on-disk cache key deliberately excludes CUTE_DSL_CACHE_DIR, so the winner's still-in-memory ir_module can be re-persisted directly into the real cache dir under the exact filename a later process looks up.
Replaces finalize's evict-and-recompile: saves one cute.compile per autotune, keeps the winner hot in memory, and persists the artifact even when the tuning process never launches the kernel afterward.

The on-disk cache key deliberately excludes CUTE_DSL_CACHE_DIR, so the
winner's still-in-memory ir_module can be re-persisted directly into the
real cache dir under the exact filename a later process looks up.
Replaces finalize's evict-and-recompile: saves one cute.compile per
autotune, keeps the winner hot in memory, and persists the artifact even
when the tuning process never launches the kernel afterward.
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 11, 2026
@fulvius31 fulvius31 marked this pull request as draft June 11, 2026 21:24
@fulvius31

Copy link
Copy Markdown
Collaborator Author

@claude review these changes

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude finished @fulvius31's task in 2m 37s —— View job


Review: Persist autotune winner from memory instead of recompiling

  • Read the PR diff
  • Analyze finalize_ephemeral_cache change in backend.py
  • Analyze persist_compiled in runtime/__init__.py
  • Review the new test

Overall this is a clean, well-reasoned change. Replacing the evict-and-recompile dance with a direct re-persist from memory is the right move: it saves one cute.compile per autotune, keeps the winner hot, and (the genuinely nice part) guarantees the artifact exists even if the tuning process never launches the kernel again. The docstrings carry the reasoning well, and the env-var handoff between the ephemeral context and finalize_ephemeral_cache is correct.

Correctness — the env-var handoff checks out

The flow is sound:

  • _ephemeral_cute_cache restores (or pops) CUTE_DSL_CACHE_DIR on exit (backend.py:3304-3308).
  • finalize_ephemeral_cache then calls setup_compile_cache_dir(device_index) (backend.py:3347), which only sets the var if absent — so a user-provided dir is honored, and the real per-device dir is used otherwise.
  • persist_compiled_cache_file_paths reads the var at call time (runtime/__init__.py:2542), so it lands in the restored/real dir. Device index defaulting to 0 matches setup_compile_cache_dir and the test's /0 subdir.

The central correctness assumption — that the on-disk key excludes CUTE_DSL_CACHE_DIR, so an artifact compiled under the ephemeral dir is byte-identical to one a later process expects — is the linchpin. It's documented (backend.py:3316-3318), and only safe as long as nothing in the compiled IR encodes the cache path. Worth keeping in mind if CuTe's caching ever changes.

Main feedback — the test's "no recompile" assertion is tautological

with patch.object(cute, "compile", side_effect=AssertionError("winner should not recompile")):
    result = kernel(*args_a)

This runs in the same process right after autotune. The winning launcher's _compiled is already populated in memory (set during autotuning in __call__, runtime/__init__.py:2525), and the new finalize deliberately keeps launchers intact. So kernel(*args_a) returns the in-memory _compiled and never reaches _reload_from_disk or cute.compileregardless of whether the disk persist succeeded. The assertion passes trivially and does not exercise the disk-reload path it appears to be testing.

The valuable half of the test is solid: artifact.is_file(), sidecar.is_file(), and entries == 2 genuinely verify the winner (and only the winner) is persisted to the real dir at finalize time. To make the no-recompile claim meaningful, force the disk-reload path — e.g. clear the in-memory launchers (or bind a fresh kernel against the same cache dir) before the guarded launch, so a successful call must have come from the persisted artifact. Fix this →

Minor

  • Silent persist failure. persist_compiled_persist_to_disk is wrapped in suppress(Exception) (runtime/__init__.py:2548) with no logging. If the finalize persist fails, the winner silently isn't cached and a later process recompiles with no diagnostic. This is pre-existing behavior for the launch path, but finalize is now the only place the artifact gets written, so a log.debug on failure would make a missed cache observable. Not blocking.

  • entries == 2 is slightly brittle — it assumes exactly the .mlir + .json pair and nothing else (dotfiles already filtered). Fine for now, but it'll need a touch-up if CuTe ever emits an extra sidecar.

  • launchers.values() loop persists every launcher on the winner's module. Correct (each distinct shape persists under its own _cache_key), just noting it's intentional and not winner-launcher-only.

Nice cleanup overall — the only thing I'd genuinely push on before merge is tightening the test so the reload path is actually covered.
• branch cute-cache

@fulvius31 fulvius31 marked this pull request as ready for review June 11, 2026 22:32
@fulvius31 fulvius31 requested review from jansel and oulgen June 11, 2026 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant