Skip to content

Collect kernel artifacts and append-mode autotune telemetry with run_id#2737

Draft
IshanAryendu wants to merge 4 commits into
pytorch:mainfrom
IshanAryendu:collect-kernel-artifacts
Draft

Collect kernel artifacts and append-mode autotune telemetry with run_id#2737
IshanAryendu wants to merge 4 commits into
pytorch:mainfrom
IshanAryendu:collect-kernel-artifacts

Conversation

@IshanAryendu

@IshanAryendu IshanAryendu commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Collect kernel artifacts from real autotuning runs in CI

Tweak the autotune telemetry of Helion into a reliable, joinable dataset that can be collected from CI and used to build a cost-model / kernel-artifact corpus. It builds on the existing kernel_id / sample_id / per-config CSV + identity sidecar.

Changes:

  1. Convert the autotune-log sink into append-mode for capturing all kernels and input shapes of the CI job in a single file instead of overwriting them.
  2. Change the sidecar from a single truncating .meta.json to an appended .meta.jsonl which collects one record per autotune run
  3. Adding run_id, a content-derived foreign key that uniquely identifies one autotune invocation, stamped on every CSV row and meta record so per-config measurements can be attributed to the exact shape/dtype/hardware they were measured on.
  4. Recording the decorator string (@helion.kernel(config=…)) as a structured CSV column.
  5. Wiring the benchmark CI workflow to emit these artifacts per kernel and upload them with the existing benchmark artifact.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2026
…he decorator already computed for sample_id and record it per row, completing the kernel-artifact set (source, input shapes, decorator).
@IshanAryendu IshanAryendu force-pushed the collect-kernel-artifacts branch from 6db0d5c to 44334ef Compare June 10, 2026 17:46
@IshanAryendu

Copy link
Copy Markdown
Contributor Author

@claude

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @IshanAryendu's task in 3m 47s —— View job


Code review

  • Read the diff vs origin/main
  • Review changed files for bugs / correctness / style
  • Post review feedback

Reviewed the full diff (helion/autotuner/{metrics,logger,base_search,benchmark_provider}.py, the benchmark workflow, docs, and tests). Overall this is a clean, well-documented change: the run_id join key, append-mode sidecars, and the decorator artifact column all hang together, and the test coverage (test_run_id_is_derived_and_distinguishes_shapes, test_sink_appends_across_runs_at_same_base_path) is good — it exercises both the derive-vs-pinned run_id paths and the cross-run accumulation case. No correctness bugs that would break the CI collection path.

A few things worth considering:

1. CSV append mode is fragile to schema drift across versions

AutotuneLogSink.open() (helion/autotuner/logger.py:323) writes the header only when the file is new/empty:

write_header = not self.csv_path.exists() or self.csv_path.stat().st_size == 0
self._csv_file = self.csv_path.open("a", ...)

If a .csv written by an older Helion (the pre-PR 9-column schema, no run_id/decorator) already exists at the base path, the new 11-column rows get appended under the stale header with no indication, silently misaligning every column. In CI this is safe (fresh TEST_REPORTS_DIR each run), but for a persistent HELION_AUTOTUNE_LOG it produces a corrupt file. Consider validating the existing header against the expected columns when not writing one, and either erroring or rotating on mismatch.

2. Concurrent writers to a shared base path can interleave / double-header

Append mode assumes a single writer. If two processes autotune into the same HELION_AUTOTUNE_LOG, the write_header check is a TOCTOU race (both can see an empty file and both write a header), and CSV/JSONL line writes from different processes can interleave. The benchmark workflow sidesteps this with a per-kernel path (autotune-$kernel) in a sequential loop, so it's fine as used — but it's an implicit constraint of the design. A one-line note in the autotune_log docs that the base path must not be shared by concurrent processes would save a future debugging session.

3. Stale comment: join key is run_id, not kernel_id

The comment in open() (helion/autotuner/logger.py:317-318) says:

# ... CSV rows join back to
# these records via kernel_id.

But the rest of the PR (and record() at logger.py:382-384) correctly describes the precise join as run_id (many-to-one), with kernel_id being the coarser cross-shape grouping key. Worth aligning this comment to avoid contradicting the new model. Fix this →

4. Minor: run_id for an empty/unidentified kernel is a fixed nonzero hash

With default/empty KernelMetadata, __post_init__ still computes run_id = sha256("\x00\x00\x00") — a constant non-empty value, while kernel_id stays "". Through the sink that's masked (run_id is only emitted when metadata is present), so it's harmless, but it means "unidentified" runs all collapse to one shared run_id rather than an empty one. If that matters for downstream dedup, you may want to short-circuit to "" when kernel_id is empty. test_default_metadata_has_empty_identity currently doesn't assert on run_id, so this behavior is untested either way.

None of these block the PR for its CI use case (#1/#2 are guarded by the per-kernel fresh-dir setup); #3 is a trivial doc fix and #1 is the most worth hardening if autotune_log is ever pointed at a long-lived path.
collect-kernel-artifacts

@IshanAryendu IshanAryendu changed the title Collect kernel artifacts and append-mode autotune telemetry with run_idCollect kernel artifacts Collect kernel artifacts and append-mode autotune telemetry with run_id Jun 11, 2026
@IshanAryendu IshanAryendu marked this pull request as draft June 11, 2026 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant