autotuner: record perf stats in .meta.jsonl by IshanAryendu · Pull Request #2824 · pytorch/helion

IshanAryendu · 2026-06-18T23:32:08Z

Depends on #2809

Summary

Record per-config perf distribution (min/median/mean/p90/std/n_samples) in .meta.jsonl for cost-model dataset
Extend do_bench and do_bench_generic with return_mode='stats'
Propagate through BenchmarkJob IPC as dict, BenchmarkResult perf_stats field, and AutotuneLogSink configs map
CSV unchanged for back-compat
Fix benchmark_isolated tuple handling for subprocess stats return

Test plan

pytest test/test_kernel_metadata.py -q -v
pytest test/test_kernel_metadata.py::TestAutotuneLogSink -q -v
pytest test/test_autotuner.py -k "log_sink or restricted" -q
pytest test/test_llm_autotuner.py -k LFBO -q
pytest test/test_benchmarking.py -q

…pytorch#2779)

…torch#2768)

…ches (pytorch#2786)

Co-authored-by: eche <eche@devvm32174.atn0.facebook.com> Co-authored-by: Ethan Che <eche@meta.com>

…torch#2799)

…d-heuristic curriculum (pytorch#2760) - jsd: declare the dX accumulator fp32 to fix a bf16/fp16 ControlFlowTensorMismatch. dX was _input.dtype, but the beta==1 branch folds the fp32 intermediate_loss into it, so the carried dtype mismatches at the branch merge (type propagation visits every beta branch). True no-op at fp32. - welford: use the true valid-column count for the divisor instead of the constexpr tile width (which over-counts last-tile padding) -- fixes wrong results at non-divisor N. - jsd / kl_div: document the fp16 wide-V NaN (out of scope; not fixed here). The new sibling/probe kernels (argmax, groupnorm, l2_norm, log_softmax, logsumexp, row_max) are intentionally excluded from this PR and kept on the lab branch for a later standalone PR.

…nriched MemoryOpFact (pytorch#2761) Add the reduction fact layer the Triton reduction seed heuristic will read, built in the compiler fact pass. No heuristic yet — that lands in the next PR; these facts are populated but not consumed here. - helion/autotuner/config_spec.py: ReductionFact, AccumulatorFact, enriched MemoryOpFact (per-op provenance), and the fact storage fields on ConfigSpec. - helion/_compiler/device_ir.py: a 3-phase fact build (roll reductions, collect enriched memory-op facts after rolling, derive reduction/accumulator facts) that replaces bespoke per-config graph walks. - full_width_output / input_load_itemsize key on the reduction AXIS via a new MemoryOpFact.subscript_block_ids (block-id read from the index subscript, resolved reduction-agnostically) rather than an inner_extent==size_hint size-match — faithful for user-tiled (T2) reductions whose reduction block is reduction=False, and immune to a non-reduction dim coincidentally equal to the reduction extent. Byte-identical seed configs across the curriculum (no size coincidences existed). - test/test_memory_op_facts.py: fact coverage incl. the indexing-slot invariant. - test/test_barrier.py: update the rdim / config_spec fakes for the new fact build.

…orch#2762) Add the Triton inner-reduction seed heuristic, reading the reduction facts from the previous PR. - TritonReductionTileHeuristic (T1: rollable rdim — sum, rms_norm, layer_norm, softmax-row, cross_entropy) and TritonReductionUserTileHeuristic (T2: user-tiled — softmax_two_pass, kl_div, jsd, welford/groupnorm). Persistent-vs-looped element/byte spill caps, the rnumel num_warps ramp, Band-B R_BLOCK + Band-C combine footprint caps, the narrow-row w1 occupancy gate, the M_BLOCK>1 apply-loop stream cap, per-slot stream/re-read eviction, off-sm90 conservative fallback. This is the pruned generalizable core — the over-fit dtype tail is intentionally deferred. - helion/_compiler/autotuner_heuristics/__init__.py: register the two heuristics. - test/test_autotuner_heuristics.py, test/test_best_available.py: heuristic decisions + reduction-loop config round-trip coverage.

…torch#2718)

…8 GEMM (pytorch#2808)

…h#2813) Co-authored-by: Ethan Che <eche@meta.com>

…d by cute backend (pytorch#2815)

These can result in a test passing without actually running the test. Without this check, we end up running the decorator function as the test, NOT the test body. In doing so we get the following warning: ``` DeprecationWarning: It is deprecated to return a value that is not None from a test case (<bound method skipIfFn.<locals>.decorator of <test.test_foo.TestFoo testMethod=test_bar _with_post_reduction_op>>) return self.run(*args, **kwds) ```

Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…orch#2820) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…f nightly (pytorch#2818) Co-authored-by: Ethan Che <eche@meta.com>

…rch#2817)

…ytorch#2812) Co-authored-by: Ethan Che <eche@meta.com>

Collect a lean, opt-in, joinable autotuning dataset for training an autotuner cost model. During an autotune run, each benchmarked config and its result are recorded so a downstream model can learn (config, context) → perf. This revision reworks the original approach in response to review: it removes the config-minimization path, moves per-config data out of the CSV into a per-run JSON sidecar, replaces the positional row index with a content-addressed config_id, and gates the whole dataset behind an explicit flag. Supersedes the initial implementation (69b7af1) with three focused follow-up commits. Reviewer feedback addressed 1. Logger plumbing — register_config returns None when no sink; metadata + collect_dataset threaded into the sink. 2. When/how data is written — per-run record written once at run end as a single JSON line (default=str); CSV header is the exact lean set; row counter removed. 3. Search logic & settings — collect only when logging on + dataset flag on + not a restricted search; one-time warning if flag set without a log path; config_defaults removed; new autotune_dataset setting. What changed? - C1 moved the config into .meta.jsonl as a configs map (keyed by config_id), and moved the meta write from log-open to end_run (one JSON line per run). - C2 run_id hashes all codegen/perf-affecting settings (backend, dot_precision, fast_math, static_shapes, index_dtype, allow_warp_specialize, triton_do_not_specialize, pallas_interpret, debug_dtype_asserts, persistent_reserved_sms) - C3 Replaced the run_id dataclass field + __post_init__ + _compute_run_id with a single @functools.cached_property

Copilot

Pull request overview

This PR extends Helion’s autotuner benchmarking/logging pipeline to capture and persist per-config performance distribution statistics (min/median/mean/p90/std/n_samples) into the .meta.jsonl dataset sidecar, while keeping the CSV format unchanged for backwards compatibility.

Changes:

Add a PerfStats summary type and return_mode="stats" support through do_bench / do_bench_generic.
Propagate perf stats through subprocess benchmarking IPC, BenchmarkResult, and into .meta.jsonl via AutotuneLogEntry / AutotuneLogSink.
Add/adjust tests to validate the schema and stats computation, and to account for new return shapes in benchmarking paths.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`helion/autotuner/benchmarking.py`	Introduces `PerfStats`, computes stats, and adds `"stats"` return_mode to bench APIs.
`helion/autotuner/benchmark_provider.py`	Switches autotune benchmarking to request stats, threads them through subprocess/in-process flows, and records to log sink.
`helion/autotuner/benchmark_job.py`	Extends benchmark worker job to support return_mode propagation and stats dict return over IPC.
`helion/autotuner/logger.py`	Adds `perf_stats` to log entries and persists per-config perf stats in `.meta.jsonl` configs map.
`test/test_benchmarking.py`	Adds unit tests for `_compute_perf_stats` and stats-mode fallback behavior.
`test/test_kernel_metadata.py`	Validates `.meta.jsonl` schema includes `perf_stats` with expected subfields and null template on failure.
`test/test_autotuner.py`	Updates mocks/assertions to reflect benchmark function returning `(perf, stats)` and do_bench stats-mode.
`test/test_benchmark_worker.py`	Updates subprocess benchmarking integration test shim for new tuple return shape.
`test/test_debug_utils.py`	Updates autotune error-path test mock to return `PerfStats` under stats-mode.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def _compute_perf_stats(times: list[float]) -> PerfStats:
+    n = len(times)
+    if n == 0:
+        return PerfStats(0.0, 0.0, 0.0, 0.0, 0.0, 0)
+    sorted_times = sorted(times)
+    min_val = sorted_times[0]
+    median_val = statistics.median(sorted_times)
+    mean_val = statistics.mean(sorted_times)
+    p90_val = float(np.percentile(sorted_times, 90))
+    try:
+        std_val = statistics.stdev(sorted_times) if n > 1 else 0.0
+    except statistics.StatisticsError:
+        std_val = 0.0
+    return PerfStats(min_val, median_val, mean_val, p90_val, std_val, n)


+    def __call__(self) -> float | dict[str, object]:
        # Subprocess inherits parent stderr; capture so Triton runtime


Introduce PerfStats NamedTuple in benchmarking layer with min median mean p90 std n_samples. Extend do_bench and do_bench_generic with return_mode='stats'. Propagate through BenchmarkJob IPC as dict, BenchmarkResult perf_stats field, and AutotuneLogSink configs map. CSV unchanged for back-compat. Fixes benchmark_isolated tuple handling for subprocess stats return. Test plan: pytest test/test_kernel_metadata.py -q pytest test/test_kernel_metadata.py::TestAutotuneLogSink -q -v pytest test/test_autotuner.py -k 'log_sink or restricted' -q pytest test/test_llm_autotuner.py -k LFBO -q

… refactor TestSubprocessBenchmarkIntegration.test_autotune_continues_when_subprocess_reports_inf patches _benchmark_function_subprocess to simulate inf return, but after perf stats refactor that method returns tuple[float, PerfStats|None] not float. The old patch returned math.inf causing unpack error and fallback to in-process path which then raised TritonError on bad configs instead of skipping. Update patch to return (math.inf, None) matching new signature.

yushangdi and others added 20 commits June 16, 2026 11:39

Add regression test for scalar-arg Triton specialization (pytorch#2793)

7bab48d

[cute] Pre-wait rowvec aux register hoist for the bm=128 2-CTA family (…

53ea26f

…pytorch#2779)

Fix pyrefly type errors in distributed examples (pytorch#2801)

680625f

[cute] Persist autotune winner from memory instead of recompiling (py…

c2577aa

…torch#2768)

[cute] Reuse reduction thread axis for free hl.arange in sibling bran…

1ec5f1a

…ches (pytorch#2786)

[Pallas] Add pallas_loop_type = compact_worklist (pytorch#2782)

1755f30

Co-authored-by: eche <eche@devvm32174.atn0.facebook.com> Co-authored-by: Ethan Che <eche@meta.com>

[examples] Add a simpler concat implementation (pytorch#2766)

3aeaf5d

[test] Add partial slice indexing test for [:n] and [n:] patterns (py…

c149ea9

…torch#2799)

[Pallas] Implement is_row_map_axis legality gate for jagged carry (py…

468b000

…torch#2718)

[cute] Add CuteFp8GemmSkinnyMHeuristic autotuner seed for skinny-M FP…

6d86748

…8 GEMM (pytorch#2808)

Fix pyrefly errors from torch SymInt._sympy_() -> sympy.Basic (pytorc…

3b32b20

…h#2813) Co-authored-by: Ethan Che <eche@meta.com>

[cute] Clear L2 in the generic autotune benchmark loops, which is use…

591c299

…d by cute backend (pytorch#2815)

[chore](deps): Bump actions/checkout from 6 to 7 (pytorch#2811)

813bda0

Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[autotune] Fix pyrefly missing-attribute on L2 cache-clear calls (pyt…

f5e562b

…orch#2820) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pin pytorch version in benchmark-tpu to existing ci version instead o…

6facd85

…f nightly (pytorch#2818) Co-authored-by: Ethan Che <eche@meta.com>

[cute] Sample CUDA stream per launch to fix empty-graph capture (pyto…

62e9166

…rch#2817)

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2026

ethche and others added 3 commits June 18, 2026 20:45

[Pallas] Defer tile load masks past transposes onto the sublane axis (p…

e036705

…ytorch#2812) Co-authored-by: Ethan Che <eche@meta.com>

[chore] Upgrade pyrefly to 1.1.1 (pytorch#2825)

98f5ee7

IshanAryendu force-pushed the iaryendu/autotune-perf-stats branch from 3e26887 to 9493216 Compare June 19, 2026 22:15

IshanAryendu requested a review from Copilot June 19, 2026 22:16

Copilot started reviewing on behalf of IshanAryendu June 19, 2026 22:16 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

IshanAryendu added 2 commits June 19, 2026 15:57

IshanAryendu force-pushed the iaryendu/autotune-perf-stats branch from 9493216 to f13902e Compare June 19, 2026 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autotuner: record perf stats in .meta.jsonl#2824

autotuner: record perf stats in .meta.jsonl#2824
IshanAryendu wants to merge 25 commits into
pytorch:ig-ir-graph-triton-codefrom
IshanAryendu:iaryendu/autotune-perf-stats

IshanAryendu commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

		def __call__(self) -> float \| dict[str, object]:
		# Subprocess inherits parent stderr; capture so Triton runtime

Conversation

IshanAryendu commented Jun 18, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants