Skip to content

autotuner: record perf stats in .meta.jsonl#2824

Draft
IshanAryendu wants to merge 25 commits into
pytorch:ig-ir-graph-triton-codefrom
IshanAryendu:iaryendu/autotune-perf-stats
Draft

autotuner: record perf stats in .meta.jsonl#2824
IshanAryendu wants to merge 25 commits into
pytorch:ig-ir-graph-triton-codefrom
IshanAryendu:iaryendu/autotune-perf-stats

Conversation

@IshanAryendu

Copy link
Copy Markdown
Contributor

Depends on #2809

Summary

  • Record per-config perf distribution (min/median/mean/p90/std/n_samples) in .meta.jsonl for cost-model dataset
  • Extend do_bench and do_bench_generic with return_mode='stats'
  • Propagate through BenchmarkJob IPC as dict, BenchmarkResult perf_stats field, and AutotuneLogSink configs map
  • CSV unchanged for back-compat
  • Fix benchmark_isolated tuple handling for subprocess stats return

Test plan

  • pytest test/test_kernel_metadata.py -q -v
  • pytest test/test_kernel_metadata.py::TestAutotuneLogSink -q -v
  • pytest test/test_autotuner.py -k "log_sink or restricted" -q
  • pytest test/test_llm_autotuner.py -k LFBO -q
  • pytest test/test_benchmarking.py -q

yushangdi and others added 20 commits June 16, 2026 11:39
Co-authored-by: eche <eche@devvm32174.atn0.facebook.com>
Co-authored-by: Ethan Che <eche@meta.com>
…d-heuristic curriculum (pytorch#2760)

- jsd: declare the dX accumulator fp32 to fix a bf16/fp16 ControlFlowTensorMismatch.
  dX was _input.dtype, but the beta==1 branch folds the fp32 intermediate_loss into it,
  so the carried dtype mismatches at the branch merge (type propagation visits every
  beta branch). True no-op at fp32.
- welford: use the true valid-column count for the divisor instead of the constexpr tile
  width (which over-counts last-tile padding) -- fixes wrong results at non-divisor N.
- jsd / kl_div: document the fp16 wide-V NaN (out of scope; not fixed here).

The new sibling/probe kernels (argmax, groupnorm, l2_norm, log_softmax, logsumexp,
row_max) are intentionally excluded from this PR and kept on the lab branch for a later
standalone PR.
…nriched MemoryOpFact (pytorch#2761)

Add the reduction fact layer the Triton reduction seed heuristic will read, built in
the compiler fact pass. No heuristic yet — that lands in the next PR; these facts are
populated but not consumed here.

- helion/autotuner/config_spec.py: ReductionFact, AccumulatorFact, enriched
  MemoryOpFact (per-op provenance), and the fact storage fields on ConfigSpec.
- helion/_compiler/device_ir.py: a 3-phase fact build (roll reductions, collect
  enriched memory-op facts after rolling, derive reduction/accumulator facts) that
  replaces bespoke per-config graph walks.
- full_width_output / input_load_itemsize key on the reduction AXIS via a new
  MemoryOpFact.subscript_block_ids (block-id read from the index subscript, resolved
  reduction-agnostically) rather than an inner_extent==size_hint size-match — faithful
  for user-tiled (T2) reductions whose reduction block is reduction=False, and immune to
  a non-reduction dim coincidentally equal to the reduction extent. Byte-identical seed
  configs across the curriculum (no size coincidences existed).
- test/test_memory_op_facts.py: fact coverage incl. the indexing-slot invariant.
- test/test_barrier.py: update the rdim / config_spec fakes for the new fact build.
…orch#2762)

Add the Triton inner-reduction seed heuristic, reading the reduction facts from the
previous PR.

- TritonReductionTileHeuristic (T1: rollable rdim — sum, rms_norm, layer_norm,
  softmax-row, cross_entropy) and TritonReductionUserTileHeuristic (T2: user-tiled —
  softmax_two_pass, kl_div, jsd, welford/groupnorm). Persistent-vs-looped element/byte
  spill caps, the rnumel num_warps ramp, Band-B R_BLOCK + Band-C combine footprint
  caps, the narrow-row w1 occupancy gate, the M_BLOCK>1 apply-loop stream cap,
  per-slot stream/re-read eviction, off-sm90 conservative fallback. This is the pruned
  generalizable core — the over-fit dtype tail is intentionally deferred.
- helion/_compiler/autotuner_heuristics/__init__.py: register the two heuristics.
- test/test_autotuner_heuristics.py, test/test_best_available.py: heuristic decisions
  + reduction-loop config round-trip coverage.
These can result in a test passing without actually running the test. Without this check, we end up running the decorator function as the test, NOT the test body. In doing so we get the following warning:

```
DeprecationWarning: It is deprecated to return a value that is not None from a test case (<bound method skipIfFn.<locals>.decorator of <test.test_foo.TestFoo testMethod=test_bar
_with_post_reduction_op>>)
    return self.run(*args, **kwds)
```
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…orch#2820)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…f nightly (pytorch#2818)

Co-authored-by: Ethan Che <eche@meta.com>
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2026
ethche and others added 3 commits June 18, 2026 20:45
Collect a lean, opt-in, joinable autotuning dataset for training an autotuner cost model. During an autotune run, each benchmarked config and its result are recorded so a downstream model can learn (config, context) → perf. This revision reworks the original approach in response to review: it removes the config-minimization path, moves per-config data out of the CSV into a per-run JSON sidecar, replaces the positional row index with a content-addressed config_id, and gates the whole dataset behind an explicit flag.
  
Supersedes the initial implementation (69b7af1) with three focused follow-up commits.

Reviewer feedback addressed

  1. Logger plumbing — register_config returns None when no sink; metadata + collect_dataset threaded into the sink.
  2. When/how data is written — per-run record written once at run end as a single JSON line (default=str); CSV header is the exact lean set; row counter removed.
  3. Search logic & settings — collect only when logging on + dataset flag on + not a restricted search; one-time warning if flag set without a log path; config_defaults removed; new autotune_dataset setting.

What changed?

-   C1 moved the config into .meta.jsonl as a configs map (keyed by config_id), and moved the meta write from log-open to end_run (one JSON line per run).
-   C2 run_id hashes all codegen/perf-affecting settings (backend, dot_precision, fast_math, static_shapes, index_dtype, allow_warp_specialize, triton_do_not_specialize, pallas_interpret, debug_dtype_asserts, persistent_reserved_sms)
-   C3 Replaced the run_id dataclass field + __post_init__ + _compute_run_id with a single @functools.cached_property

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Helion’s autotuner benchmarking/logging pipeline to capture and persist per-config performance distribution statistics (min/median/mean/p90/std/n_samples) into the .meta.jsonl dataset sidecar, while keeping the CSV format unchanged for backwards compatibility.

Changes:

  • Add a PerfStats summary type and return_mode="stats" support through do_bench / do_bench_generic.
  • Propagate perf stats through subprocess benchmarking IPC, BenchmarkResult, and into .meta.jsonl via AutotuneLogEntry / AutotuneLogSink.
  • Add/adjust tests to validate the schema and stats computation, and to account for new return shapes in benchmarking paths.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
helion/autotuner/benchmarking.py Introduces PerfStats, computes stats, and adds "stats" return_mode to bench APIs.
helion/autotuner/benchmark_provider.py Switches autotune benchmarking to request stats, threads them through subprocess/in-process flows, and records to log sink.
helion/autotuner/benchmark_job.py Extends benchmark worker job to support return_mode propagation and stats dict return over IPC.
helion/autotuner/logger.py Adds perf_stats to log entries and persists per-config perf stats in .meta.jsonl configs map.
test/test_benchmarking.py Adds unit tests for _compute_perf_stats and stats-mode fallback behavior.
test/test_kernel_metadata.py Validates .meta.jsonl schema includes perf_stats with expected subfields and null template on failure.
test/test_autotuner.py Updates mocks/assertions to reflect benchmark function returning (perf, stats) and do_bench stats-mode.
test/test_benchmark_worker.py Updates subprocess benchmarking integration test shim for new tuple return shape.
test/test_debug_utils.py Updates autotune error-path test mock to return PerfStats under stats-mode.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +55 to +68
def _compute_perf_stats(times: list[float]) -> PerfStats:
n = len(times)
if n == 0:
return PerfStats(0.0, 0.0, 0.0, 0.0, 0.0, 0)
sorted_times = sorted(times)
min_val = sorted_times[0]
median_val = statistics.median(sorted_times)
mean_val = statistics.mean(sorted_times)
p90_val = float(np.percentile(sorted_times, 90))
try:
std_val = statistics.stdev(sorted_times) if n > 1 else 0.0
except statistics.StatisticsError:
std_val = 0.0
return PerfStats(min_val, median_val, mean_val, p90_val, std_val, n)
Comment on lines +30 to 31
def __call__(self) -> float | dict[str, object]:
# Subprocess inherits parent stderr; capture so Triton runtime
Introduce PerfStats NamedTuple in benchmarking layer with min median mean p90 std n_samples.
Extend do_bench and do_bench_generic with return_mode='stats'.
Propagate through BenchmarkJob IPC as dict, BenchmarkResult perf_stats field, and AutotuneLogSink configs map.
CSV unchanged for back-compat. Fixes benchmark_isolated tuple handling for subprocess stats return.

Test plan:
pytest test/test_kernel_metadata.py -q
pytest test/test_kernel_metadata.py::TestAutotuneLogSink -q -v
pytest test/test_autotuner.py -k 'log_sink or restricted' -q
pytest test/test_llm_autotuner.py -k LFBO -q
… refactor

TestSubprocessBenchmarkIntegration.test_autotune_continues_when_subprocess_reports_inf patches _benchmark_function_subprocess to simulate inf return, but after perf stats refactor that method returns tuple[float, PerfStats|None] not float. The old patch returned math.inf causing unpack error and fallback to in-process path which then raised TritonError on bad configs instead of skipping.

Update patch to return (math.inf, None) matching new signature.
@IshanAryendu IshanAryendu force-pushed the iaryendu/autotune-perf-stats branch from 9493216 to f13902e Compare June 19, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants