autotuner: record perf stats in .meta.jsonl#2824
Draft
IshanAryendu wants to merge 25 commits into
Draft
Conversation
Co-authored-by: eche <eche@devvm32174.atn0.facebook.com> Co-authored-by: Ethan Che <eche@meta.com>
…d-heuristic curriculum (pytorch#2760) - jsd: declare the dX accumulator fp32 to fix a bf16/fp16 ControlFlowTensorMismatch. dX was _input.dtype, but the beta==1 branch folds the fp32 intermediate_loss into it, so the carried dtype mismatches at the branch merge (type propagation visits every beta branch). True no-op at fp32. - welford: use the true valid-column count for the divisor instead of the constexpr tile width (which over-counts last-tile padding) -- fixes wrong results at non-divisor N. - jsd / kl_div: document the fp16 wide-V NaN (out of scope; not fixed here). The new sibling/probe kernels (argmax, groupnorm, l2_norm, log_softmax, logsumexp, row_max) are intentionally excluded from this PR and kept on the lab branch for a later standalone PR.
…nriched MemoryOpFact (pytorch#2761) Add the reduction fact layer the Triton reduction seed heuristic will read, built in the compiler fact pass. No heuristic yet — that lands in the next PR; these facts are populated but not consumed here. - helion/autotuner/config_spec.py: ReductionFact, AccumulatorFact, enriched MemoryOpFact (per-op provenance), and the fact storage fields on ConfigSpec. - helion/_compiler/device_ir.py: a 3-phase fact build (roll reductions, collect enriched memory-op facts after rolling, derive reduction/accumulator facts) that replaces bespoke per-config graph walks. - full_width_output / input_load_itemsize key on the reduction AXIS via a new MemoryOpFact.subscript_block_ids (block-id read from the index subscript, resolved reduction-agnostically) rather than an inner_extent==size_hint size-match — faithful for user-tiled (T2) reductions whose reduction block is reduction=False, and immune to a non-reduction dim coincidentally equal to the reduction extent. Byte-identical seed configs across the curriculum (no size coincidences existed). - test/test_memory_op_facts.py: fact coverage incl. the indexing-slot invariant. - test/test_barrier.py: update the rdim / config_spec fakes for the new fact build.
…orch#2762) Add the Triton inner-reduction seed heuristic, reading the reduction facts from the previous PR. - TritonReductionTileHeuristic (T1: rollable rdim — sum, rms_norm, layer_norm, softmax-row, cross_entropy) and TritonReductionUserTileHeuristic (T2: user-tiled — softmax_two_pass, kl_div, jsd, welford/groupnorm). Persistent-vs-looped element/byte spill caps, the rnumel num_warps ramp, Band-B R_BLOCK + Band-C combine footprint caps, the narrow-row w1 occupancy gate, the M_BLOCK>1 apply-loop stream cap, per-slot stream/re-read eviction, off-sm90 conservative fallback. This is the pruned generalizable core — the over-fit dtype tail is intentionally deferred. - helion/_compiler/autotuner_heuristics/__init__.py: register the two heuristics. - test/test_autotuner_heuristics.py, test/test_best_available.py: heuristic decisions + reduction-loop config round-trip coverage.
…h#2813) Co-authored-by: Ethan Che <eche@meta.com>
…d by cute backend (pytorch#2815)
These can result in a test passing without actually running the test. Without this check, we end up running the decorator function as the test, NOT the test body. In doing so we get the following warning:
```
DeprecationWarning: It is deprecated to return a value that is not None from a test case (<bound method skipIfFn.<locals>.decorator of <test.test_foo.TestFoo testMethod=test_bar
_with_post_reduction_op>>)
return self.run(*args, **kwds)
```
Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…orch#2820) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…f nightly (pytorch#2818) Co-authored-by: Ethan Che <eche@meta.com>
…ytorch#2812) Co-authored-by: Ethan Che <eche@meta.com>
Collect a lean, opt-in, joinable autotuning dataset for training an autotuner cost model. During an autotune run, each benchmarked config and its result are recorded so a downstream model can learn (config, context) → perf. This revision reworks the original approach in response to review: it removes the config-minimization path, moves per-config data out of the CSV into a per-run JSON sidecar, replaces the positional row index with a content-addressed config_id, and gates the whole dataset behind an explicit flag. Supersedes the initial implementation (69b7af1) with three focused follow-up commits. Reviewer feedback addressed 1. Logger plumbing — register_config returns None when no sink; metadata + collect_dataset threaded into the sink. 2. When/how data is written — per-run record written once at run end as a single JSON line (default=str); CSV header is the exact lean set; row counter removed. 3. Search logic & settings — collect only when logging on + dataset flag on + not a restricted search; one-time warning if flag set without a log path; config_defaults removed; new autotune_dataset setting. What changed? - C1 moved the config into .meta.jsonl as a configs map (keyed by config_id), and moved the meta write from log-open to end_run (one JSON line per run). - C2 run_id hashes all codegen/perf-affecting settings (backend, dot_precision, fast_math, static_shapes, index_dtype, allow_warp_specialize, triton_do_not_specialize, pallas_interpret, debug_dtype_asserts, persistent_reserved_sms) - C3 Replaced the run_id dataclass field + __post_init__ + _compute_run_id with a single @functools.cached_property
3e26887 to
9493216
Compare
There was a problem hiding this comment.
Pull request overview
This PR extends Helion’s autotuner benchmarking/logging pipeline to capture and persist per-config performance distribution statistics (min/median/mean/p90/std/n_samples) into the .meta.jsonl dataset sidecar, while keeping the CSV format unchanged for backwards compatibility.
Changes:
- Add a
PerfStatssummary type andreturn_mode="stats"support throughdo_bench/do_bench_generic. - Propagate perf stats through subprocess benchmarking IPC,
BenchmarkResult, and into.meta.jsonlviaAutotuneLogEntry/AutotuneLogSink. - Add/adjust tests to validate the schema and stats computation, and to account for new return shapes in benchmarking paths.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
helion/autotuner/benchmarking.py |
Introduces PerfStats, computes stats, and adds "stats" return_mode to bench APIs. |
helion/autotuner/benchmark_provider.py |
Switches autotune benchmarking to request stats, threads them through subprocess/in-process flows, and records to log sink. |
helion/autotuner/benchmark_job.py |
Extends benchmark worker job to support return_mode propagation and stats dict return over IPC. |
helion/autotuner/logger.py |
Adds perf_stats to log entries and persists per-config perf stats in .meta.jsonl configs map. |
test/test_benchmarking.py |
Adds unit tests for _compute_perf_stats and stats-mode fallback behavior. |
test/test_kernel_metadata.py |
Validates .meta.jsonl schema includes perf_stats with expected subfields and null template on failure. |
test/test_autotuner.py |
Updates mocks/assertions to reflect benchmark function returning (perf, stats) and do_bench stats-mode. |
test/test_benchmark_worker.py |
Updates subprocess benchmarking integration test shim for new tuple return shape. |
test/test_debug_utils.py |
Updates autotune error-path test mock to return PerfStats under stats-mode. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+55
to
+68
| def _compute_perf_stats(times: list[float]) -> PerfStats: | ||
| n = len(times) | ||
| if n == 0: | ||
| return PerfStats(0.0, 0.0, 0.0, 0.0, 0.0, 0) | ||
| sorted_times = sorted(times) | ||
| min_val = sorted_times[0] | ||
| median_val = statistics.median(sorted_times) | ||
| mean_val = statistics.mean(sorted_times) | ||
| p90_val = float(np.percentile(sorted_times, 90)) | ||
| try: | ||
| std_val = statistics.stdev(sorted_times) if n > 1 else 0.0 | ||
| except statistics.StatisticsError: | ||
| std_val = 0.0 | ||
| return PerfStats(min_val, median_val, mean_val, p90_val, std_val, n) |
Comment on lines
+30
to
31
| def __call__(self) -> float | dict[str, object]: | ||
| # Subprocess inherits parent stderr; capture so Triton runtime |
Introduce PerfStats NamedTuple in benchmarking layer with min median mean p90 std n_samples. Extend do_bench and do_bench_generic with return_mode='stats'. Propagate through BenchmarkJob IPC as dict, BenchmarkResult perf_stats field, and AutotuneLogSink configs map. CSV unchanged for back-compat. Fixes benchmark_isolated tuple handling for subprocess stats return. Test plan: pytest test/test_kernel_metadata.py -q pytest test/test_kernel_metadata.py::TestAutotuneLogSink -q -v pytest test/test_autotuner.py -k 'log_sink or restricted' -q pytest test/test_llm_autotuner.py -k LFBO -q
… refactor TestSubprocessBenchmarkIntegration.test_autotune_continues_when_subprocess_reports_inf patches _benchmark_function_subprocess to simulate inf return, but after perf stats refactor that method returns tuple[float, PerfStats|None] not float. The old patch returned math.inf causing unpack error and fallback to in-process path which then raised TritonError on bad configs instead of skipping. Update patch to return (math.inf, None) matching new signature.
9493216 to
f13902e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depends on #2809
Summary
.meta.jsonlfor cost-model datasetTest plan