Skip to content

[autotuner] Triton reduction seed heuristic (generalizable core)#2762

Open
calebmkim wants to merge 1 commit into
calebmkim/stack/2from
calebmkim/stack/3
Open

[autotuner] Triton reduction seed heuristic (generalizable core)#2762
calebmkim wants to merge 1 commit into
calebmkim/stack/2from
calebmkim/stack/3

Conversation

@calebmkim

@calebmkim calebmkim commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Stacked PRs:


[autotuner] Triton reduction seed heuristic (generalizable core)

Add the Triton inner-reduction seed heuristic, reading the reduction facts from the
previous PR.

  • TritonReductionTileHeuristic (T1: rollable rdim — sum, rms_norm, layer_norm,
    softmax-row, cross_entropy) and TritonReductionUserTileHeuristic (T2: user-tiled —
    softmax_two_pass, kl_div, jsd, welford/groupnorm). Persistent-vs-looped element/byte
    spill caps, the rnumel num_warps ramp, Band-B R_BLOCK + Band-C combine footprint
    caps, the narrow-row w1 occupancy gate, the M_BLOCK>1 apply-loop stream cap,
    per-slot stream/re-read eviction, off-sm90 conservative fallback. This is the pruned
    generalizable core — the over-fit dtype tail is intentionally deferred.
  • helion/_compiler/autotuner_heuristics/init.py: register the two heuristics.
  • test/test_autotuner_heuristics.py, test/test_best_available.py: heuristic decisions
    • reduction-loop config round-trip coverage.

calebmkim pushed a commit that referenced this pull request Jun 11, 2026
Add the Triton inner-reduction seed heuristic, reading the reduction facts from the
previous PR.

- TritonReductionTileHeuristic (T1: rollable rdim — sum, rms_norm, layer_norm,
  softmax-row, cross_entropy) and TritonReductionUserTileHeuristic (T2: user-tiled —
  softmax_two_pass, kl_div, jsd, welford/groupnorm). Persistent-vs-looped element/byte
  spill caps, the rnumel num_warps ramp, Band-B R_BLOCK + Band-C combine footprint
  caps, the narrow-row w1 occupancy gate, the M_BLOCK>1 apply-loop stream cap,
  per-slot stream/re-read eviction, off-sm90 conservative fallback. This is the pruned
  generalizable core — the over-fit dtype tail is intentionally deferred.
- helion/_compiler/autotuner_heuristics/__init__.py: register the two heuristics.
- test/test_autotuner_heuristics.py, test/test_best_available.py: heuristic decisions
  + reduction-loop config round-trip coverage.

stack-info: PR: #2762, branch: calebmkim/stack/3
@calebmkim calebmkim force-pushed the calebmkim/stack/2 branch from 5041cf6 to 124d450 Compare June 11, 2026 18:04
@calebmkim calebmkim force-pushed the calebmkim/stack/3 branch from 8c3bb63 to 4a766eb Compare June 11, 2026 18:04
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 11, 2026
calebmkim pushed a commit that referenced this pull request Jun 11, 2026
Add the Triton inner-reduction seed heuristic, reading the reduction facts from the
previous PR.

- TritonReductionTileHeuristic (T1: rollable rdim — sum, rms_norm, layer_norm,
  softmax-row, cross_entropy) and TritonReductionUserTileHeuristic (T2: user-tiled —
  softmax_two_pass, kl_div, jsd, welford/groupnorm). Persistent-vs-looped element/byte
  spill caps, the rnumel num_warps ramp, Band-B R_BLOCK + Band-C combine footprint
  caps, the narrow-row w1 occupancy gate, the M_BLOCK>1 apply-loop stream cap,
  per-slot stream/re-read eviction, off-sm90 conservative fallback. This is the pruned
  generalizable core — the over-fit dtype tail is intentionally deferred.
- helion/_compiler/autotuner_heuristics/__init__.py: register the two heuristics.
- test/test_autotuner_heuristics.py, test/test_best_available.py: heuristic decisions
  + reduction-loop config round-trip coverage.

stack-info: PR: #2762, branch: calebmkim/stack/3
@calebmkim calebmkim force-pushed the calebmkim/stack/3 branch from 4a766eb to 4a1c32d Compare June 11, 2026 19:58
@calebmkim calebmkim force-pushed the calebmkim/stack/2 branch from 124d450 to 0e1b911 Compare June 11, 2026 20:02
@calebmkim calebmkim changed the base branch from calebmkim/stack/2 to main June 11, 2026 23:21
calebmkim pushed a commit that referenced this pull request Jun 11, 2026
Add the Triton inner-reduction seed heuristic, reading the reduction facts from the
previous PR.

- TritonReductionTileHeuristic (T1: rollable rdim — sum, rms_norm, layer_norm,
  softmax-row, cross_entropy) and TritonReductionUserTileHeuristic (T2: user-tiled —
  softmax_two_pass, kl_div, jsd, welford/groupnorm). Persistent-vs-looped element/byte
  spill caps, the rnumel num_warps ramp, Band-B R_BLOCK + Band-C combine footprint
  caps, the narrow-row w1 occupancy gate, the M_BLOCK>1 apply-loop stream cap,
  per-slot stream/re-read eviction, off-sm90 conservative fallback. This is the pruned
  generalizable core — the over-fit dtype tail is intentionally deferred.
- helion/_compiler/autotuner_heuristics/__init__.py: register the two heuristics.
- test/test_autotuner_heuristics.py, test/test_best_available.py: heuristic decisions
  + reduction-loop config round-trip coverage.

stack-info: PR: #2762, branch: calebmkim/stack/3
@calebmkim calebmkim force-pushed the calebmkim/stack/3 branch from 4a1c32d to bbfa3cd Compare June 11, 2026 23:22
@calebmkim calebmkim changed the base branch from main to calebmkim/stack/2 June 11, 2026 23:22
@calebmkim calebmkim marked this pull request as ready for review June 11, 2026 23:27
@calebmkim calebmkim marked this pull request as draft June 12, 2026 00:10
@calebmkim calebmkim changed the base branch from calebmkim/stack/2 to main June 12, 2026 00:10
calebmkim pushed a commit that referenced this pull request Jun 12, 2026
Add the Triton inner-reduction seed heuristic, reading the reduction facts from the
previous PR.

- TritonReductionTileHeuristic (T1: rollable rdim — sum, rms_norm, layer_norm,
  softmax-row, cross_entropy) and TritonReductionUserTileHeuristic (T2: user-tiled —
  softmax_two_pass, kl_div, jsd, welford/groupnorm). Persistent-vs-looped element/byte
  spill caps, the rnumel num_warps ramp, Band-B R_BLOCK + Band-C combine footprint
  caps, the narrow-row w1 occupancy gate, the M_BLOCK>1 apply-loop stream cap,
  per-slot stream/re-read eviction, off-sm90 conservative fallback. This is the pruned
  generalizable core — the over-fit dtype tail is intentionally deferred.
- helion/_compiler/autotuner_heuristics/__init__.py: register the two heuristics.
- test/test_autotuner_heuristics.py, test/test_best_available.py: heuristic decisions
  + reduction-loop config round-trip coverage.

stack-info: PR: #2762, branch: calebmkim/stack/3
@calebmkim calebmkim force-pushed the calebmkim/stack/3 branch from bbfa3cd to d6ad115 Compare June 12, 2026 00:10
@calebmkim calebmkim changed the base branch from main to calebmkim/stack/2 June 12, 2026 00:10
@calebmkim calebmkim marked this pull request as ready for review June 12, 2026 00:10
Add the Triton inner-reduction seed heuristic, reading the reduction facts from the
previous PR.

- TritonReductionTileHeuristic (T1: rollable rdim — sum, rms_norm, layer_norm,
  softmax-row, cross_entropy) and TritonReductionUserTileHeuristic (T2: user-tiled —
  softmax_two_pass, kl_div, jsd, welford/groupnorm). Persistent-vs-looped element/byte
  spill caps, the rnumel num_warps ramp, Band-B R_BLOCK + Band-C combine footprint
  caps, the narrow-row w1 occupancy gate, the M_BLOCK>1 apply-loop stream cap,
  per-slot stream/re-read eviction, off-sm90 conservative fallback. This is the pruned
  generalizable core — the over-fit dtype tail is intentionally deferred.
- helion/_compiler/autotuner_heuristics/__init__.py: register the two heuristics.
- test/test_autotuner_heuristics.py, test/test_best_available.py: heuristic decisions
  + reduction-loop config round-trip coverage.

stack-info: PR: #2762, branch: calebmkim/stack/3
@calebmkim calebmkim marked this pull request as draft June 12, 2026 00:39
@calebmkim calebmkim changed the base branch from calebmkim/stack/2 to main June 12, 2026 00:39
@calebmkim calebmkim force-pushed the calebmkim/stack/3 branch from d6ad115 to 4a7c335 Compare June 12, 2026 00:39
@calebmkim calebmkim changed the base branch from main to calebmkim/stack/2 June 12, 2026 00:39
@calebmkim calebmkim marked this pull request as ready for review June 12, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant