Skip to content

[Loom] Harden AMDGPU benchmark planning diagnostics#131

Merged
benvanik merged 17 commits into
mainfrom
users/benvanik/loom/benchmark-planning-diagnostics
Jun 23, 2026
Merged

[Loom] Harden AMDGPU benchmark planning diagnostics#131
benvanik merged 17 commits into
mainfrom
users/benvanik/loom/benchmark-planning-diagnostics

Conversation

@benvanik

Copy link
Copy Markdown
Collaborator

This branch tightens the feedback loop between Loom's AMDGPU codegen, its report surfaces, and the benchmark/check tooling agents use to understand failures. The common theme is turning ambiguous backend behavior into structured, attributable evidence while also fixing several codegen contracts that were making that evidence noisy: scheduler pressure cliffs, storage lease release, native fragment layout coverage, configured targetless AMDGPU lowering, benchmark planning diagnostics, and compact report identity.

The AMDGPU scheduler now treats pressure cliffs as a first-class scheduling concern before local stall hiding. Resource-stall scheduling still tries to hide latency, but not at the cost of stepping over a known pressure cliff that will make the allocation or occupancy story worse. The focused schedule regression keeps that policy visible in the low scheduler, and the checked benchmark coverage gives us a larger end-to-end shape where the policy matters.

Allocation can now release storage leases through the coalescing path in the same spirit as the normal append path. That closes a real asymmetry where structurally reusable storage remained live simply because the value moved through a coalesced route, which inflated pressure and could make otherwise valid kernels look worse than their WYSIWYG schedule implied.

The AMDGPU report surface gets more complete and more machine-readable. Packet JSON checks cover AMDGPU packet structures, structural values with null locations are represented cleanly, and compile reports now retain register-class pressure summaries independently from detailed pressure rows. That means summary-mode target_resources can report nonzero scalar/vector pressure peaks without forcing detailed row retention, while details mode still exposes the full pressure row surface. Compact iree-benchmark-loom snapshots also now attach the representative benchmark and case names to completed physical work_items[], matching planned work items and logical benchmark rows so agents no longer need positional joins to explain completed work.

Benchmark/test tooling now reports planning and compile-time problems closer to the row that caused them. Check/testbench planning issues flow through structured tool output, benchmark compile rejections stay local to the affected work item instead of poisoning the whole run, and the shared testbench issue-report helper gives both iree-test-loom and iree-benchmark-loom the same diagnostic vocabulary for unsupported cases. This keeps large benchmark sweeps useful even when one dynamic launch/configuration path is not executable yet.

Configured targetless AMDGPU kernels are covered directly. The compiler can use the invocation target for HAL roots without requiring a source-level target attribute, including kernels whose launch geometry comes from configured values. That keeps authored kernels generic while still letting invocation-time target/config selection specialize them.

AMDGPU native fragment layout coverage now includes the RDNA3 wave64 half-fragment LDS roundtrip. This guards the lane/register layout that fragment-result lowering relies on and gives us a checked benchmark-style fixture for a class of layout bugs that can otherwise survive all-ones reductions and fail only under layout-sensitive data.

The HAL AMDGPU kernarg materialization path no longer ties kernarg loads unnecessarily. The low HAL materialization fixture now covers the actual lowering behavior without adding a broad target-specific C++ IR blob, and the change keeps the compiler focused on the kernel ABI contract rather than test-only root assumptions.

The branch also cleans up a few pieces of supporting infrastructure. loomc C bindings use clearer ownership signaling around released artifacts/modules/targets, the status bridge exemption keeps clang-tidy focused on real status misuse rather than intentional ownership transfer, and Bazel-to-CMake conversion now handles location arguments without expanding one-line Loom check suites into noisy CMake churn.

benvanik added 17 commits June 19, 2026 02:29
Resource-stall scheduling now treats modeled pressure cliff crossings as
a stronger ordering signal than immediate stall-cycle avoidance. This
prevents the scheduler from hiding short latency by issuing independent
producers that move a function into a worse occupancy class when a ready
consumer can keep pressure under the cliff.

AMDGPU loom-check emit providers now pass the same occupancy
pressure-cliff model used by production HAL emission, so resource-stall
assembly and occupancy fixtures exercise the backend scheduling contract
instead of a cliff-free approximation. A focused scheduler regression
covers the add-vs-load choice at a synthetic cliff.
AMDGPU loom-check can now emit packet JSON from the same scheduled and
allocated frame used by native assembly emission. This gives tests and
investigations access to backend placement after AMDGPU-specific
scheduling, occupancy pressure cliffs, affinities, storage leases, and
spill materialization have been applied.

The shared packet JSON formatter now represents values outside the
allocation domain with a null location instead of asserting. Full
backend frames can contain structural values such as storage
reservations, and the JSON schema already has a nullable location for
exactly that state.
Coalesced placement now probes storage lease conflicts with the same
release-capable policy used by ordinary interval allocation. The append
path already records the required release actions when the assignment is
committed, so rejecting those locations during coalescing created a
stricter and higher-pressure allocation state than the allocator could
otherwise legalize.

This keeps concat reservation and source-placement decisions from
jumping over target-visible memory-result leases that can be released
before the coalesced assignment. The AMDGPU occupancy fixture models the
pressure cliff directly with a wide live VGPR span, an older LDS result
lease, and a later concat that should reuse the first releasable slot.
An existing resource assembly fixture now also shows the expected
effect: fewer descriptor copies after a releasable memory-result wait.
The Loom C binding uses ABI-equivalent public and private bridge helpers
to transfer status ownership across the loomc and IREE status types.
Teach the borrowed-status checker those helper names so callers can use
the direct transfer path without tripping observer diagnostics.
Use null-safe IREE stream releases directly in loomc path writers and
rename the target-selection factory to match the create/release
refcount lifecycle. This keeps the public C binding clean under the
runtime ownership checkers.
Thread compile-time config bindings through the shared HAL actual
provider so correctness tests and benchmarks materialize config
declarations before sample constants and target lowering.

Link iree-test-loom target artifact providers through the same
target-artifact configuration used by the benchmark tool. Actual HAL
execution now reaches device selection with an explicit provider
registry instead of failing before the selected device can be named.
Add an AMDGPU loom-compile execution case that exercises a targetless
HAL root with config-driven launch dimensions and a scalar global-buffer
store.

This keeps config materialization, invocation target selection, HAL
wrapper emission, and HSACO artifact emission on the same tested path
used by CLI users.
AMDGPU HAL binding materialization used a same-type tied result on final
s_load_dwordx2 kernarg loads to encourage reuse of the kernarg pointer
SGPRs. That tie is not part of the descriptor packet contract, so paths
that serialize prepared-low modules through low text asm reject the op
even though native emission can otherwise proceed.

Remove the load flag and the branch structure that selected it. The
S_LOAD descriptors already carry early-clobber constraints for placement
hazards, and storage reuse should be decided by allocation/coalescing
instead of by adding descriptor-undeclared ties in HAL ABI
materialization.

Update the HAL materialization golden to the untied load shape so
prepared-low text emission exercises the descriptor packet contract
directly.
Route benchmark and execution-suite ARGS through the existing Bazel
location conversion path so generated CMake receives source-tree paths
instead of raw $(location ...) tokens.

This lets data-backed benchmarks and HAL-style integration tests share a
single Bazel declaration while keeping generated CMake runnable.
Add a checked benchmark that roundtrips an RDNA3 wave64 f16 fragment
through LDS and compares it against the direct f32 fragment store path.

The underlying layout fix already exists on the rebased branch. This
coverage keeps the non-uniform half-fragment LDS order visible in the
integration suite so future layout changes cannot silently reintroduce
row-block permutation.
Testbench planning already records structured issues when a check.case
cannot produce executable samples, but the CLI tools collapsed those
cases into generic sample-count or benchmark-selection failures. Add a
small shared reporter for those issue records and thread it through
iree-test-loom and iree-benchmark-loom report output.

Selected iree-test-loom cases now keep producing the normal loom.test.v0
report with planning issue fields before exiting with failure. Selected
iree-benchmark-loom benchmarks now emit planning failure rows with the
same issue object before the summary row, leaving the generic
zero-sample work-plan error for cases without planner evidence.
HAL actual compilation can reject a candidate for product reasons that
should not abort a benchmark run. Treat unresolved dynamic workgroup
counts after sample constants are applied as compile rejections instead
of infrastructure status failures so mixed once/per-sample work plans
can still report the remaining samples.

Propagate the optional rejection message through benchmark results,
compile rows, and compile-report artifacts while preserving the existing
stable stage/kind fields. Add a HAL actual-provider regression for the
dynamic launch-geometry path that verifies the provider records a
rejected candidate without requiring a real device.
Compact benchmark snapshots now carry the representative benchmark and
case names on completed physical work items, matching the planned
work-item and logical benchmark rows. Agents consuming snapshot output
no longer need positional joins to explain which benchmark produced a
completed work item.

Compile reports now retain register-class pressure summaries
independently from detailed pressure rows. Summary-mode target resources
can populate scalar and vector pressure peaks without forcing detailed
row retention, while details mode continues to emit the full pressure
row surface.
Destructive packet forms cannot be selected only by checking whether the tied operand itself is live after the consuming op. A tied operand can be a slice or other storage-relation result whose source value remains live, and rewriting the packet in place can clobber that source storage before its later use.

Introduce a generic StorageRelation trait and a low-owned relation query for tied results, copies, slices, concats, and branch payloads. Placement analysis now consumes the shared relation rows instead of re-walking exact structural op families, and operand-form selection rejects destructive rewrites when any relation source remains dynamically live after the consume point.
Feedback failure branching expects an EXEC-width SGPRx2 lane mask. Wave32 compare producers only define the low half of that shape, and stale high-half bits can make the scalar nonzero check enter the cold failure path while EXEC narrowing activates no lanes.

Move the ASAN wave32 zero-extension into the shared feedback helper and use it for TSAN failure masks as well. The branch splitter stays a trusted-input primitive, while producers canonicalize masks once according to the selected wave size before comparison or EXEC narrowing.
Source memory planning no longer peels producer ops to recognize
workitem and workgroup coordinate values. Kernel topology operations
publish coordinate-domain facts, and memory dynamic terms now classify
coordinate source and dimension from the fact table instead of exact op
shape.

This lets non-local SSA forms, such as CFG-forwarded block arguments,
preserve coordinate provenance for memory planning. The same workgroup
topology facts also feed target-legalize diagnostics so narrowed
workgroup.id assumptions report the structured topology value kind
rather than falling through as anonymous scalar values.
Vector fragment facts were previously raw extension payloads without a
type-domain join. Loop summary meet/widen dropped them when an init
accumulator and a matrix result differed only by native-storage proof,
so the next vector.mma saw an ordinary vector instead of an init/result
fragment.

Register vector as a value fact domain and teach its meet/widen path to
preserve generic equal vector extensions and compatible accumulator
fragment contracts. The join canonicalizes init/result roles and only
keeps native-storage when both incoming facts prove it.

Add source-low coverage for a non-unrolled BF16 WMMA loop carrying the
f32 accumulator through CFG, proving the selected packet remains in the
loop body.
@benvanik benvanik merged commit 863d29e into main Jun 23, 2026
22 checks passed
@benvanik benvanik deleted the users/benvanik/loom/benchmark-planning-diagnostics branch June 23, 2026 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant