[Loom] Harden AMDGPU benchmark planning diagnostics#131
Merged
benvanik merged 17 commits intoJun 23, 2026
Conversation
Resource-stall scheduling now treats modeled pressure cliff crossings as a stronger ordering signal than immediate stall-cycle avoidance. This prevents the scheduler from hiding short latency by issuing independent producers that move a function into a worse occupancy class when a ready consumer can keep pressure under the cliff. AMDGPU loom-check emit providers now pass the same occupancy pressure-cliff model used by production HAL emission, so resource-stall assembly and occupancy fixtures exercise the backend scheduling contract instead of a cliff-free approximation. A focused scheduler regression covers the add-vs-load choice at a synthetic cliff.
AMDGPU loom-check can now emit packet JSON from the same scheduled and allocated frame used by native assembly emission. This gives tests and investigations access to backend placement after AMDGPU-specific scheduling, occupancy pressure cliffs, affinities, storage leases, and spill materialization have been applied. The shared packet JSON formatter now represents values outside the allocation domain with a null location instead of asserting. Full backend frames can contain structural values such as storage reservations, and the JSON schema already has a nullable location for exactly that state.
Coalesced placement now probes storage lease conflicts with the same release-capable policy used by ordinary interval allocation. The append path already records the required release actions when the assignment is committed, so rejecting those locations during coalescing created a stricter and higher-pressure allocation state than the allocator could otherwise legalize. This keeps concat reservation and source-placement decisions from jumping over target-visible memory-result leases that can be released before the coalesced assignment. The AMDGPU occupancy fixture models the pressure cliff directly with a wide live VGPR span, an older LDS result lease, and a later concat that should reuse the first releasable slot. An existing resource assembly fixture now also shows the expected effect: fewer descriptor copies after a releasable memory-result wait.
The Loom C binding uses ABI-equivalent public and private bridge helpers to transfer status ownership across the loomc and IREE status types. Teach the borrowed-status checker those helper names so callers can use the direct transfer path without tripping observer diagnostics.
Use null-safe IREE stream releases directly in loomc path writers and rename the target-selection factory to match the create/release refcount lifecycle. This keeps the public C binding clean under the runtime ownership checkers.
Thread compile-time config bindings through the shared HAL actual provider so correctness tests and benchmarks materialize config declarations before sample constants and target lowering. Link iree-test-loom target artifact providers through the same target-artifact configuration used by the benchmark tool. Actual HAL execution now reaches device selection with an explicit provider registry instead of failing before the selected device can be named.
Add an AMDGPU loom-compile execution case that exercises a targetless HAL root with config-driven launch dimensions and a scalar global-buffer store. This keeps config materialization, invocation target selection, HAL wrapper emission, and HSACO artifact emission on the same tested path used by CLI users.
AMDGPU HAL binding materialization used a same-type tied result on final s_load_dwordx2 kernarg loads to encourage reuse of the kernarg pointer SGPRs. That tie is not part of the descriptor packet contract, so paths that serialize prepared-low modules through low text asm reject the op even though native emission can otherwise proceed. Remove the load flag and the branch structure that selected it. The S_LOAD descriptors already carry early-clobber constraints for placement hazards, and storage reuse should be decided by allocation/coalescing instead of by adding descriptor-undeclared ties in HAL ABI materialization. Update the HAL materialization golden to the untied load shape so prepared-low text emission exercises the descriptor packet contract directly.
Route benchmark and execution-suite ARGS through the existing Bazel location conversion path so generated CMake receives source-tree paths instead of raw $(location ...) tokens. This lets data-backed benchmarks and HAL-style integration tests share a single Bazel declaration while keeping generated CMake runnable.
Add a checked benchmark that roundtrips an RDNA3 wave64 f16 fragment through LDS and compares it against the direct f32 fragment store path. The underlying layout fix already exists on the rebased branch. This coverage keeps the non-uniform half-fragment LDS order visible in the integration suite so future layout changes cannot silently reintroduce row-block permutation.
Testbench planning already records structured issues when a check.case cannot produce executable samples, but the CLI tools collapsed those cases into generic sample-count or benchmark-selection failures. Add a small shared reporter for those issue records and thread it through iree-test-loom and iree-benchmark-loom report output. Selected iree-test-loom cases now keep producing the normal loom.test.v0 report with planning issue fields before exiting with failure. Selected iree-benchmark-loom benchmarks now emit planning failure rows with the same issue object before the summary row, leaving the generic zero-sample work-plan error for cases without planner evidence.
HAL actual compilation can reject a candidate for product reasons that should not abort a benchmark run. Treat unresolved dynamic workgroup counts after sample constants are applied as compile rejections instead of infrastructure status failures so mixed once/per-sample work plans can still report the remaining samples. Propagate the optional rejection message through benchmark results, compile rows, and compile-report artifacts while preserving the existing stable stage/kind fields. Add a HAL actual-provider regression for the dynamic launch-geometry path that verifies the provider records a rejected candidate without requiring a real device.
Compact benchmark snapshots now carry the representative benchmark and case names on completed physical work items, matching the planned work-item and logical benchmark rows. Agents consuming snapshot output no longer need positional joins to explain which benchmark produced a completed work item. Compile reports now retain register-class pressure summaries independently from detailed pressure rows. Summary-mode target resources can populate scalar and vector pressure peaks without forcing detailed row retention, while details mode continues to emit the full pressure row surface.
Destructive packet forms cannot be selected only by checking whether the tied operand itself is live after the consuming op. A tied operand can be a slice or other storage-relation result whose source value remains live, and rewriting the packet in place can clobber that source storage before its later use. Introduce a generic StorageRelation trait and a low-owned relation query for tied results, copies, slices, concats, and branch payloads. Placement analysis now consumes the shared relation rows instead of re-walking exact structural op families, and operand-form selection rejects destructive rewrites when any relation source remains dynamically live after the consume point.
Feedback failure branching expects an EXEC-width SGPRx2 lane mask. Wave32 compare producers only define the low half of that shape, and stale high-half bits can make the scalar nonzero check enter the cold failure path while EXEC narrowing activates no lanes. Move the ASAN wave32 zero-extension into the shared feedback helper and use it for TSAN failure masks as well. The branch splitter stays a trusted-input primitive, while producers canonicalize masks once according to the selected wave size before comparison or EXEC narrowing.
Source memory planning no longer peels producer ops to recognize workitem and workgroup coordinate values. Kernel topology operations publish coordinate-domain facts, and memory dynamic terms now classify coordinate source and dimension from the fact table instead of exact op shape. This lets non-local SSA forms, such as CFG-forwarded block arguments, preserve coordinate provenance for memory planning. The same workgroup topology facts also feed target-legalize diagnostics so narrowed workgroup.id assumptions report the structured topology value kind rather than falling through as anonymous scalar values.
Vector fragment facts were previously raw extension payloads without a type-domain join. Loop summary meet/widen dropped them when an init accumulator and a matrix result differed only by native-storage proof, so the next vector.mma saw an ordinary vector instead of an init/result fragment. Register vector as a value fact domain and teach its meet/widen path to preserve generic equal vector extensions and compatible accumulator fragment contracts. The join canonicalizes init/result roles and only keeps native-storage when both incoming facts prove it. Add source-low coverage for a non-unrolled BF16 WMMA loop carrying the f32 accumulator through CFG, proving the selected packet remains in the loop body.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch tightens the feedback loop between Loom's AMDGPU codegen, its report surfaces, and the benchmark/check tooling agents use to understand failures. The common theme is turning ambiguous backend behavior into structured, attributable evidence while also fixing several codegen contracts that were making that evidence noisy: scheduler pressure cliffs, storage lease release, native fragment layout coverage, configured targetless AMDGPU lowering, benchmark planning diagnostics, and compact report identity.
The AMDGPU scheduler now treats pressure cliffs as a first-class scheduling concern before local stall hiding. Resource-stall scheduling still tries to hide latency, but not at the cost of stepping over a known pressure cliff that will make the allocation or occupancy story worse. The focused schedule regression keeps that policy visible in the low scheduler, and the checked benchmark coverage gives us a larger end-to-end shape where the policy matters.
Allocation can now release storage leases through the coalescing path in the same spirit as the normal append path. That closes a real asymmetry where structurally reusable storage remained live simply because the value moved through a coalesced route, which inflated pressure and could make otherwise valid kernels look worse than their WYSIWYG schedule implied.
The AMDGPU report surface gets more complete and more machine-readable. Packet JSON checks cover AMDGPU packet structures, structural values with null locations are represented cleanly, and compile reports now retain register-class pressure summaries independently from detailed pressure rows. That means summary-mode
target_resourcescan report nonzero scalar/vector pressure peaks without forcing detailed row retention, while details mode still exposes the full pressure row surface. Compactiree-benchmark-loomsnapshots also now attach the representative benchmark and case names to completed physicalwork_items[], matching planned work items and logical benchmark rows so agents no longer need positional joins to explain completed work.Benchmark/test tooling now reports planning and compile-time problems closer to the row that caused them. Check/testbench planning issues flow through structured tool output, benchmark compile rejections stay local to the affected work item instead of poisoning the whole run, and the shared testbench issue-report helper gives both
iree-test-loomandiree-benchmark-loomthe same diagnostic vocabulary for unsupported cases. This keeps large benchmark sweeps useful even when one dynamic launch/configuration path is not executable yet.Configured targetless AMDGPU kernels are covered directly. The compiler can use the invocation target for HAL roots without requiring a source-level target attribute, including kernels whose launch geometry comes from configured values. That keeps authored kernels generic while still letting invocation-time target/config selection specialize them.
AMDGPU native fragment layout coverage now includes the RDNA3 wave64 half-fragment LDS roundtrip. This guards the lane/register layout that fragment-result lowering relies on and gives us a checked benchmark-style fixture for a class of layout bugs that can otherwise survive all-ones reductions and fail only under layout-sensitive data.
The HAL AMDGPU kernarg materialization path no longer ties kernarg loads unnecessarily. The low HAL materialization fixture now covers the actual lowering behavior without adding a broad target-specific C++ IR blob, and the change keeps the compiler focused on the kernel ABI contract rather than test-only root assumptions.
The branch also cleans up a few pieces of supporting infrastructure.
loomcC bindings use clearer ownership signaling around released artifacts/modules/targets, the status bridge exemption keeps clang-tidy focused on real status misuse rather than intentional ownership transfer, and Bazel-to-CMake conversion now handles location arguments without expanding one-line Loom check suites into noisy CMake churn.