Skip to content

amdgpu: make the AQL queue doorbell kind-aware#56

Open
zjgarvey wants to merge 4 commits into
mainfrom
users/zjgarvey/fix/aql-doorbell-kind-aware
Open

amdgpu: make the AQL queue doorbell kind-aware#56
zjgarvey wants to merge 4 commits into
mainfrom
users/zjgarvey/fix/aql-doorbell-kind-aware

Conversation

@zjgarvey

@zjgarvey zjgarvey commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Nightly ROCm on MI300X returns a USER-kind queue doorbell signal (not DOORBELL-kind), so the cached hardware_doorbell_ptr union slot is a signal value, not a writable MMIO address; the raw atomic store to it faulted — every H2D copy / kernel dispatch SEGV'd. Keep the raw-MMIO fast path for DOORBELL-kind; fall back to hsa_signal_store_screlease (system-scope release) for USER/interrupt-kind.

Review

Agent review confirmed the HSA release-ordering correctness and that the fast path is byte-for-byte preserved, and caught a [blocker]: the aql_ring_initialize signature change left a second caller (pm4_dispatch_live_test.cc) on the old form — fixed (so the tests-enabled build compiles). Doc comment on the now-conditionally-NULL doorbell field corrected.

Test

Paired hip-cts: iree-org/hip-cts users/zjgarvey/aql-doorbell-kind-test (32 H2D/D2H roundtrips + data integrity ring the doorbell repeatedly). Verified on MI300X (98 assertions).

🤖 Generated with Claude Code

Nightly ROCm on MI300X returns a USER-kind doorbell signal (not DOORBELL-kind),
so the cached hardware_doorbell_ptr is a signal value rather than a writable
MMIO address; the raw atomic store to it faulted, SEGV-ing every H2D copy and
kernel dispatch. Keep the raw-MMIO fast path for DOORBELL-kind doorbells; fall
back to hsa_signal_store_screlease for USER/interrupt-kind.

Peeled from the nightly-bring-up branch for isolated review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread runtime/src/iree/hal/drivers/amdgpu/util/aql_ring.h Outdated
Comment thread runtime/src/iree/hal/drivers/amdgpu/util/aql_ring.h Outdated
…ibhsa to struct top

Addresses benvanik's review comments. Also clang-format and generalize
the doorbell-kind comments away from local bring-up context.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@zjgarvey zjgarvey marked this pull request as ready for review June 11, 2026 17:32
@zjgarvey zjgarvey requested review from AWoloszyn and benvanik June 11, 2026 17:34
zjgarvey and others added 2 commits June 11, 2026 13:41
Adds runtime/src/iree/hal/drivers/amdgpu/util/aql_doorbell.md: what the AQL
doorbell is, why nightly ROCm/MI300X hands back a USER-kind doorbell signal
(the union slot is a signal value, not an MMIO pointer) so the raw store
faulted, and how the kind-aware ring fixes it. Every claim cites the code.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: zjgarvey <zjgarvey@gmail.com>
This file was a local markdown file I was using to gain context, and not meant to be committed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants