(written by codex at my request)
Summary
Please consider adding a public ROCR/HSA AMD extension API for creating and
destroying runtime-owned SDMA queues whose rings can be written directly by an
application or runtime layer.
The request is not for another async copy API. hsa_amd_memory_async_copy,
hsa_amd_memory_async_copy_on_engine, and the batch copy APIs are useful when
ROCR owns copy scheduling and packet emission. The missing surface is for
clients that need to directly emit SDMA packets while still relying on ROCR for
the hard resource-management parts: KFD process VM acquisition, FMM-backed queue
control allocation, GPU mapping, queue creation/destruction, and dGPU doorbell
mapping.
The goal is to avoid forcing advanced clients to either:
- tunnel all work through
hsa_amd_memory_async_copy*, losing control over
packet sequences and cross-engine synchronization, or
- fork/copy a large chunk of
libhsakmt/FMM code just to obtain a valid SDMA
queue resource.
Existing API Surface
The current AMD extension surface already has most of the surrounding concepts:
hsa_amd_sdma_engine_id_t names SDMA engines as one-hot masks.
HSA_AMD_AGENT_INFO_NUM_SDMA_ENG and
HSA_AMD_AGENT_INFO_NUM_SDMA_XGMI_ENG expose engine counts.
hsa_amd_memory_async_copy_on_engine lets callers request a specific SDMA
engine for runtime-managed copies.
hsa_amd_memory_copy_engine_status and
hsa_amd_memory_get_preferred_copy_engine expose availability and preferred
routing information.
hsa_amd_counted_queue_acquire / hsa_amd_counted_queue_release already
establish an AMD extension pattern for queue-resource acquisition beyond
core hsa_queue_create.
However, none of those APIs return a direct SDMA ring, RPTR/WPTR storage, or
doorbell mapping to the caller. The low-level KMT API has the resource shape
needed for this (hsaKmtCreateQueueExt returning HsaQueueResource), but in
modern TheRock-style deployments the relevant KMT functions may be linked into
libhsa-runtime64.so.1 as local symbols rather than exported dynamic symbols.
The install may also ship libhsakmt.a, but linking a second static KMT copy
into a process that already loaded libhsa-runtime64.so.1 risks creating two
independent KMT/FMM global states.
The API should therefore be an exported ROCR/HSA AMD extension that reuses the
runtime's existing KMT/FMM state instead of exposing raw KFD ioctls or requiring
clients to link their own thunk instance.
API Placement
The preferred placement is hsa_ext_amd.h as an AMD extension next to the
existing SDMA copy-engine APIs and counted-queue APIs. That keeps the surface in
the place where clients already discover AMD-specific queue/resource behavior
without changing the core HSA queue contract.
Two alternatives are possible, but less attractive:
- Extending
hsa_queue_create with a new queue type would blur the meaning of
hsa_queue_t, whose visible fields and callbacks are built around AQL queue
semantics rather than raw SDMA packet rings.
- Extending
hsa_amd_counted_queue_acquire could fit the existing
resource-pool idea, but counted queues currently return hsa_queue_t* and
are documented around shared AQL hardware queues. A direct SDMA queue has a
different ownership contract: callers need explicit ring, pointer, doorbell,
engine, and packet-property information.
So the cleanest shape is a new AMD extension resource handle with create,
destroy, and get-info entry points. ROCR can still implement that handle using
the same internal engine policy and resource pooling used by
hsa_amd_memory_async_copy*.
Preferred API Shape
The most useful v1 API would be a runtime-owned resource API:
typedef struct hsa_amd_sdma_queue_s {
uint64_t handle;
} hsa_amd_sdma_queue_t;
typedef enum hsa_amd_sdma_queue_type_s {
HSA_AMD_SDMA_QUEUE_TYPE_AUTO = 0,
HSA_AMD_SDMA_QUEUE_TYPE_SDMA = 1,
HSA_AMD_SDMA_QUEUE_TYPE_XGMI = 2,
} hsa_amd_sdma_queue_type_t;
typedef enum hsa_amd_sdma_queue_create_flag_s {
HSA_AMD_SDMA_QUEUE_CREATE_NONE = 0,
HSA_AMD_SDMA_QUEUE_CREATE_FORCE_ENGINE = 1u << 0,
} hsa_amd_sdma_queue_create_flag_t;
#define HSA_AMD_SDMA_QUEUE_CREATE_DESC_VERSION 1
typedef struct hsa_amd_sdma_queue_create_desc_s {
uint32_t version;
uint32_t reserved0;
hsa_agent_t agent;
// AUTO lets ROCR choose SDMA vs XGMI when possible. SDMA and XGMI request a
// specific KFD queue class.
hsa_amd_sdma_queue_type_t type;
// Optional one-hot engine mask from hsa_amd_sdma_engine_id_t. If 0, ROCR
// chooses an engine and reports the actual engine in queue info. If
// HSA_AMD_SDMA_QUEUE_CREATE_FORCE_ENGINE is set, invalid or unavailable
// engine masks fail queue creation instead of falling back.
uint32_t engine_id_mask;
// Requested ring capacity in bytes. Must be a power of two and at least the
// KFD minimum; ROCR may round up and reports the actual capacity in info.
size_t ring_size;
hsa_amd_queue_priority_t priority;
uint32_t queue_percentage;
uint64_t flags;
uint64_t reserved1[4];
} hsa_amd_sdma_queue_create_desc_t;
typedef enum hsa_amd_sdma_queue_info_attribute_s {
HSA_AMD_SDMA_QUEUE_INFO_AGENT = 0,
HSA_AMD_SDMA_QUEUE_INFO_RESOURCE = 1,
HSA_AMD_SDMA_QUEUE_INFO_ENGINE_ID_MASK = 2,
HSA_AMD_SDMA_QUEUE_INFO_QUEUE_ID = 3,
HSA_AMD_SDMA_QUEUE_INFO_PACKET_PROPERTIES = 4,
} hsa_amd_sdma_queue_info_attribute_t;
#define HSA_AMD_SDMA_QUEUE_RESOURCE_VERSION 1
typedef struct hsa_amd_sdma_queue_resource_s {
uint32_t version;
uint32_t reserved0;
// CPU-visible byte ring. The caller writes SDMA packet dwords here.
void* ring_base;
size_t ring_size;
// CPU-visible queue pointer slots. For GFX9+ these are 64-bit byte offsets.
volatile uint64_t* read_pointer;
volatile uint64_t* write_pointer;
// CPU-visible doorbell mapping. GFX9+ callers write a 64-bit byte-offset
// WPTR. If older ASICs are supported, doorbell_size tells the caller whether
// the doorbell is 32-bit or 64-bit.
volatile void* doorbell;
uint32_t doorbell_size;
// Actual one-hot engine mask selected for this queue.
uint32_t engine_id_mask;
// KFD queue id for diagnostics only. Callers should not use it with raw KFD
// ioctls unless ROCR explicitly documents that as supported.
uint32_t queue_id;
uint32_t reserved1;
uint64_t reserved2[4];
} hsa_amd_sdma_queue_resource_t;
typedef struct hsa_amd_sdma_queue_packet_properties_s {
uint32_t version;
// Minimum packet-stream submission size in bytes, if the ASIC/firmware needs
// padding. 0 means no additional minimum beyond packet alignment.
uint32_t min_submission_size;
// Packet-stream alignment in bytes. Expected to be 4 for SDMA dword packets.
uint32_t packet_alignment;
// True if the queue supports 64-bit atomics to system memory from SDMA.
bool supports_atomic64;
// True if the queue supports TRAP/event notification packets in this mode.
bool supports_trap;
uint8_t reserved0[2];
uint64_t reserved1[4];
} hsa_amd_sdma_queue_packet_properties_t;
hsa_status_t HSA_API hsa_amd_sdma_queue_create(
const hsa_amd_sdma_queue_create_desc_t* desc,
hsa_amd_sdma_queue_t* queue);
hsa_status_t HSA_API hsa_amd_sdma_queue_destroy(
hsa_amd_sdma_queue_t queue);
hsa_status_t HSA_API hsa_amd_sdma_queue_get_info(
hsa_amd_sdma_queue_t queue,
hsa_amd_sdma_queue_info_attribute_t attribute,
void* value);
The names are placeholders, but the important design point is that ROCR owns
the resource and returns mapped CPU pointers for direct ring writes.
Expected Usage
Create one queue on a preferred engine:
uint32_t preferred = 0;
hsa_status_t status =
hsa_amd_memory_get_preferred_copy_engine(dst_agent, src_agent, &preferred);
if (status != HSA_STATUS_SUCCESS || preferred == 0) {
preferred = HSA_AMD_SDMA_ENGINE_0;
}
hsa_amd_sdma_queue_create_desc_t desc = {
.version = HSA_AMD_SDMA_QUEUE_CREATE_DESC_VERSION,
.agent = dst_agent,
.type = HSA_AMD_SDMA_QUEUE_TYPE_AUTO,
.engine_id_mask = preferred,
.ring_size = 1u << 20,
.priority = HSA_AMD_QUEUE_PRIORITY_NORMAL,
.queue_percentage = 100,
};
hsa_amd_sdma_queue_t queue = {0};
HSA_CHECK(hsa_amd_sdma_queue_create(&desc, &queue));
hsa_amd_sdma_queue_resource_t resource = {
.version = HSA_AMD_SDMA_QUEUE_RESOURCE_VERSION,
};
HSA_CHECK(hsa_amd_sdma_queue_get_info(
queue, HSA_AMD_SDMA_QUEUE_INFO_RESOURCE, &resource));
Submit direct packets:
uint64_t write_pointer = *resource.write_pointer;
uint8_t* packet_base =
(uint8_t*)resource.ring_base + (write_pointer & (resource.ring_size - 1));
// Caller emits SDMA packet dwords into packet_base and pads wrap/minimum
// submission size according to packet properties.
size_t packet_bytes = emit_sdma_copy_linear(packet_base, dst, src, size);
write_pointer += packet_bytes;
__atomic_store_n(resource.write_pointer, write_pointer, __ATOMIC_RELEASE);
if (resource.doorbell_size == sizeof(uint64_t)) {
__atomic_store_n((volatile uint64_t*)resource.doorbell, write_pointer,
__ATOMIC_RELEASE);
} else {
__atomic_store_n((volatile uint32_t*)resource.doorbell,
(uint32_t)write_pointer, __ATOMIC_RELEASE);
}
Destroy after the caller has drained its own work:
wait_until_sdma_ring_idle(resource.read_pointer, write_pointer, timeout_ns);
HSA_CHECK(hsa_amd_sdma_queue_destroy(queue));
Required Semantics
The API should make the following contracts explicit:
- ROCR owns KFD queue creation and all KMT/FMM backing state for the returned
resource.
- Returned
ring_base, read_pointer, write_pointer, and doorbell
pointers remain valid until hsa_amd_sdma_queue_destroy or hsa_shut_down,
whichever comes first.
- The caller owns ring serialization. ROCR does not inspect or schedule packets
written to this queue.
read_pointer and write_pointer units are specified by the API. For GFX9+
SDMA queues, the desired contract is 64-bit byte offsets.
- The caller is responsible for ring-space checks, wrap padding, packet
alignment, memory ordering, doorbell writes, dependency packets, completion
packets, and avoiding dependency cycles.
- The caller must not destroy a queue with in-flight work unless ROCR documents
a bounded drain/reset behavior for that case.
- Queue creation consumes a real SDMA queue resource. Interactions with
hsa_amd_memory_async_copy, hsa_amd_memory_async_copy_on_engine, and
hsa_amd_memory_copy_engine_status should be documented. At minimum, ROCR
should not hand the same KFD queue resource to its internal async-copy stack
while it is held by a direct SDMA queue handle.
- The API should document whether direct SDMA queues survive GPU reset/queue
eviction and what error status clients should expect after a reset.
Nice-To-Have Additions
These are useful but not required for the first usable API:
hsa_status_t HSA_API hsa_amd_sdma_queue_wait_idle(
hsa_amd_sdma_queue_t queue, uint64_t timeout_ns);
hsa_status_t HSA_API hsa_amd_sdma_queue_ring_doorbell(
hsa_amd_sdma_queue_t queue, uint64_t write_pointer);
hsa_amd_sdma_queue_wait_idle would give clients a runtime-owned bounded idle
check for teardown diagnostics. hsa_amd_sdma_queue_ring_doorbell would let
clients avoid hard-coding 32-bit vs 64-bit doorbell writes while still retaining
direct packet control. Neither should be necessary if the resource struct
exposes enough information.
Another useful adjunct would be a documented way to obtain a signal value
address that SDMA may poll or update. hsa_amd_signal_value_pointer exists, but
its current constraints make it awkward as the only bridge between direct SDMA
packets and HSA signal waits. This can be separate from queue creation.
Why Not Just Export hsaKmtCreateQueueExt?
Exporting hsaKmtCreateQueueExt alone is not enough for modern users. The
queue-create ioctl is the visible end of a larger runtime-owned path:
- KFD must be opened and initialized in the same KMT context.
- The process VM must be acquired from the matching DRM render node.
- Queue control memory must be allocated, mapped, and tracked through KMT/FMM.
- dGPU doorbells usually go through the GPUVM doorbell path, not just a legacy
/dev/kfd mmap.
- The returned queue resource must remain consistent with ROCR's own lifetime
and shutdown behavior.
A public direct-SDMA extension API can preserve that ownership while still
giving callers the only thing they need for direct control: a valid SDMA ring,
RPTR/WPTR memory, and doorbell mapping.
Non-Goals
- No packet builder API in v1. Clients can emit SDMA packets themselves once
they have a valid queue resource and packet-property information.
- No runtime-managed copy scheduling in this API. That remains the job of
hsa_amd_memory_async_copy*.
- No implicit semaphore or signal protocol. Clients using direct SDMA queues
own their dependency and completion packets.
- No requirement to expose raw KFD file descriptors or doorbell mmap offsets.
The goal is to avoid depending on those implementation details.
Minimal Acceptance Test
A useful validation test for the new API would be:
- Initialize HSA and select one GPU agent.
- Create a direct SDMA queue with a 1 MiB runtime-owned ring.
- Query the resource and packet properties.
- Emit one bounded
FENCE packet that writes a host-visible completion word.
- Publish WPTR, ring the doorbell, and wait on the completion word with a
finite host timeout.
- Destroy the queue cleanly.
That would prove the API is enough for clients to safely bring up direct SDMA
packet emission without copying KMT/FMM internals.
Operating System
No response
GPU
No response
ROCm Component
ROCR
(written by codex at my request)
Summary
Please consider adding a public ROCR/HSA AMD extension API for creating and
destroying runtime-owned SDMA queues whose rings can be written directly by an
application or runtime layer.
The request is not for another async copy API.
hsa_amd_memory_async_copy,hsa_amd_memory_async_copy_on_engine, and the batch copy APIs are useful whenROCR owns copy scheduling and packet emission. The missing surface is for
clients that need to directly emit SDMA packets while still relying on ROCR for
the hard resource-management parts: KFD process VM acquisition, FMM-backed queue
control allocation, GPU mapping, queue creation/destruction, and dGPU doorbell
mapping.
The goal is to avoid forcing advanced clients to either:
hsa_amd_memory_async_copy*, losing control overpacket sequences and cross-engine synchronization, or
libhsakmt/FMM code just to obtain a valid SDMAqueue resource.
Existing API Surface
The current AMD extension surface already has most of the surrounding concepts:
hsa_amd_sdma_engine_id_tnames SDMA engines as one-hot masks.HSA_AMD_AGENT_INFO_NUM_SDMA_ENGandHSA_AMD_AGENT_INFO_NUM_SDMA_XGMI_ENGexpose engine counts.hsa_amd_memory_async_copy_on_enginelets callers request a specific SDMAengine for runtime-managed copies.
hsa_amd_memory_copy_engine_statusandhsa_amd_memory_get_preferred_copy_engineexpose availability and preferredrouting information.
hsa_amd_counted_queue_acquire/hsa_amd_counted_queue_releasealreadyestablish an AMD extension pattern for queue-resource acquisition beyond
core
hsa_queue_create.However, none of those APIs return a direct SDMA ring, RPTR/WPTR storage, or
doorbell mapping to the caller. The low-level KMT API has the resource shape
needed for this (
hsaKmtCreateQueueExtreturningHsaQueueResource), but inmodern TheRock-style deployments the relevant KMT functions may be linked into
libhsa-runtime64.so.1as local symbols rather than exported dynamic symbols.The install may also ship
libhsakmt.a, but linking a second static KMT copyinto a process that already loaded
libhsa-runtime64.so.1risks creating twoindependent KMT/FMM global states.
The API should therefore be an exported ROCR/HSA AMD extension that reuses the
runtime's existing KMT/FMM state instead of exposing raw KFD ioctls or requiring
clients to link their own thunk instance.
API Placement
The preferred placement is
hsa_ext_amd.has an AMD extension next to theexisting SDMA copy-engine APIs and counted-queue APIs. That keeps the surface in
the place where clients already discover AMD-specific queue/resource behavior
without changing the core HSA queue contract.
Two alternatives are possible, but less attractive:
hsa_queue_createwith a new queue type would blur the meaning ofhsa_queue_t, whose visible fields and callbacks are built around AQL queuesemantics rather than raw SDMA packet rings.
hsa_amd_counted_queue_acquirecould fit the existingresource-pool idea, but counted queues currently return
hsa_queue_t*andare documented around shared AQL hardware queues. A direct SDMA queue has a
different ownership contract: callers need explicit ring, pointer, doorbell,
engine, and packet-property information.
So the cleanest shape is a new AMD extension resource handle with create,
destroy, and get-info entry points. ROCR can still implement that handle using
the same internal engine policy and resource pooling used by
hsa_amd_memory_async_copy*.Preferred API Shape
The most useful v1 API would be a runtime-owned resource API:
The names are placeholders, but the important design point is that ROCR owns
the resource and returns mapped CPU pointers for direct ring writes.
Expected Usage
Create one queue on a preferred engine:
Submit direct packets:
Destroy after the caller has drained its own work:
Required Semantics
The API should make the following contracts explicit:
resource.
ring_base,read_pointer,write_pointer, anddoorbellpointers remain valid until
hsa_amd_sdma_queue_destroyorhsa_shut_down,whichever comes first.
written to this queue.
read_pointerandwrite_pointerunits are specified by the API. For GFX9+SDMA queues, the desired contract is 64-bit byte offsets.
alignment, memory ordering, doorbell writes, dependency packets, completion
packets, and avoiding dependency cycles.
a bounded drain/reset behavior for that case.
hsa_amd_memory_async_copy,hsa_amd_memory_async_copy_on_engine, andhsa_amd_memory_copy_engine_statusshould be documented. At minimum, ROCRshould not hand the same KFD queue resource to its internal async-copy stack
while it is held by a direct SDMA queue handle.
eviction and what error status clients should expect after a reset.
Nice-To-Have Additions
These are useful but not required for the first usable API:
hsa_amd_sdma_queue_wait_idlewould give clients a runtime-owned bounded idlecheck for teardown diagnostics.
hsa_amd_sdma_queue_ring_doorbellwould letclients avoid hard-coding 32-bit vs 64-bit doorbell writes while still retaining
direct packet control. Neither should be necessary if the resource struct
exposes enough information.
Another useful adjunct would be a documented way to obtain a signal value
address that SDMA may poll or update.
hsa_amd_signal_value_pointerexists, butits current constraints make it awkward as the only bridge between direct SDMA
packets and HSA signal waits. This can be separate from queue creation.
Why Not Just Export hsaKmtCreateQueueExt?
Exporting
hsaKmtCreateQueueExtalone is not enough for modern users. Thequeue-create ioctl is the visible end of a larger runtime-owned path:
/dev/kfdmmap.and shutdown behavior.
A public direct-SDMA extension API can preserve that ownership while still
giving callers the only thing they need for direct control: a valid SDMA ring,
RPTR/WPTR memory, and doorbell mapping.
Non-Goals
they have a valid queue resource and packet-property information.
hsa_amd_memory_async_copy*.own their dependency and completion packets.
The goal is to avoid depending on those implementation details.
Minimal Acceptance Test
A useful validation test for the new API would be:
FENCEpacket that writes a host-visible completion word.finite host timeout.
That would prove the API is enough for clients to safely bring up direct SDMA
packet emission without copying KMT/FMM internals.
Operating System
No response
GPU
No response
ROCm Component
ROCR