Skip to content

[Feature]: ROCR SDMA Direct Queue Creation API Request #4771

@benvanik

Description

@benvanik

(written by codex at my request)

Summary

Please consider adding a public ROCR/HSA AMD extension API for creating and
destroying runtime-owned SDMA queues whose rings can be written directly by an
application or runtime layer.

The request is not for another async copy API. hsa_amd_memory_async_copy,
hsa_amd_memory_async_copy_on_engine, and the batch copy APIs are useful when
ROCR owns copy scheduling and packet emission. The missing surface is for
clients that need to directly emit SDMA packets while still relying on ROCR for
the hard resource-management parts: KFD process VM acquisition, FMM-backed queue
control allocation, GPU mapping, queue creation/destruction, and dGPU doorbell
mapping.

The goal is to avoid forcing advanced clients to either:

  • tunnel all work through hsa_amd_memory_async_copy*, losing control over
    packet sequences and cross-engine synchronization, or
  • fork/copy a large chunk of libhsakmt/FMM code just to obtain a valid SDMA
    queue resource.

Existing API Surface

The current AMD extension surface already has most of the surrounding concepts:

  • hsa_amd_sdma_engine_id_t names SDMA engines as one-hot masks.
  • HSA_AMD_AGENT_INFO_NUM_SDMA_ENG and
    HSA_AMD_AGENT_INFO_NUM_SDMA_XGMI_ENG expose engine counts.
  • hsa_amd_memory_async_copy_on_engine lets callers request a specific SDMA
    engine for runtime-managed copies.
  • hsa_amd_memory_copy_engine_status and
    hsa_amd_memory_get_preferred_copy_engine expose availability and preferred
    routing information.
  • hsa_amd_counted_queue_acquire / hsa_amd_counted_queue_release already
    establish an AMD extension pattern for queue-resource acquisition beyond
    core hsa_queue_create.

However, none of those APIs return a direct SDMA ring, RPTR/WPTR storage, or
doorbell mapping to the caller. The low-level KMT API has the resource shape
needed for this (hsaKmtCreateQueueExt returning HsaQueueResource), but in
modern TheRock-style deployments the relevant KMT functions may be linked into
libhsa-runtime64.so.1 as local symbols rather than exported dynamic symbols.
The install may also ship libhsakmt.a, but linking a second static KMT copy
into a process that already loaded libhsa-runtime64.so.1 risks creating two
independent KMT/FMM global states.

The API should therefore be an exported ROCR/HSA AMD extension that reuses the
runtime's existing KMT/FMM state instead of exposing raw KFD ioctls or requiring
clients to link their own thunk instance.

API Placement

The preferred placement is hsa_ext_amd.h as an AMD extension next to the
existing SDMA copy-engine APIs and counted-queue APIs. That keeps the surface in
the place where clients already discover AMD-specific queue/resource behavior
without changing the core HSA queue contract.

Two alternatives are possible, but less attractive:

  • Extending hsa_queue_create with a new queue type would blur the meaning of
    hsa_queue_t, whose visible fields and callbacks are built around AQL queue
    semantics rather than raw SDMA packet rings.
  • Extending hsa_amd_counted_queue_acquire could fit the existing
    resource-pool idea, but counted queues currently return hsa_queue_t* and
    are documented around shared AQL hardware queues. A direct SDMA queue has a
    different ownership contract: callers need explicit ring, pointer, doorbell,
    engine, and packet-property information.

So the cleanest shape is a new AMD extension resource handle with create,
destroy, and get-info entry points. ROCR can still implement that handle using
the same internal engine policy and resource pooling used by
hsa_amd_memory_async_copy*.

Preferred API Shape

The most useful v1 API would be a runtime-owned resource API:

typedef struct hsa_amd_sdma_queue_s {
  uint64_t handle;
} hsa_amd_sdma_queue_t;

typedef enum hsa_amd_sdma_queue_type_s {
  HSA_AMD_SDMA_QUEUE_TYPE_AUTO = 0,
  HSA_AMD_SDMA_QUEUE_TYPE_SDMA = 1,
  HSA_AMD_SDMA_QUEUE_TYPE_XGMI = 2,
} hsa_amd_sdma_queue_type_t;

typedef enum hsa_amd_sdma_queue_create_flag_s {
  HSA_AMD_SDMA_QUEUE_CREATE_NONE = 0,
  HSA_AMD_SDMA_QUEUE_CREATE_FORCE_ENGINE = 1u << 0,
} hsa_amd_sdma_queue_create_flag_t;

#define HSA_AMD_SDMA_QUEUE_CREATE_DESC_VERSION 1

typedef struct hsa_amd_sdma_queue_create_desc_s {
  uint32_t version;
  uint32_t reserved0;

  hsa_agent_t agent;

  // AUTO lets ROCR choose SDMA vs XGMI when possible. SDMA and XGMI request a
  // specific KFD queue class.
  hsa_amd_sdma_queue_type_t type;

  // Optional one-hot engine mask from hsa_amd_sdma_engine_id_t. If 0, ROCR
  // chooses an engine and reports the actual engine in queue info. If
  // HSA_AMD_SDMA_QUEUE_CREATE_FORCE_ENGINE is set, invalid or unavailable
  // engine masks fail queue creation instead of falling back.
  uint32_t engine_id_mask;

  // Requested ring capacity in bytes. Must be a power of two and at least the
  // KFD minimum; ROCR may round up and reports the actual capacity in info.
  size_t ring_size;

  hsa_amd_queue_priority_t priority;
  uint32_t queue_percentage;
  uint64_t flags;
  uint64_t reserved1[4];
} hsa_amd_sdma_queue_create_desc_t;

typedef enum hsa_amd_sdma_queue_info_attribute_s {
  HSA_AMD_SDMA_QUEUE_INFO_AGENT = 0,
  HSA_AMD_SDMA_QUEUE_INFO_RESOURCE = 1,
  HSA_AMD_SDMA_QUEUE_INFO_ENGINE_ID_MASK = 2,
  HSA_AMD_SDMA_QUEUE_INFO_QUEUE_ID = 3,
  HSA_AMD_SDMA_QUEUE_INFO_PACKET_PROPERTIES = 4,
} hsa_amd_sdma_queue_info_attribute_t;

#define HSA_AMD_SDMA_QUEUE_RESOURCE_VERSION 1

typedef struct hsa_amd_sdma_queue_resource_s {
  uint32_t version;
  uint32_t reserved0;

  // CPU-visible byte ring. The caller writes SDMA packet dwords here.
  void* ring_base;
  size_t ring_size;

  // CPU-visible queue pointer slots. For GFX9+ these are 64-bit byte offsets.
  volatile uint64_t* read_pointer;
  volatile uint64_t* write_pointer;

  // CPU-visible doorbell mapping. GFX9+ callers write a 64-bit byte-offset
  // WPTR. If older ASICs are supported, doorbell_size tells the caller whether
  // the doorbell is 32-bit or 64-bit.
  volatile void* doorbell;
  uint32_t doorbell_size;

  // Actual one-hot engine mask selected for this queue.
  uint32_t engine_id_mask;

  // KFD queue id for diagnostics only. Callers should not use it with raw KFD
  // ioctls unless ROCR explicitly documents that as supported.
  uint32_t queue_id;
  uint32_t reserved1;

  uint64_t reserved2[4];
} hsa_amd_sdma_queue_resource_t;

typedef struct hsa_amd_sdma_queue_packet_properties_s {
  uint32_t version;

  // Minimum packet-stream submission size in bytes, if the ASIC/firmware needs
  // padding. 0 means no additional minimum beyond packet alignment.
  uint32_t min_submission_size;

  // Packet-stream alignment in bytes. Expected to be 4 for SDMA dword packets.
  uint32_t packet_alignment;

  // True if the queue supports 64-bit atomics to system memory from SDMA.
  bool supports_atomic64;

  // True if the queue supports TRAP/event notification packets in this mode.
  bool supports_trap;

  uint8_t reserved0[2];
  uint64_t reserved1[4];
} hsa_amd_sdma_queue_packet_properties_t;

hsa_status_t HSA_API hsa_amd_sdma_queue_create(
    const hsa_amd_sdma_queue_create_desc_t* desc,
    hsa_amd_sdma_queue_t* queue);

hsa_status_t HSA_API hsa_amd_sdma_queue_destroy(
    hsa_amd_sdma_queue_t queue);

hsa_status_t HSA_API hsa_amd_sdma_queue_get_info(
    hsa_amd_sdma_queue_t queue,
    hsa_amd_sdma_queue_info_attribute_t attribute,
    void* value);

The names are placeholders, but the important design point is that ROCR owns
the resource and returns mapped CPU pointers for direct ring writes.

Expected Usage

Create one queue on a preferred engine:

uint32_t preferred = 0;
hsa_status_t status =
    hsa_amd_memory_get_preferred_copy_engine(dst_agent, src_agent, &preferred);
if (status != HSA_STATUS_SUCCESS || preferred == 0) {
  preferred = HSA_AMD_SDMA_ENGINE_0;
}

hsa_amd_sdma_queue_create_desc_t desc = {
    .version = HSA_AMD_SDMA_QUEUE_CREATE_DESC_VERSION,
    .agent = dst_agent,
    .type = HSA_AMD_SDMA_QUEUE_TYPE_AUTO,
    .engine_id_mask = preferred,
    .ring_size = 1u << 20,
    .priority = HSA_AMD_QUEUE_PRIORITY_NORMAL,
    .queue_percentage = 100,
};

hsa_amd_sdma_queue_t queue = {0};
HSA_CHECK(hsa_amd_sdma_queue_create(&desc, &queue));

hsa_amd_sdma_queue_resource_t resource = {
    .version = HSA_AMD_SDMA_QUEUE_RESOURCE_VERSION,
};
HSA_CHECK(hsa_amd_sdma_queue_get_info(
    queue, HSA_AMD_SDMA_QUEUE_INFO_RESOURCE, &resource));

Submit direct packets:

uint64_t write_pointer = *resource.write_pointer;
uint8_t* packet_base =
    (uint8_t*)resource.ring_base + (write_pointer & (resource.ring_size - 1));

// Caller emits SDMA packet dwords into packet_base and pads wrap/minimum
// submission size according to packet properties.
size_t packet_bytes = emit_sdma_copy_linear(packet_base, dst, src, size);

write_pointer += packet_bytes;
__atomic_store_n(resource.write_pointer, write_pointer, __ATOMIC_RELEASE);

if (resource.doorbell_size == sizeof(uint64_t)) {
  __atomic_store_n((volatile uint64_t*)resource.doorbell, write_pointer,
                   __ATOMIC_RELEASE);
} else {
  __atomic_store_n((volatile uint32_t*)resource.doorbell,
                   (uint32_t)write_pointer, __ATOMIC_RELEASE);
}

Destroy after the caller has drained its own work:

wait_until_sdma_ring_idle(resource.read_pointer, write_pointer, timeout_ns);
HSA_CHECK(hsa_amd_sdma_queue_destroy(queue));

Required Semantics

The API should make the following contracts explicit:

  • ROCR owns KFD queue creation and all KMT/FMM backing state for the returned
    resource.
  • Returned ring_base, read_pointer, write_pointer, and doorbell
    pointers remain valid until hsa_amd_sdma_queue_destroy or hsa_shut_down,
    whichever comes first.
  • The caller owns ring serialization. ROCR does not inspect or schedule packets
    written to this queue.
  • read_pointer and write_pointer units are specified by the API. For GFX9+
    SDMA queues, the desired contract is 64-bit byte offsets.
  • The caller is responsible for ring-space checks, wrap padding, packet
    alignment, memory ordering, doorbell writes, dependency packets, completion
    packets, and avoiding dependency cycles.
  • The caller must not destroy a queue with in-flight work unless ROCR documents
    a bounded drain/reset behavior for that case.
  • Queue creation consumes a real SDMA queue resource. Interactions with
    hsa_amd_memory_async_copy, hsa_amd_memory_async_copy_on_engine, and
    hsa_amd_memory_copy_engine_status should be documented. At minimum, ROCR
    should not hand the same KFD queue resource to its internal async-copy stack
    while it is held by a direct SDMA queue handle.
  • The API should document whether direct SDMA queues survive GPU reset/queue
    eviction and what error status clients should expect after a reset.

Nice-To-Have Additions

These are useful but not required for the first usable API:

hsa_status_t HSA_API hsa_amd_sdma_queue_wait_idle(
    hsa_amd_sdma_queue_t queue, uint64_t timeout_ns);

hsa_status_t HSA_API hsa_amd_sdma_queue_ring_doorbell(
    hsa_amd_sdma_queue_t queue, uint64_t write_pointer);

hsa_amd_sdma_queue_wait_idle would give clients a runtime-owned bounded idle
check for teardown diagnostics. hsa_amd_sdma_queue_ring_doorbell would let
clients avoid hard-coding 32-bit vs 64-bit doorbell writes while still retaining
direct packet control. Neither should be necessary if the resource struct
exposes enough information.

Another useful adjunct would be a documented way to obtain a signal value
address that SDMA may poll or update. hsa_amd_signal_value_pointer exists, but
its current constraints make it awkward as the only bridge between direct SDMA
packets and HSA signal waits. This can be separate from queue creation.

Why Not Just Export hsaKmtCreateQueueExt?

Exporting hsaKmtCreateQueueExt alone is not enough for modern users. The
queue-create ioctl is the visible end of a larger runtime-owned path:

  • KFD must be opened and initialized in the same KMT context.
  • The process VM must be acquired from the matching DRM render node.
  • Queue control memory must be allocated, mapped, and tracked through KMT/FMM.
  • dGPU doorbells usually go through the GPUVM doorbell path, not just a legacy
    /dev/kfd mmap.
  • The returned queue resource must remain consistent with ROCR's own lifetime
    and shutdown behavior.

A public direct-SDMA extension API can preserve that ownership while still
giving callers the only thing they need for direct control: a valid SDMA ring,
RPTR/WPTR memory, and doorbell mapping.

Non-Goals

  • No packet builder API in v1. Clients can emit SDMA packets themselves once
    they have a valid queue resource and packet-property information.
  • No runtime-managed copy scheduling in this API. That remains the job of
    hsa_amd_memory_async_copy*.
  • No implicit semaphore or signal protocol. Clients using direct SDMA queues
    own their dependency and completion packets.
  • No requirement to expose raw KFD file descriptors or doorbell mmap offsets.
    The goal is to avoid depending on those implementation details.

Minimal Acceptance Test

A useful validation test for the new API would be:

  1. Initialize HSA and select one GPU agent.
  2. Create a direct SDMA queue with a 1 MiB runtime-owned ring.
  3. Query the resource and packet properties.
  4. Emit one bounded FENCE packet that writes a host-visible completion word.
  5. Publish WPTR, ring the doorbell, and wait on the completion word with a
    finite host timeout.
  6. Destroy the queue cleanly.

That would prove the API is enough for clients to safely bring up direct SDMA
packet emission without copying KMT/FMM internals.

Operating System

No response

GPU

No response

ROCm Component

ROCR

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions