[Feature]: ROCR SDMA Direct Queue Creation API Request

(written by codex at my request)

## Summary

Please consider adding a public ROCR/HSA AMD extension API for creating and
destroying runtime-owned SDMA queues whose rings can be written directly by an
application or runtime layer.

The request is not for another async copy API. `hsa_amd_memory_async_copy`,
`hsa_amd_memory_async_copy_on_engine`, and the batch copy APIs are useful when
ROCR owns copy scheduling and packet emission. The missing surface is for
clients that need to directly emit SDMA packets while still relying on ROCR for
the hard resource-management parts: KFD process VM acquisition, FMM-backed queue
control allocation, GPU mapping, queue creation/destruction, and dGPU doorbell
mapping.

The goal is to avoid forcing advanced clients to either:

- tunnel all work through `hsa_amd_memory_async_copy*`, losing control over
  packet sequences and cross-engine synchronization, or
- fork/copy a large chunk of `libhsakmt`/FMM code just to obtain a valid SDMA
  queue resource.

## Existing API Surface

The current AMD extension surface already has most of the surrounding concepts:

- `hsa_amd_sdma_engine_id_t` names SDMA engines as one-hot masks.
- `HSA_AMD_AGENT_INFO_NUM_SDMA_ENG` and
  `HSA_AMD_AGENT_INFO_NUM_SDMA_XGMI_ENG` expose engine counts.
- `hsa_amd_memory_async_copy_on_engine` lets callers request a specific SDMA
  engine for runtime-managed copies.
- `hsa_amd_memory_copy_engine_status` and
  `hsa_amd_memory_get_preferred_copy_engine` expose availability and preferred
  routing information.
- `hsa_amd_counted_queue_acquire` / `hsa_amd_counted_queue_release` already
  establish an AMD extension pattern for queue-resource acquisition beyond
  core `hsa_queue_create`.

However, none of those APIs return a direct SDMA ring, RPTR/WPTR storage, or
doorbell mapping to the caller. The low-level KMT API has the resource shape
needed for this (`hsaKmtCreateQueueExt` returning `HsaQueueResource`), but in
modern TheRock-style deployments the relevant KMT functions may be linked into
`libhsa-runtime64.so.1` as local symbols rather than exported dynamic symbols.
The install may also ship `libhsakmt.a`, but linking a second static KMT copy
into a process that already loaded `libhsa-runtime64.so.1` risks creating two
independent KMT/FMM global states.

The API should therefore be an exported ROCR/HSA AMD extension that reuses the
runtime's existing KMT/FMM state instead of exposing raw KFD ioctls or requiring
clients to link their own thunk instance.

## API Placement

The preferred placement is `hsa_ext_amd.h` as an AMD extension next to the
existing SDMA copy-engine APIs and counted-queue APIs. That keeps the surface in
the place where clients already discover AMD-specific queue/resource behavior
without changing the core HSA queue contract.

Two alternatives are possible, but less attractive:

- Extending `hsa_queue_create` with a new queue type would blur the meaning of
  `hsa_queue_t`, whose visible fields and callbacks are built around AQL queue
  semantics rather than raw SDMA packet rings.
- Extending `hsa_amd_counted_queue_acquire` could fit the existing
  resource-pool idea, but counted queues currently return `hsa_queue_t*` and
  are documented around shared AQL hardware queues. A direct SDMA queue has a
  different ownership contract: callers need explicit ring, pointer, doorbell,
  engine, and packet-property information.

So the cleanest shape is a new AMD extension resource handle with create,
destroy, and get-info entry points. ROCR can still implement that handle using
the same internal engine policy and resource pooling used by
`hsa_amd_memory_async_copy*`.

## Preferred API Shape

The most useful v1 API would be a runtime-owned resource API:

```c
typedef struct hsa_amd_sdma_queue_s {
  uint64_t handle;
} hsa_amd_sdma_queue_t;

typedef enum hsa_amd_sdma_queue_type_s {
  HSA_AMD_SDMA_QUEUE_TYPE_AUTO = 0,
  HSA_AMD_SDMA_QUEUE_TYPE_SDMA = 1,
  HSA_AMD_SDMA_QUEUE_TYPE_XGMI = 2,
} hsa_amd_sdma_queue_type_t;

typedef enum hsa_amd_sdma_queue_create_flag_s {
  HSA_AMD_SDMA_QUEUE_CREATE_NONE = 0,
  HSA_AMD_SDMA_QUEUE_CREATE_FORCE_ENGINE = 1u << 0,
} hsa_amd_sdma_queue_create_flag_t;

#define HSA_AMD_SDMA_QUEUE_CREATE_DESC_VERSION 1

typedef struct hsa_amd_sdma_queue_create_desc_s {
  uint32_t version;
  uint32_t reserved0;

  hsa_agent_t agent;

  // AUTO lets ROCR choose SDMA vs XGMI when possible. SDMA and XGMI request a
  // specific KFD queue class.
  hsa_amd_sdma_queue_type_t type;

  // Optional one-hot engine mask from hsa_amd_sdma_engine_id_t. If 0, ROCR
  // chooses an engine and reports the actual engine in queue info. If
  // HSA_AMD_SDMA_QUEUE_CREATE_FORCE_ENGINE is set, invalid or unavailable
  // engine masks fail queue creation instead of falling back.
  uint32_t engine_id_mask;

  // Requested ring capacity in bytes. Must be a power of two and at least the
  // KFD minimum; ROCR may round up and reports the actual capacity in info.
  size_t ring_size;

  hsa_amd_queue_priority_t priority;
  uint32_t queue_percentage;
  uint64_t flags;
  uint64_t reserved1[4];
} hsa_amd_sdma_queue_create_desc_t;

typedef enum hsa_amd_sdma_queue_info_attribute_s {
  HSA_AMD_SDMA_QUEUE_INFO_AGENT = 0,
  HSA_AMD_SDMA_QUEUE_INFO_RESOURCE = 1,
  HSA_AMD_SDMA_QUEUE_INFO_ENGINE_ID_MASK = 2,
  HSA_AMD_SDMA_QUEUE_INFO_QUEUE_ID = 3,
  HSA_AMD_SDMA_QUEUE_INFO_PACKET_PROPERTIES = 4,
} hsa_amd_sdma_queue_info_attribute_t;

#define HSA_AMD_SDMA_QUEUE_RESOURCE_VERSION 1

typedef struct hsa_amd_sdma_queue_resource_s {
  uint32_t version;
  uint32_t reserved0;

  // CPU-visible byte ring. The caller writes SDMA packet dwords here.
  void* ring_base;
  size_t ring_size;

  // CPU-visible queue pointer slots. For GFX9+ these are 64-bit byte offsets.
  volatile uint64_t* read_pointer;
  volatile uint64_t* write_pointer;

  // CPU-visible doorbell mapping. GFX9+ callers write a 64-bit byte-offset
  // WPTR. If older ASICs are supported, doorbell_size tells the caller whether
  // the doorbell is 32-bit or 64-bit.
  volatile void* doorbell;
  uint32_t doorbell_size;

  // Actual one-hot engine mask selected for this queue.
  uint32_t engine_id_mask;

  // KFD queue id for diagnostics only. Callers should not use it with raw KFD
  // ioctls unless ROCR explicitly documents that as supported.
  uint32_t queue_id;
  uint32_t reserved1;

  uint64_t reserved2[4];
} hsa_amd_sdma_queue_resource_t;

typedef struct hsa_amd_sdma_queue_packet_properties_s {
  uint32_t version;

  // Minimum packet-stream submission size in bytes, if the ASIC/firmware needs
  // padding. 0 means no additional minimum beyond packet alignment.
  uint32_t min_submission_size;

  // Packet-stream alignment in bytes. Expected to be 4 for SDMA dword packets.
  uint32_t packet_alignment;

  // True if the queue supports 64-bit atomics to system memory from SDMA.
  bool supports_atomic64;

  // True if the queue supports TRAP/event notification packets in this mode.
  bool supports_trap;

  uint8_t reserved0[2];
  uint64_t reserved1[4];
} hsa_amd_sdma_queue_packet_properties_t;

hsa_status_t HSA_API hsa_amd_sdma_queue_create(
    const hsa_amd_sdma_queue_create_desc_t* desc,
    hsa_amd_sdma_queue_t* queue);

hsa_status_t HSA_API hsa_amd_sdma_queue_destroy(
    hsa_amd_sdma_queue_t queue);

hsa_status_t HSA_API hsa_amd_sdma_queue_get_info(
    hsa_amd_sdma_queue_t queue,
    hsa_amd_sdma_queue_info_attribute_t attribute,
    void* value);
```

The names are placeholders, but the important design point is that ROCR owns
the resource and returns mapped CPU pointers for direct ring writes.

## Expected Usage

Create one queue on a preferred engine:

```c
uint32_t preferred = 0;
hsa_status_t status =
    hsa_amd_memory_get_preferred_copy_engine(dst_agent, src_agent, &preferred);
if (status != HSA_STATUS_SUCCESS || preferred == 0) {
  preferred = HSA_AMD_SDMA_ENGINE_0;
}

hsa_amd_sdma_queue_create_desc_t desc = {
    .version = HSA_AMD_SDMA_QUEUE_CREATE_DESC_VERSION,
    .agent = dst_agent,
    .type = HSA_AMD_SDMA_QUEUE_TYPE_AUTO,
    .engine_id_mask = preferred,
    .ring_size = 1u << 20,
    .priority = HSA_AMD_QUEUE_PRIORITY_NORMAL,
    .queue_percentage = 100,
};

hsa_amd_sdma_queue_t queue = {0};
HSA_CHECK(hsa_amd_sdma_queue_create(&desc, &queue));

hsa_amd_sdma_queue_resource_t resource = {
    .version = HSA_AMD_SDMA_QUEUE_RESOURCE_VERSION,
};
HSA_CHECK(hsa_amd_sdma_queue_get_info(
    queue, HSA_AMD_SDMA_QUEUE_INFO_RESOURCE, &resource));
```

Submit direct packets:

```c
uint64_t write_pointer = *resource.write_pointer;
uint8_t* packet_base =
    (uint8_t*)resource.ring_base + (write_pointer & (resource.ring_size - 1));

// Caller emits SDMA packet dwords into packet_base and pads wrap/minimum
// submission size according to packet properties.
size_t packet_bytes = emit_sdma_copy_linear(packet_base, dst, src, size);

write_pointer += packet_bytes;
__atomic_store_n(resource.write_pointer, write_pointer, __ATOMIC_RELEASE);

if (resource.doorbell_size == sizeof(uint64_t)) {
  __atomic_store_n((volatile uint64_t*)resource.doorbell, write_pointer,
                   __ATOMIC_RELEASE);
} else {
  __atomic_store_n((volatile uint32_t*)resource.doorbell,
                   (uint32_t)write_pointer, __ATOMIC_RELEASE);
}
```

Destroy after the caller has drained its own work:

```c
wait_until_sdma_ring_idle(resource.read_pointer, write_pointer, timeout_ns);
HSA_CHECK(hsa_amd_sdma_queue_destroy(queue));
```

## Required Semantics

The API should make the following contracts explicit:

- ROCR owns KFD queue creation and all KMT/FMM backing state for the returned
  resource.
- Returned `ring_base`, `read_pointer`, `write_pointer`, and `doorbell`
  pointers remain valid until `hsa_amd_sdma_queue_destroy` or `hsa_shut_down`,
  whichever comes first.
- The caller owns ring serialization. ROCR does not inspect or schedule packets
  written to this queue.
- `read_pointer` and `write_pointer` units are specified by the API. For GFX9+
  SDMA queues, the desired contract is 64-bit byte offsets.
- The caller is responsible for ring-space checks, wrap padding, packet
  alignment, memory ordering, doorbell writes, dependency packets, completion
  packets, and avoiding dependency cycles.
- The caller must not destroy a queue with in-flight work unless ROCR documents
  a bounded drain/reset behavior for that case.
- Queue creation consumes a real SDMA queue resource. Interactions with
  `hsa_amd_memory_async_copy`, `hsa_amd_memory_async_copy_on_engine`, and
  `hsa_amd_memory_copy_engine_status` should be documented. At minimum, ROCR
  should not hand the same KFD queue resource to its internal async-copy stack
  while it is held by a direct SDMA queue handle.
- The API should document whether direct SDMA queues survive GPU reset/queue
  eviction and what error status clients should expect after a reset.

## Nice-To-Have Additions

These are useful but not required for the first usable API:

```c
hsa_status_t HSA_API hsa_amd_sdma_queue_wait_idle(
    hsa_amd_sdma_queue_t queue, uint64_t timeout_ns);

hsa_status_t HSA_API hsa_amd_sdma_queue_ring_doorbell(
    hsa_amd_sdma_queue_t queue, uint64_t write_pointer);
```

`hsa_amd_sdma_queue_wait_idle` would give clients a runtime-owned bounded idle
check for teardown diagnostics. `hsa_amd_sdma_queue_ring_doorbell` would let
clients avoid hard-coding 32-bit vs 64-bit doorbell writes while still retaining
direct packet control. Neither should be necessary if the resource struct
exposes enough information.

Another useful adjunct would be a documented way to obtain a signal value
address that SDMA may poll or update. `hsa_amd_signal_value_pointer` exists, but
its current constraints make it awkward as the only bridge between direct SDMA
packets and HSA signal waits. This can be separate from queue creation.

## Why Not Just Export hsaKmtCreateQueueExt?

Exporting `hsaKmtCreateQueueExt` alone is not enough for modern users. The
queue-create ioctl is the visible end of a larger runtime-owned path:

- KFD must be opened and initialized in the same KMT context.
- The process VM must be acquired from the matching DRM render node.
- Queue control memory must be allocated, mapped, and tracked through KMT/FMM.
- dGPU doorbells usually go through the GPUVM doorbell path, not just a legacy
  `/dev/kfd` mmap.
- The returned queue resource must remain consistent with ROCR's own lifetime
  and shutdown behavior.

A public direct-SDMA extension API can preserve that ownership while still
giving callers the only thing they need for direct control: a valid SDMA ring,
RPTR/WPTR memory, and doorbell mapping.

## Non-Goals

- No packet builder API in v1. Clients can emit SDMA packets themselves once
  they have a valid queue resource and packet-property information.
- No runtime-managed copy scheduling in this API. That remains the job of
  `hsa_amd_memory_async_copy*`.
- No implicit semaphore or signal protocol. Clients using direct SDMA queues
  own their dependency and completion packets.
- No requirement to expose raw KFD file descriptors or doorbell mmap offsets.
  The goal is to avoid depending on those implementation details.

## Minimal Acceptance Test

A useful validation test for the new API would be:

1. Initialize HSA and select one GPU agent.
2. Create a direct SDMA queue with a 1 MiB runtime-owned ring.
3. Query the resource and packet properties.
4. Emit one bounded `FENCE` packet that writes a host-visible completion word.
5. Publish WPTR, ring the doorbell, and wait on the completion word with a
   finite host timeout.
6. Destroy the queue cleanly.

That would prove the API is enough for clients to safely bring up direct SDMA
packet emission without copying KMT/FMM internals.


### Operating System

_No response_

### GPU

_No response_

### ROCm Component

ROCR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: ROCR SDMA Direct Queue Creation API Request #4771

Summary

Existing API Surface

API Placement

Preferred API Shape

Expected Usage

Required Semantics

Nice-To-Have Additions

Why Not Just Export hsaKmtCreateQueueExt?

Non-Goals

Minimal Acceptance Test

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: ROCR SDMA Direct Queue Creation API Request #4771

Description

Summary

Existing API Surface

API Placement

Preferred API Shape

Expected Usage

Required Semantics

Nice-To-Have Additions

Why Not Just Export hsaKmtCreateQueueExt?

Non-Goals

Minimal Acceptance Test

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions