Skip to content

io/xgmi: honor sub-region base offset in IPC remapping (fixes #415)#416

Draft
carlushuang wants to merge 3 commits into
ROCm:mainfrom
carlushuang:fix-xgmi-suballocation-offset
Draft

io/xgmi: honor sub-region base offset in IPC remapping (fixes #415)#416
carlushuang wants to merge 3 commits into
ROCm:mainfrom
carlushuang:fix-xgmi-suballocation-offset

Conversation

@carlushuang

@carlushuang carlushuang commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

What

Fixes #415.

XgmiBackendSession computes the remote/local transfer addresses from the IPC-remapped allocation base without accounting for the offset of the registered region within that allocation. hipIpcGetMemHandle / hipIpcOpenMemHandle are keyed to the allocation base, so registering a sub-region — e.g. a per-layer view of one paged KV-cache allocation, which is exactly how PD-disaggregation KV connectors register memory — yields a remote pointer aimed at the allocation base instead of the registered region.

Symptoms on the XGMI backend (intra-node, 8×MI355X):

  • Silent data corruption: BatchRead returns StatusCode::SUCCESS but moves bytes from the allocation base (wrong data).
  • SIGSEGV: when the mis-computed pointer falls outside the mapped IPC range, hipPointerGetAttributes reports it as host/unregistered (type=0), so hipMemcpyPeerAsync takes a CPU-memcpy path over a device address and crashes inside glibc __memmove_avx512.

Fix

Record the registered pointer's offset within its underlying allocation (MemoryDesc::ipcOffset, resolved via hipMemGetAddressRange at registration, and the IPC handle is taken on the allocation base), then add it back to the remapped base on the importing side in GetRemappedAddress.

ipcOffset == 0 for whole-allocation registrations, so existing callers (including the vLLM/SGLang connectors, which register whole per-layer tensors) are unaffected. The RDMA backend is untouched.

Test

Added tests/python/io/test_xgmi_suballocation.py — a two-process XGMI test that registers a sub-region (offset view) of a larger allocation on one GPU and batch_reads it from another process/GPU, then checks the bytes. It must be cross-process: within a single process the XGMI backend serves remote memory via the same-process direct pointer (offset already baked in), so the offset handling is only exercised across processes (real hipIpcOpenMemHandle). Skips when fewer than 2 GPUs are visible.

Result on 2×MI355X (gfx950): fails before this fix (the reader gets the zero-filled allocation base → "registered region's base offset was not honored"), passes after.

End-to-end

GLM-5.2-FP8 intra-node prefill/decode disaggregation (TP4+TP4, paged MLA KV cache registered as per-layer views): with this fix the IPC source pointers resolve as device memory (type=2) for the per-layer transfers, versus host/unregistered (type=0) before.

Note / follow-up

This PR fixes the offset/addressing bug (the data-corruption + the dominant crash cause). A separate issue remains for the scatter/gather kernel path (MORI_IO_XGMI_SCATTER_GATHER_THRESHOLD) under intra-node IPC: the device-side kernel cannot trigger hipIpcMemLazyEnablePeerAccess, so it needs peer access enabled eagerly to avoid a GPU page fault. Happy to follow up with that in a separate change if useful.

XgmiBackendSession transfers computed remote addresses from the IPC-remapped
allocation base without accounting for the offset of the registered region
within that allocation. hipIpcGetMemHandle/hipIpcOpenMemHandle are keyed to the
allocation base, so registering a sub-region (e.g. a per-layer view of one
paged KV-cache allocation, as PD-disaggregation KV connectors do) produced a
remote pointer pointing at the allocation base instead of the registered
region.

Concretely this caused BatchRead/BatchReadWrite to either move the wrong bytes
(silent corruption, StatusCode::SUCCESS) or, when the mis-computed pointer fell
outside the mapped range, to be classified as host memory by
hipPointerGetAttributes and trigger a CPU memcpy over a device address inside
hipMemcpyPeerAsync -> SIGSEGV.

Fix: record the registered pointer's offset within its allocation
(MemoryDesc::ipcOffset, computed via hipMemGetAddressRange at registration) and
add it back to the remapped base on the importing side. ipcOffset is 0 for
whole-allocation registrations, so existing callers are unaffected.

Repro (pure mori + torch): register a sub-region at base+OFF on one GPU and
XGMI BatchRead it from another -> returns zeros before, correct data after.
Two-process XGMI test that registers a sub-region (offset view) of a larger
allocation on one GPU and batch_reads it from another process/GPU, then checks
the bytes. Before the offset fix the reader gets the allocation base instead of
the registered region, so the data mismatches. Must be cross-process: a single
process serves remote memory via the same-process direct pointer, which already
carries the offset. Skips when fewer than 2 GPUs are visible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MoRI-IO XGMI backend: BatchRead ignores the base offset of a registered sub-region (silent data corruption; SIGSEGV in production)

1 participant