io/xgmi: honor sub-region base offset in IPC remapping (fixes #415)#416
Draft
carlushuang wants to merge 3 commits into
Draft
io/xgmi: honor sub-region base offset in IPC remapping (fixes #415)#416carlushuang wants to merge 3 commits into
carlushuang wants to merge 3 commits into
Conversation
XgmiBackendSession transfers computed remote addresses from the IPC-remapped allocation base without accounting for the offset of the registered region within that allocation. hipIpcGetMemHandle/hipIpcOpenMemHandle are keyed to the allocation base, so registering a sub-region (e.g. a per-layer view of one paged KV-cache allocation, as PD-disaggregation KV connectors do) produced a remote pointer pointing at the allocation base instead of the registered region. Concretely this caused BatchRead/BatchReadWrite to either move the wrong bytes (silent corruption, StatusCode::SUCCESS) or, when the mis-computed pointer fell outside the mapped range, to be classified as host memory by hipPointerGetAttributes and trigger a CPU memcpy over a device address inside hipMemcpyPeerAsync -> SIGSEGV. Fix: record the registered pointer's offset within its allocation (MemoryDesc::ipcOffset, computed via hipMemGetAddressRange at registration) and add it back to the remapped base on the importing side. ipcOffset is 0 for whole-allocation registrations, so existing callers are unaffected. Repro (pure mori + torch): register a sub-region at base+OFF on one GPU and XGMI BatchRead it from another -> returns zeros before, correct data after.
Two-process XGMI test that registers a sub-region (offset view) of a larger allocation on one GPU and batch_reads it from another process/GPU, then checks the bytes. Before the offset fix the reader gets the allocation base instead of the registered region, so the data mismatches. Must be cross-process: a single process serves remote memory via the same-process direct pointer, which already carries the offset. Skips when fewer than 2 GPUs are visible.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes #415.
XgmiBackendSessioncomputes the remote/local transfer addresses from the IPC-remapped allocation base without accounting for the offset of the registered region within that allocation.hipIpcGetMemHandle/hipIpcOpenMemHandleare keyed to the allocation base, so registering a sub-region — e.g. a per-layer view of one paged KV-cache allocation, which is exactly how PD-disaggregation KV connectors register memory — yields a remote pointer aimed at the allocation base instead of the registered region.Symptoms on the XGMI backend (intra-node, 8×MI355X):
BatchReadreturnsStatusCode::SUCCESSbut moves bytes from the allocation base (wrong data).hipPointerGetAttributesreports it as host/unregistered (type=0), sohipMemcpyPeerAsynctakes a CPU-memcpypath over a device address and crashes inside glibc__memmove_avx512.Fix
Record the registered pointer's offset within its underlying allocation (
MemoryDesc::ipcOffset, resolved viahipMemGetAddressRangeat registration, and the IPC handle is taken on the allocation base), then add it back to the remapped base on the importing side inGetRemappedAddress.ipcOffset == 0for whole-allocation registrations, so existing callers (including the vLLM/SGLang connectors, which register whole per-layer tensors) are unaffected. The RDMA backend is untouched.Test
Added
tests/python/io/test_xgmi_suballocation.py— a two-process XGMI test that registers a sub-region (offset view) of a larger allocation on one GPU andbatch_reads it from another process/GPU, then checks the bytes. It must be cross-process: within a single process the XGMI backend serves remote memory via the same-process direct pointer (offset already baked in), so the offset handling is only exercised across processes (realhipIpcOpenMemHandle). Skips when fewer than 2 GPUs are visible.Result on 2×MI355X (gfx950): fails before this fix (the reader gets the zero-filled allocation base → "registered region's base offset was not honored"), passes after.
End-to-end
GLM-5.2-FP8 intra-node prefill/decode disaggregation (TP4+TP4, paged MLA KV cache registered as per-layer views): with this fix the IPC source pointers resolve as device memory (
type=2) for the per-layer transfers, versus host/unregistered (type=0) before.Note / follow-up
This PR fixes the offset/addressing bug (the data-corruption + the dominant crash cause). A separate issue remains for the scatter/gather kernel path (
MORI_IO_XGMI_SCATTER_GATHER_THRESHOLD) under intra-node IPC: the device-side kernel cannot triggerhipIpcMemLazyEnablePeerAccess, so it needs peer access enabled eagerly to avoid a GPU page fault. Happy to follow up with that in a separate change if useful.