Skip to content

[Issue]: Memory access fault in hipMemcpyAsync with valid parameters #4817

@IMbackK

Description

@IMbackK

Problem Description

hipMemcpyAsync fails with Memory access fault by GPU node-6 (Agent handle: 0x55b4c3010740) on address 0x7f6a4cc27000. Reason: Page not present or supervisor privilege. with valid source and destination parameters in a multi gpu scenario

Operating System

Ubuntu 24.04 with mainline AMDGPU and AMDHSA kenrel modules

CPU

AMD EPYC 7443

GPU

MI100, RX 7900 XTX

ROCm Version

7.2.0

ROCm Component

clr, ROCR-Runtime

Steps to Reproduce

Hi i am a maintainer for the hip backend of ggml

Checkpoint restoring in llamacpp causes the above issue, see the ggml bug: ggml-org/llama.cpp#20176, and the reproduction notes there.

To debug this, as i have found an ltrace of hip calls to be not sufficient, i have created an instrumented version of llamacpp: https://github.com/IMbackK/llama.cpp/tree/memdebug this version logs all the calls to the hip memory management functions. For the crashing case this creates this trace: mem.log, further i have created a script which transforms this log into a equivalent replay:
replay-gen.zip. The generated replay code for this crash is: replay.cpp, note this replay dose not crash, unlike the original, the crash in the original binary also seams timing sensitive.

As you can see from mem.log and replay.cpp the crashing call at mem.log:13042 has a valid dst pointer inside a block hipMalloced on the same device and its src pointer is a valid host pointer inside a block malloced.

The host buffer is malloced at https://github.com/IMbackK/llama.cpp/blob/b0b35e1c6be4fa9993e63d694c986612f1bc00b8/tools/server/server-context.cpp#L2642 see also mem.log:4245. Note that to ensure the buffer remains valid the instrumented version of llamacpp never frees this buffer and instead leaks it.

This issue is not reproducible on one deivce at least two devices are required. I am able to reproduce it with 2x and 3x mi100's as well as 1x rx7900xtx and 1x mi100.

Further I would like to attempt to debug this using asan, however as my system contains a RDNA device i can not put the CDNA devices into xnack+ mode for this purpose and i am unable to grab a multi gpu system amd.digitalocean.com for this purpose as there never is any capacity at the ATL1 location i have credits for, if some resources could be provided for this that would be helpful as well.

Additional Information

The iommu of the system is in pt mode.

here is the dmsg log from a crash of this type:

[11372.104682] amdgpu: Freeing queue vital buffer 0x7f3563000000, queue evicted
[11372.104979] amdgpu: Freeing queue vital buffer 0x7f3567800000, queue evicted
[11372.105265] amdgpu: Freeing queue vital buffer 0x7f3582800000, queue evicted
[11372.105552] amdgpu: Freeing queue vital buffer 0x7f3592800000, queue evicted
[11372.105848] amdgpu: Freeing queue vital buffer 0x7f3597200000, queue evicted
[11372.106134] amdgpu: Freeing queue vital buffer 0x7f359ba00000, queue evicted
[11372.106419] amdgpu: Freeing queue vital buffer 0x7f35a0000000, queue evicted
[11372.106712] amdgpu: Freeing queue vital buffer 0x7f35a0200000, queue evicted
[11372.106998] amdgpu: Freeing queue vital buffer 0x7f35c5800000, queue evicted
[11372.107282] amdgpu: Freeing queue vital buffer 0x7f35ca000000, queue evicted
[11372.107567] amdgpu: Freeing queue vital buffer 0x7f35ce800000, queue evicted
[11478.874161] gmc_v9_0_process_interrupt: 4834 callbacks suppressed
[11478.874168] amdgpu 0000:03:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32787)
[11478.874750] amdgpu 0000:03:00.0: amdgpu:  Process llama-server pid 212858 thread llama-server pid 212858
[11478.875094] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f6a4cc00000 from IH client 0x12 (VMC)
[11478.875474] amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00320031
[11478.875747] amdgpu 0000:03:00.0: amdgpu:      Faulty UTCL2 client ID: SDMA0 (0x100)
[11478.876022] amdgpu 0000:03:00.0: amdgpu:      MORE_FAULTS: 0x1
[11478.876226] amdgpu 0000:03:00.0: amdgpu:      WALKER_ERROR: 0x0
[11478.876433] amdgpu 0000:03:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[11478.876655] amdgpu 0000:03:00.0: amdgpu:      MAPPING_ERROR: 0x0
[11478.876865] amdgpu 0000:03:00.0: amdgpu:      RW: 0x0
[11478.877045] amdgpu 0000:03:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32787)
[11478.877398] amdgpu 0000:03:00.0: amdgpu:  Process llama-server pid 212858 thread llama-server pid 212858
[11478.877738] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f6a4cc28000 from IH client 0x12 (VMC)
[11478.878110] amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00320031
[11478.878379] amdgpu 0000:03:00.0: amdgpu:      Faulty UTCL2 client ID: SDMA0 (0x100)
[11478.878645] amdgpu 0000:03:00.0: amdgpu:      MORE_FAULTS: 0x1
[11478.878846] amdgpu 0000:03:00.0: amdgpu:      WALKER_ERROR: 0x0
[11478.879050] amdgpu 0000:03:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[11478.879269] amdgpu 0000:03:00.0: amdgpu:      MAPPING_ERROR: 0x0
[11478.879475] amdgpu 0000:03:00.0: amdgpu:      RW: 0x0
[11478.879651] amdgpu 0000:03:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32787)
[11478.880001] amdgpu 0000:03:00.0: amdgpu:  Process llama-server pid 212858 thread llama-server pid 212858
[11478.880339] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f6a4cc29000 from IH client 0x12 (VMC)
[11478.880711] amdgpu 0000:03:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32787)
[11478.881060] amdgpu 0000:03:00.0: amdgpu:  Process llama-server pid 212858 thread llama-server pid 212858
[11478.881396] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f6a4cc2a000 from IH client 0x12 (VMC)
[11478.881768] amdgpu 0000:03:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32787)
[11478.882119] amdgpu 0000:03:00.0: amdgpu:  Process llama-server pid 212858 thread llama-server pid 212858
[11478.882456] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f6a4cc2b000 from IH client 0x12 (VMC)
[11478.882827] amdgpu 0000:03:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32787)
[11478.883176] amdgpu 0000:03:00.0: amdgpu:  Process llama-server pid 212858 thread llama-server pid 212858
[11478.883512] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f6a4cc2c000 from IH client 0x12 (VMC)
[11478.883885] amdgpu 0000:03:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32787)
[11478.884234] amdgpu 0000:03:00.0: amdgpu:  Process llama-server pid 212858 thread llama-server pid 212858
[11478.884570] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f6a4cc2d000 from IH client 0x12 (VMC)
[11478.884941] amdgpu 0000:03:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32787)
[11478.885289] amdgpu 0000:03:00.0: amdgpu:  Process llama-server pid 212858 thread llama-server pid 212858
[11478.885626] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f6a4cc2e000 from IH client 0x12 (VMC)
[11478.885997] amdgpu 0000:03:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32787)
[11478.886347] amdgpu 0000:03:00.0: amdgpu:  Process llama-server pid 212858 thread llama-server pid 212858
[11478.886687] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f6a4cc2f000 from IH client 0x12 (VMC)
[11478.887059] amdgpu 0000:03:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32787)
[11478.887409] amdgpu 0000:03:00.0: amdgpu:  Process llama-server pid 212858 thread llama-server pid 212858
[11478.887747] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00007f6a4cc30000 from IH client 0x12 (VMC)
[11478.888124] amdgpu 0000:03:00.0: amdgpu: ih ring buffer overflow (0x000BD480, 0x00014500, 0x0003D4A0)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions