pml/ob1: ARM64 memory ordering fixes and stale send range drain by blkqi · Pull Request #13800 · open-mpi/ompi

blkqi · 2026-04-01T14:50:16Z

On weakly-ordered architectures (ARM64), the ob1 send and receive request locks (req_lock) use relaxed atomics without memory barriers. This allows the request completion path to race ahead of request processing when multiple threads are involved. For send requests, the failure mode is: Thread A holds the lock and runs schedule_once, processing a send range. Thread B handles a completion callback, decrements req_state to 0 via a relaxed atomic, acquires the lock (relaxed, no acquire barrier), and completes the request. The sendreq is returned to the free list with stale send ranges still linked, causing an infinite loop in schedule_once when the sendreq is recycled. For receive requests, the same pattern applies: lock_recv_request and unlock_recv_request use identical relaxed atomics. A recvreq can be completed and recycled while another thread still holds a reference, leading to use-after-free crashes in recv_request_pml_complete. Add opal_atomic_wmb() before the relaxed atomic decrement in both unlock_send_request and unlock_recv_request to ensure all stores performed under the lock are visible before the lock is released. Add opal_atomic_rmb() after successful lock acquisition in both lock_send_request and lock_recv_request to ensure the new holder sees all stores from the previous holder. Observed as a deadlock (sendreq) and segfault (recvreq) under MPI_THREAD_MULTIPLE on ARM64 (Graviton4) clusters. Related to open-mpi#13761, open-mpi#12011, open-mpi#11999 Signed-off-by: Brett Kleinschmidt <blk@blk.me>

On weakly-ordered architectures (ARM64), ob1 completion paths that bypass the request lock perform relaxed atomic writes to req_state, req_bytes_delivered, req_bytes_received, or req_pipeline_depth, then immediately call send_request_pml_complete_check or recv_request_pml_complete_check. Without a release barrier before the atomic update, another thread's complete_check (which has an acquire barrier from commit 0b0f9d1, 2007) may observe the updated counter but not the preceding stores — leading to request recycling with stale internal state. Add opal_atomic_wmb() before each OPAL_THREAD_ADD_FETCH or pml_complete_check call at the following unlocked sites: sendreq.c: rndv_completion_request, mca_pml_ob1_rget_completion, mca_pml_ob1_frag_completion, mca_pml_ob1_put_completion, mca_pml_ob1_send_request_put recvreq.c: mca_pml_ob1_put_completion, mca_pml_ob1_rget_completion, mca_pml_ob1_recv_request_progress_frag, mca_pml_ob1_recv_request_frag_copy_finished, mca_pml_ob1_recv_request_progress_rndv recvfrag.c: mca_pml_ob1_recv_frag_callback_ack These pair with the existing opal_atomic_rmb() at the top of send_request_pml_complete_check and recv_request_pml_complete_check. On ARM64, wmb compiles to dmb st and rmb to dmb ld, which together establish cross-thread visibility through the relaxed atomic intermediary. On x86 (TSO), both are no-ops. Also removes a TODO comment ("TODO -- read ordering") in mca_pml_ob1_put_completion that identified this exact missing barrier since 2016 (commit 1e2019c). Observed as segfaults and infinite loops under MPI_THREAD_MULTIPLE on 128-rank Graviton4 (ARM64) clusters after deploying the lock ordering fix, which masked these unlocked paths. Related to open-mpi#13761, open-mpi#12011, open-mpi#11999 Signed-off-by: Brett Kleinschmidt <blk@blk.me>

devreal · 2026-04-01T15:11:07Z

ompi/mca/pml/ob1/pml_ob1_recvfrag.c

+    /* ensure all prior stores (copy_in_out, rdma_frag, throttle_sends,
+     * req_state, accelerator flags) are visible before complete_check
+     * may recycle the request */
+    opal_atomic_wmb();


Should we move the barrier into send_request_pml_complete_check instead? This seems to be a repeating pattern.

I considered folding a full barrier into complete_check, but only two of the wmb sites can be absorbed — the rest need to order stores before intervening relaxed decrements that precede the call, so the explicit barriers would remain anyway.

The cleaner long-term solution would be adding memory order parameters to the atomics backend (release on decrement, acquire on the completion read), which would eliminate the need for explicit barriers entirely. Open to suggestions.

When a send request is reused (e.g. persistent requests or request recycling), stale entries may remain on req_send_ranges from the previous lifecycle. These stale ranges can cause the request to operate on outdated BTL state. Drain any remaining send ranges during request start before the new lifecycle begins. Related to open-mpi#13761, open-mpi#12011, open-mpi#11999 Signed-off-by: Brett Kleinschmidt <blk@blk.me>

blkqi added 2 commits March 31, 2026 08:38

github-actions bot added the Target: main label Apr 1, 2026

devreal reviewed Apr 1, 2026

View reviewed changes

blkqi force-pushed the blk/arm64-ob1-fixes-main branch from dc50790 to 07a05ae Compare April 1, 2026 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pml/ob1: ARM64 memory ordering fixes and stale send range drain#13800

pml/ob1: ARM64 memory ordering fixes and stale send range drain#13800
blkqi wants to merge 3 commits intoopen-mpi:mainfrom
blkqi:blk/arm64-ob1-fixes-main

blkqi commented Apr 1, 2026

Uh oh!

devreal Apr 1, 2026

Uh oh!

blkqi Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

blkqi commented Apr 1, 2026

Uh oh!

devreal Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

blkqi Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants