pml/ob1: ARM64 memory ordering fixes and stale send range drain#13800
Open
blkqi wants to merge 3 commits intoopen-mpi:mainfrom
Open
pml/ob1: ARM64 memory ordering fixes and stale send range drain#13800blkqi wants to merge 3 commits intoopen-mpi:mainfrom
blkqi wants to merge 3 commits intoopen-mpi:mainfrom
Conversation
On weakly-ordered architectures (ARM64), the ob1 send and receive request locks (req_lock) use relaxed atomics without memory barriers. This allows the request completion path to race ahead of request processing when multiple threads are involved. For send requests, the failure mode is: Thread A holds the lock and runs schedule_once, processing a send range. Thread B handles a completion callback, decrements req_state to 0 via a relaxed atomic, acquires the lock (relaxed, no acquire barrier), and completes the request. The sendreq is returned to the free list with stale send ranges still linked, causing an infinite loop in schedule_once when the sendreq is recycled. For receive requests, the same pattern applies: lock_recv_request and unlock_recv_request use identical relaxed atomics. A recvreq can be completed and recycled while another thread still holds a reference, leading to use-after-free crashes in recv_request_pml_complete. Add opal_atomic_wmb() before the relaxed atomic decrement in both unlock_send_request and unlock_recv_request to ensure all stores performed under the lock are visible before the lock is released. Add opal_atomic_rmb() after successful lock acquisition in both lock_send_request and lock_recv_request to ensure the new holder sees all stores from the previous holder. Observed as a deadlock (sendreq) and segfault (recvreq) under MPI_THREAD_MULTIPLE on ARM64 (Graviton4) clusters. Related to open-mpi#13761, open-mpi#12011, open-mpi#11999 Signed-off-by: Brett Kleinschmidt <blk@blk.me>
On weakly-ordered architectures (ARM64), ob1 completion paths that bypass the request lock perform relaxed atomic writes to req_state, req_bytes_delivered, req_bytes_received, or req_pipeline_depth, then immediately call send_request_pml_complete_check or recv_request_pml_complete_check. Without a release barrier before the atomic update, another thread's complete_check (which has an acquire barrier from commit 0b0f9d1, 2007) may observe the updated counter but not the preceding stores — leading to request recycling with stale internal state. Add opal_atomic_wmb() before each OPAL_THREAD_ADD_FETCH or pml_complete_check call at the following unlocked sites: sendreq.c: rndv_completion_request, mca_pml_ob1_rget_completion, mca_pml_ob1_frag_completion, mca_pml_ob1_put_completion, mca_pml_ob1_send_request_put recvreq.c: mca_pml_ob1_put_completion, mca_pml_ob1_rget_completion, mca_pml_ob1_recv_request_progress_frag, mca_pml_ob1_recv_request_frag_copy_finished, mca_pml_ob1_recv_request_progress_rndv recvfrag.c: mca_pml_ob1_recv_frag_callback_ack These pair with the existing opal_atomic_rmb() at the top of send_request_pml_complete_check and recv_request_pml_complete_check. On ARM64, wmb compiles to dmb st and rmb to dmb ld, which together establish cross-thread visibility through the relaxed atomic intermediary. On x86 (TSO), both are no-ops. Also removes a TODO comment ("TODO -- read ordering") in mca_pml_ob1_put_completion that identified this exact missing barrier since 2016 (commit 1e2019c). Observed as segfaults and infinite loops under MPI_THREAD_MULTIPLE on 128-rank Graviton4 (ARM64) clusters after deploying the lock ordering fix, which masked these unlocked paths. Related to open-mpi#13761, open-mpi#12011, open-mpi#11999 Signed-off-by: Brett Kleinschmidt <blk@blk.me>
devreal
reviewed
Apr 1, 2026
| /* ensure all prior stores (copy_in_out, rdma_frag, throttle_sends, | ||
| * req_state, accelerator flags) are visible before complete_check | ||
| * may recycle the request */ | ||
| opal_atomic_wmb(); |
Contributor
There was a problem hiding this comment.
Should we move the barrier into send_request_pml_complete_check instead? This seems to be a repeating pattern.
Contributor
Author
There was a problem hiding this comment.
I considered folding a full barrier into complete_check, but only two of the wmb sites can be absorbed — the rest need to order stores before intervening relaxed decrements that precede the call, so the explicit barriers would remain anyway.
The cleaner long-term solution would be adding memory order parameters to the atomics backend (release on decrement, acquire on the completion read), which would eliminate the need for explicit barriers entirely. Open to suggestions.
When a send request is reused (e.g. persistent requests or request recycling), stale entries may remain on req_send_ranges from the previous lifecycle. These stale ranges can cause the request to operate on outdated BTL state. Drain any remaining send ranges during request start before the new lifecycle begins. Related to open-mpi#13761, open-mpi#12011, open-mpi#11999 Signed-off-by: Brett Kleinschmidt <blk@blk.me>
dc50790 to
07a05ae
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #13799