Skip to content

pml/ob1: ARM64 memory ordering fixes and stale send range drain#13800

Open
blkqi wants to merge 3 commits intoopen-mpi:mainfrom
blkqi:blk/arm64-ob1-fixes-main
Open

pml/ob1: ARM64 memory ordering fixes and stale send range drain#13800
blkqi wants to merge 3 commits intoopen-mpi:mainfrom
blkqi:blk/arm64-ob1-fixes-main

Conversation

@blkqi
Copy link
Copy Markdown
Contributor

@blkqi blkqi commented Apr 1, 2026

Fixes #13799

blkqi added 2 commits March 31, 2026 08:38
On weakly-ordered architectures (ARM64), the ob1 send and receive
request locks (req_lock) use relaxed atomics without memory barriers.
This allows the request completion path to race ahead of request
processing when multiple threads are involved.

For send requests, the failure mode is: Thread A holds the lock and
runs schedule_once, processing a send range. Thread B handles a
completion callback, decrements req_state to 0 via a relaxed atomic,
acquires the lock (relaxed, no acquire barrier), and completes the
request. The sendreq is returned to the free list with stale send
ranges still linked, causing an infinite loop in schedule_once when
the sendreq is recycled.

For receive requests, the same pattern applies: lock_recv_request and
unlock_recv_request use identical relaxed atomics. A recvreq can be
completed and recycled while another thread still holds a reference,
leading to use-after-free crashes in recv_request_pml_complete.

Add opal_atomic_wmb() before the relaxed atomic decrement in both
unlock_send_request and unlock_recv_request to ensure all stores
performed under the lock are visible before the lock is released.
Add opal_atomic_rmb() after successful lock acquisition in both
lock_send_request and lock_recv_request to ensure the new holder
sees all stores from the previous holder.

Observed as a deadlock (sendreq) and segfault (recvreq) under
MPI_THREAD_MULTIPLE on ARM64 (Graviton4) clusters.

Related to open-mpi#13761, open-mpi#12011, open-mpi#11999

Signed-off-by: Brett Kleinschmidt <blk@blk.me>
On weakly-ordered architectures (ARM64), ob1 completion paths that
bypass the request lock perform relaxed atomic writes to req_state,
req_bytes_delivered, req_bytes_received, or req_pipeline_depth, then
immediately call send_request_pml_complete_check or
recv_request_pml_complete_check. Without a release barrier before the
atomic update, another thread's complete_check (which has an acquire
barrier from commit 0b0f9d1, 2007) may observe the updated counter
but not the preceding stores — leading to request recycling with stale
internal state.

Add opal_atomic_wmb() before each OPAL_THREAD_ADD_FETCH or
pml_complete_check call at the following unlocked sites:

  sendreq.c: rndv_completion_request, mca_pml_ob1_rget_completion,
             mca_pml_ob1_frag_completion, mca_pml_ob1_put_completion,
             mca_pml_ob1_send_request_put
  recvreq.c: mca_pml_ob1_put_completion, mca_pml_ob1_rget_completion,
             mca_pml_ob1_recv_request_progress_frag,
             mca_pml_ob1_recv_request_frag_copy_finished,
             mca_pml_ob1_recv_request_progress_rndv
  recvfrag.c: mca_pml_ob1_recv_frag_callback_ack

These pair with the existing opal_atomic_rmb() at the top of
send_request_pml_complete_check and recv_request_pml_complete_check.
On ARM64, wmb compiles to dmb st and rmb to dmb ld, which together
establish cross-thread visibility through the relaxed atomic
intermediary. On x86 (TSO), both are no-ops.

Also removes a TODO comment ("TODO -- read ordering") in
mca_pml_ob1_put_completion that identified this exact missing barrier
since 2016 (commit 1e2019c).

Observed as segfaults and infinite loops under MPI_THREAD_MULTIPLE on
128-rank Graviton4 (ARM64) clusters after deploying the lock ordering
fix, which masked these unlocked paths.

Related to open-mpi#13761, open-mpi#12011, open-mpi#11999

Signed-off-by: Brett Kleinschmidt <blk@blk.me>
/* ensure all prior stores (copy_in_out, rdma_frag, throttle_sends,
* req_state, accelerator flags) are visible before complete_check
* may recycle the request */
opal_atomic_wmb();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the barrier into send_request_pml_complete_check instead? This seems to be a repeating pattern.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered folding a full barrier into complete_check, but only two of the wmb sites can be absorbed — the rest need to order stores before intervening relaxed decrements that precede the call, so the explicit barriers would remain anyway.

The cleaner long-term solution would be adding memory order parameters to the atomics backend (release on decrement, acquire on the completion read), which would eliminate the need for explicit barriers entirely. Open to suggestions.

When a send request is reused (e.g. persistent requests or request
recycling), stale entries may remain on req_send_ranges from the
previous lifecycle. These stale ranges can cause the request to
operate on outdated BTL state.

Drain any remaining send ranges during request start before the
new lifecycle begins.

Related to open-mpi#13761, open-mpi#12011, open-mpi#11999

Signed-off-by: Brett Kleinschmidt <blk@blk.me>
@blkqi blkqi force-pushed the blk/arm64-ob1-fixes-main branch from dc50790 to 07a05ae Compare April 1, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pml/ob1: ARM64 memory ordering bugs and stale send range on sendreq reuse

2 participants