[bugfix]: quiet all CXI proxy queues before reuse#998
Conversation
|
Here really hit @MaoZiming 's expertise on Btw, I saw aws people also asked about this previously. |
|
@PanJason @YangZhou1997 I think the |
| break; | ||
| } | ||
| } | ||
| #endif |
There was a problem hiding this comment.
| #endif | |
| #endif | |
| // NOTE: the `break` here is intentional (only post to proxy thread 0): unlike | |
| // QUIET, one proxy thread per GPU suffices since one GPU proxy thread is enough to form a complete barrier across all ranks. | |
| break; |
|
Sounds good. cc @PanJason, would this make sense to you. If so, then we can apply Ziming's edit and merge the PR |
|
@YangZhou1997 It seemed that the timeout still exists even after removing the two break in stress testing. I am still investigating |
|
Hey @PanJason, @MaoZiming - I've been investigating a similar issue. The formulation I got up and serving at substantial load is this one: doublewordai#8 (https://github.com/doublewordai/uccl/pull/8/changes#diff-d6afe63b16f2ff37b6c8aaeecf90c546748b0ba6a0c81feb453737169b93b1cbL333 relevant bit) My understanding is that this PR (#998) fixes a real issue, removing the premature break means QUIET is posted once per proxy thread. But then, each proxy thread owns several D2H rings ( The extra piece in doublewordai#8 is to make CXI QUIET stride across every D2H ring and wait on the exact rings it posted to. I'm not sure why this would show up only for CXI though? @PanJason do you have a repro for the timeout, and does this change fix it? |
This issue showed up when I was trying to do some sweep tests and change the Let me create a minimal reproducer for you. Then I will also test your code to see if the problem is gone. |
|
@fergusfinn You can try to apply this patch: https://pastebin.com/phH7ejuK To see if you can also generate this timeout |
|
I was checking the code and I found this |
|
@fergusfinn I applied your fix, but still some overflow bugs that I fixed as well. You can take a look I guess? |
|
Thanks. @PanJason, I feel this might be a genuine stability bug. We did run stability test (>24 hours) over EFA/Broadcom before and it didn't experience hang. This might be because of some subtle timing issues. In particular, enqueueing quiet is good (we might even differentiate between actually draining the cq vs. only draining the ring buffer). |
|
@MaoZiming By what I have tested, no hang was seen using the default configure. I got this when I tried to change the rdma buffer size in |
- restore the sync loop exit so BARRIER remains posted through one representative queue - keep QUIET handling separate from BARRIER semantics for CXI timeout debugging
- widen quiet and barrier wr id storage so FIFO ring bits are not truncated - keep the existing negative sentinel while preserving high 32-bit queue indices
- post QUIET to every CXI D2H queue so sibling proxy queues are drained - track the posted queue index and wait on the exact queue for completion
Summary
Fix the UCCL-EP CXI timeout by making
QUIETdrain every CXI D2H queue, preserving full FIFO control wr ids, and keepingBARRIERon the existing single representative sync queue.Symptom
The user-visible symptom was an NVL receiver
timeoutwithUCCL_BENCH_RDMA_BUFFER_SIZE=256when we testtest_internode.py. Because RDMA slots are reused, staleSourceMeta::is_token_in_nvl_rank_bitsfrom a previous token can still carry a valid epoch tag for the currentrdma_channel_tail, while its low routing bits belong to the old token.When this happens,
WarpRole::kRDMAAndNVLForwardercan decide the token is not destined for the currentdst_nvl_rankand skip the NVL enqueue. The matchingWarpRole::kNVLReceiverswarp still expects the advertised token count, so it keeps pollingnvl_channel_tailuntil the NVL receiver timeout fires.This is not a receiver-capacity deadlock. It is a stale-control-state race from non-quieted CXI queues overlapping RDMA/atomic buffer reset or reuse.
Root cause
So far, I have found 3 separate control-plane issues related to the timeout problem I found earlier.
Remove break
nvshmemi_ibgda_quiet()did not fence all CXI commands before RDMA/atomic buffer reuse. The original helper had an unconditional outerbreak, so it only posted oneQUIETfrom one thread in one sm. See for example inuccl/ep/src/internode.cu
Line 128 in d59f5b6
Removing that break made it post multiple
QUIETto different GPU to CPU queues, instead of simply queue 0.Loop over all queues in
nvshmemi_ibgda_quietIn
nvshmemi_ibgda_quietpreviously the quiet command was only put everykChannelPerProxy. In the default setting, there are 4 proxy threads, andkChannelPerProxy == 8so quiet command will only be put on queuefor example.
Since only one cuda thread is launching
nvshmemi_ibgda_quiet, at the end only 4 queues will have the quiet command and be drained.intoverflowAfter we fix the previous two, the timeout moved to
wait_until_cmd_consumed. WhenQUIETwas expanded to all CXI queues, the FIFO completion path revealed a wr-id truncation bug. FIFO control wr ids are generated as:So
rb_idxsays which FIFO inside this proxy thread the command came from.When the proxy sees a
QUIET, it records that wr id inctx_.quiet_wr. Later, after the quiet work is done,notify_gpu_completion()checks whether the completed wr id matches the pending FIFO entry:The
fifo->pop()is the important part. On the GPU side,wait_until_cmd_consumed()is waiting for the FIFO tail to advance:If
quiet_wris onlyint, then a wr id fromrb_idx = 1:gets truncated to:
So the proxy can finish the CXI quiet drain and even insert the ack, but when
notify_gpu_completion()looks at FIFO 1’s pending entry:they do not match. Therefore:
Last point.
nvshmem_sync_with_same_gpu_idx()should not be changed in the same way asQUIETfor this fix. The barrier path is a proxy-thread rendezvous, whileQUIETis the operation that must fence per-queue CXI writes/atomics before buffer reuse. This PR therefore keeps the sync/BARRIER helper on the existing single representative queue and only expands CXIQUIETcoverage.Fix
This is split into three commits:
breakinnvshmem_sync_with_same_gpu_idx().ProxyCtx::quiet_wrandProxyCtx::barrier_wrfrominttoint64_t.USE_LIBFABRIC_CXI, makenvshmemi_ibgda_quiet()stride every D2H queue, record the exact posted queue index, and wait on the same queue for completion. Non-CXI keeps the existing one-queue-per-proxy stride.Validation
On Alps/CXI, EP8 with
UCCL_BENCH_RDMA_BUFFER_SIZE=256previously reproduced forwarder epoch/NVL receiver timeouts. The reproducer initially failed 6/6 times. With this fix, 10 serial clean runs completed successfully with notimeout,wait_until_cmd_consumed,RuntimeError,Traceback,ChildFailedError,NVL receiver,EP epoch, or CUDA launch-failure signatures in the node logs.Best-result ranges across the 10 clean runs: