Skip to content

[bugfix]: quiet all CXI proxy queues before reuse#998

Open
PanJason wants to merge 3 commits into
uccl-project:mainfrom
swiss-ai:control_fix
Open

[bugfix]: quiet all CXI proxy queues before reuse#998
PanJason wants to merge 3 commits into
uccl-project:mainfrom
swiss-ai:control_fix

Conversation

@PanJason

@PanJason PanJason commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Fix the UCCL-EP CXI timeout by making QUIET drain every CXI D2H queue, preserving full FIFO control wr ids, and keeping BARRIER on the existing single representative sync queue.

Symptom

The user-visible symptom was an NVL receiver timeout with UCCL_BENCH_RDMA_BUFFER_SIZE=256 when we test test_internode.py. Because RDMA slots are reused, stale SourceMeta::is_token_in_nvl_rank_bits from a previous token can still carry a valid epoch tag for the current rdma_channel_tail, while its low routing bits belong to the old token.

When this happens, WarpRole::kRDMAAndNVLForwarder can decide the token is not destined for the current dst_nvl_rank and skip the NVL enqueue. The matching WarpRole::kNVLReceivers warp still expects the advertised token count, so it keeps polling nvl_channel_tail until the NVL receiver timeout fires.

This is not a receiver-capacity deadlock. It is a stale-control-state race from non-quieted CXI queues overlapping RDMA/atomic buffer reset or reuse.

Root cause

So far, I have found 3 separate control-plane issues related to the timeout problem I found earlier.

Remove break

nvshmemi_ibgda_quiet() did not fence all CXI commands before RDMA/atomic buffer reuse. The original helper had an unconditional outer break, so it only posted one QUIET from one thread in one sm. See for example in

uccl::nvshmemi_ibgda_quiet(d2h_channel_addrs, num_d2h_channel_addrs,

Removing that break made it post multiple QUIET to different GPU to CPU queues, instead of simply queue 0.

Loop over all queues in nvshmemi_ibgda_quiet

In nvshmemi_ibgda_quiet previously the quiet command was only put every kChannelPerProxy. In the default setting, there are 4 proxy threads, and kChannelPerProxy == 8 so quiet command will only be put on queue

0, 8, 16, 24

for example.
Since only one cuda thread is launching nvshmemi_ibgda_quiet, at the end only 4 queues will have the quiet command and be drained.

int overflow

After we fix the previous two, the timeout moved to wait_until_cmd_consumed. When QUIET was expanded to all CXI queues, the FIFO completion path revealed a wr-id truncation bug. FIFO control wr ids are generated as:

unique_wr_id = (static_cast<uint64_t>(rb_idx) << 32) |
               (fifo_seq_[rb_idx]++ & 0xFFFFFFFFULL);

So rb_idx says which FIFO inside this proxy thread the command came from.

When the proxy sees a QUIET, it records that wr id in ctx_.quiet_wr. Later, after the quiet work is done, notify_gpu_completion() checks whether the completed wr id matches the pending FIFO entry:

if (ctx_.quiet_wr != -1 && front_wr == (uint64_t)ctx_.quiet_wr) {
  ctx_.quiet_inflight = false;
  ctx_.quiet_wr = -1;
  fifo->pop();
}

The fifo->pop() is the important part. On the GPU side, wait_until_cmd_consumed() is waiting for the FIFO tail to advance:

if (h->fifo.poll(slot)) break;

If quiet_wr is only int, then a wr id from rb_idx = 1:

4294967296

gets truncated to:

0

So the proxy can finish the CXI quiet drain and even insert the ack, but when notify_gpu_completion() looks at FIFO 1’s pending entry:

front_wr = 4294967296
ctx_.quiet_wr = 0

they do not match. Therefore:

fifo->pop() is not called
FIFO tail does not advance
GPU h->fifo.poll(slot) stays false
GPU waits forever
timeout

Last point. nvshmem_sync_with_same_gpu_idx() should not be changed in the same way as QUIET for this fix. The barrier path is a proxy-thread rendezvous, while QUIET is the operation that must fence per-queue CXI writes/atomics before buffer reuse. This PR therefore keeps the sync/BARRIER helper on the existing single representative queue and only expands CXI QUIET coverage.

Fix

This is split into three commits:

  1. Keep sync/BARRIER single-queued by restoring the outer break in nvshmem_sync_with_same_gpu_idx().
  2. Preserve full FIFO control wr ids by widening ProxyCtx::quiet_wr and ProxyCtx::barrier_wr from int to int64_t.
  3. For USE_LIBFABRIC_CXI, make nvshmemi_ibgda_quiet() stride every D2H queue, record the exact posted queue index, and wait on the same queue for completion. Non-CXI keeps the existing one-queue-per-proxy stride.

Validation

On Alps/CXI, EP8 with UCCL_BENCH_RDMA_BUFFER_SIZE=256 previously reproduced forwarder epoch/NVL receiver timeouts. The reproducer initially failed 6/6 times. With this fix, 10 serial clean runs completed successfully with no timeout, wait_until_cmd_consumed, RuntimeError, Traceback, ChildFailedError, NVL receiver, EP epoch, or CUDA launch-failure signatures in the node logs.

Best-result ranges across the 10 clean runs:

FP8 dispatch: 37.25-40.34 GB/s RDMA, 1496-1619 us
BF16 dispatch: 41.23-43.19 GB/s RDMA, 2710-2837 us
Combine: 41.40-42.68 GB/s RDMA, 2742-2827 us

@YangZhou1997

Copy link
Copy Markdown
Member

Here really hit @MaoZiming 's expertise on nvshmemi_ibgda_quiet

Btw, I saw aws people also asked about this previously.

@MaoZiming

Copy link
Copy Markdown
Member

@PanJason @YangZhou1997 I think the break in the quiet can be removed (probably a bug). The break in the barrier should be correct and intentional. For example, you only need one thread per GPU to establish a barrier across multiple ranks. Having multiple threads per GPU to establish a barrier might be redundant.

Comment thread ep/include/uccl_ibgda.cuh
break;
}
}
#endif

@MaoZiming MaoZiming Jun 21, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
#endif
// NOTE: the `break` here is intentional (only post to proxy thread 0): unlike
// QUIET, one proxy thread per GPU suffices since one GPU proxy thread is enough to form a complete barrier across all ranks.
break;

@MaoZiming MaoZiming mentioned this pull request Jun 21, 2026
11 tasks
@YangZhou1997

Copy link
Copy Markdown
Member

Sounds good. cc @PanJason, would this make sense to you. If so, then we can apply Ziming's edit and merge the PR

@PanJason

Copy link
Copy Markdown
Contributor Author

@YangZhou1997 It seemed that the timeout still exists even after removing the two break in stress testing. I am still investigating

@fergusfinn

fergusfinn commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Hey @PanJason, @MaoZiming - I've been investigating a similar issue.

The formulation I got up and serving at substantial load is this one: doublewordai#8 (https://github.com/doublewordai/uccl/pull/8/changes#diff-d6afe63b16f2ff37b6c8aaeecf90c546748b0ba6a0c81feb453737169b93b1cbL333 relevant bit)

My understanding is that this PR (#998) fixes a real issue, removing the premature break means QUIET is posted once per proxy thread.

But then, each proxy thread owns several D2H rings (kChannelPerProxy). Since the loop still advances by kChannelPerProxy, it only posts QUIET to the first ring for each proxy thread. Any work already enqueued on sibling rings is not necessarily pulled into the CXI outstanding-op set before QUIET completes, so buffer reuse can still race that work.

The extra piece in doublewordai#8 is to make CXI QUIET stride across every D2H ring and wait on the exact rings it posted to.

I'm not sure why this would show up only for CXI though? @PanJason do you have a repro for the timeout, and does this change fix it?

@PanJason

Copy link
Copy Markdown
Contributor Author

I'm not sure why this would show up only for CXI though?

This issue showed up when I was trying to do some sweep tests and change the rdma_buffer_size in Config in test_internode.py as this:

rdma_buffer_size = int(os.environ.get("UCCL_BENCH_RDMA_BUFFER_SIZE", "512"))
config = Config(num_sms, 8, nvl_buffer_size, 16, rdma_buffer_size)

Let me create a minimal reproducer for you. Then I will also test your code to see if the problem is gone.

@PanJason

Copy link
Copy Markdown
Contributor Author

@fergusfinn You can try to apply this patch: https://pastebin.com/phH7ejuK
To swiss-ai/slingshot-dev-clean branch https://github.com/swiss-ai/uccl/tree/slingshot-dev-clean
And then run the sbatch script (need to adapt a bit to your env):
https://pastebin.com/eCK25hc0

To see if you can also generate this timeout

@PanJason

Copy link
Copy Markdown
Contributor Author

I was checking the code and I found this nvshmemi_ibgda_quiet is only launched once from sm==0 and thread_id==WARP_SIZE which is lane0 of warp 1. Only one CUDA thread is issuing nvshmemi_ibgda_quiet. Now I am wondering why the IB code did not hang. Maybe @MaoZiming has some idea?

@PanJason PanJason changed the title [bugfix]: drain all proxy queues during sync [bugfix]: quiet all CXI proxy queues before reuse Jun 22, 2026
@PanJason

Copy link
Copy Markdown
Contributor Author

@fergusfinn I applied your fix, but still some overflow bugs that I fixed as well. You can take a look I guess?

@MaoZiming

MaoZiming commented Jun 24, 2026

Copy link
Copy Markdown
Member

Thanks. @PanJason, I feel this might be a genuine stability bug. We did run stability test (>24 hours) over EFA/Broadcom before and it didn't experience hang. This might be because of some subtle timing issues. In particular, enqueueing quiet is good (we might even differentiate between actually draining the cq vs. only draining the ring buffer).

@PanJason

Copy link
Copy Markdown
Contributor Author

@MaoZiming By what I have tested, no hang was seen using the default configure. I got this when I tried to change the rdma buffer size in config in test_internode.py. Maybe you can also try to lower it from 512 to 256 and see if you can reproduce the problem with EFA/broadcom

PanJason added 3 commits June 26, 2026 19:10
- restore the sync loop exit so BARRIER remains posted through one representative queue

- keep QUIET handling separate from BARRIER semantics for CXI timeout debugging
- widen quiet and barrier wr id storage so FIFO ring bits are not truncated

- keep the existing negative sentinel while preserving high 32-bit queue indices
- post QUIET to every CXI D2H queue so sibling proxy queues are drained

- track the posted queue index and wait on the exact queue for completion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants