[bugfix]: quiet all CXI proxy queues before reuse by PanJason · Pull Request #998 · uccl-project/uccl

PanJason · 2026-06-16T17:08:43Z

Summary

Fix the UCCL-EP CXI timeout by making QUIET drain every CXI D2H queue, preserving full FIFO control wr ids, and keeping BARRIER on the existing single representative sync queue.

Symptom

The user-visible symptom was an NVL receiver timeout with UCCL_BENCH_RDMA_BUFFER_SIZE=256 when we test test_internode.py. Because RDMA slots are reused, stale SourceMeta::is_token_in_nvl_rank_bits from a previous token can still carry a valid epoch tag for the current rdma_channel_tail, while its low routing bits belong to the old token.

When this happens, WarpRole::kRDMAAndNVLForwarder can decide the token is not destined for the current dst_nvl_rank and skip the NVL enqueue. The matching WarpRole::kNVLReceivers warp still expects the advertised token count, so it keeps polling nvl_channel_tail until the NVL receiver timeout fires.

This is not a receiver-capacity deadlock. It is a stale-control-state race from non-quieted CXI queues overlapping RDMA/atomic buffer reset or reuse.

Root cause

So far, I have found 3 separate control-plane issues related to the timeout problem I found earlier.

Remove break

nvshmemi_ibgda_quiet() did not fence all CXI commands before RDMA/atomic buffer reuse. The original helper had an unconditional outer break, so it only posted one QUIET from one thread in one sm. See for example in

uccl/ep/src/internode.cu

Line 128 in d59f5b6

uccl::nvshmemi_ibgda_quiet(d2h_channel_addrs, num_d2h_channel_addrs,

Removing that break made it post multiple QUIET to different GPU to CPU queues, instead of simply queue 0.

Loop over all queues in `nvshmemi_ibgda_quiet`

In nvshmemi_ibgda_quiet previously the quiet command was only put every kChannelPerProxy. In the default setting, there are 4 proxy threads, and kChannelPerProxy == 8 so quiet command will only be put on queue

0, 8, 16, 24

for example.
Since only one cuda thread is launching nvshmemi_ibgda_quiet, at the end only 4 queues will have the quiet command and be drained.

`int` overflow

After we fix the previous two, the timeout moved to wait_until_cmd_consumed. When QUIET was expanded to all CXI queues, the FIFO completion path revealed a wr-id truncation bug. FIFO control wr ids are generated as:

unique_wr_id = (static_cast<uint64_t>(rb_idx) << 32) |
               (fifo_seq_[rb_idx]++ & 0xFFFFFFFFULL);

So rb_idx says which FIFO inside this proxy thread the command came from.

When the proxy sees a QUIET, it records that wr id in ctx_.quiet_wr. Later, after the quiet work is done, notify_gpu_completion() checks whether the completed wr id matches the pending FIFO entry:

if (ctx_.quiet_wr != -1 && front_wr == (uint64_t)ctx_.quiet_wr) {
  ctx_.quiet_inflight = false;
  ctx_.quiet_wr = -1;
  fifo->pop();
}

The fifo->pop() is the important part. On the GPU side, wait_until_cmd_consumed() is waiting for the FIFO tail to advance:

if (h->fifo.poll(slot)) break;

If quiet_wr is only int, then a wr id from rb_idx = 1:

4294967296

gets truncated to:

So the proxy can finish the CXI quiet drain and even insert the ack, but when notify_gpu_completion() looks at FIFO 1’s pending entry:

front_wr = 4294967296
ctx_.quiet_wr = 0

they do not match. Therefore:

fifo->pop() is not called
FIFO tail does not advance
GPU h->fifo.poll(slot) stays false
GPU waits forever
timeout

Last point. nvshmem_sync_with_same_gpu_idx() should not be changed in the same way as QUIET for this fix. The barrier path is a proxy-thread rendezvous, while QUIET is the operation that must fence per-queue CXI writes/atomics before buffer reuse. This PR therefore keeps the sync/BARRIER helper on the existing single representative queue and only expands CXI QUIET coverage.

Fix

This is split into three commits:

Keep sync/BARRIER single-queued by restoring the outer break in nvshmem_sync_with_same_gpu_idx().
Preserve full FIFO control wr ids by widening ProxyCtx::quiet_wr and ProxyCtx::barrier_wr from int to int64_t.
For USE_LIBFABRIC_CXI, make nvshmemi_ibgda_quiet() stride every D2H queue, record the exact posted queue index, and wait on the same queue for completion. Non-CXI keeps the existing one-queue-per-proxy stride.

Validation

On Alps/CXI, EP8 with UCCL_BENCH_RDMA_BUFFER_SIZE=256 previously reproduced forwarder epoch/NVL receiver timeouts. The reproducer initially failed 6/6 times. With this fix, 10 serial clean runs completed successfully with no timeout, wait_until_cmd_consumed, RuntimeError, Traceback, ChildFailedError, NVL receiver, EP epoch, or CUDA launch-failure signatures in the node logs.

Best-result ranges across the 10 clean runs:

FP8 dispatch: 37.25-40.34 GB/s RDMA, 1496-1619 us
BF16 dispatch: 41.23-43.19 GB/s RDMA, 2710-2837 us
Combine: 41.40-42.68 GB/s RDMA, 2742-2827 us

YangZhou1997 · 2026-06-18T20:08:55Z

Here really hit @MaoZiming 's expertise on nvshmemi_ibgda_quiet

Btw, I saw aws people also asked about this previously.

MaoZiming · 2026-06-21T16:11:33Z

@PanJason @YangZhou1997 I think the break in the quiet can be removed (probably a bug). The break in the barrier should be correct and intentional. For example, you only need one thread per GPU to establish a barrier across multiple ranks. Having multiple threads per GPU to establish a barrier might be redundant.

MaoZiming · 2026-06-21T16:16:05Z

        break;
      }
    }
 #endif


Suggested change

#endif

#endif

// NOTE: the `break` here is intentional (only post to proxy thread 0): unlike

// QUIET, one proxy thread per GPU suffices since one GPU proxy thread is enough to form a complete barrier across all ranks.

break;

YangZhou1997 · 2026-06-21T18:48:01Z

Sounds good. cc @PanJason, would this make sense to you. If so, then we can apply Ziming's edit and merge the PR

PanJason · 2026-06-22T06:51:40Z

@YangZhou1997 It seemed that the timeout still exists even after removing the two break in stress testing. I am still investigating

fergusfinn · 2026-06-22T08:51:40Z

Hey @PanJason, @MaoZiming - I've been investigating a similar issue.

The formulation I got up and serving at substantial load is this one: doublewordai#8 (https://github.com/doublewordai/uccl/pull/8/changes#diff-d6afe63b16f2ff37b6c8aaeecf90c546748b0ba6a0c81feb453737169b93b1cbL333 relevant bit)

My understanding is that this PR (#998) fixes a real issue, removing the premature break means QUIET is posted once per proxy thread.

But then, each proxy thread owns several D2H rings (kChannelPerProxy). Since the loop still advances by kChannelPerProxy, it only posts QUIET to the first ring for each proxy thread. Any work already enqueued on sibling rings is not necessarily pulled into the CXI outstanding-op set before QUIET completes, so buffer reuse can still race that work.

The extra piece in doublewordai#8 is to make CXI QUIET stride across every D2H ring and wait on the exact rings it posted to.

I'm not sure why this would show up only for CXI though? @PanJason do you have a repro for the timeout, and does this change fix it?

PanJason · 2026-06-22T12:00:04Z

I'm not sure why this would show up only for CXI though?

This issue showed up when I was trying to do some sweep tests and change the rdma_buffer_size in Config in test_internode.py as this:

rdma_buffer_size = int(os.environ.get("UCCL_BENCH_RDMA_BUFFER_SIZE", "512"))
config = Config(num_sms, 8, nvl_buffer_size, 16, rdma_buffer_size)

Let me create a minimal reproducer for you. Then I will also test your code to see if the problem is gone.

PanJason · 2026-06-22T12:49:50Z

@fergusfinn You can try to apply this patch: https://pastebin.com/phH7ejuK
To swiss-ai/slingshot-dev-clean branch https://github.com/swiss-ai/uccl/tree/slingshot-dev-clean
And then run the sbatch script (need to adapt a bit to your env):
https://pastebin.com/eCK25hc0

To see if you can also generate this timeout

PanJason · 2026-06-22T13:36:49Z

I was checking the code and I found this nvshmemi_ibgda_quiet is only launched once from sm==0 and thread_id==WARP_SIZE which is lane0 of warp 1. Only one CUDA thread is issuing nvshmemi_ibgda_quiet. Now I am wondering why the IB code did not hang. Maybe @MaoZiming has some idea?

PanJason · 2026-06-22T16:21:31Z

@fergusfinn I applied your fix, but still some overflow bugs that I fixed as well. You can take a look I guess?

MaoZiming · 2026-06-24T07:50:17Z

Thanks. @PanJason, I feel this might be a genuine stability bug. We did run stability test (>24 hours) over EFA/Broadcom before and it didn't experience hang. This might be because of some subtle timing issues. In particular, enqueueing quiet is good (we might even differentiate between actually draining the cq vs. only draining the ring buffer).

PanJason · 2026-06-24T08:55:36Z

@MaoZiming By what I have tested, no hang was seen using the default configure. I got this when I tried to change the rdma buffer size in config in test_internode.py. Maybe you can also try to lower it from 512 to 256 and see if you can reproduce the problem with EFA/broadcom

- restore the sync loop exit so BARRIER remains posted through one representative queue - keep QUIET handling separate from BARRIER semantics for CXI timeout debugging

- widen quiet and barrier wr id storage so FIFO ring bits are not truncated - keep the existing negative sentinel while preserving high 32-bit queue indices

- post QUIET to every CXI D2H queue so sibling proxy queues are drained - track the posted queue index and wait on the exact queue for completion

PanJason requested review from MaoZiming, YangZhou1997 and zhenhuang12 as code owners June 16, 2026 17:08

MaoZiming reviewed Jun 21, 2026

View reviewed changes

MaoZiming mentioned this pull request Jun 21, 2026

Add CXI transport support for EP #997

Merged

11 tasks

PanJason changed the title ~~[bugfix]: drain all proxy queues during sync~~ [bugfix]: quiet all CXI proxy queues before reuse Jun 22, 2026

fergusfinn mentioned this pull request Jun 24, 2026

fix(merge): drop stray break in amo_nonfetch_add doublewordai/uccl#17

Open

PanJason force-pushed the control_fix branch from a542012 to 3ded2f3 Compare June 26, 2026 16:46

PanJason added 3 commits June 26, 2026 19:10

[bugfix]: keep sync barrier single-queued

7d3766c

- restore the sync loop exit so BARRIER remains posted through one representative queue - keep QUIET handling separate from BARRIER semantics for CXI timeout debugging

[bugfix]: preserve fifo control wr ids

b20d747

- widen quiet and barrier wr id storage so FIFO ring bits are not truncated - keep the existing negative sentinel while preserving high 32-bit queue indices

[bugfix]: quiet all CXI proxy queues

9c470bf

- post QUIET to every CXI D2H queue so sibling proxy queues are drained - track the posted queue index and wait on the exact queue for completion

PanJason force-pushed the control_fix branch from 1dd647e to 9c470bf Compare June 26, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix]: quiet all CXI proxy queues before reuse#998

[bugfix]: quiet all CXI proxy queues before reuse#998
PanJason wants to merge 3 commits into
uccl-project:mainfrom
swiss-ai:control_fix

PanJason commented Jun 16, 2026 •

edited

Loading

Uh oh!

YangZhou1997 commented Jun 18, 2026

Uh oh!

MaoZiming commented Jun 21, 2026

Uh oh!

MaoZiming Jun 21, 2026 •

edited

Loading

Uh oh!

YangZhou1997 commented Jun 21, 2026

Uh oh!

PanJason commented Jun 22, 2026

Uh oh!

fergusfinn commented Jun 22, 2026 •

edited

Loading

Uh oh!

PanJason commented Jun 22, 2026

Uh oh!

PanJason commented Jun 22, 2026

Uh oh!

PanJason commented Jun 22, 2026

Uh oh!

PanJason commented Jun 22, 2026

Uh oh!

MaoZiming commented Jun 24, 2026 •

edited

Loading

Uh oh!

PanJason commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-#endif
+#endif
+    // NOTE: the `break` here is intentional (only post to proxy thread 0): unlike
+    // QUIET, one proxy thread per GPU suffices since one GPU proxy thread is enough to form a complete barrier across all ranks.
+    break;

Uh oh!

Conversation

PanJason commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Symptom

Root cause

Remove break

Loop over all queues in nvshmemi_ibgda_quiet

int overflow

Fix

Validation

Uh oh!

YangZhou1997 commented Jun 18, 2026

Uh oh!

MaoZiming commented Jun 21, 2026

Uh oh!

MaoZiming Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YangZhou1997 commented Jun 21, 2026

Uh oh!

PanJason commented Jun 22, 2026

Uh oh!

fergusfinn commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PanJason commented Jun 22, 2026

Uh oh!

PanJason commented Jun 22, 2026

Uh oh!

PanJason commented Jun 22, 2026

Uh oh!

PanJason commented Jun 22, 2026

Uh oh!

MaoZiming commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PanJason commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PanJason commented Jun 16, 2026 •

edited

Loading

Loop over all queues in `nvshmemi_ibgda_quiet`

`int` overflow

MaoZiming Jun 21, 2026 •

edited

Loading

fergusfinn commented Jun 22, 2026 •

edited

Loading

MaoZiming commented Jun 24, 2026 •

edited

Loading