ep: CXI/Slingshot (libfabric) transport backend behind USE_CXI#1
ep: CXI/Slingshot (libfabric) transport backend behind USE_CXI#1fergusfinn wants to merge 20 commits into
Conversation
Minimal, USE_CXI-gated CXI transport for the EP dispatch/combine path; the
existing verbs/EFA path is untouched when USE_CXI is not defined.
- New self-contained uccl::cxi::Transport module (ep/{include,src}/cxi_transport.*):
RAII libfabric (cxi provider) endpoint, CUDA + host MR registration, peer
insertion, RMA write, inject atomic-add64, and CQ completion polling.
- Wire-in via rdma.cpp/proxy.cpp behind #ifdef USE_CXI; one NIC per rank
(local_rank % 4, rail-aligned), no remote-CQ-data / write-with-imm dependency.
- internode.cu / ep_configs.cuh: support GH200 4-GPU nodes (drop hard-coded
8-local-rank assumption); test_internode.py allows 4 local ranks.
Public dispatch/combine/Buffer/Config interfaces unchanged (DeepEP-compatible).
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
…i-GPU / srun environments
…sor and make initialize_uccl robust
…n test_internode_simple
(cherry picked from commit 75a2617)
Carry only the runtime fix from 4c5b63f. The focused regression test remains split out with the delayed test changes.
Keep the zero-token runtime fix in the base branch and carry only the focused guard coverage here.
There was a problem hiding this comment.
Pull request overview
Adds an optional (USE_CXI) CXI/Slingshot (libfabric) transport backend for the EP internode dispatch/combine path while preserving the existing verbs/EFA implementation when CXI is disabled. It also improves GH200 4-GPU-per-node support, tightens initialization/robustness, and includes a couple of correctness fixes found during serving.
Changes:
- Introduce
uccl::cxi::Transport(libfabric/cxi provider) and wire it into RDMA/proxy paths behind#ifdef USE_CXI(RMA writes + atomic-add control signaling). - Update internode kernels and configs to support
NUM_MAX_NVL_PEERS <= 8and 4-GPU nodes (including LL internode fixes). - Improve runtime/bench robustness (CUDA device selection, scratch pointer handling), add build toggles in
setup.py, and expand test/bench CLI options.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| ep/src/uccl_proxy.cpp | CXI/host atomic buffer allocation + stop/destroy behavior adjustments. |
| ep/src/uccl_ep.cc | Fix RDMA-rank memset overrun; minor includes and helper wiring for atomic-buffer connection. |
| ep/src/rdma.cpp | Add CXI transport initialization, connection info exchange, batched CXI writes, and CXI atomics. |
| ep/src/proxy.cpp | CXI-aware peer selection, pinning, barrier signaling via CXI atomics, and 4-GPU stride fixes. |
| ep/src/internode.cu | Relax NVL peer assumptions and avoid type-punning loads by explicitly packing NVL presence. |
| ep/src/internode_ll.cu | Make LL receive path robust when CXI forces internode even for same-node peers. |
| ep/src/cxi_transport.cpp | New libfabric-based CXI transport implementation (endpoint + MR + RMA + CQ polling). |
| ep/setup.py | Add USE_CXI build plumbing and libfabric link/include configuration. |
| ep/include/rdma.hpp | Extend RDMAConnectionInfo with CXI endpoint/MR metadata (gated by USE_CXI). |
| ep/include/proxy_ctx.hpp | Add CXI transport fields to ProxyCtx under USE_CXI. |
| ep/include/ep_configs.cuh | Allow overriding NUM_MAX_NVL_PEERS at compile time. |
| ep/include/cxi_transport.hpp | New CXI transport public header (gated by USE_CXI). |
| ep/bench/utils.py | Robust device initialization + accept scratch as tensor or raw pointer. |
| ep/bench/test_internode.py | Support 4 local ranks, add smoke/fixed-chunk options, set CXI phase env for stats. |
| ep/bench/test_internode_simple.py | Use Buffer directly (less proxy plumbing), safer destroy flow. |
| ep/bench/buffer.py | Adjust EP8 combine config for 4-GPU nodes. |
| AGENTS.md | Add Isambard/GH200+CXI build and multinode test guidance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace the blocking post/wait_all/atomics sequence with an async design: - writes post without waiting; ring slots retire on CQ completion via a per-loop retire sweep (verbs-style), with an outstanding cap as backpressure - tail/control atomics inject immediately behind their writes, relying on same-TX-context submission ordering (measured on Slingshot-11; see uccl-project#956 discussion) - -FI_EAGAIN on any post is handled by polling + retry instead of terminating - UCCL_CXI_SYNC_WRITES=1 restores the previous conservative behavior - quiet drains all outstanding writes; shutdown drains bounded Also: env-gated UCCL_CXI_DELIVERY_COMPLETE=1 knob on TX (audit A/B).
Immediate same-TX injection reordered under FI_CXI_RDZV_THRESHOLD=0: every write goes rendezvous (target pulls data), so an eager atomic lands before the payload it announces -> corrupted dispatch output at >=1024 tokens. Queue atomics per peer with threshold = writes posted before them; flush in cxi_retire_ctx as completions arrive. Standalone atomics (no writes in flight) still inject immediately. quiet/barrier/shutdown drain the queue.
ep/cxi: async write path with completion-gated control atomics
- The CXI barrier compared a 21-bit-masked barrier_seq against per-node slot counters that accumulate forever (NIC atomic adds, never reset); after 2^21 barriers in one process the seq wraps and every release check passes spuriously. Keep seq monotonic (unmasked) under USE_CXI; the mask only exists for the verbs immediate-data encoding. - atomic_buffer_ptr_ was uninitialized for thread_idx != 0 proxies, so the null-check guard in get_atomic_buffer_ptr() read an indeterminate pointer. Initialize to nullptr. Both flagged by Copilot review on the PR.
|
Closing this as superseded by uccl-project#997. uccl-project#997 is now the cleaner upstream CXI/Slingshot EP path and is already open against uccl-project/main from swiss-ai/main. This PR was useful as validation/history, but it mixes the older USE_CXI transport wiring with local serving/debug fixes. Any remaining useful deltas should move forward as small PRs on top of uccl-project#997 rather than keeping this branch alive as a competing upstream path. |
Summary
Adds a CXI / HPE Slingshot (libfabric) transport backend for the UCCL EP
(expert-parallelism) dispatch/combine path, as a minimal patch gated behind
USE_CXI. WhenUSE_CXIis not defined the existing verbs/EFA path iscompletely untouched.
The public
dispatch/combine/Buffer/Configinterfaces areunchanged, so this stays DeepEP-compatible and drop-in.
What's here
ep/{include,src}/cxi_transport.*—uccl::cxi::Transport, an RAII libfabric (cxiprovider) endpoint:CUDA + host MR registration (dmabuf-friendly), peer address exchange/insert,
RMA
write,inject_atomic_add64, and CQ completion polling. Whole file is#ifdef USE_CXI.rdma.cpp/proxy.cppbehind#ifdef USE_CXI, with#elsepreserving the verbs path. One NIC per rank (local_rank % 4,rail-aligned). Deliberately avoids write-with-immediate / remote-CQ-data,
which the CXI provider does not support here; control notifications map to
libfabric atomic adds instead.
(
internode_ll.cu).internode.cu/ep_configs.cuhdrop thehard-coded 8-local-rank assumption, including an EP8 combine-config fix for
4-GPU nodes;
test_internode.pyallows 4 local ranks.init_distsets the CUDA device correctlyunder srun/multi-GPU launchers;
Buffermanages scratch allocation andinitializes UCCL internally; proxy init takes a
data_ptrinstead of a rawtorch tensor.
Buffer::get_dispatch_layout()'s zero-token early return clearednum_ranksints in a[num_rdma_ranks]-sized tensor — a heap overrunwhenever a rank gets an empty DP-attention microbatch (e.g. EP16 on four
4-GPU nodes wrote 16 ints into a 4-int allocation). Extent is now
num_rdma_ranks * sizeof(int), with a distributed regression guard inep/bench/test_zero_layout_guard.pythat fails on the unfixed code.cudaHostAllocunderUSE_CXI) were initialized withcudaMemset, which can fail with CUDAinvalid argument; they now use
std::memset, withcudaMemsetkept forGPU/managed allocations.
Status / validation
in the project Podman container with
USE_CXI=1.test_internode.pypass.ep/bench/test_zero_layout_guard.pypasses on 2 nodes / 8 GPUs(
[zero-layout-guard] pass world=8 num_nodes=2); without the fix it failswith the intended guard-clobber assertion.
--all2all-backend deepep_high_throughputon this stack sustains multi-thousand-token/sserving across 2 nodes.
Notes for reviewers
fprintf(stderr, ...); could be routed through a loggingmacro.