ep: CXI/Slingshot (libfabric) transport backend behind USE_CXI by fergusfinn · Pull Request #1 · doublewordai/uccl

fergusfinn · 2026-06-10T14:20:50Z

Summary

Adds a CXI / HPE Slingshot (libfabric) transport backend for the UCCL EP
(expert-parallelism) dispatch/combine path, as a minimal patch gated behind
USE_CXI. When USE_CXI is not defined the existing verbs/EFA path is
completely untouched.

The public dispatch / combine / Buffer / Config interfaces are
unchanged, so this stays DeepEP-compatible and drop-in.

What's here

New self-contained module ep/{include,src}/cxi_transport.* —
uccl::cxi::Transport, an RAII libfabric (cxi provider) endpoint:
CUDA + host MR registration (dmabuf-friendly), peer address exchange/insert,
RMA write, inject_atomic_add64, and CQ completion polling. Whole file is
#ifdef USE_CXI.
Wire-in through rdma.cpp / proxy.cpp behind #ifdef USE_CXI, with
#else preserving the verbs path. One NIC per rank (local_rank % 4,
rail-aligned). Deliberately avoids write-with-immediate / remote-CQ-data,
which the CXI provider does not support here; control notifications map to
libfabric atomic adds instead.
Low-latency internode path fixed to work over the CXI transport
(internode_ll.cu).
GH200 4-GPU-per-node support: internode.cu / ep_configs.cuh drop the
hard-coded 8-local-rank assumption, including an EP8 combine-config fix for
4-GPU nodes; test_internode.py allows 4 local ranks.
Initialization robustness: init_dist sets the CUDA device correctly
under srun/multi-GPU launchers; Buffer manages scratch allocation and
initializes UCCL internally; proxy init takes a data_ptr instead of a raw
torch tensor.
Two bug fixes found during serving (relevant beyond CXI):
- Buffer::get_dispatch_layout()'s zero-token early return cleared
  num_ranks ints in a [num_rdma_ranks]-sized tensor — a heap overrun
  whenever a rank gets an empty DP-attention microbatch (e.g. EP16 on four
  4-GPU nodes wrote 16 ints into a 4-int allocation). Extent is now
  num_rdma_ranks * sizeof(int), with a distributed regression guard in
  ep/bench/test_zero_layout_guard.py that fails on the unfixed code.
- Host-pinned EP atomic buffers (allocated with cudaHostAlloc under
  USE_CXI) were initialized with cudaMemset, which can fail with CUDA
  invalid argument; they now use std::memset, with cudaMemset kept for
  GPU/managed allocations.

Status / validation

Builds and runs on Isambard (GH200 + Slingshot-11/CXI), 2 nodes × 4 GPUs,
in the project Podman container with USE_CXI=1.
Dispatch/combine correctness checks in test_internode.py pass.
ep/bench/test_zero_layout_guard.py passes on 2 nodes / 8 GPUs
([zero-layout-guard] pass world=8 num_nodes=2); without the fix it fails
with the intended guard-clobber assertion.
Serving validated end-to-end: vLLM with --all2all-backend deepep_high_throughput on this stack sustains multi-thousand-token/s
serving across 2 nodes.

Notes for reviewers

Error paths use fprintf(stderr, ...); could be routed through a logging
macro.
Produced with heavy coding-agent assistance

Minimal, USE_CXI-gated CXI transport for the EP dispatch/combine path; the existing verbs/EFA path is untouched when USE_CXI is not defined. - New self-contained uccl::cxi::Transport module (ep/{include,src}/cxi_transport.*): RAII libfabric (cxi provider) endpoint, CUDA + host MR registration, peer insertion, RMA write, inject atomic-add64, and CQ completion polling. - Wire-in via rdma.cpp/proxy.cpp behind #ifdef USE_CXI; one NIC per rank (local_rank % 4, rail-aligned), no remote-CQ-data / write-with-imm dependency. - internode.cu / ep_configs.cuh: support GH200 4-GPU nodes (drop hard-coded 8-local-rank assumption); test_internode.py allows 4 local ranks. Public dispatch/combine/Buffer/Config interfaces unchanged (DeepEP-compatible).

chatgpt-codex-connector · 2026-06-10T14:20:56Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

…i-GPU / srun environments

…sor and make initialize_uccl robust

…n test_internode_simple

(cherry picked from commit 75a2617)

Carry only the runtime fix from 4c5b63f. The focused regression test remains split out with the delayed test changes.

Keep the zero-token runtime fix in the base branch and carry only the focused guard coverage here.

Copilot

Pull request overview

Adds an optional (USE_CXI) CXI/Slingshot (libfabric) transport backend for the EP internode dispatch/combine path while preserving the existing verbs/EFA implementation when CXI is disabled. It also improves GH200 4-GPU-per-node support, tightens initialization/robustness, and includes a couple of correctness fixes found during serving.

Changes:

Introduce uccl::cxi::Transport (libfabric/cxi provider) and wire it into RDMA/proxy paths behind #ifdef USE_CXI (RMA writes + atomic-add control signaling).
Update internode kernels and configs to support NUM_MAX_NVL_PEERS <= 8 and 4-GPU nodes (including LL internode fixes).
Improve runtime/bench robustness (CUDA device selection, scratch pointer handling), add build toggles in setup.py, and expand test/bench CLI options.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
ep/src/uccl_proxy.cpp	CXI/host atomic buffer allocation + stop/destroy behavior adjustments.
ep/src/uccl_ep.cc	Fix RDMA-rank memset overrun; minor includes and helper wiring for atomic-buffer connection.
ep/src/rdma.cpp	Add CXI transport initialization, connection info exchange, batched CXI writes, and CXI atomics.
ep/src/proxy.cpp	CXI-aware peer selection, pinning, barrier signaling via CXI atomics, and 4-GPU stride fixes.
ep/src/internode.cu	Relax NVL peer assumptions and avoid type-punning loads by explicitly packing NVL presence.
ep/src/internode_ll.cu	Make LL receive path robust when CXI forces internode even for same-node peers.
ep/src/cxi_transport.cpp	New libfabric-based CXI transport implementation (endpoint + MR + RMA + CQ polling).
ep/setup.py	Add `USE_CXI` build plumbing and libfabric link/include configuration.
ep/include/rdma.hpp	Extend `RDMAConnectionInfo` with CXI endpoint/MR metadata (gated by `USE_CXI`).
ep/include/proxy_ctx.hpp	Add CXI transport fields to `ProxyCtx` under `USE_CXI`.
ep/include/ep_configs.cuh	Allow overriding `NUM_MAX_NVL_PEERS` at compile time.
ep/include/cxi_transport.hpp	New CXI transport public header (gated by `USE_CXI`).
ep/bench/utils.py	Robust device initialization + accept scratch as tensor or raw pointer.
ep/bench/test_internode.py	Support 4 local ranks, add smoke/fixed-chunk options, set CXI phase env for stats.
ep/bench/test_internode_simple.py	Use `Buffer` directly (less proxy plumbing), safer destroy flow.
ep/bench/buffer.py	Adjust EP8 combine config for 4-GPU nodes.
AGENTS.md	Add Isambard/GH200+CXI build and multinode test guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Replace the blocking post/wait_all/atomics sequence with an async design: - writes post without waiting; ring slots retire on CQ completion via a per-loop retire sweep (verbs-style), with an outstanding cap as backpressure - tail/control atomics inject immediately behind their writes, relying on same-TX-context submission ordering (measured on Slingshot-11; see uccl-project#956 discussion) - -FI_EAGAIN on any post is handled by polling + retry instead of terminating - UCCL_CXI_SYNC_WRITES=1 restores the previous conservative behavior - quiet drains all outstanding writes; shutdown drains bounded Also: env-gated UCCL_CXI_DELIVERY_COMPLETE=1 knob on TX (audit A/B).

Immediate same-TX injection reordered under FI_CXI_RDZV_THRESHOLD=0: every write goes rendezvous (target pulls data), so an eager atomic lands before the payload it announces -> corrupted dispatch output at >=1024 tokens. Queue atomics per peer with threshold = writes posted before them; flush in cxi_retire_ctx as completions arrive. Standalone atomics (no writes in flight) still inject immediately. quiet/barrier/shutdown drain the queue.

…_loop

… assert

ep/cxi: async write path with completion-gated control atomics

- The CXI barrier compared a 21-bit-masked barrier_seq against per-node slot counters that accumulate forever (NIC atomic adds, never reset); after 2^21 barriers in one process the seq wraps and every release check passes spuriously. Keep seq monotonic (unmasked) under USE_CXI; the mask only exists for the verbs immediate-data encoding. - atomic_buffer_ptr_ was uninitialized for thread_idx != 0 proxies, so the null-check guard in get_atomic_buffer_ptr() read an indeterminate pointer. Initialize to nullptr. Both flagged by Copilot review on the PR.

fergusfinn · 2026-06-17T14:56:29Z

Closing this as superseded by uccl-project#997.

uccl-project#997 is now the cleaner upstream CXI/Slingshot EP path and is already open against uccl-project/main from swiss-ai/main. This PR was useful as validation/history, but it mixes the older USE_CXI transport wiring with local serving/debug fixes. Any remaining useful deltas should move forward as small PRs on top of uccl-project#997 rather than keeping this branch alive as a competing upstream path.

Copilot AI review requested due to automatic review settings June 10, 2026 14:20

Copilot started reviewing on behalf of fergusfinn June 10, 2026 14:21 View session

This was referenced Jun 10, 2026

ep: CXI/Slingshot (libfabric) transport backend behind USE_CXI fergusfinn/uccl#1

Closed

ep: vLLM serving integration fixes (device selection, disagg proxy, test controls) #2

Closed

fergusfinn added 9 commits June 10, 2026 15:26

Fix init_dist to correctly set CUDA device and default device in mult…

7eb9fd2

…i-GPU / srun environments

Fix proxy initialization by passing data_ptr instead of raw torch ten…

3c3e5b9

…sor and make initialize_uccl robust

Let Buffer manage scratch allocation and initialize uccl internally i…

c148a45

…n test_internode_simple

Revert changes to test_internode_simple.py

0ef8108

Fix CXI low-latency internode path

35ba3b6

Fix EP8 combine config for 4-GPU nodes

6be7e09

Fix host atomic buffer initialization

16de290

(cherry picked from commit 75a2617)

Fix zero-token RDMA rank count clearing

7dec293

Carry only the runtime fix from 4c5b63f. The focused regression test remains split out with the delayed test changes.

Add zero-layout guard regression test

c386222

Keep the zero-token runtime fix in the base branch and carry only the focused guard coverage here.

fergusfinn force-pushed the cxi-ep branch from 5e5adad to c386222 Compare June 10, 2026 14:26

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread ep/src/uccl_proxy.cpp

Comment thread ep/src/proxy.cpp

This comment was marked as low quality.

Sign in to view

This comment was marked as abuse.

Sign in to view

This comment was marked as resolved.

Sign in to view

fergusfinn mentioned this pull request Jun 10, 2026

[Proposal] libfabric-CXI backend for UCCL-EP on HPE Slingshot uccl-project/uccl#956

Open

fergus barratt added 8 commits June 11, 2026 06:59

ep/bench: sustained dispatch loop for wire-utilization measurement

b9e6a6c

ep/bench: fix dispatch_loop teardown ordering (context-destroyed abort)

9b8dc12

ep/bench: drop cached-dispatch output ref before teardown in dispatch…

e808810

…_loop

ep/bench: scope measurement tensors so dispatch_loop tears down cleanly

98304f6

ep: account for the CXI barrier carve-out in the atomic-buffer sizing…

8b45c6d

… assert

gitignore core dumps (containers run with the repo as workdir)

19f36fc

fergusfinn mentioned this pull request Jun 11, 2026

ep/cxi: async write path with completion-gated control atomics #4

Merged

Merge pull request #4 from doublewordai/cxi-async-transport

dd2a672

ep/cxi: async write path with completion-gated control atomics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ep: CXI/Slingshot (libfabric) transport backend behind USE_CXI#1

ep: CXI/Slingshot (libfabric) transport backend behind USE_CXI#1
fergusfinn wants to merge 20 commits into
mainfrom
cxi-ep

fergusfinn commented Jun 10, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as low quality.

Uh oh!

This comment was marked as abuse.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as low quality.

Uh oh!

fergusfinn commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fergusfinn commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's here

Status / validation

Notes for reviewers

Uh oh!

chatgpt-codex-connector Bot commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

This comment was marked as low quality.

Uh oh!

This comment was marked as abuse.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as low quality.

Uh oh!

fergusfinn commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fergusfinn commented Jun 10, 2026 •

edited

Loading