Add CXI transport support for EP by PanJason · Pull Request #997 · uccl-project/uccl

PanJason · 2026-06-16T12:46:30Z

Description

This PR adds a libfabric/CXI transport path for UCCL EP so the EP proxy can run over HPE Slingshot/Cassini through the cxi provider.

Implementation details:

Adds an EpTransport abstraction and a CxiTransport implementation, enabled with USE_LIBFABRIC_CXI=1.
Initializes libfabric FI_EP_RDM endpoints on the CXI domain matching the local GPU rank (cxi0, cxi1, ...), requesting RMA, atomic, fence, HMEM, local, and remote communication capabilities.
Uses per-peer CXI transport state in the proxy: cxi_transports_by_rank_ is keyed by peer rank, each peer transport owns its local CXI endpoint/MRs, inserts the matching remote endpoint name into the libfabric AV, and post_cxi_commands() routes each WRITE/ATOMIC to the destination peer's transport. This is because shared endpoint decreased the performance significantly.
Registers the EP main RDMA buffer and atomic buffer as CUDA HMEM memory regions with provider keys, and exchanges the CXI endpoint name plus MR keys through the existing proxy metadata path.
Maps EP WRITE commands to fi_write: the proxy decodes the local and remote EP buffer offsets and posts a CXI RMA write from the registered local main buffer to the peer main-buffer MR/key.
Maps EP atomic notifications/control commands to libfabric atomics: normal EP ATOMIC commands become fenced FI_SUM fi_atomicmsg operations into the peer atomic buffer; WRITE commands carrying an atomic signal post the payload fi_write first and then a fenced FI_SUM atomic notification so the completion associated with the WR id corresponds to the notification step.
Uses a small host-registered operand pool for outgoing atomics and polls the CXI CQ to translate libfabric completions back into the proxy's existing WR acknowledgement flow.
Adds generic build knobs for CXI builds (USE_LIBFABRIC_CXI, LIBFABRIC_HOME, NUM_MAX_NVL_PEERS) and relaxes the internode benchmark local-rank assumption for 4-GPU GH200 nodes.

Related issue: #956

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Unit tests
Integration tests
Manual testing

Formatting:

Format checker was run.

Build/validation:

Our Alps cluster uses Slurm scheduler, so we wrote customized sbatch scripts that installed ep inside the target container with USE_LIBFABRIC_CXI=1 via python3 setup.py install.
Full EP validation completed on Alps GH200 nodes with the CXI transport enabled.

EP	Job	Result	FP8 dispatch	BF16 dispatch	Combine
8	`2544644`	completed `0:0`	40.15-40.37 GB/s RDMA (1495-1502 us)	42.94-43.00 GB/s RDMA (2722-2724 us)	42.20-42.33 GB/s RDMA (2765-2772 us)
16	`2544645`	completed `0:0`	25.95-26.23 GB/s RDMA (4172-4203 us)	25.92-26.31 GB/s RDMA (8064-8126 us)	26.63-27.10 GB/s RDMA (7829-7928 us)
32	`2544646`	completed `0:0`	22.84-23.05 GB/s RDMA (5234-5281 us)	22.24-22.45 GB/s RDMA (10423-10519 us)	22.12-22.50 GB/s RDMA (10399-10566 us)

No error was seen in the log.

Note: the representative performance validation above used the same CXI implementation plus a local benchmark/process-affinity patch to pin torchrun forked processes by local rank. On GH200 this pinning is needed for representative numbers; without it, the clean-branch validation still completed but measured noticeably slower.

Checklist

I have run format.sh to follow the style guidelines.
I have run build.sh to verify compilation. The full build.sh container wheel build was not run; instead, the target Slurm/container validation built and installed ep with USE_LIBFABRIC_CXI=1.
I have removed redundant variables and comments.
I have updated the documentation. No user-facing documentation was added for this transport backend in this PR.
I have added tests. No standalone unit test was added; validation used the existing EP internode benchmark on real CXI hardware.

- add a libfabric CXI transport backend and wire it into the EP proxy runtime - add generic build switches for USE_LIBFABRIC_CXI and NUM_MAX_NVL_PEERS - relax internode benchmark assumptions so validation works on non-8-GPU nodes

Add CXI transport support for EP

PanJason · 2026-06-16T13:30:20Z

@MaoZiming Could you take a look and let us know what other tests are needed for merging?

- remove premature loop exits when posting quiet and barrier commands - ensure every proxy thread is drained before RDMA buffer reuse

[bugfix]: drain all proxy queues during sync

YangZhou1997 · 2026-06-19T17:45:02Z

/run-benchmark amd

YangZhou1997 · 2026-06-19T18:08:26Z

Hi @PanJason , thank you for the contribution!

The compilation on non-CXI machines would trigger some errors, see https://github.com/uccl-project/uccl/actions/runs/27840541212/job/82398387232

fix(ep): guard CXI libfabric include

PanJason · 2026-06-20T10:41:52Z

Add the fix to conditionally include. Check again?

YangZhou1997 · 2026-06-20T17:55:56Z

/run-benchmark amd

github-actions · 2026-06-20T17:56:06Z

Benchmark `amd` passed

PR: #997
Commit: 224f02da646205036bd3782eceb52b72a1acd8f7
Result: success
Workflow run: https://github.com/uccl-project/uccl/actions/runs/27879248891

YangZhou1997 · 2026-06-20T18:21:37Z

/run-benchmark gh200

github-actions · 2026-06-20T18:21:47Z

Benchmark `gh200` passed

PR: #997
Commit: 224f02da646205036bd3782eceb52b72a1acd8f7
Result: success
Workflow run: https://github.com/uccl-project/uccl/actions/runs/27879881202

YangZhou1997 · 2026-06-20T18:42:53Z

Both tests passed, can you give a close look if possible? @MaoZiming

MaoZiming · 2026-06-21T16:17:44Z

@PanJason Thank you! Taking a look

MaoZiming · 2026-06-21T16:27:21Z

      }
    }
 #endif
-    break;


#998
Let's maybe continue discussion there. do you see any issues when you (1) remove the first break (2) toggle the second break on/off. was the " forwarder epoch/NVL receiver timeouts" only observed with first break removed + second break on, and disappear with first break removed + second break removed?

I will do some A/B testing to answer your question. One thing I wanna know:

Shall I respond you here or in [bugfix]: quiet all CXI proxy queues before reuse #998?

Also the following commits should be included in [bugfix]: quiet all CXI proxy queues before reuse #998 or here?

Reason why asking is because I feel this is a generic fix, rather than sth specific to CXI port.

Thanks! Let's incldue that in #998 !

MaoZiming · 2026-06-21T16:32:52Z

+        throw std::runtime_error("CXI peer transport is not initialized");
+      }
+      transport->connect_peer(peer, remote_infos_[peer]);
+      if (proxy_trace_enabled()) {


Let's clean the profiling code, e.g. proxy_trace_enabled, and the cmd_name, etc.

Cleaned in fbbaf3d

MaoZiming · 2026-06-21T16:36:29Z

+      cxi_transports_by_rank_[peer] = std::move(transport);
+    }
+
+    std::thread receiver_thread([this, num_ranks, my_rank]() {


Let's try to reduce duplicated code, e.g. std::thread receiver_thread send_connection_info_as_client, etc. also appear on the non-cxi path

These two are folded into exchange_peer_connection_info now. Also duplicated if clause for connection build is folded to should_connect_peer. Cleaned in b0a4d68

MaoZiming · 2026-06-21T16:57:08Z

 uint64_t Proxy::completed_wr() const { return completion_count_; }

+bool Proxy::use_cxi_transport() const {
+  char const* transport = std::getenv("UCCL_EP_TRANSPORT");


We can cache whether the transport is cxi rather than getting from env every time

Cached in use_cxi_transport_ var. See f46f191

MaoZiming · 2026-06-21T16:58:35Z

+      if (cfg_.rank == 0) {
+        uint64_t const expected_arrivals =
+            seq * static_cast<uint64_t>(std::max(0, cfg_.num_nodes - 1));
+        if (load_cxi_barrier_word_sum(kArrivalSlot) < expected_arrivals) {


Nit: the barrier seq never wraps around.

Remove the wrap for CXI code in 7227ea4

MaoZiming · 2026-06-21T17:00:44Z

+#include <stdexcept>
+#include <vector>
+
+class CxiTransport final : public EpTransport {


cc @YangZhou1997 I think eventually it would be nice to package all different kinds of transport (since we have more of them now), into a class, right now only cxi instantiates EpTransport.

MaoZiming · 2026-06-21T17:54:10Z

I also notice that the nproc_per_node=4 (< 8) doesn't work on other transport paths, e.g. on EFA. It's currently hard-coded
https://github.com/uccl-project/uccl/blob/main/ep/include/common.hpp#L88
We also need to parameterize that for other transports if we want to relax nproc_per_node<8 in general. It might be good to keep that as a separate PR.

https://github.com/uccl-project/uccl/blob/main/ep/src/rdma.cpp#L1453-L1454

- keep the default MAX_NUM_GPUS value at 8 - allow setup.py and Makefile builds to override MAX_NUM_GPUS - preserve NUM_MAX_NVL_PEERS as a separate EP topology knob

PanJason · 2026-06-21T18:45:05Z

It might be good to keep that as a separate PR.

@MaoZiming A separate PR that makes MAX_NUM_GPUS configurable is created here:
#1005

PanJason · 2026-06-23T18:58:37Z

@MaoZiming Anything else needed for merging?

MaoZiming

Thanks @PanJason ! I just tested it. It LGTM

PanJason · 2026-06-24T08:52:52Z

@YangZhou1997 Could you please also take a look at your convenience?

Yueyang Pan and others added 2 commits June 16, 2026 14:23

[ep]: add cxi transport support

593e16f

- add a libfabric CXI transport backend and wire it into the EP proxy runtime - add generic build switches for USE_LIBFABRIC_CXI and NUM_MAX_NVL_PEERS - relax internode benchmark assumptions so validation works on non-8-GPU nodes

Merge pull request #7 from swiss-ai/slingshot-dev-clean

38a24ae

Add CXI transport support for EP

PanJason requested review from MaoZiming, YangZhou1997 and zhenhuang12 as code owners June 16, 2026 12:46

PanJason mentioned this pull request Jun 16, 2026

[Proposal] libfabric-CXI backend for UCCL-EP on HPE Slingshot #956

Open

Yueyang Pan and others added 2 commits June 16, 2026 19:05

[bugfix]: drain all proxy queues during sync

7bdfab2

- remove premature loop exits when posting quiet and barrier commands - ensure every proxy thread is drained before RDMA buffer reuse

Merge pull request #8 from swiss-ai/slingshot-dev-clean

554dff8

[bugfix]: drain all proxy queues during sync

This was referenced Jun 17, 2026

p2p: add CXI transport endpoint doublewordai/uccl#6

Closed

[P2P]: add CXI transport endpoint #999

Merged

ep: CXI/Slingshot (libfabric) transport backend behind USE_CXI doublewordai/uccl#1

Closed

YangZhou1997 added the run-benchmark label Jun 18, 2026

Merge branch 'main' into main

298d622

YangZhou1997 removed the run-benchmark label Jun 19, 2026

Yueyang Pan and others added 2 commits June 19, 2026 23:26

[fix]: guard CXI libfabric atomic include

179fd96

Merge pull request #10 from swiss-ai/compilation-fix

0baf3bf

fix(ep): guard CXI libfabric include

Merge branch 'main' into main

224f02d

MaoZiming reviewed Jun 21, 2026

View reviewed changes

Yueyang Pan added 4 commits June 21, 2026 19:48

[refactor]: remove proxy trace logging

fbbaf3d

[refactor]: share proxy connection exchange

b0a4d68

[refactor]: cache CXI transport selection

f46f191

[fix]: keep CXI barrier sequence monotonic

7227ea4

[ep]: make MAX_NUM_GPUS configurable

6edc474

- keep the default MAX_NUM_GPUS value at 8 - allow setup.py and Makefile builds to override MAX_NUM_GPUS - preserve NUM_MAX_NVL_PEERS as a separate EP topology knob

Merge branch 'main' into main

91026e6

MaoZiming approved these changes Jun 24, 2026

View reviewed changes

fergusfinn mentioned this pull request Jun 24, 2026

fix(merge): drop stray break in amo_nonfetch_add doublewordai/uccl#17

Open

YangZhou1997 merged commit 5f78eb6 into uccl-project:main Jun 24, 2026
14 checks passed

Uh oh!

Conversation

PanJason commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

PanJason commented Jun 16, 2026

Uh oh!

YangZhou1997 commented Jun 19, 2026

Uh oh!

YangZhou1997 commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PanJason commented Jun 20, 2026

Uh oh!

YangZhou1997 commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark amd passed

Uh oh!

YangZhou1997 commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark gh200 passed

Uh oh!

YangZhou1997 commented Jun 20, 2026

Uh oh!

MaoZiming commented Jun 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaoZiming commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PanJason commented Jun 21, 2026

Uh oh!

PanJason commented Jun 23, 2026

Uh oh!

MaoZiming left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PanJason commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PanJason commented Jun 16, 2026 •

edited

Loading

YangZhou1997 commented Jun 19, 2026 •

edited

Loading

github-actions Bot commented Jun 20, 2026 •

edited

Loading

Benchmark `amd` passed

github-actions Bot commented Jun 20, 2026 •

edited

Loading

Benchmark `gh200` passed

MaoZiming commented Jun 21, 2026 •

edited

Loading

MaoZiming left a comment •

edited

Loading