Skip to content

Add CXI transport support for EP#997

Merged
YangZhou1997 merged 14 commits into
uccl-project:mainfrom
swiss-ai:main
Jun 24, 2026
Merged

Add CXI transport support for EP#997
YangZhou1997 merged 14 commits into
uccl-project:mainfrom
swiss-ai:main

Conversation

@PanJason

@PanJason PanJason commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds a libfabric/CXI transport path for UCCL EP so the EP proxy can run over HPE Slingshot/Cassini through the cxi provider.

Implementation details:

  • Adds an EpTransport abstraction and a CxiTransport implementation, enabled with USE_LIBFABRIC_CXI=1.
  • Initializes libfabric FI_EP_RDM endpoints on the CXI domain matching the local GPU rank (cxi0, cxi1, ...), requesting RMA, atomic, fence, HMEM, local, and remote communication capabilities.
  • Uses per-peer CXI transport state in the proxy: cxi_transports_by_rank_ is keyed by peer rank, each peer transport owns its local CXI endpoint/MRs, inserts the matching remote endpoint name into the libfabric AV, and post_cxi_commands() routes each WRITE/ATOMIC to the destination peer's transport. This is because shared endpoint decreased the performance significantly.
  • Registers the EP main RDMA buffer and atomic buffer as CUDA HMEM memory regions with provider keys, and exchanges the CXI endpoint name plus MR keys through the existing proxy metadata path.
  • Maps EP WRITE commands to fi_write: the proxy decodes the local and remote EP buffer offsets and posts a CXI RMA write from the registered local main buffer to the peer main-buffer MR/key.
  • Maps EP atomic notifications/control commands to libfabric atomics: normal EP ATOMIC commands become fenced FI_SUM fi_atomicmsg operations into the peer atomic buffer; WRITE commands carrying an atomic signal post the payload fi_write first and then a fenced FI_SUM atomic notification so the completion associated with the WR id corresponds to the notification step.
  • Uses a small host-registered operand pool for outgoing atomics and polls the CXI CQ to translate libfabric completions back into the proxy's existing WR acknowledgement flow.
  • Adds generic build knobs for CXI builds (USE_LIBFABRIC_CXI, LIBFABRIC_HOME, NUM_MAX_NVL_PEERS) and relaxes the internode benchmark local-rank assumption for 4-GPU GH200 nodes.

Related issue: #956

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

  • Unit tests
  • Integration tests
  • Manual testing

Formatting:

  • Format checker was run.

Build/validation:

  • Our Alps cluster uses Slurm scheduler, so we wrote customized sbatch scripts that installed ep inside the target container with USE_LIBFABRIC_CXI=1 via python3 setup.py install.
  • Full EP validation completed on Alps GH200 nodes with the CXI transport enabled.
EP Job Result FP8 dispatch BF16 dispatch Combine
8 2544644 completed 0:0 40.15-40.37 GB/s RDMA (1495-1502 us) 42.94-43.00 GB/s RDMA (2722-2724 us) 42.20-42.33 GB/s RDMA (2765-2772 us)
16 2544645 completed 0:0 25.95-26.23 GB/s RDMA (4172-4203 us) 25.92-26.31 GB/s RDMA (8064-8126 us) 26.63-27.10 GB/s RDMA (7829-7928 us)
32 2544646 completed 0:0 22.84-23.05 GB/s RDMA (5234-5281 us) 22.24-22.45 GB/s RDMA (10423-10519 us) 22.12-22.50 GB/s RDMA (10399-10566 us)

No error was seen in the log.

Note: the representative performance validation above used the same CXI implementation plus a local benchmark/process-affinity patch to pin torchrun forked processes by local rank. On GH200 this pinning is needed for representative numbers; without it, the clean-branch validation still completed but measured noticeably slower.

Checklist

  • I have run format.sh to follow the style guidelines.
  • I have run build.sh to verify compilation. The full build.sh container wheel build was not run; instead, the target Slurm/container validation built and installed ep with USE_LIBFABRIC_CXI=1.
  • I have removed redundant variables and comments.
  • I have updated the documentation. No user-facing documentation was added for this transport backend in this PR.
  • I have added tests. No standalone unit test was added; validation used the existing EP internode benchmark on real CXI hardware.

Yueyang Pan and others added 2 commits June 16, 2026 14:23
- add a libfabric CXI transport backend and wire it into the EP proxy runtime
- add generic build switches for USE_LIBFABRIC_CXI and NUM_MAX_NVL_PEERS
- relax internode benchmark assumptions so validation works on non-8-GPU nodes
Add CXI transport support for EP
@PanJason

Copy link
Copy Markdown
Contributor Author

@MaoZiming Could you take a look and let us know what other tests are needed for merging?

Yueyang Pan and others added 2 commits June 16, 2026 19:05
- remove premature loop exits when posting quiet and barrier commands

- ensure every proxy thread is drained before RDMA buffer reuse
[bugfix]: drain all proxy queues during sync
@YangZhou1997

Copy link
Copy Markdown
Member

/run-benchmark amd

@YangZhou1997

YangZhou1997 commented Jun 19, 2026

Copy link
Copy Markdown
Member

Hi @PanJason , thank you for the contribution!

The compilation on non-CXI machines would trigger some errors, see https://github.com/uccl-project/uccl/actions/runs/27840541212/job/82398387232

Yueyang Pan and others added 2 commits June 19, 2026 23:26
@PanJason

Copy link
Copy Markdown
Contributor Author

Add the fix to conditionally include. Check again?

@YangZhou1997

Copy link
Copy Markdown
Member

/run-benchmark amd

@github-actions

github-actions Bot commented Jun 20, 2026

Copy link
Copy Markdown

Benchmark amd passed

PR: #997
Commit: 224f02da646205036bd3782eceb52b72a1acd8f7
Result: success
Workflow run: https://github.com/uccl-project/uccl/actions/runs/27879248891

@YangZhou1997

Copy link
Copy Markdown
Member

/run-benchmark gh200

@github-actions

github-actions Bot commented Jun 20, 2026

Copy link
Copy Markdown

Benchmark gh200 passed

PR: #997
Commit: 224f02da646205036bd3782eceb52b72a1acd8f7
Result: success
Workflow run: https://github.com/uccl-project/uccl/actions/runs/27879881202

@YangZhou1997

Copy link
Copy Markdown
Member

Both tests passed, can you give a close look if possible? @MaoZiming

@MaoZiming

Copy link
Copy Markdown
Member

@PanJason Thank you! Taking a look

Comment thread ep/include/uccl_ibgda.cuh
}
}
#endif
break;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#998
Let's maybe continue discussion there. do you see any issues when you (1) remove the first break (2) toggle the second break on/off. was the " forwarder epoch/NVL receiver timeouts" only observed with first break removed + second break on, and disappear with first break removed + second break removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do some A/B testing to answer your question. One thing I wanna know:

Reason why asking is because I feel this is a generic fix, rather than sth specific to CXI port.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let's incldue that in #998 !

Comment thread ep/src/proxy.cpp Outdated
throw std::runtime_error("CXI peer transport is not initialized");
}
transport->connect_peer(peer, remote_infos_[peer]);
if (proxy_trace_enabled()) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean the profiling code, e.g. proxy_trace_enabled, and the cmd_name, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned in fbbaf3d

Comment thread ep/src/proxy.cpp Outdated
cxi_transports_by_rank_[peer] = std::move(transport);
}

std::thread receiver_thread([this, num_ranks, my_rank]() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to reduce duplicated code, e.g. std::thread receiver_thread send_connection_info_as_client, etc. also appear on the non-cxi path

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are folded into exchange_peer_connection_info now. Also duplicated if clause for connection build is folded to should_connect_peer. Cleaned in b0a4d68

Comment thread ep/src/proxy.cpp Outdated
uint64_t Proxy::completed_wr() const { return completion_count_; }

bool Proxy::use_cxi_transport() const {
char const* transport = std::getenv("UCCL_EP_TRANSPORT");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can cache whether the transport is cxi rather than getting from env every time

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cached in use_cxi_transport_ var. See f46f191

Comment thread ep/src/proxy.cpp
if (cfg_.rank == 0) {
uint64_t const expected_arrivals =
seq * static_cast<uint64_t>(std::max(0, cfg_.num_nodes - 1));
if (load_cxi_barrier_word_sum(kArrivalSlot) < expected_arrivals) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the barrier seq never wraps around.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the wrap for CXI code in 7227ea4

#include <stdexcept>
#include <vector>

class CxiTransport final : public EpTransport {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @YangZhou1997 I think eventually it would be nice to package all different kinds of transport (since we have more of them now), into a class, right now only cxi instantiates EpTransport.

@MaoZiming

MaoZiming commented Jun 21, 2026

Copy link
Copy Markdown
Member

I also notice that the nproc_per_node=4 (< 8) doesn't work on other transport paths, e.g. on EFA. It's currently hard-coded
https://github.com/uccl-project/uccl/blob/main/ep/include/common.hpp#L88
We also need to parameterize that for other transports if we want to relax nproc_per_node<8 in general. It might be good to keep that as a separate PR.

https://github.com/uccl-project/uccl/blob/main/ep/src/rdma.cpp#L1453-L1454

- keep the default MAX_NUM_GPUS value at 8

- allow setup.py and Makefile builds to override MAX_NUM_GPUS

- preserve NUM_MAX_NVL_PEERS as a separate EP topology knob
@PanJason

Copy link
Copy Markdown
Contributor Author

It might be good to keep that as a separate PR.

@MaoZiming A separate PR that makes MAX_NUM_GPUS configurable is created here:
#1005

@PanJason

Copy link
Copy Markdown
Contributor Author

@MaoZiming Anything else needed for merging?

@MaoZiming MaoZiming left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PanJason ! I just tested it. It LGTM

@PanJason

Copy link
Copy Markdown
Contributor Author

@YangZhou1997 Could you please also take a look at your convenience?

@YangZhou1997 YangZhou1997 merged commit 5f78eb6 into uccl-project:main Jun 24, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants