Add CXI transport support for EP#997
Conversation
- add a libfabric CXI transport backend and wire it into the EP proxy runtime - add generic build switches for USE_LIBFABRIC_CXI and NUM_MAX_NVL_PEERS - relax internode benchmark assumptions so validation works on non-8-GPU nodes
Add CXI transport support for EP
|
@MaoZiming Could you take a look and let us know what other tests are needed for merging? |
- remove premature loop exits when posting quiet and barrier commands - ensure every proxy thread is drained before RDMA buffer reuse
[bugfix]: drain all proxy queues during sync
|
/run-benchmark amd |
|
Hi @PanJason , thank you for the contribution! The compilation on non-CXI machines would trigger some errors, see https://github.com/uccl-project/uccl/actions/runs/27840541212/job/82398387232 |
fix(ep): guard CXI libfabric include
|
Add the fix to conditionally include. Check again? |
|
/run-benchmark amd |
Benchmark
|
|
/run-benchmark gh200 |
Benchmark
|
|
Both tests passed, can you give a close look if possible? @MaoZiming |
|
@PanJason Thank you! Taking a look |
| } | ||
| } | ||
| #endif | ||
| break; |
There was a problem hiding this comment.
#998
Let's maybe continue discussion there. do you see any issues when you (1) remove the first break (2) toggle the second break on/off. was the " forwarder epoch/NVL receiver timeouts" only observed with first break removed + second break on, and disappear with first break removed + second break removed?
There was a problem hiding this comment.
I will do some A/B testing to answer your question. One thing I wanna know:
- Shall I respond you here or in [bugfix]: quiet all CXI proxy queues before reuse #998?
- Also the following commits should be included in [bugfix]: quiet all CXI proxy queues before reuse #998 or here?
Reason why asking is because I feel this is a generic fix, rather than sth specific to CXI port.
| throw std::runtime_error("CXI peer transport is not initialized"); | ||
| } | ||
| transport->connect_peer(peer, remote_infos_[peer]); | ||
| if (proxy_trace_enabled()) { |
There was a problem hiding this comment.
Let's clean the profiling code, e.g. proxy_trace_enabled, and the cmd_name, etc.
| cxi_transports_by_rank_[peer] = std::move(transport); | ||
| } | ||
|
|
||
| std::thread receiver_thread([this, num_ranks, my_rank]() { |
There was a problem hiding this comment.
Let's try to reduce duplicated code, e.g. std::thread receiver_thread send_connection_info_as_client, etc. also appear on the non-cxi path
There was a problem hiding this comment.
These two are folded into exchange_peer_connection_info now. Also duplicated if clause for connection build is folded to should_connect_peer. Cleaned in b0a4d68
| uint64_t Proxy::completed_wr() const { return completion_count_; } | ||
|
|
||
| bool Proxy::use_cxi_transport() const { | ||
| char const* transport = std::getenv("UCCL_EP_TRANSPORT"); |
There was a problem hiding this comment.
We can cache whether the transport is cxi rather than getting from env every time
There was a problem hiding this comment.
Cached in use_cxi_transport_ var. See f46f191
| if (cfg_.rank == 0) { | ||
| uint64_t const expected_arrivals = | ||
| seq * static_cast<uint64_t>(std::max(0, cfg_.num_nodes - 1)); | ||
| if (load_cxi_barrier_word_sum(kArrivalSlot) < expected_arrivals) { |
There was a problem hiding this comment.
Nit: the barrier seq never wraps around.
| #include <stdexcept> | ||
| #include <vector> | ||
|
|
||
| class CxiTransport final : public EpTransport { |
There was a problem hiding this comment.
cc @YangZhou1997 I think eventually it would be nice to package all different kinds of transport (since we have more of them now), into a class, right now only cxi instantiates EpTransport.
|
I also notice that the nproc_per_node=4 (< 8) doesn't work on other transport paths, e.g. on EFA. It's currently hard-coded https://github.com/uccl-project/uccl/blob/main/ep/src/rdma.cpp#L1453-L1454 |
- keep the default MAX_NUM_GPUS value at 8 - allow setup.py and Makefile builds to override MAX_NUM_GPUS - preserve NUM_MAX_NVL_PEERS as a separate EP topology knob
@MaoZiming A separate PR that makes MAX_NUM_GPUS configurable is created here: |
|
@MaoZiming Anything else needed for merging? |
|
@YangZhou1997 Could you please also take a look at your convenience? |
Description
This PR adds a libfabric/CXI transport path for UCCL EP so the EP proxy can run over HPE Slingshot/Cassini through the
cxiprovider.Implementation details:
EpTransportabstraction and aCxiTransportimplementation, enabled withUSE_LIBFABRIC_CXI=1.FI_EP_RDMendpoints on the CXI domain matching the local GPU rank (cxi0,cxi1, ...), requesting RMA, atomic, fence, HMEM, local, and remote communication capabilities.cxi_transports_by_rank_is keyed by peer rank, each peer transport owns its local CXI endpoint/MRs, inserts the matching remote endpoint name into the libfabric AV, andpost_cxi_commands()routes each WRITE/ATOMIC to the destination peer's transport. This is becauseshared endpointdecreased the performance significantly.fi_write: the proxy decodes the local and remote EP buffer offsets and posts a CXI RMA write from the registered local main buffer to the peer main-buffer MR/key.FI_SUMfi_atomicmsgoperations into the peer atomic buffer; WRITE commands carrying an atomic signal post the payloadfi_writefirst and then a fencedFI_SUMatomic notification so the completion associated with the WR id corresponds to the notification step.USE_LIBFABRIC_CXI,LIBFABRIC_HOME,NUM_MAX_NVL_PEERS) and relaxes the internode benchmark local-rank assumption for 4-GPU GH200 nodes.Related issue: #956
Type of Change
How Has This Been Tested?
Formatting:
Build/validation:
epinside the target container withUSE_LIBFABRIC_CXI=1viapython3 setup.py install.25446440:025446450:025446460:0No error was seen in the log.
Note: the representative performance validation above used the same CXI implementation plus a local benchmark/process-affinity patch to pin torchrun forked processes by local rank. On GH200 this pinning is needed for representative numbers; without it, the clean-branch validation still completed but measured noticeably slower.
Checklist
format.shto follow the style guidelines.build.shto verify compilation. The fullbuild.shcontainer wheel build was not run; instead, the target Slurm/container validation built and installedepwithUSE_LIBFABRIC_CXI=1.