[Proposal]: libfabric-CXI backend for UCCL-EP on HPE Slingshot
Hi UCCL team,
We're from the Apertus / Swiss AI training effort, working on large-scale open MoE training on Alps (using HPE's Slingshot-11 Cassini NIC). For Apertus 2 we're evaluating replacements for the default NCCL all-gather-style MoE dispatcher, and UCCL-EP looks like a strong fit because Slingshot does not provide IBGDA and we can only use some CPU-proxy design for all2all communication.
On Alps the network interface uses libfabric's cxi provider, not typical ibverbs. Before we start prototyping, we'd like to ask:
- Does UCCL-EP currently have a libfabric transport path? I saw EFA, Broadcom, and CX7 mentioned and supported. Is EFA implemented via libfabric (
efa provider), and if so, is the transport layer abstracted enough that adding a cxi provider would be mostly provider substitution?
- Is any CXI / Slingshot work already underway or planned? If so, we'd be glad to test on our platform for debugging and help with performance benchmarking.
- If not, would a libfabric/CXI backend be welcomed? Our intent is to keep the UCCL-EP API unchanged and map proxy operations to libfabric RMA, starting with low-latency mode. This might be useful because, as far as we know, many EU data centers actually use Slingshot because they were initially built for HPC, not exactly LLM training. If we can demonstrate performance gains, it would benefit many users.
What we'd contribute
Assuming the answer to (3) is yes, we'd plan to upstream:
- A libfabric/CXI transport backend (low-latency mode first, then normal/high-throughput).
- Validation on real Apertus MoE training on Alps Slingshot
- Benchmarks vs. all-gather performances.
What we plan to verify first currently
The main open questions on the CXI side are CUDA HMEM registration, MR-mode handling (FI_MR_ENDPOINT / FI_MR_PROV_KEY), the write-data / immediate-data path (FI_CXI_ENABLE_WRITEDATA, cq_data_size, FI_REMOTE_CQ_DATA), and ordering between payload writes and control notifications on CXI. We plan to work through these in standalone libfabric tests before touching UCCL — starting with fi_info -p cxi -v on the actual training container, then host-memory RMA, CUDA-memory RMA, a write-data test, and a payload+control ordering stress test. Happy to share the full plan if useful.
From my understanding of the paper, UCCL-EP uses immediate-data control messages to emulate ordering semantics for EP modes. Correct me if wrong. This means the write-with-immediate number perf matters a lot which we have to test. If performance is low, we need to think of design changes.
Why we think this can be useful for both sides
- For UCCL: real Slingshot/CXI deployment and validation on an open large-scale MoE workload.
- For Apertus: a reusable open-source EP layer instead of an internal one-off.
- For the broader community: other EU data centers that want to train or serve large MoE models on Slingshot NICs.
Finally, thanks for building UCCL-EP, which benefits many people not use ib nics
[Proposal]: libfabric-CXI backend for UCCL-EP on HPE Slingshot
Hi UCCL team,
We're from the Apertus / Swiss AI training effort, working on large-scale open MoE training on Alps (using HPE's Slingshot-11 Cassini NIC). For Apertus 2 we're evaluating replacements for the default NCCL all-gather-style MoE dispatcher, and UCCL-EP looks like a strong fit because Slingshot does not provide IBGDA and we can only use some CPU-proxy design for all2all communication.
On Alps the network interface uses libfabric's
cxiprovider, not typical ibverbs. Before we start prototyping, we'd like to ask:efaprovider), and if so, is the transport layer abstracted enough that adding acxiprovider would be mostly provider substitution?What we'd contribute
Assuming the answer to (3) is yes, we'd plan to upstream:
What we plan to verify first currently
The main open questions on the CXI side are CUDA HMEM registration, MR-mode handling (
FI_MR_ENDPOINT/FI_MR_PROV_KEY), the write-data / immediate-data path (FI_CXI_ENABLE_WRITEDATA,cq_data_size,FI_REMOTE_CQ_DATA), and ordering between payload writes and control notifications on CXI. We plan to work through these in standalone libfabric tests before touching UCCL — starting withfi_info -p cxi -von the actual training container, then host-memory RMA, CUDA-memory RMA, a write-data test, and a payload+control ordering stress test. Happy to share the full plan if useful.From my understanding of the paper, UCCL-EP uses immediate-data control messages to emulate ordering semantics for EP modes. Correct me if wrong. This means the write-with-immediate number perf matters a lot which we have to test. If performance is low, we need to think of design changes.
Why we think this can be useful for both sides
Finally, thanks for building UCCL-EP, which benefits many people not use ib nics