[Proposal] libfabric-CXI backend for UCCL-EP on HPE Slingshot

# [Proposal]: libfabric-CXI backend for UCCL-EP on HPE Slingshot

Hi UCCL team,

We're from the [Apertus](https://www.swiss-ai.org/) / Swiss AI training effort, working on large-scale open MoE training on [Alps](https://www.cscs.ch/computers/alps) (using HPE's Slingshot-11 Cassini NIC). For Apertus 2 we're evaluating replacements for the default NCCL all-gather-style MoE dispatcher, and UCCL-EP looks like a strong fit because Slingshot does not provide IBGDA and we can only use some CPU-proxy design for all2all communication.

On Alps the network interface uses libfabric's `cxi` provider, not typical ibverbs. Before we start prototyping, we'd like to ask:

1. **Does UCCL-EP currently have a libfabric transport path?** I saw EFA, Broadcom, and CX7 mentioned and supported. Is EFA implemented via libfabric (`efa` provider), and if so, is the transport layer abstracted enough that adding a `cxi` provider would be mostly provider substitution? 
2. **Is any CXI / Slingshot work already underway or planned?** If so, we'd be glad to test on our platform for debugging and help with performance benchmarking.
3. **If not, would a libfabric/CXI backend be welcomed?** Our intent is to keep the UCCL-EP API unchanged and map proxy operations to libfabric RMA, starting with low-latency mode. This might be useful because, as far as we know, many EU data centers actually use Slingshot because they were initially built for HPC, not exactly LLM training. If we can demonstrate performance gains, it would benefit many users.

## What we'd contribute

Assuming the answer to (3) is yes, we'd plan to upstream:

- A libfabric/CXI transport backend (low-latency mode first, then normal/high-throughput).
- Validation on real Apertus MoE training on Alps Slingshot
- Benchmarks vs. all-gather performances.

## What we plan to verify first currently

The main open questions on the CXI side are CUDA HMEM registration, MR-mode handling (`FI_MR_ENDPOINT` / `FI_MR_PROV_KEY`), the write-data / immediate-data path (`FI_CXI_ENABLE_WRITEDATA`, `cq_data_size`, `FI_REMOTE_CQ_DATA`), and ordering between payload writes and control notifications on CXI. We plan to work through these in standalone libfabric tests before touching UCCL — starting with `fi_info -p cxi -v` on the actual training container, then host-memory RMA, CUDA-memory RMA, a write-data test, and a payload+control ordering stress test. Happy to share the full plan if useful.

From my understanding of the paper, UCCL-EP uses immediate-data control messages to emulate ordering semantics for EP modes. Correct me if wrong. This means the write-with-immediate number perf matters a lot which we have to test. If performance is low, we need to think of design changes.

## Why we think this can be useful for both sides

- For UCCL: real Slingshot/CXI deployment and validation on an open large-scale MoE workload.
- For Apertus: a reusable open-source EP layer instead of an internal one-off.
- For the broader community: other EU data centers that want to train or serve large MoE models on Slingshot NICs.

Finally, thanks for building UCCL-EP, which benefits many people not use ib nics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] libfabric-CXI backend for UCCL-EP on HPE Slingshot #956

[Proposal]: libfabric-CXI backend for UCCL-EP on HPE Slingshot

What we'd contribute

What we plan to verify first currently

Why we think this can be useful for both sides

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Proposal] libfabric-CXI backend for UCCL-EP on HPE Slingshot #956

Description

[Proposal]: libfabric-CXI backend for UCCL-EP on HPE Slingshot

What we'd contribute

What we plan to verify first currently

Why we think this can be useful for both sides

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions