Skip to content

ep: CXI/Slingshot (libfabric) transport backend behind USE_CXI#1

Closed
fergusfinn wants to merge 20 commits into
mainfrom
cxi-ep
Closed

ep: CXI/Slingshot (libfabric) transport backend behind USE_CXI#1
fergusfinn wants to merge 20 commits into
mainfrom
cxi-ep

Conversation

@fergusfinn

@fergusfinn fergusfinn commented Jun 10, 2026

Copy link
Copy Markdown

Summary

Adds a CXI / HPE Slingshot (libfabric) transport backend for the UCCL EP
(expert-parallelism) dispatch/combine path, as a minimal patch gated behind
USE_CXI. When USE_CXI is not defined the existing verbs/EFA path is
completely untouched.

The public dispatch / combine / Buffer / Config interfaces are
unchanged, so this stays DeepEP-compatible and drop-in.

What's here

  • New self-contained module ep/{include,src}/cxi_transport.*
    uccl::cxi::Transport, an RAII libfabric (cxi provider) endpoint:
    CUDA + host MR registration (dmabuf-friendly), peer address exchange/insert,
    RMA write, inject_atomic_add64, and CQ completion polling. Whole file is
    #ifdef USE_CXI.
  • Wire-in through rdma.cpp / proxy.cpp behind #ifdef USE_CXI, with
    #else preserving the verbs path. One NIC per rank (local_rank % 4,
    rail-aligned). Deliberately avoids write-with-immediate / remote-CQ-data,
    which the CXI provider does not support here; control notifications map to
    libfabric atomic adds instead.
  • Low-latency internode path fixed to work over the CXI transport
    (internode_ll.cu).
  • GH200 4-GPU-per-node support: internode.cu / ep_configs.cuh drop the
    hard-coded 8-local-rank assumption, including an EP8 combine-config fix for
    4-GPU nodes; test_internode.py allows 4 local ranks.
  • Initialization robustness: init_dist sets the CUDA device correctly
    under srun/multi-GPU launchers; Buffer manages scratch allocation and
    initializes UCCL internally; proxy init takes a data_ptr instead of a raw
    torch tensor.
  • Two bug fixes found during serving (relevant beyond CXI):
    • Buffer::get_dispatch_layout()'s zero-token early return cleared
      num_ranks ints in a [num_rdma_ranks]-sized tensor — a heap overrun
      whenever a rank gets an empty DP-attention microbatch (e.g. EP16 on four
      4-GPU nodes wrote 16 ints into a 4-int allocation). Extent is now
      num_rdma_ranks * sizeof(int), with a distributed regression guard in
      ep/bench/test_zero_layout_guard.py that fails on the unfixed code.
    • Host-pinned EP atomic buffers (allocated with cudaHostAlloc under
      USE_CXI) were initialized with cudaMemset, which can fail with CUDA
      invalid argument; they now use std::memset, with cudaMemset kept for
      GPU/managed allocations.

Status / validation

  • Builds and runs on Isambard (GH200 + Slingshot-11/CXI), 2 nodes × 4 GPUs,
    in the project Podman container with USE_CXI=1.
  • Dispatch/combine correctness checks in test_internode.py pass.
  • ep/bench/test_zero_layout_guard.py passes on 2 nodes / 8 GPUs
    ([zero-layout-guard] pass world=8 num_nodes=2); without the fix it fails
    with the intended guard-clobber assertion.
  • Serving validated end-to-end: vLLM with --all2all-backend deepep_high_throughput on this stack sustains multi-thousand-token/s
    serving across 2 nodes.

Notes for reviewers

  • Error paths use fprintf(stderr, ...); could be routed through a logging
    macro.
  • Produced with heavy coding-agent assistance

Minimal, USE_CXI-gated CXI transport for the EP dispatch/combine path; the
existing verbs/EFA path is untouched when USE_CXI is not defined.

- New self-contained uccl::cxi::Transport module (ep/{include,src}/cxi_transport.*):
  RAII libfabric (cxi provider) endpoint, CUDA + host MR registration, peer
  insertion, RMA write, inject atomic-add64, and CQ completion polling.
- Wire-in via rdma.cpp/proxy.cpp behind #ifdef USE_CXI; one NIC per rank
  (local_rank % 4, rail-aligned), no remote-CQ-data / write-with-imm dependency.
- internode.cu / ep_configs.cuh: support GH200 4-GPU nodes (drop hard-coded
  8-local-rank assumption); test_internode.py allows 4 local ranks.

Public dispatch/combine/Buffer/Config interfaces unchanged (DeepEP-compatible).
Copilot AI review requested due to automatic review settings June 10, 2026 14:20
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

(cherry picked from commit 75a2617)
Carry only the runtime fix from 4c5b63f. The focused regression test remains split out with the delayed test changes.
Keep the zero-token runtime fix in the base branch and carry only the focused guard coverage here.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional (USE_CXI) CXI/Slingshot (libfabric) transport backend for the EP internode dispatch/combine path while preserving the existing verbs/EFA implementation when CXI is disabled. It also improves GH200 4-GPU-per-node support, tightens initialization/robustness, and includes a couple of correctness fixes found during serving.

Changes:

  • Introduce uccl::cxi::Transport (libfabric/cxi provider) and wire it into RDMA/proxy paths behind #ifdef USE_CXI (RMA writes + atomic-add control signaling).
  • Update internode kernels and configs to support NUM_MAX_NVL_PEERS <= 8 and 4-GPU nodes (including LL internode fixes).
  • Improve runtime/bench robustness (CUDA device selection, scratch pointer handling), add build toggles in setup.py, and expand test/bench CLI options.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ep/src/uccl_proxy.cpp CXI/host atomic buffer allocation + stop/destroy behavior adjustments.
ep/src/uccl_ep.cc Fix RDMA-rank memset overrun; minor includes and helper wiring for atomic-buffer connection.
ep/src/rdma.cpp Add CXI transport initialization, connection info exchange, batched CXI writes, and CXI atomics.
ep/src/proxy.cpp CXI-aware peer selection, pinning, barrier signaling via CXI atomics, and 4-GPU stride fixes.
ep/src/internode.cu Relax NVL peer assumptions and avoid type-punning loads by explicitly packing NVL presence.
ep/src/internode_ll.cu Make LL receive path robust when CXI forces internode even for same-node peers.
ep/src/cxi_transport.cpp New libfabric-based CXI transport implementation (endpoint + MR + RMA + CQ polling).
ep/setup.py Add USE_CXI build plumbing and libfabric link/include configuration.
ep/include/rdma.hpp Extend RDMAConnectionInfo with CXI endpoint/MR metadata (gated by USE_CXI).
ep/include/proxy_ctx.hpp Add CXI transport fields to ProxyCtx under USE_CXI.
ep/include/ep_configs.cuh Allow overriding NUM_MAX_NVL_PEERS at compile time.
ep/include/cxi_transport.hpp New CXI transport public header (gated by USE_CXI).
ep/bench/utils.py Robust device initialization + accept scratch as tensor or raw pointer.
ep/bench/test_internode.py Support 4 local ranks, add smoke/fixed-chunk options, set CXI phase env for stats.
ep/bench/test_internode_simple.py Use Buffer directly (less proxy plumbing), safer destroy flow.
ep/bench/buffer.py Adjust EP8 combine config for 4-GPU nodes.
AGENTS.md Add Isambard/GH200+CXI build and multinode test guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ep/src/uccl_proxy.cpp
Comment thread ep/src/proxy.cpp
doubleword-code[bot]

This comment was marked as low quality.

doubleword-code[bot]

This comment was marked as abuse.

doubleword-code[bot]

This comment was marked as resolved.

fergus barratt added 8 commits June 11, 2026 06:59
Replace the blocking post/wait_all/atomics sequence with an async design:
- writes post without waiting; ring slots retire on CQ completion via a
  per-loop retire sweep (verbs-style), with an outstanding cap as
  backpressure
- tail/control atomics inject immediately behind their writes, relying on
  same-TX-context submission ordering (measured on Slingshot-11; see
  uccl-project#956 discussion)
- -FI_EAGAIN on any post is handled by polling + retry instead of
  terminating
- UCCL_CXI_SYNC_WRITES=1 restores the previous conservative behavior
- quiet drains all outstanding writes; shutdown drains bounded

Also: env-gated UCCL_CXI_DELIVERY_COMPLETE=1 knob on TX (audit A/B).
Immediate same-TX injection reordered under FI_CXI_RDZV_THRESHOLD=0: every
write goes rendezvous (target pulls data), so an eager atomic lands before
the payload it announces -> corrupted dispatch output at >=1024 tokens.
Queue atomics per peer with threshold = writes posted before them; flush in
cxi_retire_ctx as completions arrive. Standalone atomics (no writes in
flight) still inject immediately. quiet/barrier/shutdown drain the queue.
ep/cxi: async write path with completion-gated control atomics
doubleword-code[bot]

This comment was marked as low quality.

- The CXI barrier compared a 21-bit-masked barrier_seq against per-node
  slot counters that accumulate forever (NIC atomic adds, never reset);
  after 2^21 barriers in one process the seq wraps and every release
  check passes spuriously. Keep seq monotonic (unmasked) under USE_CXI;
  the mask only exists for the verbs immediate-data encoding.
- atomic_buffer_ptr_ was uninitialized for thread_idx != 0 proxies, so
  the null-check guard in get_atomic_buffer_ptr() read an indeterminate
  pointer. Initialize to nullptr.

Both flagged by Copilot review on the PR.
@doublewordai doublewordai deleted a comment from doubleword-code Bot Jun 11, 2026
@doublewordai doublewordai deleted a comment from doubleword-code Bot Jun 11, 2026
@doublewordai doublewordai deleted a comment from doubleword-code Bot Jun 11, 2026
@doublewordai doublewordai deleted a comment from doubleword-code Bot Jun 11, 2026
@doublewordai doublewordai deleted a comment from doubleword-code Bot Jun 11, 2026
@fergusfinn

Copy link
Copy Markdown
Author

Closing this as superseded by uccl-project#997.

uccl-project#997 is now the cleaner upstream CXI/Slingshot EP path and is already open against uccl-project/main from swiss-ai/main. This PR was useful as validation/history, but it mixes the older USE_CXI transport wiring with local serving/debug fixes. Any remaining useful deltas should move forward as small PRs on top of uccl-project#997 rather than keeping this branch alive as a competing upstream path.

@fergusfinn fergusfinn closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants