Skip to content

p2p: add CXI transport endpoint#6

Closed
fergusfinn wants to merge 5 commits into
mainfrom
cxi-p2p-engine
Closed

p2p: add CXI transport endpoint#6
fergusfinn wants to merge 5 commits into
mainfrom
cxi-p2p-engine

Conversation

@fergusfinn

@fergusfinn fergusfinn commented Jun 14, 2026

Copy link
Copy Markdown

This PR adds a CXI/libfabric transport for the UCCL P2P endpoint used by NIXL. It is separate from the UCCL-EP CXI work in uccl-project#956 and uccl-project#997, but it targets the same Slingshot/Cassini cxi provider.

Benchmarking/testing

export UCCL_P2P_TRANSPORT=cxi
export UCCL_P2P_DISABLE_IPC=1
export UCCL_LIBFABRIC_SO=/opt/libfabric/lib/libfabric.so.1

# Run once per num_iovs in 1,2,4,8,16,32,64,128.
torchrun --nnodes=2 --nproc_per_node=1 --node-rank=<0|1> \
  --master_addr=127.0.0.1 --master_port=<port> \
  p2p/benchmarks/benchmark_uccl.py \
  --local-gpu-idx=<0|1> --device=gpu --sizes=1179648 --iters=16 \
  --num-iovs=<num_iovs> --mode=write --float-type=none
num_iovs total_bytes GB/s
1 1179648 19.60
2 2359296 21.58
4 4718592 22.56
8 9437184 23.34
16 18874368 23.81
32 37748736 24.00
64 75497472 24.10
128 150994944 24.17

Then, KVBench in NIXL (currently, kvbench doesn't have UCCL wired in, but it has a plan mode, and nixlbench does have uccl wired in):

 python3 main.py plan \
    --model ./examples/model_deepseek_r1.yaml \
    --model_config ./examples/block-tp1-pp8.yaml \
    --backend UCX \
    --source gpu \
    --destination gpu \
    --runtime_type ASIO \
    --num_iter 16 \
    --warmup_iter 16 \
    --num_threads 1 \
    --page_size 256 \
    --format json
    
  nixlbench \
    --runtime_type ASIO \
    --backend UCCL \
    --worker_type nixl \
    --initiator_seg_type VRAM \
    --target_seg_type VRAM \
    --scheme pairwise \
    --mode SG \
    --op_type WRITE \
    --total_buffer_size 351535104 \
    --num_initiator_dev 1 \
    --num_target_dev 1 \
    --start_block_size 1179648 \
    --max_block_size 1179648 \
    --start_batch_size 298 \
    --max_batch_size 298 \
    --num_iter 16 \
    --warmup_iter 16 \
    --large_blk_iter_ftr 16 \
    --num_threads 1
access page_size block_size batch_size total_bytes GB/s avg_lat_us avg_prep_us avg_post_us avg_tx_us
block 256 1179648 298 351535104 23.141011 51.0 12.0 9.0 15165.0

Also wired it in to vLLM, and vLLM disaggregated prefill completes for Qwen3-0.6b with direct NIXL telemetry showing expected KV transfer volume.

@fergusfinn fergusfinn changed the base branch from cxi-ep to main June 15, 2026 09:07
@fergusfinn fergusfinn force-pushed the cxi-p2p-engine branch 4 times, most recently from af8eff3 to 2014486 Compare June 15, 2026 17:54
@fergusfinn fergusfinn marked this pull request as ready for review June 16, 2026 10:14
Copilot AI review requested due to automatic review settings June 16, 2026 10:14

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional HPE Slingshot/CXI (libfabric) transport backend to the UCCL P2P engine, selectable at runtime via UCCL_P2P_TRANSPORT=cxi, with build-time enablement via USE_CXI=1.

Changes:

  • Introduces CxiEndpoint (libfabric-based) and wires it into the engine’s runtime endpoint variant and notification path.
  • Adds a libfabric dlopen/dlsym wrapper scaffold and Makefile build toggles for CXI.
  • Adds CXI-related runtime configuration (README) and adjusts one-sided in-flight limits / transient-post handling.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
p2p/util/transport_type.h Adds CXI transport selection and helper is_cxi_transport().
p2p/util/common.h Adds UCCL_POST_TRANSIENT return code used for transient post failures.
p2p/README.md Documents how to build/run with CXI and new CXI-related env vars.
p2p/Makefile Adds USE_CXI toggle, libfabric include path, and CXI sources.
p2p/engine.h Extends GenericEndpoint + MR handle to support CXI endpoint/MR.
p2p/engine.cc Adds CXI endpoint construction, transient rc handling, and CXI inflight tuning.
p2p/endpoint_wrapper.h Adds CXI dispatch for regmr/deregmr/read/write and FIFO metadata encoding.
p2p/cxi/fabric_dl.h New libfabric dlopen/dlsym helper.
p2p/cxi/fabric_dl.cc New exported fi_* wrappers (currently partial).
p2p/cxi/cxi_endpoint.h Declares CxiEndpoint and CXI FIFO metadata helpers.
p2p/cxi/cxi_endpoint.cc Implements the CXI/libfabric endpoint, OOB handshake, MR reg, and RMA ops.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread p2p/cxi/fabric_dl.cc
Comment thread p2p/engine.cc Outdated
Comment thread p2p/engine.cc Outdated
Comment thread p2p/engine.cc Outdated
Comment thread p2p/engine.cc Outdated
Comment thread p2p/cxi/cxi_endpoint.cc

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Comment thread p2p/cxi/cxi_endpoint.cc
Comment thread p2p/cxi/cxi_endpoint.cc
Comment thread p2p/cxi/cxi_endpoint.cc
Comment thread p2p/cxi/cxi_endpoint.cc

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment thread p2p/Makefile
Comment thread p2p/cxi/cxi_endpoint.cc
@fergusfinn fergusfinn closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants