Skip to content

[P2P]: add CXI transport endpoint#999

Merged
praveingk merged 12 commits into
uccl-project:mainfrom
doublewordai:cxi-p2p-engine
Jun 23, 2026
Merged

[P2P]: add CXI transport endpoint#999
praveingk merged 12 commits into
uccl-project:mainfrom
doublewordai:cxi-p2p-engine

Conversation

@fergusfinn

@fergusfinn fergusfinn commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Thanks for all your work on UCCL!

This PR adds a CXI/libfabric transport for the UCCL P2P endpoint used by NIXL. It is separate from the UCCL-EP CXI work in #956 and #997, but it targets the same Slingshot/Cassini cxi provider.

Would be good to know if there should be more code sharing between this and the earlier EP patch.

Benchmarking/testing

Benchmarking was performed on the Isambard cluster in the UK - 4xGH200 nodes, with HPE Slingshot and 4 x Cassini 25GBps NICs on each node.

export UCCL_P2P_TRANSPORT=cxi
export UCCL_P2P_DISABLE_IPC=1
export UCCL_LIBFABRIC_SO=/opt/libfabric/lib/libfabric.so.1

# Run once per num_iovs in 1,2,4,8,16,32,64,128.
torchrun --nnodes=2 --nproc_per_node=1 --node-rank=<0|1> \
  --master_addr=127.0.0.1 --master_port=<port> \
  p2p/benchmarks/benchmark_uccl.py \
  --local-gpu-idx=<0|1> --device=gpu --sizes=1179648 --iters=16 \
  --num-iovs=<num_iovs> --mode=write --float-type=none
num_iovs total_bytes GB/s
1 1179648 19.60
2 2359296 21.58
4 4718592 22.56
8 9437184 23.34
16 18874368 23.81
32 37748736 24.00
64 75497472 24.10
128 150994944 24.17

Then, KVBench in NIXL (currently, kvbench doesn't have UCCL wired in, but it has a plan mode, and nixlbench does have uccl wired in):

 python3 main.py plan \
    --model ./examples/model_deepseek_r1.yaml \
    --model_config ./examples/block-tp1-pp8.yaml \
    --backend UCX \
    --source gpu \
    --destination gpu \
    --runtime_type ASIO \
    --num_iter 16 \
    --warmup_iter 16 \
    --num_threads 1 \
    --page_size 256 \
    --format json
    
  nixlbench \
    --runtime_type ASIO \
    --backend UCCL \
    --worker_type nixl \
    --initiator_seg_type VRAM \
    --target_seg_type VRAM \
    --scheme pairwise \
    --mode SG \
    --op_type WRITE \
    --total_buffer_size 351535104 \
    --num_initiator_dev 1 \
    --num_target_dev 1 \
    --start_block_size 1179648 \
    --max_block_size 1179648 \
    --start_batch_size 298 \
    --max_batch_size 298 \
    --num_iter 16 \
    --warmup_iter 16 \
    --large_blk_iter_ftr 16 \
    --num_threads 1
access page_size block_size batch_size total_bytes GB/s avg_lat_us avg_prep_us avg_post_us avg_tx_us
block 256 1179648 298 351535104 23.141011 51.0 12.0 9.0 15165.0

Also wired it in to vLLM, and vLLM disaggregated prefill completes for Qwen3-0.6b with direct NIXL telemetry showing expected KV transfer volume.

@fergusfinn fergusfinn changed the title p2p: add CXI transport endpoint [P2P]: add CXI transport endpoint Jun 17, 2026

@praveingk praveingk left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, and it's great to see that you had validated with nixlbench. Yes, you are right that KVbench is not wired up yet. I have a few comments which could be looked into.

Comment thread p2p/Makefile Outdated
Comment thread p2p/engine.cc Outdated
Comment thread p2p/engine.cc Outdated
@praveingk

Copy link
Copy Markdown
Collaborator

@fergusfinn You may need to run ./format.sh for the clang check to pass in CI

@YangZhou1997

Copy link
Copy Markdown
Member

/run-benchmark gh200

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown

Benchmark gh200 failed

PR: #999
Commit: 462d7ee92701d64bfd9bc2faf7f9202303582679
Result: failure
Workflow run: https://github.com/uccl-project/uccl/actions/runs/27842975018

@YangZhou1997

Copy link
Copy Markdown
Member

Thank you @fergusfinn for the contribution. Seems some header missing compilation errors on machines without libfabric/CXI installed, see the above workflow run

@fergusfinn

Copy link
Copy Markdown
Contributor Author

hi @YangZhou1997 - from the previous review from @praveingk i thought that including the various transport headers at compile time was the standard (i.e. rdma/efadv_dl.h has <infiniband/efadv.h>). Is the idea to vendor them? alternatively, i could update the PR to include them in the CI runner

@YangZhou1997

Copy link
Copy Markdown
Member

I see. My bad. I think your current approach is right. Our CI servers just do not have libfabric installed, thus triggering errors. I have created a PR on top of yours: doublewordai#14, and it should pass our CI server.

fixing CI and add libfabric dependency
@YangZhou1997

Copy link
Copy Markdown
Member

/run-benchmark gh200

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Benchmark gh200 passed

PR: #999
Commit: eb9a9e74ab5b08a7f3e3dd6f92c7b4cbebc122c3
Result: success
Workflow run: https://github.com/uccl-project/uccl/actions/runs/27913992426

@YangZhou1997

Copy link
Copy Markdown
Member

@praveingk the CI run looks good! Hope this may help your reviewing process

@praveingk praveingk self-requested a review June 22, 2026 04:12

@praveingk praveingk left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments @fergusfinn . Looks good to me.

@fergusfinn

Copy link
Copy Markdown
Contributor Author

Anything still required to merge @praveingk?

@praveingk

Copy link
Copy Markdown
Collaborator

Anything still required to merge @praveingk?

Apologies. Let me do it now.

@praveingk praveingk merged commit 2dfa76e into uccl-project:main Jun 23, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants