[P2P]: add CXI transport endpoint by fergusfinn · Pull Request #999 · uccl-project/uccl

fergusfinn · 2026-06-17T13:49:12Z

Thanks for all your work on UCCL!

This PR adds a CXI/libfabric transport for the UCCL P2P endpoint used by NIXL. It is separate from the UCCL-EP CXI work in #956 and #997, but it targets the same Slingshot/Cassini cxi provider.

Would be good to know if there should be more code sharing between this and the earlier EP patch.

Benchmarking/testing

Benchmarking was performed on the Isambard cluster in the UK - 4xGH200 nodes, with HPE Slingshot and 4 x Cassini 25GBps NICs on each node.

export UCCL_P2P_TRANSPORT=cxi
export UCCL_P2P_DISABLE_IPC=1
export UCCL_LIBFABRIC_SO=/opt/libfabric/lib/libfabric.so.1

# Run once per num_iovs in 1,2,4,8,16,32,64,128.
torchrun --nnodes=2 --nproc_per_node=1 --node-rank=<0|1> \
  --master_addr=127.0.0.1 --master_port=<port> \
  p2p/benchmarks/benchmark_uccl.py \
  --local-gpu-idx=<0|1> --device=gpu --sizes=1179648 --iters=16 \
  --num-iovs=<num_iovs> --mode=write --float-type=none

num_iovs	total_bytes	GB/s
1	1179648	19.60
2	2359296	21.58
4	4718592	22.56
8	9437184	23.34
16	18874368	23.81
32	37748736	24.00
64	75497472	24.10
128	150994944	24.17

Then, KVBench in NIXL (currently, kvbench doesn't have UCCL wired in, but it has a plan mode, and nixlbench does have uccl wired in):

 python3 main.py plan \
    --model ./examples/model_deepseek_r1.yaml \
    --model_config ./examples/block-tp1-pp8.yaml \
    --backend UCX \
    --source gpu \
    --destination gpu \
    --runtime_type ASIO \
    --num_iter 16 \
    --warmup_iter 16 \
    --num_threads 1 \
    --page_size 256 \
    --format json
    
  nixlbench \
    --runtime_type ASIO \
    --backend UCCL \
    --worker_type nixl \
    --initiator_seg_type VRAM \
    --target_seg_type VRAM \
    --scheme pairwise \
    --mode SG \
    --op_type WRITE \
    --total_buffer_size 351535104 \
    --num_initiator_dev 1 \
    --num_target_dev 1 \
    --start_block_size 1179648 \
    --max_block_size 1179648 \
    --start_batch_size 298 \
    --max_batch_size 298 \
    --num_iter 16 \
    --warmup_iter 16 \
    --large_blk_iter_ftr 16 \
    --num_threads 1

access	page_size	block_size	batch_size	total_bytes	GB/s	avg_lat_us	avg_prep_us	avg_post_us	avg_tx_us
block	256	1179648	298	351535104	23.141011	51.0	12.0	9.0	15165.0

Also wired it in to vLLM, and vLLM disaggregated prefill completes for Qwen3-0.6b with direct NIXL telemetry showing expected KV transfer volume.

praveingk

Thanks for this contribution, and it's great to see that you had validated with nixlbench. Yes, you are right that KVbench is not wired up yet. I have a few comments which could be looked into.

praveingk · 2026-06-18T09:21:43Z

@fergusfinn You may need to run ./format.sh for the clang check to pass in CI

YangZhou1997 · 2026-06-19T18:45:03Z

/run-benchmark gh200

github-actions · 2026-06-19T18:45:17Z

Benchmark `gh200` failed

PR: #999
Commit: 462d7ee92701d64bfd9bc2faf7f9202303582679
Result: failure
Workflow run: https://github.com/uccl-project/uccl/actions/runs/27842975018

YangZhou1997 · 2026-06-19T18:52:08Z

Thank you @fergusfinn for the contribution. Seems some header missing compilation errors on machines without libfabric/CXI installed, see the above workflow run

fergusfinn · 2026-06-20T10:14:45Z

hi @YangZhou1997 - from the previous review from @praveingk i thought that including the various transport headers at compile time was the standard (i.e. rdma/efadv_dl.h has <infiniband/efadv.h>). Is the idea to vendor them? alternatively, i could update the PR to include them in the CI runner

YangZhou1997 · 2026-06-20T17:53:11Z

I see. My bad. I think your current approach is right. Our CI servers just do not have libfabric installed, thus triggering errors. I have created a PR on top of yours: doublewordai#14, and it should pass our CI server.

fixing CI and add libfabric dependency

YangZhou1997 · 2026-06-21T18:46:29Z

/run-benchmark gh200

github-actions · 2026-06-21T18:46:39Z

Benchmark `gh200` passed

PR: #999
Commit: eb9a9e74ab5b08a7f3e3dd6f92c7b4cbebc122c3
Result: success
Workflow run: https://github.com/uccl-project/uccl/actions/runs/27913992426

YangZhou1997 · 2026-06-21T18:55:23Z

@praveingk the CI run looks good! Hope this may help your reviewing process

praveingk

Thanks for addressing the comments @fergusfinn . Looks good to me.

fergusfinn · 2026-06-23T06:38:45Z

Anything still required to merge @praveingk?

praveingk · 2026-06-23T06:40:12Z

Anything still required to merge @praveingk?

Apologies. Let me do it now.

fergus barratt and others added 5 commits June 16, 2026 10:40

p2p: add CXI transport backend

72e0e0e

p2p: fail direct posts on hard errors

7088689

p2p: harden CXI completion contexts

276f517

p2p: avoid spinning on transient vector posts

b0a84d9

p2p: always request CXI delivery completion

c728050

fergusfinn requested review from YangZhou1997, derekwin, monopodium, praveingk and zhongjiechen as code owners June 17, 2026 13:49

fergusfinn changed the title ~~p2p: add CXI transport endpoint~~ [P2P]: add CXI transport endpoint Jun 17, 2026

praveingk added the run-benchmark label Jun 18, 2026

praveingk reviewed Jun 18, 2026

View reviewed changes

Comment thread p2p/Makefile Outdated

Comment thread p2p/engine.cc Outdated

Comment thread p2p/engine.cc Outdated

p2p: build CXI transport unconditionally

86c7b51

fergusfinn added 2 commits June 18, 2026 11:08

p2p: retry transient one-sided posts

c2f0e96

p2p: format CXI transport changes

462d7ee

fergusfinn force-pushed the cxi-p2p-engine branch from 1798cdc to 462d7ee Compare June 18, 2026 10:09

YangZhou1997 removed the run-benchmark label Jun 19, 2026

Merge branch 'main' into cxi-p2p-engine

40998a1

fixing CI and add libfabric dependency

c4ec1f3

Merge pull request #14 from uccl-project/yzhou/fix_cxi_ci

eb9a9e7

fixing CI and add libfabric dependency

praveingk self-requested a review June 22, 2026 04:12

praveingk approved these changes Jun 22, 2026

View reviewed changes

Merge branch 'main' into cxi-p2p-engine

462aa23

praveingk merged commit 2dfa76e into uccl-project:main Jun 23, 2026
14 checks passed

Uh oh!

Conversation

fergusfinn commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking/testing

Uh oh!

praveingk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

praveingk commented Jun 18, 2026

Uh oh!

YangZhou1997 commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark gh200 failed

Uh oh!

YangZhou1997 commented Jun 19, 2026

Uh oh!

fergusfinn commented Jun 20, 2026

Uh oh!

YangZhou1997 commented Jun 20, 2026

Uh oh!

YangZhou1997 commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark gh200 passed

Uh oh!

YangZhou1997 commented Jun 21, 2026

Uh oh!

praveingk left a comment

Choose a reason for hiding this comment

Uh oh!

fergusfinn commented Jun 23, 2026

Uh oh!

praveingk commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fergusfinn commented Jun 17, 2026 •

edited

Loading

github-actions Bot commented Jun 19, 2026 •

edited

Loading

Benchmark `gh200` failed

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Benchmark `gh200` passed