[P2P]: add CXI transport endpoint#999
Conversation
praveingk
left a comment
There was a problem hiding this comment.
Thanks for this contribution, and it's great to see that you had validated with nixlbench. Yes, you are right that KVbench is not wired up yet. I have a few comments which could be looked into.
|
@fergusfinn You may need to run |
1798cdc to
462d7ee
Compare
|
/run-benchmark gh200 |
Benchmark
|
|
Thank you @fergusfinn for the contribution. Seems some header missing compilation errors on machines without libfabric/CXI installed, see the above workflow run |
|
hi @YangZhou1997 - from the previous review from @praveingk i thought that including the various transport headers at compile time was the standard (i.e. rdma/efadv_dl.h has |
|
I see. My bad. I think your current approach is right. Our CI servers just do not have libfabric installed, thus triggering errors. I have created a PR on top of yours: doublewordai#14, and it should pass our CI server. |
fixing CI and add libfabric dependency
|
/run-benchmark gh200 |
Benchmark
|
|
@praveingk the CI run looks good! Hope this may help your reviewing process |
praveingk
left a comment
There was a problem hiding this comment.
Thanks for addressing the comments @fergusfinn . Looks good to me.
|
Anything still required to merge @praveingk? |
Apologies. Let me do it now. |
Thanks for all your work on UCCL!
This PR adds a CXI/libfabric transport for the UCCL P2P endpoint used by NIXL. It is separate from the UCCL-EP CXI work in #956 and #997, but it targets the same Slingshot/Cassini
cxiprovider.Would be good to know if there should be more code sharing between this and the earlier EP patch.
Benchmarking/testing
Benchmarking was performed on the Isambard cluster in the UK - 4xGH200 nodes, with HPE Slingshot and 4 x Cassini 25GBps NICs on each node.
Then, KVBench in NIXL (currently, kvbench doesn't have UCCL wired in, but it has a plan mode, and nixlbench does have uccl wired in):
python3 main.py plan \ --model ./examples/model_deepseek_r1.yaml \ --model_config ./examples/block-tp1-pp8.yaml \ --backend UCX \ --source gpu \ --destination gpu \ --runtime_type ASIO \ --num_iter 16 \ --warmup_iter 16 \ --num_threads 1 \ --page_size 256 \ --format json nixlbench \ --runtime_type ASIO \ --backend UCCL \ --worker_type nixl \ --initiator_seg_type VRAM \ --target_seg_type VRAM \ --scheme pairwise \ --mode SG \ --op_type WRITE \ --total_buffer_size 351535104 \ --num_initiator_dev 1 \ --num_target_dev 1 \ --start_block_size 1179648 \ --max_block_size 1179648 \ --start_batch_size 298 \ --max_batch_size 298 \ --num_iter 16 \ --warmup_iter 16 \ --large_blk_iter_ftr 16 \ --num_threads 1Also wired it in to vLLM, and vLLM disaggregated prefill completes for Qwen3-0.6b with direct NIXL telemetry showing expected KV transfer volume.