p2p: add CXI transport endpoint#6
Closed
fergusfinn wants to merge 5 commits into
Closed
Conversation
0ff913d to
2f1a9ea
Compare
af8eff3 to
2014486
Compare
2014486 to
72e0e0e
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds an optional HPE Slingshot/CXI (libfabric) transport backend to the UCCL P2P engine, selectable at runtime via UCCL_P2P_TRANSPORT=cxi, with build-time enablement via USE_CXI=1.
Changes:
- Introduces
CxiEndpoint(libfabric-based) and wires it into the engine’s runtime endpoint variant and notification path. - Adds a libfabric
dlopen/dlsymwrapper scaffold and Makefile build toggles for CXI. - Adds CXI-related runtime configuration (README) and adjusts one-sided in-flight limits / transient-post handling.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| p2p/util/transport_type.h | Adds CXI transport selection and helper is_cxi_transport(). |
| p2p/util/common.h | Adds UCCL_POST_TRANSIENT return code used for transient post failures. |
| p2p/README.md | Documents how to build/run with CXI and new CXI-related env vars. |
| p2p/Makefile | Adds USE_CXI toggle, libfabric include path, and CXI sources. |
| p2p/engine.h | Extends GenericEndpoint + MR handle to support CXI endpoint/MR. |
| p2p/engine.cc | Adds CXI endpoint construction, transient rc handling, and CXI inflight tuning. |
| p2p/endpoint_wrapper.h | Adds CXI dispatch for regmr/deregmr/read/write and FIFO metadata encoding. |
| p2p/cxi/fabric_dl.h | New libfabric dlopen/dlsym helper. |
| p2p/cxi/fabric_dl.cc | New exported fi_* wrappers (currently partial). |
| p2p/cxi/cxi_endpoint.h | Declares CxiEndpoint and CXI FIFO metadata helpers. |
| p2p/cxi/cxi_endpoint.cc | Implements the CXI/libfabric endpoint, OOB handshake, MR reg, and RMA ops. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a CXI/libfabric transport for the UCCL P2P endpoint used by NIXL. It is separate from the UCCL-EP CXI work in uccl-project#956 and uccl-project#997, but it targets the same Slingshot/Cassini
cxiprovider.Benchmarking/testing
Then, KVBench in NIXL (currently, kvbench doesn't have UCCL wired in, but it has a plan mode, and nixlbench does have uccl wired in):
python3 main.py plan \ --model ./examples/model_deepseek_r1.yaml \ --model_config ./examples/block-tp1-pp8.yaml \ --backend UCX \ --source gpu \ --destination gpu \ --runtime_type ASIO \ --num_iter 16 \ --warmup_iter 16 \ --num_threads 1 \ --page_size 256 \ --format json nixlbench \ --runtime_type ASIO \ --backend UCCL \ --worker_type nixl \ --initiator_seg_type VRAM \ --target_seg_type VRAM \ --scheme pairwise \ --mode SG \ --op_type WRITE \ --total_buffer_size 351535104 \ --num_initiator_dev 1 \ --num_target_dev 1 \ --start_block_size 1179648 \ --max_block_size 1179648 \ --start_batch_size 298 \ --max_batch_size 298 \ --num_iter 16 \ --warmup_iter 16 \ --large_blk_iter_ftr 16 \ --num_threads 1Also wired it in to vLLM, and vLLM disaggregated prefill completes for Qwen3-0.6b with direct NIXL telemetry showing expected KV transfer volume.