PLUGINS/UCX: add scatter-gather (SGL) put path#1835
Conversation
|
👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
61383da to
0aa8bbb
Compare
📝 WalkthroughWalkthroughThe PR adds optional UCX SGL support detection and declarations, runtime enablement and posting helpers, and a new write-transfer path that can build and submit SGL-based UCX operations. ChangesUCX SGL support
Sequence Diagram(s)sequenceDiagram
participant nixlUcxEngine
participant nixlUcxEp
participant UCX
nixlUcxEngine->>nixlUcxEp: postSgl(local SGL, remote SGL, count, req)
nixlUcxEp->>UCX: ucp_put_nbx(...)
alt UCX returns a request pointer
UCX-->>nixlUcxEp: in-progress request
nixlUcxEp-->>nixlUcxEngine: NIXL_IN_PROG + req
else UCX completes inline
UCX-->>nixlUcxEp: completion status
nixlUcxEp-->>nixlUcxEngine: status + req = nullptr
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/plugins/ucx/ucx_backend.cpp`:
- Around line 1323-1325: The SGL fast path in ucx_backend.cpp currently skips
the per-endpoint batching logic in the send path. Update the NIXL_WRITE branch
in the transfer routine (the block that calls sendXferSgl) so it only takes this
route when the full remote range remote[start_idx:end_idx] resolves to a single
connection, or move the batching/connection-homogeneity check into sendXferSgl
itself. Use the existing transfer flow and metadata helpers in this backend to
verify the range before returning the SGL path.
- Around line 1249-1293: The sendXferSgl() path is using a single
getConnection(remote_agent) for the whole range, which can post remote
addrs/rkeys through the wrong UCX endpoint when descriptors span multiple
connections. Update ucx_backend.cpp so the SGL send is scoped to one connection
by grouping descriptors by rmd->conn (like the non-SGL batching logic) and
posting each group on its own conn->getEp(worker_id), or detect mixed
connections and fall back to the non-SGL path. Keep the existing
local_sgl/remote_sgl setup but build it per connection instead of once for the
whole range.
In `@src/plugins/ucx/ucx_utils.h`:
- Around line 112-117: The public endpoint API `postSgl()` in `ucx_utils.h`
needs a Doxygen block documenting the SGL lifetime contract. Add a `/** ... */`
comment above `postSgl()` that clearly states the caller-owned
`ucp_dt_local_sgl_t` and `ucp_dt_remote_sgl_t` arrays must remain valid until
the associated request completes, so users know the stored UCX pointers must
outlive completion. Use the function name `postSgl()` and the `nixlUcxReq`
request type in the comment to make the ownership and completion requirement
explicit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 8909565d-7350-4d42-8b9f-c50ca03ce3eb
📒 Files selected for processing (5)
meson.buildsrc/plugins/ucx/ucx_backend.cppsrc/plugins/ucx/ucx_backend.hsrc/plugins/ucx/ucx_utils.cppsrc/plugins/ucx/ucx_utils.h
| const ucx_connection_ptr_t conn = getConnection(remote_agent); | ||
| if (!conn) { | ||
| NIXL_ERROR << "No connection found for remote agent: " << remote_agent; | ||
| return NIXL_ERR_NOT_FOUND; | ||
| } | ||
|
|
||
| const size_t count = end_idx - start_idx; | ||
| auto &sgl = int_handle->sgl; | ||
| sgl.resize(count); | ||
| for (size_t i = start_idx; i < end_idx; ++i) { | ||
| const size_t out = i - start_idx; | ||
| const auto lmd = static_cast<nixlUcxPrivateMetadata *>(local[i].metadataP); | ||
| const auto rmd = static_cast<nixlUcxPublicMetadata *>(remote[i].metadataP); | ||
| NIXL_ASSERT(local[i].len == remote[i].len); | ||
|
|
||
| sgl.localAddrs[out] = reinterpret_cast<void *>(local[i].addr); | ||
| sgl.remoteAddrs[out] = static_cast<uint64_t>(remote[i].addr); | ||
| sgl.lengths[out] = local[i].len; | ||
| sgl.memhs[out] = lmd->getMem().getMemh(); | ||
| sgl.rkeys[out] = rmd->getRkey(worker_id).get(); | ||
| } | ||
|
|
||
| const ucp_dt_local_sgl_t local_sgl = { | ||
| .field_mask = UCP_DT_LOCAL_SGL_FIELD_BUFFERS | | ||
| UCP_DT_LOCAL_SGL_FIELD_LENGTHS | | ||
| UCP_DT_LOCAL_SGL_FIELD_MEMHS, | ||
| .buffers = sgl.localAddrs.data(), | ||
| .lengths = sgl.lengths.data(), | ||
| .memhs = sgl.memhs.data(), | ||
| }; | ||
| const ucp_dt_remote_sgl_t remote_sgl = { | ||
| .field_mask = UCP_DT_REMOTE_SGL_FIELD_REMOTE_ADDRS | | ||
| UCP_DT_REMOTE_SGL_FIELD_LENGTHS | | ||
| UCP_DT_REMOTE_SGL_FIELD_RKEYS, | ||
| .remote_addrs = sgl.remoteAddrs.data(), | ||
| .lengths = sgl.lengths.data(), | ||
| .rkeys = sgl.rkeys.data(), | ||
| }; | ||
|
|
||
| auto &ep = conn->getEp(worker_id); | ||
|
|
||
| int_handle->reserve(single_ep_request_count); | ||
|
|
||
| nixlUcxReq req; | ||
| const nixl_status_t post_ret = ep->postSgl(local_sgl, remote_sgl, count, req); |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
Keep SGL posts scoped to a single UCX connection.
sendXferSgl() posts the entire range through getConnection(remote_agent), but the non-SGL path batches by each descriptor’s rmd->conn and flushes all distinct connections. If a range spans multiple connections, this sends remote addresses/rkeys through the wrong endpoint. Split the SGL path per rmd->conn or fall back unless all descriptors share the same connection.
🧰 Tools
🪛 GitHub Actions: Clang Format Check / 0_clang-format.txt
[error] 1269-1277: clang-format-diff-19 reported formatting changes required (field_mask line wrapping). Run clang-format-diff-19/clang-format to apply formatting.
[error] 1278-1286: clang-format-diff-19 reported formatting changes required (field_mask line wrapping). Run clang-format-diff-19/clang-format to apply formatting.
🪛 GitHub Actions: Clang Format Check / clang-format
[error] 1269-1276: clang-format-diff-19 reported formatting differences (clang format check failed). Run clang-format on this file to match project style.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/plugins/ucx/ucx_backend.cpp` around lines 1249 - 1293, The sendXferSgl()
path is using a single getConnection(remote_agent) for the whole range, which
can post remote addrs/rkeys through the wrong UCX endpoint when descriptors span
multiple connections. Update ucx_backend.cpp so the SGL send is scoped to one
connection by grouping descriptors by rmd->conn (like the non-SGL batching
logic) and posting each group on its own conn->getEp(worker_id), or detect mixed
connections and fall back to the non-SGL path. Keep the existing
local_sgl/remote_sgl setup but build it per connection instead of once for the
whole range.
| #ifdef HAVE_UCX_SGL_API | ||
| if (sglEnabled_ && operation == NIXL_WRITE) { | ||
| return sendXferSgl(local, remote, remote_agent, handle, start_idx, end_idx); |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
Only route to SGL after confirming the range is connection-homogeneous.
This branch bypasses the existing per-endpoint batching logic. Gate it on all remote[start_idx:end_idx] metadata resolving to one connection, or make sendXferSgl() perform the same batching internally.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/plugins/ucx/ucx_backend.cpp` around lines 1323 - 1325, The SGL fast path
in ucx_backend.cpp currently skips the per-endpoint batching logic in the send
path. Update the NIXL_WRITE branch in the transfer routine (the block that calls
sendXferSgl) so it only takes this route when the full remote range
remote[start_idx:end_idx] resolves to a single connection, or move the
batching/connection-homogeneity check into sendXferSgl itself. Use the existing
transfer flow and metadata helpers in this backend to verify the range before
returning the SGL path.
| #ifdef HAVE_UCX_SGL_API | ||
| [[nodiscard]] nixl_status_t | ||
| postSgl(const ucp_dt_local_sgl_t &local, | ||
| const ucp_dt_remote_sgl_t &remote, | ||
| size_t count, | ||
| nixlUcxReq &req); |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win
Document the SGL buffer lifetime contract on the public endpoint API.
postSgl() stores UCX pointers into caller-owned SGL arrays, so callers must know those arrays must outlive request completion. Add a Doxygen block for the new public API. As per path instructions, “Use Doxygen block comments (/** ... */) for public APIs.”
Suggested documentation
`#ifdef` HAVE_UCX_SGL_API
+ /**
+ * Post a UCX put using local and remote scatter-gather descriptors.
+ *
+ * The arrays referenced by `local` and `remote` must remain valid until
+ * the returned UCX request completes.
+ */
[[nodiscard]] nixl_status_t
postSgl(const ucp_dt_local_sgl_t &local,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #ifdef HAVE_UCX_SGL_API | |
| [[nodiscard]] nixl_status_t | |
| postSgl(const ucp_dt_local_sgl_t &local, | |
| const ucp_dt_remote_sgl_t &remote, | |
| size_t count, | |
| nixlUcxReq &req); | |
| `#ifdef` HAVE_UCX_SGL_API | |
| /** | |
| * Post a UCX put using local and remote scatter-gather descriptors. | |
| * | |
| * The arrays referenced by `local` and `remote` must remain valid until | |
| * the returned UCX request completes. | |
| */ | |
| [[nodiscard]] nixl_status_t | |
| postSgl(const ucp_dt_local_sgl_t &local, | |
| const ucp_dt_remote_sgl_t &remote, | |
| size_t count, | |
| nixlUcxReq &req); | |
| `#endif` |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/plugins/ucx/ucx_utils.h` around lines 112 - 117, The public endpoint API
`postSgl()` in `ucx_utils.h` needs a Doxygen block documenting the SGL lifetime
contract. Add a `/** ... */` comment above `postSgl()` that clearly states the
caller-owned `ucp_dt_local_sgl_t` and `ucp_dt_remote_sgl_t` arrays must remain
valid until the associated request completes, so users know the stored UCX
pointers must outlive completion. Use the function name `postSgl()` and the
`nixlUcxReq` request type in the comment to make the ownership and completion
requirement explicit.
Source: Path instructions
| const auto engine_config = | ||
| nixl::getBackendParamDefaulted(custom_params, "engine_config", std::string()); | ||
|
|
||
| sglEnabled_ = nixl::config::getValueOptional<bool>("NIXL_UCX_SGL_ENABLE").value_or(false); |
There was a problem hiding this comment.
...::getValueDefaulted("NIXL_UCX_SGL_ENABLE", false)
Pending openucx/ucx#11577
What?
Add a UCX scatter-gather (SGL) put path. When enabled, a
NIXL_WRITEis issued as a singleucp_put_nbxover all of its descriptors instead of one put per element. It is gated at compile time byHAVE_UCX_SGL_API(meson-detected) and at runtime by theNIXL_UCX_SGL_ENABLEenv var.Why?
Posting one UCX request per descriptor is wasteful for multi-element writes. An SGL put hands UCX the whole descriptor list in one call, reducing per-element overhead.
How?
nixlUcxEp::postSgl()is a thin wrapper arounducp_put_nbxwith the SGL datatype.sendXferRange()selectssendXferSgl()for enabled writes.sendXferSgl()gathers the range into per-field arrays and posts once. All engines (base, progress-thread, thread pool per chunk) route through this one decision point.sglBuffers) because UCX keeps pointers into them until the put completes.