Skip to content

PLUGINS/UCX: add scatter-gather (SGL) put path#1835

Draft
michal-shalev wants to merge 1 commit into
ai-dynamo:mainfrom
michal-shalev:ucx-sgl-put
Draft

PLUGINS/UCX: add scatter-gather (SGL) put path#1835
michal-shalev wants to merge 1 commit into
ai-dynamo:mainfrom
michal-shalev:ucx-sgl-put

Conversation

@michal-shalev

@michal-shalev michal-shalev commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Pending openucx/ucx#11577

What?

Add a UCX scatter-gather (SGL) put path. When enabled, a NIXL_WRITE is issued as a single ucp_put_nbx over all of its descriptors instead of one put per element. It is gated at compile time by HAVE_UCX_SGL_API (meson-detected) and at runtime by the NIXL_UCX_SGL_ENABLE env var.

Why?

Posting one UCX request per descriptor is wasteful for multi-element writes. An SGL put hands UCX the whole descriptor list in one call, reducing per-element overhead.

How?

  • nixlUcxEp::postSgl() is a thin wrapper around ucp_put_nbx with the SGL datatype.
  • sendXferRange() selects sendXferSgl() for enabled writes. sendXferSgl() gathers the range into per-field arrays and posts once. All engines (base, progress-thread, thread pool per chunk) route through this one decision point.
  • The arrays live on the request handle (sglBuffers) because UCX keeps pointers into them until the put completes.

@michal-shalev michal-shalev self-assigned this Jun 25, 2026
@michal-shalev michal-shalev requested review from a team, brminich, gleon99 and yosefe as code owners June 25, 2026 00:28
@copy-pr-bot

copy-pr-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@michal-shalev michal-shalev marked this pull request as draft June 25, 2026 00:28
@github-actions

Copy link
Copy Markdown

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds optional UCX SGL support detection and declarations, runtime enablement and posting helpers, and a new write-transfer path that can build and submit SGL-based UCX operations.

Changes

UCX SGL support

Layer / File(s) Summary
Feature gate and declarations
meson.build, src/plugins/ucx/ucx_utils.h, src/plugins/ucx/ucx_backend.h
HAVE_UCX_SGL_API is defined when UCX exposes ucp_dt_local_sgl_t, and the UCX endpoint and backend headers declare the SGL-specific API and runtime flag behind that macro.
SGL request state and posting
src/plugins/ucx/ucx_backend.cpp, src/plugins/ucx/ucx_utils.cpp
The backend request handle adds SGL buffer storage, the engine reads NIXL_UCX_SGL_ENABLE, and the endpoint posts UCX puts with local and remote SGL descriptors.
Write-path routing and request budgeting
src/plugins/ucx/ucx_backend.cpp
The engine adds sendXferSgl(...), builds per-segment SGL descriptors, records the UCX and flush requests, and routes NIXL_WRITE transfers through the SGL path when enabled.

Sequence Diagram(s)

sequenceDiagram
  participant nixlUcxEngine
  participant nixlUcxEp
  participant UCX

  nixlUcxEngine->>nixlUcxEp: postSgl(local SGL, remote SGL, count, req)
  nixlUcxEp->>UCX: ucp_put_nbx(...)
  alt UCX returns a request pointer
    UCX-->>nixlUcxEp: in-progress request
    nixlUcxEp-->>nixlUcxEngine: NIXL_IN_PROG + req
  else UCX completes inline
    UCX-->>nixlUcxEp: completion status
    nixlUcxEp-->>nixlUcxEngine: status + req = nullptr
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through SGL lanes with a twitch of my nose,
One put, one flush, and the bunny path grows.
Meson said “yes” where the UCX leaves glow,
Now little packets leap neatly in rows.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: adding a UCX SGL put path.
Description check ✅ Passed The description follows the template with What, Why, and How sections and covers the main implementation details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/plugins/ucx/ucx_backend.cpp`:
- Around line 1323-1325: The SGL fast path in ucx_backend.cpp currently skips
the per-endpoint batching logic in the send path. Update the NIXL_WRITE branch
in the transfer routine (the block that calls sendXferSgl) so it only takes this
route when the full remote range remote[start_idx:end_idx] resolves to a single
connection, or move the batching/connection-homogeneity check into sendXferSgl
itself. Use the existing transfer flow and metadata helpers in this backend to
verify the range before returning the SGL path.
- Around line 1249-1293: The sendXferSgl() path is using a single
getConnection(remote_agent) for the whole range, which can post remote
addrs/rkeys through the wrong UCX endpoint when descriptors span multiple
connections. Update ucx_backend.cpp so the SGL send is scoped to one connection
by grouping descriptors by rmd->conn (like the non-SGL batching logic) and
posting each group on its own conn->getEp(worker_id), or detect mixed
connections and fall back to the non-SGL path. Keep the existing
local_sgl/remote_sgl setup but build it per connection instead of once for the
whole range.

In `@src/plugins/ucx/ucx_utils.h`:
- Around line 112-117: The public endpoint API `postSgl()` in `ucx_utils.h`
needs a Doxygen block documenting the SGL lifetime contract. Add a `/** ... */`
comment above `postSgl()` that clearly states the caller-owned
`ucp_dt_local_sgl_t` and `ucp_dt_remote_sgl_t` arrays must remain valid until
the associated request completes, so users know the stored UCX pointers must
outlive completion. Use the function name `postSgl()` and the `nixlUcxReq`
request type in the comment to make the ownership and completion requirement
explicit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8909565d-7350-4d42-8b9f-c50ca03ce3eb

📥 Commits

Reviewing files that changed from the base of the PR and between a9f456b and 61383da.

📒 Files selected for processing (5)
  • meson.build
  • src/plugins/ucx/ucx_backend.cpp
  • src/plugins/ucx/ucx_backend.h
  • src/plugins/ucx/ucx_utils.cpp
  • src/plugins/ucx/ucx_utils.h

Comment on lines +1249 to +1293
const ucx_connection_ptr_t conn = getConnection(remote_agent);
if (!conn) {
NIXL_ERROR << "No connection found for remote agent: " << remote_agent;
return NIXL_ERR_NOT_FOUND;
}

const size_t count = end_idx - start_idx;
auto &sgl = int_handle->sgl;
sgl.resize(count);
for (size_t i = start_idx; i < end_idx; ++i) {
const size_t out = i - start_idx;
const auto lmd = static_cast<nixlUcxPrivateMetadata *>(local[i].metadataP);
const auto rmd = static_cast<nixlUcxPublicMetadata *>(remote[i].metadataP);
NIXL_ASSERT(local[i].len == remote[i].len);

sgl.localAddrs[out] = reinterpret_cast<void *>(local[i].addr);
sgl.remoteAddrs[out] = static_cast<uint64_t>(remote[i].addr);
sgl.lengths[out] = local[i].len;
sgl.memhs[out] = lmd->getMem().getMemh();
sgl.rkeys[out] = rmd->getRkey(worker_id).get();
}

const ucp_dt_local_sgl_t local_sgl = {
.field_mask = UCP_DT_LOCAL_SGL_FIELD_BUFFERS |
UCP_DT_LOCAL_SGL_FIELD_LENGTHS |
UCP_DT_LOCAL_SGL_FIELD_MEMHS,
.buffers = sgl.localAddrs.data(),
.lengths = sgl.lengths.data(),
.memhs = sgl.memhs.data(),
};
const ucp_dt_remote_sgl_t remote_sgl = {
.field_mask = UCP_DT_REMOTE_SGL_FIELD_REMOTE_ADDRS |
UCP_DT_REMOTE_SGL_FIELD_LENGTHS |
UCP_DT_REMOTE_SGL_FIELD_RKEYS,
.remote_addrs = sgl.remoteAddrs.data(),
.lengths = sgl.lengths.data(),
.rkeys = sgl.rkeys.data(),
};

auto &ep = conn->getEp(worker_id);

int_handle->reserve(single_ep_request_count);

nixlUcxReq req;
const nixl_status_t post_ret = ep->postSgl(local_sgl, remote_sgl, count, req);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Keep SGL posts scoped to a single UCX connection.

sendXferSgl() posts the entire range through getConnection(remote_agent), but the non-SGL path batches by each descriptor’s rmd->conn and flushes all distinct connections. If a range spans multiple connections, this sends remote addresses/rkeys through the wrong endpoint. Split the SGL path per rmd->conn or fall back unless all descriptors share the same connection.

🧰 Tools
🪛 GitHub Actions: Clang Format Check / 0_clang-format.txt

[error] 1269-1277: clang-format-diff-19 reported formatting changes required (field_mask line wrapping). Run clang-format-diff-19/clang-format to apply formatting.


[error] 1278-1286: clang-format-diff-19 reported formatting changes required (field_mask line wrapping). Run clang-format-diff-19/clang-format to apply formatting.

🪛 GitHub Actions: Clang Format Check / clang-format

[error] 1269-1276: clang-format-diff-19 reported formatting differences (clang format check failed). Run clang-format on this file to match project style.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/plugins/ucx/ucx_backend.cpp` around lines 1249 - 1293, The sendXferSgl()
path is using a single getConnection(remote_agent) for the whole range, which
can post remote addrs/rkeys through the wrong UCX endpoint when descriptors span
multiple connections. Update ucx_backend.cpp so the SGL send is scoped to one
connection by grouping descriptors by rmd->conn (like the non-SGL batching
logic) and posting each group on its own conn->getEp(worker_id), or detect mixed
connections and fall back to the non-SGL path. Keep the existing
local_sgl/remote_sgl setup but build it per connection instead of once for the
whole range.

Comment on lines +1323 to +1325
#ifdef HAVE_UCX_SGL_API
if (sglEnabled_ && operation == NIXL_WRITE) {
return sendXferSgl(local, remote, remote_agent, handle, start_idx, end_idx);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Only route to SGL after confirming the range is connection-homogeneous.

This branch bypasses the existing per-endpoint batching logic. Gate it on all remote[start_idx:end_idx] metadata resolving to one connection, or make sendXferSgl() perform the same batching internally.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/plugins/ucx/ucx_backend.cpp` around lines 1323 - 1325, The SGL fast path
in ucx_backend.cpp currently skips the per-endpoint batching logic in the send
path. Update the NIXL_WRITE branch in the transfer routine (the block that calls
sendXferSgl) so it only takes this route when the full remote range
remote[start_idx:end_idx] resolves to a single connection, or move the
batching/connection-homogeneity check into sendXferSgl itself. Use the existing
transfer flow and metadata helpers in this backend to verify the range before
returning the SGL path.

Comment on lines +112 to +117
#ifdef HAVE_UCX_SGL_API
[[nodiscard]] nixl_status_t
postSgl(const ucp_dt_local_sgl_t &local,
const ucp_dt_remote_sgl_t &remote,
size_t count,
nixlUcxReq &req);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Document the SGL buffer lifetime contract on the public endpoint API.

postSgl() stores UCX pointers into caller-owned SGL arrays, so callers must know those arrays must outlive request completion. Add a Doxygen block for the new public API. As per path instructions, “Use Doxygen block comments (/** ... */) for public APIs.”

Suggested documentation
 `#ifdef` HAVE_UCX_SGL_API
+    /**
+     * Post a UCX put using local and remote scatter-gather descriptors.
+     *
+     * The arrays referenced by `local` and `remote` must remain valid until
+     * the returned UCX request completes.
+     */
     [[nodiscard]] nixl_status_t
     postSgl(const ucp_dt_local_sgl_t &local,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#ifdef HAVE_UCX_SGL_API
[[nodiscard]] nixl_status_t
postSgl(const ucp_dt_local_sgl_t &local,
const ucp_dt_remote_sgl_t &remote,
size_t count,
nixlUcxReq &req);
`#ifdef` HAVE_UCX_SGL_API
/**
* Post a UCX put using local and remote scatter-gather descriptors.
*
* The arrays referenced by `local` and `remote` must remain valid until
* the returned UCX request completes.
*/
[[nodiscard]] nixl_status_t
postSgl(const ucp_dt_local_sgl_t &local,
const ucp_dt_remote_sgl_t &remote,
size_t count,
nixlUcxReq &req);
`#endif`
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/plugins/ucx/ucx_utils.h` around lines 112 - 117, The public endpoint API
`postSgl()` in `ucx_utils.h` needs a Doxygen block documenting the SGL lifetime
contract. Add a `/** ... */` comment above `postSgl()` that clearly states the
caller-owned `ucp_dt_local_sgl_t` and `ucp_dt_remote_sgl_t` arrays must remain
valid until the associated request completes, so users know the stored UCX
pointers must outlive completion. Use the function name `postSgl()` and the
`nixlUcxReq` request type in the comment to make the ownership and completion
requirement explicit.

Source: Path instructions

const auto engine_config =
nixl::getBackendParamDefaulted(custom_params, "engine_config", std::string());

sglEnabled_ = nixl::config::getValueOptional<bool>("NIXL_UCX_SGL_ENABLE").value_or(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...::getValueDefaulted("NIXL_UCX_SGL_ENABLE", false)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants