utils: add shared S3-over-RDMA transport and protocol helpers#1840
utils: add shared S3-over-RDMA transport and protocol helpers#1840harshavardhana wants to merge 1 commit into
Conversation
|
👋 Hi harshavardhana! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
📝 WalkthroughWalkthroughAdds S3-over-RDMA protocol helpers, public declarations, implementation wiring, signed PUT/GET handling with token retries, build updates, and unit tests for wire-format behavior. ChangesS3 RDMA object backend
Sequence Diagram(s)sequenceDiagram
participant rdmaPutWithRetry
participant SharedCuObjClient
participant S3RdmaControlPlane
participant S3
rdmaPutWithRetry->>SharedCuObjClient: getToken()
SharedCuObjClient-->>rdmaPutWithRetry: token or null
alt token available
rdmaPutWithRetry->>S3RdmaControlPlane: rdmaPut(ctx, token, buf_addr, size)
S3RdmaControlPlane->>S3: signed HTTP PUT with x-amz-rdma-token
S3-->>S3RdmaControlPlane: HTTP 200 / x-amz-rdma-reply
S3RdmaControlPlane-->>rdmaPutWithRetry: status, ETag, checksum
rdmaPutWithRetry->>SharedCuObjClient: putToken(token)
else token unavailable
Note over rdmaPutWithRetry: retry up to rdma_max_attempts
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/utils/object/rdma_protocol.h`:
- Around line 90-105: The empty-reply handling in parseRdmaReply is too broad:
it currently treats a missing x-amz-rdma-reply the same as an explicit RDMA
decline, which causes rdmaPutWithRetry in rdma.cpp to skip transient retries on
non-200 PUT failures. Update parseRdmaReply in rdma_protocol.h so the
missing-header behavior is caller-controlled (or otherwise split GET vs PUT
semantics), and make rdmaGet use the decline-on-empty path while rdmaPut treats
an absent reply as malformed/error instead of rdma_not_supported.
- Around line 66-75: The RDMA token formatting in formatRdmaToken is truncating
opaque descriptors by writing into a fixed 512-byte stack buffer and ignoring
the snprintf result. Update formatRdmaToken in rdma_protocol.h to build the
token with a dynamically sized string or otherwise ensure the full descriptor is
preserved before it is used as x-amz-rdma-token, and handle the formatted length
instead of discarding snprintf’s return value.
In `@src/utils/object/rdma.cpp`:
- Line 309: The control plane validity check is too permissive because `valid_`
in `rdma.cpp` only requires `impl_->http` and a non-empty `access_key`. Update
the validity logic in the control-plane setup path so it also requires a
non-empty `secret_key`, matching what `signV4()` needs for request signing. Keep
the change localized around the `valid_` assignment and the related credential
fields on `impl_`.
- Around line 507-513: The retry loop in rdmaGet() is minting a cuObject token
before validating the request size, so zero-length GETs still reach
rdma.getToken() even though S3RdmaControlPlane::rdmaGet() rejects them; add an
early size == 0 check at the start of rdmaGet() (before the attempt loop) and
return the same error path immediately so no token minting or cp.rdmaGet() call
happens for empty requests.
- Around line 454-458: Reject out-of-range x-amz-rdma-bytes-transferred values
in the RDMA response parsing path so callers never accept more bytes than were
requested. Update the logic around the response-header handling in the rdma
transfer code to validate the parsed header against the requested transfer size
before returning it, and treat any negative or oversized value as rdma_error.
Keep the existing std::stoll parsing and error handling, but add the bound check
in the same block where amz_rdma_bytes_transferred is read.
- Around line 416-420: The byte-range construction in the RDMA request header
can overflow when computing the end offset as offset + size - 1, which may
produce an invalid range. Update the logic in the range-setting block to compute
and validate the end value safely before calling SetHeaderValue, using the
relevant offset and size variables in this code path. If the range cannot be
represented without overflow, handle it explicitly rather than emitting a
malformed bytes= header.
In `@src/utils/object/rdma.h`:
- Around line 113-116: Add Doxygen-style documentation blocks for the remaining
public header APIs in rdma.h, specifically the valid() accessor and
rdmaGetWithRetry() function, matching the surrounding public declarations. Place
the comments immediately above valid() and rdmaGetWithRetry() using the same /**
... */ style as the other documented symbols so the public API is consistently
documented.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 91c32fec-931c-4672-bb22-1dd6134ddee9
📒 Files selected for processing (6)
src/utils/object/meson.buildsrc/utils/object/rdma.cppsrc/utils/object/rdma.hsrc/utils/object/rdma_protocol.htest/gtest/unit/obj/meson.buildtest/gtest/unit/obj/rdma_protocol.cpp
| bool | ||
| valid() const { | ||
| return valid_; | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Document the remaining public API entries.
valid() and rdmaGetWithRetry() are public header APIs, but unlike the surrounding declarations they have no Doxygen comment block.
Proposed fix
bool
+ /** Return true when the control-plane client was initialized successfully. */
valid() const {
return valid_;
}
@@
+/**
+ * Mint a token, run rdmaGet, release the token, with one transient retry
+ * (covering token-mint and control-plane hiccups). The buffer must already be
+ * registered via SharedCuObjClient::registerBuffer().
+ *
+ * `@return` >0 bytes transferred (success), rdma_not_supported (server declined),
+ * or rdma_error (failure). The caller treats anything < 0 as an error.
+ */
ssize_t
rdmaGetWithRetry(SharedCuObjClient &rdma,
S3RdmaControlPlane &cp,
S3RdmaClientCtx &ctx,As per path instructions, "Comments: use Doxygen-style /** ... */ for documenting public APIs/classes/functions".
Also applies to: 161-167
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/utils/object/rdma.h` around lines 113 - 116, Add Doxygen-style
documentation blocks for the remaining public header APIs in rdma.h,
specifically the valid() accessor and rdmaGetWithRetry() function, matching the
surrounding public declarations. Place the comments immediately above valid()
and rdmaGetWithRetry() using the same /** ... */ style as the other documented
symbols so the public API is consistently documented.
Source: Path instructions
765e429 to
58aee55
Compare
Extract the reusable GPU-direct S3-over-RDMA infrastructure into
src/utils/object/ so the object plugin's accel engines (and future vendor
engines) can compose it, rather than embedding it in the plugin. No plugin
wiring yet — that lands in a follow-up PR.
- rdma_protocol.h: dependency-free wire helpers (token formatting, reply
parsing, constants), covered by a unit test.
- rdma.{h,cpp}: SharedCuObjClient (process-wide cuObject RDMA plumbing) and
S3RdmaControlPlane (SigV4 UNSIGNED-PAYLOAD control plane carrying
x-amz-rdma-token and reading x-amz-rdma-reply / x-amz-rdma-bytes-transferred),
plus the retry wrappers. Guarded by HAVE_CUOBJ_CLIENT; compiles to nothing
without cuObjClient.
- Built into the obj_utils library; protocol unit test in test/gtest/unit/obj.
The header names and token layout come from NVIDIA's published cuObject docs and
the protocol proposed to AWS (aws-c-s3 RDMA_PROTOCOL_SPEC.md).
Review fixes:
- formatRdmaToken uses PRIx64 (portable).
- rdmaGet rejects size==0 up front.
- Credentials are resolved once at construction (documented limitation:
rotating IAM session tokens require a fresh backend).
Signed-off-by: Harshavardhana <harsha@minio.io>
58aee55 to
48be98d
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/utils/object/meson.build (1)
33-49: 🗄️ Data Integrity & Integration | 🟠 MajorPropagate
cuobj_depthroughobj_utils_interface.rdma.his now part of the public object-utils API and includes<cuobjclient.h>behindHAVE_CUOBJ_CLIENT;link_withdoes not carry that dependency transitively, so consumers ofobj_utils_interfacestill needcuobj_depindependencies:when it is available.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/utils/object/meson.build` around lines 33 - 49, Propagate the optional cuobj dependency through obj_utils_interface so consumers of the public object-utils API can compile rdma.h when HAVE_CUOBJ_CLIENT is enabled. Update the declare_dependency block for obj_utils_interface in meson.build to include cuobj_dep in dependencies alongside the existing partial_aws_* deps, while keeping it conditional via the same cuobj_dep.found() check used for obj_utils_lib.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/utils/object/rdma_protocol.h`:
- Around line 17-23: The RDMA protocol header relies on ssize_t for its
return-code sentinels, but that type is not declared by the current includes.
Update rdma_protocol.h so ssize_t is properly available by adding the
appropriate POSIX type declaration include, or alternatively replace the
sentinels with a fixed-width signed type defined within this header. Keep the
change localized near the existing rdma_protocol definitions so the return-code
constants remain consistent.
In `@src/utils/object/rdma.cpp`:
- Around line 454-468: Move the ctx.etag assignment in rdmaGet so it only
happens after the response has been fully validated and the transfer size has
been accepted, because ctx.etag is currently set before checking
x-amz-rdma-bytes-transferred. Keep the validation and return-path logic in
rdmaGet consistent with the rdma.h contract that etag is populated only on
success, and ensure malformed or out-of-range transferred-byte headers return
rdma_error without leaving ctx.etag populated. Use the existing rdmaGet and
ctx.etag symbols to place the fix in the GET success path.
In `@src/utils/object/rdma.h`:
- Around line 90-97: The new public S3RdmaClientCtx API currently exposes
snake_case member names, which conflicts with the repo’s lower-camel-case member
convention. Rename the fields upload_id, part_number, and checksum_crc64nvme to
lower camel case in S3RdmaClientCtx, and update all uses of those members
accordingly so downstream code only sees the corrected public names.
---
Outside diff comments:
In `@src/utils/object/meson.build`:
- Around line 33-49: Propagate the optional cuobj dependency through
obj_utils_interface so consumers of the public object-utils API can compile
rdma.h when HAVE_CUOBJ_CLIENT is enabled. Update the declare_dependency block
for obj_utils_interface in meson.build to include cuobj_dep in dependencies
alongside the existing partial_aws_* deps, while keeping it conditional via the
same cuobj_dep.found() check used for obj_utils_lib.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 91d0eeaa-dd6f-4bde-9076-474a62cc28b2
📒 Files selected for processing (6)
src/utils/object/meson.buildsrc/utils/object/rdma.cppsrc/utils/object/rdma.hsrc/utils/object/rdma_protocol.htest/gtest/unit/obj/meson.buildtest/gtest/unit/obj/rdma_protocol.cpp
| #include <charconv> | ||
| #include <cinttypes> | ||
| #include <cstdint> | ||
| #include <cstdio> | ||
| #include <string> | ||
| #include <system_error> | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
set -euo pipefail
# Inspect the target header with line numbers
sed -n '1,140p' src/utils/object/rdma_protocol.h | cat -n
# Check whether ssize_t is declared anywhere this header includes, and where it is used
grep -RIn --line-number --fixed-strings "ssize_t" src/utils/object/rdma_protocol.h src | head -n 50
# Map nearby declarations/usages in the same file
python3 - <<'PY'
from pathlib import Path
p = Path('src/utils/object/rdma_protocol.h')
for i, line in enumerate(p.read_text().splitlines(), 1):
if 'ssize_t' in line or 'rdma_not_supported' in line or 'rdma_error' in line or '`#include`' in line:
print(f"{i}: {line}")
PYRepository: ai-dynamo/nixl
Length of output: 9433
Include <sys/types.h> for ssize_t
rdma_protocol.h uses ssize_t for the return-code sentinels, but none of the current includes declare it. Add the POSIX type declaration here or switch these sentinels to a fixed-width signed type owned by this header.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/utils/object/rdma_protocol.h` around lines 17 - 23, The RDMA protocol
header relies on ssize_t for its return-code sentinels, but that type is not
declared by the current includes. Update rdma_protocol.h so ssize_t is properly
available by adding the appropriate POSIX type declaration include, or
alternatively replace the sentinels with a fixed-width signed type defined
within this header. Keep the change localized near the existing rdma_protocol
definitions so the return-code constants remain consistent.
| if (resp->HasHeader("etag")) { | ||
| ctx.etag = stripQuotes(resp->GetHeader("etag").c_str()); | ||
| } | ||
|
|
||
| // Trust the server's reported transferred byte count (can be < requested | ||
| // for ranged/partial GETs). | ||
| if (resp->HasHeader(amz_rdma_bytes_transferred)) { | ||
| try { | ||
| const long long n = std::stoll(resp->GetHeader(amz_rdma_bytes_transferred).c_str()); | ||
| if (n < 0 || static_cast<uint64_t>(n) > size) { | ||
| NIXL_ERROR << "rdmaGet: invalid x-amz-rdma-bytes-transferred=" << n | ||
| << " (requested " << size << ") for key=" << ctx.object; | ||
| return rdma_error; | ||
| } | ||
| return static_cast<ssize_t>(n); |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Only publish ctx.etag after the GET is fully accepted.
Line 96 in src/utils/object/rdma.h documents etag as populated on success, but this code writes it before validating x-amz-rdma-bytes-transferred. If that header is malformed or out of range, rdmaGet() returns rdma_error with a success-looking ETag left behind.
Proposed fix
- if (resp->HasHeader("etag")) {
- ctx.etag = stripQuotes(resp->GetHeader("etag").c_str());
- }
+ const std::string etag =
+ resp->HasHeader("etag") ? stripQuotes(resp->GetHeader("etag").c_str()) : "";
@@
- return static_cast<ssize_t>(n);
+ ctx.etag = etag;
+ return static_cast<ssize_t>(n);
}
catch (const std::exception &) {
return rdma_error;
}
}
+ ctx.etag = etag;
return static_cast<ssize_t>(size);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (resp->HasHeader("etag")) { | |
| ctx.etag = stripQuotes(resp->GetHeader("etag").c_str()); | |
| } | |
| // Trust the server's reported transferred byte count (can be < requested | |
| // for ranged/partial GETs). | |
| if (resp->HasHeader(amz_rdma_bytes_transferred)) { | |
| try { | |
| const long long n = std::stoll(resp->GetHeader(amz_rdma_bytes_transferred).c_str()); | |
| if (n < 0 || static_cast<uint64_t>(n) > size) { | |
| NIXL_ERROR << "rdmaGet: invalid x-amz-rdma-bytes-transferred=" << n | |
| << " (requested " << size << ") for key=" << ctx.object; | |
| return rdma_error; | |
| } | |
| return static_cast<ssize_t>(n); | |
| const std::string etag = | |
| resp->HasHeader("etag") ? stripQuotes(resp->GetHeader("etag").c_str()) : ""; | |
| // Trust the server's reported transferred byte count (can be < requested | |
| // for ranged/partial GETs). | |
| if (resp->HasHeader(amz_rdma_bytes_transferred)) { | |
| try { | |
| const long long n = std::stoll(resp->GetHeader(amz_rdma_bytes_transferred).c_str()); | |
| if (n < 0 || static_cast<uint64_t>(n) > size) { | |
| NIXL_ERROR << "rdmaGet: invalid x-amz-rdma-bytes-transferred=" << n | |
| << " (requested " << size << ") for key=" << ctx.object; | |
| return rdma_error; | |
| } | |
| ctx.etag = etag; | |
| return static_cast<ssize_t>(n); | |
| } | |
| catch (const std::exception &) { | |
| return rdma_error; | |
| } | |
| } | |
| ctx.etag = etag; | |
| return static_cast<ssize_t>(size); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/utils/object/rdma.cpp` around lines 454 - 468, Move the ctx.etag
assignment in rdmaGet so it only happens after the response has been fully
validated and the transfer size has been accepted, because ctx.etag is currently
set before checking x-amz-rdma-bytes-transferred. Keep the validation and
return-path logic in rdmaGet consistent with the rdma.h contract that etag is
populated only on success, and ensure malformed or out-of-range transferred-byte
headers return rdma_error without leaving ctx.etag populated. Use the existing
rdmaGet and ctx.etag symbols to place the fix in the GET success path.
| struct S3RdmaClientCtx { | ||
| std::string bucket; | ||
| std::string object; | ||
| std::string upload_id; // empty for single-shot (non-multipart) | ||
| uint32_t part_number = 0; // 1..=10000 when upload_id is set | ||
| std::string checksum_crc64nvme; // optional, in/out | ||
| std::string etag; // populated on success | ||
| }; |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Rename the new S3RdmaClientCtx fields before this API escapes.
upload_id, part_number, and checksum_crc64nvme expose snake_case on a brand-new public struct, so every downstream engine will inherit names that already violate the repo rule for members. This is still cheap to align now.
Proposed fix
struct S3RdmaClientCtx {
std::string bucket;
std::string object;
- std::string upload_id; // empty for single-shot (non-multipart)
- uint32_t part_number = 0; // 1..=10000 when upload_id is set
- std::string checksum_crc64nvme; // optional, in/out
+ std::string uploadId; // empty for single-shot (non-multipart)
+ uint32_t partNumber = 0; // 1..=10000 when uploadId is set
+ std::string checksumCrc64nvme; // optional, in/out
std::string etag; // populated on success
};As per path instructions, "Use lower camel case for functions/members; snake case for variables/namespace names/constants".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| struct S3RdmaClientCtx { | |
| std::string bucket; | |
| std::string object; | |
| std::string upload_id; // empty for single-shot (non-multipart) | |
| uint32_t part_number = 0; // 1..=10000 when upload_id is set | |
| std::string checksum_crc64nvme; // optional, in/out | |
| std::string etag; // populated on success | |
| }; | |
| struct S3RdmaClientCtx { | |
| std::string bucket; | |
| std::string object; | |
| std::string uploadId; // empty for single-shot (non-multipart) | |
| uint32_t partNumber = 0; // 1..=10000 when uploadId is set | |
| std::string checksumCrc64nvme; // optional, in/out | |
| std::string etag; // populated on success | |
| }; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/utils/object/rdma.h` around lines 90 - 97, The new public S3RdmaClientCtx
API currently exposes snake_case member names, which conflicts with the repo’s
lower-camel-case member convention. Rename the fields upload_id, part_number,
and checksum_crc64nvme to lower camel case in S3RdmaClientCtx, and update all
uses of those members accordingly so downstream code only sees the corrected
public names.
Source: Path instructions
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/utils/object/meson.build`:
- Around line 32-42: The obj_utils_interface target is missing public
dependencies needed by the new rdma.h header, so downstream consumers can
compile-fail even though obj_utils_lib links correctly. Update
obj_utils_interface to forward the same public dependencies as obj_utils for
headers that include object/rdma.h, specifically nixl_common_dep and
conditionally cuobj_dep alongside the existing AWS deps, so any consumer of the
interface inherits the required include paths and symbols.
In `@src/utils/object/rdma.cpp`:
- Around line 276-280: The endpoint override handling in the RDMA setup only
switches the scheme for HTTP, so HTTPS overrides can still leave impl_->scheme
on http. Update the endpoint parsing block in rdma.cpp to explicitly set
impl_->scheme based on the parsed Aws::Http::URI scheme, using the existing
config.scheme as the fallback when the override is not HTTP or HTTPS, and keep
the host/port assignment tied to the same endpointOverride path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 79b69abf-81c1-4381-9f51-b108f7e8345e
📒 Files selected for processing (6)
src/utils/object/meson.buildsrc/utils/object/rdma.cppsrc/utils/object/rdma.hsrc/utils/object/rdma_protocol.htest/gtest/unit/obj/meson.buildtest/gtest/unit/obj/rdma_protocol.cpp
| obj_utils_sources = ['s3/utils.cpp', 's3/utils.h', 'rdma_protocol.h', 'rdma.cpp', 'rdma.h'] | ||
| obj_utils_deps = [nixl_common_dep, partial_aws_s3, partial_aws_s3_crt, partial_aws_core] | ||
| if cuobj_dep.found() | ||
| obj_utils_deps += [cuobj_dep] | ||
| endif | ||
|
|
||
| # Generic object utilities library containing S3 utilities and other object storage utilities | ||
| obj_utils_lib = library('obj_utils', | ||
| 's3/utils.cpp', 's3/utils.h', | ||
| obj_utils_sources, | ||
| include_directories: [nixl_inc_dirs, utils_inc_dirs], | ||
| dependencies: [nixl_common_dep, partial_aws_s3, partial_aws_s3_crt, partial_aws_core], | ||
| dependencies: obj_utils_deps, |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Propagate the new public header dependencies through obj_utils_interface.
rdma.h is now part of obj_utils’s public surface, but the interface target still exports only the AWS deps. Consumers that include object/rdma.h will also need nixl_common_dep for nixl_types.h and, when enabled, cuobj_dep for <cuobjclient.h>. obj_utils_lib links today, but downstream builds can still fail at compile time because obj_utils_interface does not forward those dependencies.
Proposed fix
obj_utils_interface = declare_dependency(
link_with: obj_utils_lib,
include_directories: [include_directories('.'), include_directories('s3')],
- dependencies: [partial_aws_s3, partial_aws_s3_crt, partial_aws_core]
+ dependencies: obj_utils_deps
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/utils/object/meson.build` around lines 32 - 42, The obj_utils_interface
target is missing public dependencies needed by the new rdma.h header, so
downstream consumers can compile-fail even though obj_utils_lib links correctly.
Update obj_utils_interface to forward the same public dependencies as obj_utils
for headers that include object/rdma.h, specifically nixl_common_dep and
conditionally cuobj_dep alongside the existing AWS deps, so any consumer of the
interface inherits the required include paths and symbols.
| if (!config.endpointOverride.empty()) { | ||
| Aws::Http::URI ep(config.endpointOverride); | ||
| impl_->scheme = (ep.GetScheme() == Aws::Http::Scheme::HTTP) ? "http" : impl_->scheme; | ||
| impl_->host = ep.GetAuthority(); | ||
| impl_->port = ep.GetPort(); |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Honor https:// endpoint overrides explicitly.
Line 278 only updates impl_->scheme when the parsed override is HTTP. If config.scheme is HTTP and endpointOverride is an HTTPS URI, this leaves the control plane on "http" and builds/signs requests for the wrong transport.
Proposed fix
if (!config.endpointOverride.empty()) {
Aws::Http::URI ep(config.endpointOverride);
- impl_->scheme = (ep.GetScheme() == Aws::Http::Scheme::HTTP) ? "http" : impl_->scheme;
+ if (ep.GetScheme() == Aws::Http::Scheme::HTTP) {
+ impl_->scheme = "http";
+ } else if (ep.GetScheme() == Aws::Http::Scheme::HTTPS) {
+ impl_->scheme = "https";
+ }
impl_->host = ep.GetAuthority();
impl_->port = ep.GetPort();
} else {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (!config.endpointOverride.empty()) { | |
| Aws::Http::URI ep(config.endpointOverride); | |
| impl_->scheme = (ep.GetScheme() == Aws::Http::Scheme::HTTP) ? "http" : impl_->scheme; | |
| impl_->host = ep.GetAuthority(); | |
| impl_->port = ep.GetPort(); | |
| if (!config.endpointOverride.empty()) { | |
| Aws::Http::URI ep(config.endpointOverride); | |
| if (ep.GetScheme() == Aws::Http::Scheme::HTTP) { | |
| impl_->scheme = "http"; | |
| } else if (ep.GetScheme() == Aws::Http::Scheme::HTTPS) { | |
| impl_->scheme = "https"; | |
| } | |
| impl_->host = ep.GetAuthority(); | |
| impl_->port = ep.GetPort(); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/utils/object/rdma.cpp` around lines 276 - 280, The endpoint override
handling in the RDMA setup only switches the scheme for HTTP, so HTTPS overrides
can still leave impl_->scheme on http. Update the endpoint parsing block in
rdma.cpp to explicitly set impl_->scheme based on the parsed Aws::Http::URI
scheme, using the existing config.scheme as the fallback when the override is
not HTTP or HTTPS, and keep the host/port assignment tied to the same
endpointOverride path.
What
Adds the shared GPU-direct S3-over-RDMA infrastructure as reusable utilities under
src/utils/object/, so object engines compose it rather than each embedding its own copy. This keeps the transport/protocol layer in one place and easy to maintain as more engines adopt it. Additive only — no existing code paths change.Contents
rdma_protocol.h— dependency-free wire helpers (token formatting, reply parsing, constants), unit-tested.rdma.{h,cpp}—SharedCuObjClient(process-wide cuObject RDMA plumbing) andS3RdmaControlPlane(SigV4UNSIGNED-PAYLOADcontrol plane carryingx-amz-rdma-token, readingx-amz-rdma-reply/x-amz-rdma-bytes-transferred) plus retry wrappers. Guarded byHAVE_CUOBJ_CLIENT; compiles to nothing without cuObjClient.obj_utilslibrary;RdmaProtocolunit test undertest/gtest/unit/obj.The header names and token layout follow NVIDIA's published cuObject docs and the protocol proposed to AWS (aws-c-s3
RDMA_PROTOCOL_SPEC.md).Notes
formatRdmaTokenusesPRIx64;rdmaGetrejectssize==0; credentials are resolved once at construction (documented limitation: rotating IAM session tokens need a fresh backend).SharedCuObjClienthas no S3 dependency, so it can move undersrc/utils/cuobj/if preferred — happy to do that.Test
Builds into
obj_utils;RdmaProtocol.*unit tests pass (7/7).Types of changes
Summary by CodeRabbit
New Features
Bug Fixes
Tests