examples/ep: support configurable NVL group size#1820
Conversation
|
👋 Hi xtyao66! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
|
Fresh replacement for #1799 and #1819. This version keeps the PR branch clean: one core commit, API and runtime rank partitioning land together, and the old debug/VMM scope is removed. @ebarilanM @rakhmets @itayalroy could you take a look when you have a chance? |
79713a9 to
0711aab
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a runtime ChangesConfigurable nvl_group_size for EP High-Throughput Mode
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/device/ep/csrc/kernels/layout.cu`:
- Around line 73-83: Add a validation constraint to enforce that the fixed
8-lane layout (kNumRanksPerSM = 8) is properly divisible by the nvl_group_size
parameter before any RDMA indexing operations occur. Specifically, add an
assertion after the existing EP_DEVICE_ASSERT validation block to ensure that
kNumRanksPerSM is evenly divisible by nvl_group_size, so that the rank_idx
division operation on line 106 and the subsequent RDMA rank accumulation in the
loops around lines 147-155 operate on correctly aligned boundaries, preventing
incorrect RDMA group assignments when invalid divisors like 3 or 5 are passed
through.
In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu`:
- Around line 43-45: The `remote_mvh_ptr` function uses cached P2P pointers for
any rank without validating that the rank is in the same NVL group, which can
bypass security checks in `nixlPut`. Add a guard condition in the
`remote_mvh_ptr` function to check if the given global_rank is in the same NVL
group as the current context (by comparing `nvl_group_size` and rank
membership). If the rank is not in the same NVL group, return null to force the
request through RDMA descriptors instead of using the unsafe cached P2P pointer.
Ensure the null-MVH guard is preserved before calling `nixlGetPtr` to maintain
safety for valid same-group ranks.
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 404-413: After successfully loading the agent name via
loadRemoteMD in the else branch and storing it in
remote_agent_names[remote_rank], all subsequent operations targeting this remote
agent must use the stored remote_agent_names[remote_rank] instead of
std::to_string(remote_rank). Update any peer-info notification generation code
(such as in genNotif or _nixl_agents_peer_info_gather) to use the actual agent
name from remote_agent_names[remote_rank] when constructing notifications,
ensuring consistency between metadata loading and peer-info gathering
operations.
- Around line 1345-1360: The HT kernels use a single ht_barrier_counter_offset
for all remote RDMA memviews, but the current `>=` check allows peers to have
larger RDMA buffer sizes, causing barrier atomics to be written at the wrong
offset. Change the comparison operator from `>=` to `==` in the assertion that
checks `nixl_peer_info[r].rdma_buffer_bytes` to enforce that all remote peers
must have exactly matching RDMA buffer sizes with this rank, ensuring the
uniform barrier offset aligns correctly with each peer's barrier tail.
In `@examples/device/ep/nixl_ep/buffer.py`:
- Around line 89-91: The assertion used to validate the nvl_group_size parameter
can be disabled when Python is run with the -O optimization flag, allowing
invalid values to bypass validation and cause harder-to-debug failures later in
C++ code. Replace the assert statement (checking that nvl_group_size is between
0 and 8 and divides 8 evenly) with an explicit exception like ValueError or
RuntimeError that will always be raised regardless of Python optimization flags.
Ensure the exception includes the same descriptive message about the
nvl_group_size constraint.
In `@examples/device/ep/README.md`:
- Line 41: The documentation for the Buffer function initialization on line 41
is missing the valid range and divisibility constraints for the nvl_group_size
parameter. Update the API documentation line for the Buffer function to include
these constraints alongside the default value of 8, ensuring developers
understand what values are acceptable when initializing the NIXL communication
buffer with the nvl_group_size parameter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: e0f94db6-0704-4866-b6d3-0f5ee6a845aa
📒 Files selected for processing (11)
examples/device/ep/README.mdexamples/device/ep/csrc/config.hppexamples/device/ep/csrc/kernels/api.cuhexamples/device/ep/csrc/kernels/launch.cuhexamples/device/ep/csrc/kernels/layout.cuexamples/device/ep/csrc/kernels/nixl_ep_ht.cuexamples/device/ep/csrc/kernels/nixl_ep_ll.cuexamples/device/ep/csrc/kernels/utils.cuhexamples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hppexamples/device/ep/nixl_ep/buffer.py
0711aab to
ff631b3
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/device/ep/csrc/kernels/nixl_ep_ht.cu`:
- Line 500: Add local assertions before each `num_rdma_ranks` derivation in the
functions `notify_dispatch`, `dispatch`, `cached_notify`, and `combine` to
validate the runtime rank partition. For each location where `num_rdma_ranks` is
calculated from `num_ranks / nixl_ctx.nvl_group_size`, insert assertions that
verify: the group size is nonzero, the division is exact (num_ranks is evenly
divisible by nvl_group_size), the resulting RDMA rank count is positive, and the
context is consistent. This ensures incompatible rank layouts fail immediately
at the launch boundary rather than causing misrouting or crashes during later
operations like the division in the `combine` function.
In `@examples/device/ep/csrc/kernels/utils.cuh`:
- Around line 510-514: The barrier signal pointer dereferences in the
atomicAdd_system and atomicSub_system calls using indices rank and thread_id
occur before runtime bounds validation. Move the EP_DEVICE_ASSERT check for
num_ranks to occur before the atomic operations, and add an additional assertion
to ensure rank is within valid bounds (rank < num_ranks) before it is used as an
index into barrier_signal_ptrs. This ensures invalid runtime topology is caught
before any out-of-bounds system atomic operations are issued.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 36e36fcd-165c-40d3-9f6d-4b48a26c192b
📒 Files selected for processing (11)
examples/device/ep/README.mdexamples/device/ep/csrc/config.hppexamples/device/ep/csrc/kernels/api.cuhexamples/device/ep/csrc/kernels/launch.cuhexamples/device/ep/csrc/kernels/layout.cuexamples/device/ep/csrc/kernels/nixl_ep_ht.cuexamples/device/ep/csrc/kernels/nixl_ep_ll.cuexamples/device/ep/csrc/kernels/utils.cuhexamples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hppexamples/device/ep/nixl_ep/buffer.py
ff631b3 to
f5748fd
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
examples/device/ep/csrc/kernels/nixl_ep_ll.cu (1)
43-45: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winPreserve the null-MVH fallback before calling
nixlGetPtr.The cross-group guard is now in place, but same-group P2P cache initialization still calls
remote_mvh_ptr()unconditionally. Ifremote_mvh_for_rank()returnsnullptr, Line 44 passes it intonixlGetPtrinstead of cachingnullptrand falling back to RDMA.Proposed fix
__device__ __forceinline__ void* remote_mvh_ptr(const gpu_nixl_ctx& ctx, int global_rank) { - return nixlGetPtr(remote_mvh_for_rank(ctx, global_rank), remote_mvh_index(ctx, global_rank)); + auto mvh = remote_mvh_for_rank(ctx, global_rank); + return mvh == nullptr ? nullptr : nixlGetPtr(mvh, remote_mvh_index(ctx, global_rank)); }Also applies to: 1148-1148
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu` around lines 43 - 45, The `remote_mvh_ptr` function on line 44 passes the result of `remote_mvh_for_rank()` directly to `nixlGetPtr` without checking for null. Add a null-check on the result of `remote_mvh_for_rank(ctx, global_rank)` and return nullptr if it's null, rather than passing a null pointer to `nixlGetPtr`. This preserves the null-MVH fallback behavior allowing the code to fall back to RDMA when the remote MVH is unavailable. Apply the same fix to the similar function at line 1148.examples/device/ep/csrc/kernels/utils.cuh (1)
510-514: 🩺 Stability & Availability | 🔴 Critical | ⚡ Quick winMove runtime topology asserts before barrier pointer indexing.
Line 511 and Line 512 index
barrier_signal_ptrsbefore any runtime bounds validation. Ifrank/num_ranksare invalid, this can issue OOB system atomics before failure is detected.Suggested fix
barrier_block_runtime(int** barrier_signal_ptrs, int rank, int num_ranks, uint64_t timeout_cycles, int tag = 0) { auto thread_id = static_cast<int>(threadIdx.x); + EP_DEVICE_ASSERT(num_ranks > 0); + EP_DEVICE_ASSERT(num_ranks <= blockDim.x); + EP_DEVICE_ASSERT(rank >= 0 and rank < num_ranks); + if constexpr (not kSyncOnly) { memory_fence(); __syncthreads(); } if (thread_id < num_ranks) { atomicAdd_system(barrier_signal_ptrs[rank] + thread_id, FINISHED_SUM_TAG); atomicSub_system(barrier_signal_ptrs[thread_id] + rank, FINISHED_SUM_TAG); } - EP_DEVICE_ASSERT(num_ranks <= blockDim.x);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/device/ep/csrc/kernels/utils.cuh` around lines 510 - 514, The EP_DEVICE_ASSERT validation that checks num_ranks against blockDim.x is placed after the if statement that indexes into barrier_signal_ptrs using rank and thread_id. If these values are invalid, the atomicAdd_system and atomicSub_system operations will access out-of-bounds memory before the assertion detects the error. Move the EP_DEVICE_ASSERT(num_ranks <= blockDim.x) statement to immediately before the if block so that bounds validation occurs before any array indexing operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu`:
- Around line 43-45: The `remote_mvh_ptr` function on line 44 passes the result
of `remote_mvh_for_rank()` directly to `nixlGetPtr` without checking for null.
Add a null-check on the result of `remote_mvh_for_rank(ctx, global_rank)` and
return nullptr if it's null, rather than passing a null pointer to `nixlGetPtr`.
This preserves the null-MVH fallback behavior allowing the code to fall back to
RDMA when the remote MVH is unavailable. Apply the same fix to the similar
function at line 1148.
In `@examples/device/ep/csrc/kernels/utils.cuh`:
- Around line 510-514: The EP_DEVICE_ASSERT validation that checks num_ranks
against blockDim.x is placed after the if statement that indexes into
barrier_signal_ptrs using rank and thread_id. If these values are invalid, the
atomicAdd_system and atomicSub_system operations will access out-of-bounds
memory before the assertion detects the error. Move the
EP_DEVICE_ASSERT(num_ranks <= blockDim.x) statement to immediately before the if
block so that bounds validation occurs before any array indexing operations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a0f0b274-dc33-49bb-89f0-6b69930da7f9
📒 Files selected for processing (11)
examples/device/ep/README.mdexamples/device/ep/csrc/config.hppexamples/device/ep/csrc/kernels/api.cuhexamples/device/ep/csrc/kernels/launch.cuhexamples/device/ep/csrc/kernels/layout.cuexamples/device/ep/csrc/kernels/nixl_ep_ht.cuexamples/device/ep/csrc/kernels/nixl_ep_ll.cuexamples/device/ep/csrc/kernels/utils.cuhexamples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hppexamples/device/ep/nixl_ep/buffer.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/device/ep/csrc/nixl_ep.cpp (1)
322-329: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winGuard destroy-time IPC cleanup on pointer-table synchronization.
Line 322 only proves the local pointer-table storage exists; the GPU table is populated at Lines 487-488 only when
_ipc_handles_sync()runs. Ifdestroy()followsupdate_memory_buffers()before IPC sync, orconnect_ranks(..., ipc_handles={}), the intranode barrier can read uninitialized/null peer entries and hang or fault. Track anipc_handles_synced/local_ipc_readyflag set after the twocudaMemcpycalls, use it for the barrier and remote-handle close loop, but still free the local NVL allocation.🐛 Proposed direction
- if (num_nvl_bytes > 0 && buffer_ptrs[nvl_rank] != nullptr && barrier_signal_ptrs_gpu != nullptr) { + if (num_nvl_bytes > 0 && buffer_ptrs[nvl_rank] != nullptr) { + if (ipc_handles_synced && barrier_signal_ptrs_gpu != nullptr) { intranode::barrier(barrier_signal_ptrs_gpu, nvl_rank, num_nvl_ranks, timeout_cycles, comm_stream); warn_cuda(cudaDeviceSynchronize(), "synchronize device after intranode barrier"); // Close remote IPC if (is_available()) { for (int i = 0; i < num_nvl_ranks; ++ i) if (i != nvl_rank) warn_cuda(cudaIpcCloseMemHandle(buffer_ptrs[i]), "close remote IPC handle"); } + } // Free local buffer warn_cuda(cudaFree(buffer_ptrs[nvl_rank]), "free local NVL buffer"); }CUDA_CHECK(cudaMemcpy(buffer_ptrs_gpu, buffer_ptrs, sizeof(void*) * NUM_MAX_NVL_PEERS, cudaMemcpyHostToDevice)); CUDA_CHECK(cudaMemcpy(barrier_signal_ptrs_gpu, barrier_signal_ptrs, sizeof(int*) * NUM_MAX_NVL_PEERS, cudaMemcpyHostToDevice)); CUDA_CHECK(cudaDeviceSynchronize()); + ipc_handles_synced = true;This also needs a private boolean member initialized to
falseand reset during destroy.Also applies to: 466-489
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/device/ep/csrc/nixl_ep.cpp` around lines 322 - 329, Add a private boolean member variable (such as ipc_handles_synced or local_ipc_ready) initialized to false to track whether IPC handles have been synchronized on the GPU. Set this flag to true only after the two cudaMemcpy calls that populate the GPU table (at lines 487-488 in _ipc_handles_sync or similar). Guard both the intranode::barrier call and the remote IPC cleanup loop that calls cudaIpcCloseMemHandle (lines 323-327) with this flag, so they only execute if the GPU table has been properly synced. Allow the local NVL allocation to still be freed regardless of the flag state. Reset this flag to false during the destroy() method to ensure it is properly reinitialized.
♻️ Duplicate comments (1)
examples/device/ep/csrc/kernels/nixl_ep_ll.cu (1)
43-44: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winPreserve the null-MVH fallback before
nixlGetPtr.The new group guard prevents cross-group P2P caching, but same-group cache population still unconditionally calls
nixlGetPtr(remote_mvh_for_rank(...), ...). If IPC/P2P MVH is unavailable and the helper returnsnullptr, this can fault instead of cachingnullptrand falling back tonixlPut.Proposed fix
__device__ __forceinline__ void* remote_mvh_ptr(const gpu_nixl_ctx& ctx, int global_rank) { - return nixlGetPtr(remote_mvh_for_rank(ctx, global_rank), remote_mvh_index(ctx, global_rank)); + auto mvh = remote_mvh_for_rank(ctx, global_rank); + return mvh == nullptr ? nullptr : nixlGetPtr(mvh, remote_mvh_index(ctx, global_rank)); }Also applies to: 1148-1148
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu` around lines 43 - 44, In the remote_mvh_ptr function, add a null-check on the result of remote_mvh_for_rank(ctx, global_rank) before passing it to nixlGetPtr. If remote_mvh_for_rank returns nullptr (indicating unavailable IPC/P2P MVH), the function should return nullptr directly instead of calling nixlGetPtr, which allows the calling code to properly handle the fallback to nixlPut instead of causing a fault.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 322-329: Add a private boolean member variable (such as
ipc_handles_synced or local_ipc_ready) initialized to false to track whether IPC
handles have been synchronized on the GPU. Set this flag to true only after the
two cudaMemcpy calls that populate the GPU table (at lines 487-488 in
_ipc_handles_sync or similar). Guard both the intranode::barrier call and the
remote IPC cleanup loop that calls cudaIpcCloseMemHandle (lines 323-327) with
this flag, so they only execute if the GPU table has been properly synced. Allow
the local NVL allocation to still be freed regardless of the flag state. Reset
this flag to false during the destroy() method to ensure it is properly
reinitialized.
---
Duplicate comments:
In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu`:
- Around line 43-44: In the remote_mvh_ptr function, add a null-check on the
result of remote_mvh_for_rank(ctx, global_rank) before passing it to nixlGetPtr.
If remote_mvh_for_rank returns nullptr (indicating unavailable IPC/P2P MVH), the
function should return nullptr directly instead of calling nixlGetPtr, which
allows the calling code to properly handle the fallback to nixlPut instead of
causing a fault.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a24c6668-21f7-4ce8-85a1-70481d3c1485
📒 Files selected for processing (11)
examples/device/ep/README.mdexamples/device/ep/csrc/config.hppexamples/device/ep/csrc/kernels/api.cuhexamples/device/ep/csrc/kernels/launch.cuhexamples/device/ep/csrc/kernels/layout.cuexamples/device/ep/csrc/kernels/nixl_ep_ht.cuexamples/device/ep/csrc/kernels/nixl_ep_ll.cuexamples/device/ep/csrc/kernels/utils.cuhexamples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hppexamples/device/ep/nixl_ep/buffer.py
Signed-off-by: xt66 <60164575+xtyao66@users.noreply.github.com>
f5748fd to
ce03f45
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu`:
- Around line 1145-1150: The code indexes p2p_ptrs with rank_id without
validating that rank_id is within valid bounds, which can cause device memory
corruption. Add a bounds check on rank_id before the cross-group branch that
sets p2p_ptrs[rank_id] = nullptr. The validation should ensure rank_id is a
valid index into the p2p_ptrs array, and if it is invalid, return early or skip
the indexing operation to prevent out-of-bounds access.
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 1366-1367: The pointer declarations for remote_rdma_ptr and
remote_sync_ptr do not follow the repository's clang-format style guide which
uses PointerAlignment: Right. Modify both pointer declarations to move the
asterisk from the right of the type to the left, binding it to the type name
instead of the variable name (change void* to void * for both declarations).
- Around line 136-138: The assertion on line 136 is rejecting low-latency jobs
with rank counts not divisible by nvl_group_size, which contradicts the
exemption of low-latency mode from the HT RDMA-peer bound mentioned on line 135.
Modify the EP_HOST_ASSERT on line 136 to exclude the divisibility requirement
when low_latency_mode is enabled, allowing low-latency jobs to proceed with any
rank count less than or greater than nvl_group_size. Additionally, the second
assertion in lines 137-138 should be tightened to require num_rdma_bytes > 0
when num_ranks > nvl_group_size, unless low_latency_mode is active, to ensure
proper HT allocation (currently it allows multi-group configurations with only
the 8-byte barrier tail, which is insufficient).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d8421550-0654-43f0-ae80-d802081a89d9
📒 Files selected for processing (11)
examples/device/ep/README.mdexamples/device/ep/csrc/config.hppexamples/device/ep/csrc/kernels/api.cuhexamples/device/ep/csrc/kernels/launch.cuhexamples/device/ep/csrc/kernels/layout.cuexamples/device/ep/csrc/kernels/nixl_ep_ht.cuexamples/device/ep/csrc/kernels/nixl_ep_ll.cuexamples/device/ep/csrc/kernels/utils.cuhexamples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hppexamples/device/ep/nixl_ep/buffer.py
Signed-off-by: xt66 <60164575+xtyao66@users.noreply.github.com>
What?
This PR adds runtime-configurable NVLink-local group sizing to the NIXL EP high-throughput path.
Today the HT EP topology assumes an 8-rank local NVLink group:
This PR introduces
nvl_group_size, defaulting to8, and uses it for runtime rank mapping:The main target is rank-contiguous GB200 deployments where each local CUDA-IPC/NVLink group has 4 GPUs, such as
0..3,4..7,8..11, etc.Closes/follows design issue #1818.
Why?
The hard-coded 8-rank grouping is correct for existing 8-GPU-local deployments, but it is too rigid for GB200 deployments that expose fewer local GPUs per worker. With 4 GPUs per worker, the old mapping can treat ranks across worker boundaries as CUDA-IPC-local peers when they should be reached through the RDMA/fabric path.
How?
nvl_group_size=8to the EP Buffer constructor and Python wrapper.0 < nvl_group_size <= 8and that it divides the fixed 8-lane scratch layout.NUM_MAX_NVL_PEERSwith runtimenvl_group_size.NUM_MAX_NVL_PEERS=8as the maximum kernel/scratch-layout width.nvl_group_size=8behavior.Scope
This is intentionally limited to equal-size, rank-contiguous local groups. It does not add arbitrary topology discovery or a scheduler-provided rank-to-group mapping API.
This PR also intentionally excludes the CUDA VMM utility extraction and debug APIs from the older PR, so reviewers can focus on the core EP topology change.
Validation
Local/static checks:
git diff --check origin/main..codex/nixl-ep-gb200-core: passCommands run locally:
I did not run the CUDA/Ninja build or GPU functional tests in this local environment.
Follow-up
I have a separate validation branch ready with GB200-style
nvl_group_size=4test coverage:I plan to open that after this core topology PR is reviewed.
Summary by CodeRabbit
Release Notes
New Features
nvl_group_size(default:8) to theBufferconstructor to control NVLink/RDMA grouping and sizing.nvl_group_sizefor rank mapping, remote pointer/descriptor selection, layout sizing, and RDMA/NVL dispatch behavior.Bug Fixes
Documentation
nvl_group_sizeconstraints (0 < nvl_group_size <= 8and must divide8) and updated buffer-size hint APIs accordingly.