Skip to content

examples/ep: support configurable NVL group size#1820

Open
xtyao66 wants to merge 5 commits into
ai-dynamo:mainfrom
xtyao66:codex/nixl-ep-gb200-core
Open

examples/ep: support configurable NVL group size#1820
xtyao66 wants to merge 5 commits into
ai-dynamo:mainfrom
xtyao66:codex/nixl-ep-gb200-core

Conversation

@xtyao66

@xtyao66 xtyao66 commented Jun 23, 2026

Copy link
Copy Markdown

What?

This PR adds runtime-configurable NVLink-local group sizing to the NIXL EP high-throughput path.

Today the HT EP topology assumes an 8-rank local NVLink group:

rdma_rank = rank / 8
nvl_rank  = rank % 8

This PR introduces nvl_group_size, defaulting to 8, and uses it for runtime rank mapping:

rdma_rank = rank / nvl_group_size
nvl_rank  = rank % nvl_group_size

The main target is rank-contiguous GB200 deployments where each local CUDA-IPC/NVLink group has 4 GPUs, such as 0..3, 4..7, 8..11, etc.

Closes/follows design issue #1818.

Why?

The hard-coded 8-rank grouping is correct for existing 8-GPU-local deployments, but it is too rigid for GB200 deployments that expose fewer local GPUs per worker. With 4 GPUs per worker, the old mapping can treat ranks across worker boundaries as CUDA-IPC-local peers when they should be reached through the RDMA/fabric path.

How?

  • Add nvl_group_size=8 to the EP Buffer constructor and Python wrapper.
  • Validate 0 < nvl_group_size <= 8 and that it divides the fixed 8-lane scratch layout.
  • Replace HT rank partitioning from fixed NUM_MAX_NVL_PEERS with runtime nvl_group_size.
  • Keep NUM_MAX_NVL_PEERS=8 as the maximum kernel/scratch-layout width.
  • Guard active NVL lanes at runtime for smaller local groups.
  • Use per-rank remote memory views so cross-group HT traffic can target the correct remote rank.
  • Keep the HT barrier counter inside the registered RDMA region.
  • Preserve the default nvl_group_size=8 behavior.

Scope

This is intentionally limited to equal-size, rank-contiguous local groups. It does not add arbitrary topology discovery or a scheduler-provided rank-to-group mapping API.

This PR also intentionally excludes the CUDA VMM utility extraction and debug APIs from the older PR, so reviewers can focus on the core EP topology change.

Validation

Local/static checks:

  • git diff --check origin/main..codex/nixl-ep-gb200-core: pass
  • Python compile with redirected pycache: pass

Commands run locally:

git diff --check origin/main..codex/nixl-ep-gb200-core
PYTHONPYCACHEPREFIX=/private/tmp/nixl_pycache python3 -m py_compile examples/device/ep/nixl_ep/buffer.py examples/device/ep/tests/test_ht.py examples/device/ep/tests/test_ht_gb200.py

I did not run the CUDA/Ninja build or GPU functional tests in this local environment.

Follow-up

I have a separate validation branch ready with GB200-style nvl_group_size=4 test coverage:

xtyao66:codex/nixl-ep-gb200-validation

I plan to open that after this core topology PR is reviewed.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added nvl_group_size (default: 8) to the Buffer constructor to control NVLink/RDMA grouping and sizing.
    • Updated expert-parallel dispatch to use runtime nvl_group_size for rank mapping, remote pointer/descriptor selection, layout sizing, and RDMA/NVL dispatch behavior.
  • Bug Fixes

    • Improved safety and runtime handling for out-of-range NVL ranks and zero top-k cases.
    • Enhanced HT barrier timeouts with tagged timeout diagnostics.
  • Documentation

    • Documented nvl_group_size constraints (0 < nvl_group_size <= 8 and must divide 8) and updated buffer-size hint APIs accordingly.

@xtyao66 xtyao66 requested review from a team, ebarilanM, itayalroy and rakhmets as code owners June 23, 2026 23:53
@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown

👋 Hi xtyao66! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@xtyao66

xtyao66 commented Jun 23, 2026

Copy link
Copy Markdown
Author

Fresh replacement for #1799 and #1819. This version keeps the PR branch clean: one core commit, API and runtime rank partitioning land together, and the old debug/VMM scope is removed. @ebarilanM @rakhmets @itayalroy could you take a look when you have a chance?

@xtyao66 xtyao66 force-pushed the codex/nixl-ep-gb200-core branch from 79713a9 to 0711aab Compare June 24, 2026 00:07
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a runtime nvl_group_size parameter (default 8) to the EP Buffer API, propagating it through the entire stack: C++/Python constructor, gpu_nixl_ctx GPU state, buffer sizing hints, a new runtime barrier utility, layout kernel, all HT CUDA kernels, LL kernel memview helpers, and host-side memory registration/lifecycle logic, replacing hardcoded NUM_MAX_NVL_PEERS assumptions throughout.

Changes

Configurable nvl_group_size for EP High-Throughput Mode

Layer / File(s) Summary
Public API contracts and GPU context data structures
examples/device/ep/README.md, examples/device/ep/csrc/nixl_ep.hpp, examples/device/ep/csrc/kernels/api.cuh, examples/device/ep/nixl_ep/buffer.py
Buffer constructor gains nvl_group_size parameter (C++, Python, docs). gpu_nixl_ctx gains per-rank MVH arrays (remote_mvh_by_rank, barrier_mvh_by_rank), index array (remote_mvh_indices), and new fields (num_nvl_ranks, nvl_group_size, ht_barrier_counter_offset). NixlPeerInfo gains rdma_buffer_bytes; NixlAgentInfo drops ht_barrier_reg_descs. Buffer state adds default initialization for tracking fields. Python wrapper validates range and divisibility constraints.
Buffer sizing hints parameterized by nvl_group_size
examples/device/ep/csrc/config.hpp
Config::get_nvl_buffer_size_hint and Config::get_rdma_buffer_size_hint accept optional nvl_group_size parameter; RDMA rank count derived as num_ranks / nvl_group_size with new divisibility and bounds assertions replacing constant references to NUM_MAX_NVL_PEERS.
Runtime barrier utility and launch macros
examples/device/ep/csrc/kernels/utils.cuh, examples/device/ep/csrc/kernels/launch.cuh, examples/device/ep/csrc/kernels/api.cuh, examples/device/ep/csrc/kernels/nixl_ep_ht.cu
Introduces barrier_block_runtime with runtime num_ranks parameter instead of template argument; gates atomic adds and polling on threadIdx.x < num_ranks; adds runtime validation and extends timeout diagnostic with optional tag value. Adds remote_mdesc helper to construct runtime descriptors. Adds rank-translation helpers (translate_dst_rdma_rank, global_rank_from_rdma_nvl, load_token_in_rdma_rank_bits) and extends SWITCH_NVL_RANKS with case 1 and SWITCH_RDMA_RANKS with cases 3, 5, and 6.
Layout kernel parameterized by nvl_group_size
examples/device/ep/csrc/kernels/layout.cu
Device kernel and host wrapper of get_dispatch_layout gain nvl_group_size parameter; RDMA rank boundary calculations and per-token RDMA group membership use runtime nvl_group_size instead of NUM_MAX_NVL_PEERS; prior static assertion on kNumRanksPerSM divisibility is removed and replaced with host-side dynamic assertion.
HT kernels: notify_dispatch, dispatch, cached_notify, combine
examples/device/ep/csrc/kernels/nixl_ep_ht.cu
All four kernels replace fixed NUM_MAX_NVL_PEERS with nixl_ctx.nvl_group_size and nixl_ctx.num_nvl_ranks for rank splitting, loop bounds, shared-memory barrier counts, and descriptor construction. dispatch and combine introduce WarpRole::kInactive for out-of-range NVL lanes with early-exit returns. All use global_rank_from_rdma_nvl, translate_dst_rdma_rank, and remote_mdesc for runtime descriptor building. Replace compile-time barrier_block<NUM_MAX_NVL_PEERS> with barrier_block_runtime(..., nixl_ctx.num_nvl_ranks, ...). Replace nixl_barrier_wait signature with timeout-aware variant accepting timeout_cycles and tag. combine_token returns \-1 for empty top-k rank set. Host-side num_rdma_ranks computed as num_ranks / nixl_ctx.nvl_group_size.
LL kernel call sites updated to per-rank MVH helpers
examples/device/ep/csrc/kernels/nixl_ep_ll.cu
All nixlPut/nixlAtomicAdd descriptor constructions replace direct nixl_ctx.remote_mvh/barrier_mvh references with calls to remote_mdesc, barrier_mdesc, and remote_mvh_ptr. Updates p2p_ptr_get to return nullptr early when dst_rank is in a different NVL group than ctx.rank. Refactors cache_p2p_ptr_kernel to validate nvl_group_size, set cached pointers to nullptr for cross-group ranks, otherwise fill via runtime helper.
Buffer host lifecycle: init, RDMA allocation, memory views, agent connect, pybind11
examples/device/ep/csrc/nixl_ep.cpp
Buffer constructor validates and stores nvl_group_size; init computes rdma_rank/nvl_rank/num_nvl_ranks from the group size. In non-low-latency mode, embeds HT barrier counter in RDMA allocation tail with computed ht_barrier_counter_offset. _nixl_ep_memory_views_create/destroy maintain per-rank MVH host vectors and device-side copies. _nixl_ep_init/_ep_destroy allocate/free per-rank device MVH arrays and index buffers. _nixl_agents_connect refactored to split fetchRemoteMD/loadRemoteMD and use agent-name-based polling. destroy adds extra null-checks and removes conditional HT-counter deallocation. All sizing-hint and layout calls pass nvl_group_size. pybind11 binding updated for new constructor signature.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🐇 Hoppity-hop through the GPU lanes,
Eight was the max, now four remains!
nvl_group_size hops in at last,
GB200 groups zoom past,
No more hard-wired peer refrain—
The rabbit says: reconfigure your domain! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'examples/ep: support configurable NVL group size' clearly and concisely summarizes the main change—adding runtime-configurable NVL group sizing to the NIXL EP module.
Description check ✅ Passed The PR description covers all required template sections: What (runtime NVL group sizing), Why (GB200 4-GPU-local deployments), and How (implementation approach, scope, validation).
Linked Issues check ✅ Passed The PR fully addresses issue #1818 objectives: adds nvl_group_size parameter with validation (0 < nvl_group_size <= 8), replaces hard-coded NUM_MAX_NVL_PEERS with runtime group size, maintains backward compatibility (default=8), guards active NVL lanes, and uses per-rank remote memory views for correct inter-group routing.
Out of Scope Changes check ✅ Passed All changes are directly aligned with #1818 requirements: documentation updates, rank-partitioning logic, HT kernel/dispatch modifications, and per-rank remote memory view setup. No unrelated features or utilities are introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/device/ep/csrc/kernels/layout.cu`:
- Around line 73-83: Add a validation constraint to enforce that the fixed
8-lane layout (kNumRanksPerSM = 8) is properly divisible by the nvl_group_size
parameter before any RDMA indexing operations occur. Specifically, add an
assertion after the existing EP_DEVICE_ASSERT validation block to ensure that
kNumRanksPerSM is evenly divisible by nvl_group_size, so that the rank_idx
division operation on line 106 and the subsequent RDMA rank accumulation in the
loops around lines 147-155 operate on correctly aligned boundaries, preventing
incorrect RDMA group assignments when invalid divisors like 3 or 5 are passed
through.

In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu`:
- Around line 43-45: The `remote_mvh_ptr` function uses cached P2P pointers for
any rank without validating that the rank is in the same NVL group, which can
bypass security checks in `nixlPut`. Add a guard condition in the
`remote_mvh_ptr` function to check if the given global_rank is in the same NVL
group as the current context (by comparing `nvl_group_size` and rank
membership). If the rank is not in the same NVL group, return null to force the
request through RDMA descriptors instead of using the unsafe cached P2P pointer.
Ensure the null-MVH guard is preserved before calling `nixlGetPtr` to maintain
safety for valid same-group ranks.

In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 404-413: After successfully loading the agent name via
loadRemoteMD in the else branch and storing it in
remote_agent_names[remote_rank], all subsequent operations targeting this remote
agent must use the stored remote_agent_names[remote_rank] instead of
std::to_string(remote_rank). Update any peer-info notification generation code
(such as in genNotif or _nixl_agents_peer_info_gather) to use the actual agent
name from remote_agent_names[remote_rank] when constructing notifications,
ensuring consistency between metadata loading and peer-info gathering
operations.
- Around line 1345-1360: The HT kernels use a single ht_barrier_counter_offset
for all remote RDMA memviews, but the current `>=` check allows peers to have
larger RDMA buffer sizes, causing barrier atomics to be written at the wrong
offset. Change the comparison operator from `>=` to `==` in the assertion that
checks `nixl_peer_info[r].rdma_buffer_bytes` to enforce that all remote peers
must have exactly matching RDMA buffer sizes with this rank, ensuring the
uniform barrier offset aligns correctly with each peer's barrier tail.

In `@examples/device/ep/nixl_ep/buffer.py`:
- Around line 89-91: The assertion used to validate the nvl_group_size parameter
can be disabled when Python is run with the -O optimization flag, allowing
invalid values to bypass validation and cause harder-to-debug failures later in
C++ code. Replace the assert statement (checking that nvl_group_size is between
0 and 8 and divides 8 evenly) with an explicit exception like ValueError or
RuntimeError that will always be raised regardless of Python optimization flags.
Ensure the exception includes the same descriptive message about the
nvl_group_size constraint.

In `@examples/device/ep/README.md`:
- Line 41: The documentation for the Buffer function initialization on line 41
is missing the valid range and divisibility constraints for the nvl_group_size
parameter. Update the API documentation line for the Buffer function to include
these constraints alongside the default value of 8, ensuring developers
understand what values are acceptable when initializing the NIXL communication
buffer with the nvl_group_size parameter.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e0f94db6-0704-4866-b6d3-0f5ee6a845aa

📥 Commits

Reviewing files that changed from the base of the PR and between b775042 and 79713a9.

📒 Files selected for processing (11)
  • examples/device/ep/README.md
  • examples/device/ep/csrc/config.hpp
  • examples/device/ep/csrc/kernels/api.cuh
  • examples/device/ep/csrc/kernels/launch.cuh
  • examples/device/ep/csrc/kernels/layout.cu
  • examples/device/ep/csrc/kernels/nixl_ep_ht.cu
  • examples/device/ep/csrc/kernels/nixl_ep_ll.cu
  • examples/device/ep/csrc/kernels/utils.cuh
  • examples/device/ep/csrc/nixl_ep.cpp
  • examples/device/ep/csrc/nixl_ep.hpp
  • examples/device/ep/nixl_ep/buffer.py

Comment thread examples/device/ep/csrc/kernels/layout.cu
Comment thread examples/device/ep/csrc/kernels/nixl_ep_ll.cu
Comment thread examples/device/ep/csrc/nixl_ep.cpp
Comment thread examples/device/ep/csrc/nixl_ep.cpp Outdated
Comment thread examples/device/ep/nixl_ep/buffer.py Outdated
Comment thread examples/device/ep/README.md Outdated
@xtyao66 xtyao66 force-pushed the codex/nixl-ep-gb200-core branch from 0711aab to ff631b3 Compare June 24, 2026 00:13

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/device/ep/csrc/kernels/nixl_ep_ht.cu`:
- Line 500: Add local assertions before each `num_rdma_ranks` derivation in the
functions `notify_dispatch`, `dispatch`, `cached_notify`, and `combine` to
validate the runtime rank partition. For each location where `num_rdma_ranks` is
calculated from `num_ranks / nixl_ctx.nvl_group_size`, insert assertions that
verify: the group size is nonzero, the division is exact (num_ranks is evenly
divisible by nvl_group_size), the resulting RDMA rank count is positive, and the
context is consistent. This ensures incompatible rank layouts fail immediately
at the launch boundary rather than causing misrouting or crashes during later
operations like the division in the `combine` function.

In `@examples/device/ep/csrc/kernels/utils.cuh`:
- Around line 510-514: The barrier signal pointer dereferences in the
atomicAdd_system and atomicSub_system calls using indices rank and thread_id
occur before runtime bounds validation. Move the EP_DEVICE_ASSERT check for
num_ranks to occur before the atomic operations, and add an additional assertion
to ensure rank is within valid bounds (rank < num_ranks) before it is used as an
index into barrier_signal_ptrs. This ensures invalid runtime topology is caught
before any out-of-bounds system atomic operations are issued.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 36e36fcd-165c-40d3-9f6d-4b48a26c192b

📥 Commits

Reviewing files that changed from the base of the PR and between 79713a9 and 0711aab.

📒 Files selected for processing (11)
  • examples/device/ep/README.md
  • examples/device/ep/csrc/config.hpp
  • examples/device/ep/csrc/kernels/api.cuh
  • examples/device/ep/csrc/kernels/launch.cuh
  • examples/device/ep/csrc/kernels/layout.cu
  • examples/device/ep/csrc/kernels/nixl_ep_ht.cu
  • examples/device/ep/csrc/kernels/nixl_ep_ll.cu
  • examples/device/ep/csrc/kernels/utils.cuh
  • examples/device/ep/csrc/nixl_ep.cpp
  • examples/device/ep/csrc/nixl_ep.hpp
  • examples/device/ep/nixl_ep/buffer.py

Comment thread examples/device/ep/csrc/kernels/nixl_ep_ht.cu
Comment thread examples/device/ep/csrc/kernels/utils.cuh Outdated
@xtyao66 xtyao66 force-pushed the codex/nixl-ep-gb200-core branch from ff631b3 to f5748fd Compare June 24, 2026 00:24

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
examples/device/ep/csrc/kernels/nixl_ep_ll.cu (1)

43-45: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Preserve the null-MVH fallback before calling nixlGetPtr.

The cross-group guard is now in place, but same-group P2P cache initialization still calls remote_mvh_ptr() unconditionally. If remote_mvh_for_rank() returns nullptr, Line 44 passes it into nixlGetPtr instead of caching nullptr and falling back to RDMA.

Proposed fix
 __device__ __forceinline__ void* remote_mvh_ptr(const gpu_nixl_ctx& ctx, int global_rank) {
-    return nixlGetPtr(remote_mvh_for_rank(ctx, global_rank), remote_mvh_index(ctx, global_rank));
+    auto mvh = remote_mvh_for_rank(ctx, global_rank);
+    return mvh == nullptr ? nullptr : nixlGetPtr(mvh, remote_mvh_index(ctx, global_rank));
 }

Also applies to: 1148-1148

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu` around lines 43 - 45, The
`remote_mvh_ptr` function on line 44 passes the result of
`remote_mvh_for_rank()` directly to `nixlGetPtr` without checking for null. Add
a null-check on the result of `remote_mvh_for_rank(ctx, global_rank)` and return
nullptr if it's null, rather than passing a null pointer to `nixlGetPtr`. This
preserves the null-MVH fallback behavior allowing the code to fall back to RDMA
when the remote MVH is unavailable. Apply the same fix to the similar function
at line 1148.
examples/device/ep/csrc/kernels/utils.cuh (1)

510-514: 🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

Move runtime topology asserts before barrier pointer indexing.

Line 511 and Line 512 index barrier_signal_ptrs before any runtime bounds validation. If rank/num_ranks are invalid, this can issue OOB system atomics before failure is detected.

Suggested fix
 barrier_block_runtime(int** barrier_signal_ptrs, int rank, int num_ranks, uint64_t timeout_cycles, int tag = 0) {
     auto thread_id = static_cast<int>(threadIdx.x);
 
+    EP_DEVICE_ASSERT(num_ranks > 0);
+    EP_DEVICE_ASSERT(num_ranks <= blockDim.x);
+    EP_DEVICE_ASSERT(rank >= 0 and rank < num_ranks);
+
     if constexpr (not kSyncOnly) {
         memory_fence();
         __syncthreads();
     }
 
     if (thread_id < num_ranks) {
         atomicAdd_system(barrier_signal_ptrs[rank] + thread_id, FINISHED_SUM_TAG);
         atomicSub_system(barrier_signal_ptrs[thread_id] + rank, FINISHED_SUM_TAG);
     }
-    EP_DEVICE_ASSERT(num_ranks <= blockDim.x);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/device/ep/csrc/kernels/utils.cuh` around lines 510 - 514, The
EP_DEVICE_ASSERT validation that checks num_ranks against blockDim.x is placed
after the if statement that indexes into barrier_signal_ptrs using rank and
thread_id. If these values are invalid, the atomicAdd_system and
atomicSub_system operations will access out-of-bounds memory before the
assertion detects the error. Move the EP_DEVICE_ASSERT(num_ranks <= blockDim.x)
statement to immediately before the if block so that bounds validation occurs
before any array indexing operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu`:
- Around line 43-45: The `remote_mvh_ptr` function on line 44 passes the result
of `remote_mvh_for_rank()` directly to `nixlGetPtr` without checking for null.
Add a null-check on the result of `remote_mvh_for_rank(ctx, global_rank)` and
return nullptr if it's null, rather than passing a null pointer to `nixlGetPtr`.
This preserves the null-MVH fallback behavior allowing the code to fall back to
RDMA when the remote MVH is unavailable. Apply the same fix to the similar
function at line 1148.

In `@examples/device/ep/csrc/kernels/utils.cuh`:
- Around line 510-514: The EP_DEVICE_ASSERT validation that checks num_ranks
against blockDim.x is placed after the if statement that indexes into
barrier_signal_ptrs using rank and thread_id. If these values are invalid, the
atomicAdd_system and atomicSub_system operations will access out-of-bounds
memory before the assertion detects the error. Move the
EP_DEVICE_ASSERT(num_ranks <= blockDim.x) statement to immediately before the if
block so that bounds validation occurs before any array indexing operations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a0f0b274-dc33-49bb-89f0-6b69930da7f9

📥 Commits

Reviewing files that changed from the base of the PR and between 0711aab and ff631b3.

📒 Files selected for processing (11)
  • examples/device/ep/README.md
  • examples/device/ep/csrc/config.hpp
  • examples/device/ep/csrc/kernels/api.cuh
  • examples/device/ep/csrc/kernels/launch.cuh
  • examples/device/ep/csrc/kernels/layout.cu
  • examples/device/ep/csrc/kernels/nixl_ep_ht.cu
  • examples/device/ep/csrc/kernels/nixl_ep_ll.cu
  • examples/device/ep/csrc/kernels/utils.cuh
  • examples/device/ep/csrc/nixl_ep.cpp
  • examples/device/ep/csrc/nixl_ep.hpp
  • examples/device/ep/nixl_ep/buffer.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/device/ep/csrc/nixl_ep.cpp (1)

322-329: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard destroy-time IPC cleanup on pointer-table synchronization.

Line 322 only proves the local pointer-table storage exists; the GPU table is populated at Lines 487-488 only when _ipc_handles_sync() runs. If destroy() follows update_memory_buffers() before IPC sync, or connect_ranks(..., ipc_handles={}), the intranode barrier can read uninitialized/null peer entries and hang or fault. Track an ipc_handles_synced/local_ipc_ready flag set after the two cudaMemcpy calls, use it for the barrier and remote-handle close loop, but still free the local NVL allocation.

🐛 Proposed direction
-    if (num_nvl_bytes > 0 && buffer_ptrs[nvl_rank] != nullptr && barrier_signal_ptrs_gpu != nullptr) {
+    if (num_nvl_bytes > 0 && buffer_ptrs[nvl_rank] != nullptr) {
+        if (ipc_handles_synced && barrier_signal_ptrs_gpu != nullptr) {
         intranode::barrier(barrier_signal_ptrs_gpu, nvl_rank, num_nvl_ranks, timeout_cycles, comm_stream);
         warn_cuda(cudaDeviceSynchronize(), "synchronize device after intranode barrier");
 
         // Close remote IPC
         if (is_available()) {
             for (int i = 0; i < num_nvl_ranks; ++ i) if (i != nvl_rank)
                 warn_cuda(cudaIpcCloseMemHandle(buffer_ptrs[i]), "close remote IPC handle");
         }
+        }
 
         // Free local buffer
         warn_cuda(cudaFree(buffer_ptrs[nvl_rank]), "free local NVL buffer");
     }
         CUDA_CHECK(cudaMemcpy(buffer_ptrs_gpu, buffer_ptrs, sizeof(void*) * NUM_MAX_NVL_PEERS, cudaMemcpyHostToDevice));
         CUDA_CHECK(cudaMemcpy(barrier_signal_ptrs_gpu, barrier_signal_ptrs, sizeof(int*) * NUM_MAX_NVL_PEERS, cudaMemcpyHostToDevice));
         CUDA_CHECK(cudaDeviceSynchronize());
+        ipc_handles_synced = true;

This also needs a private boolean member initialized to false and reset during destroy.

Also applies to: 466-489

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/device/ep/csrc/nixl_ep.cpp` around lines 322 - 329, Add a private
boolean member variable (such as ipc_handles_synced or local_ipc_ready)
initialized to false to track whether IPC handles have been synchronized on the
GPU. Set this flag to true only after the two cudaMemcpy calls that populate the
GPU table (at lines 487-488 in _ipc_handles_sync or similar). Guard both the
intranode::barrier call and the remote IPC cleanup loop that calls
cudaIpcCloseMemHandle (lines 323-327) with this flag, so they only execute if
the GPU table has been properly synced. Allow the local NVL allocation to still
be freed regardless of the flag state. Reset this flag to false during the
destroy() method to ensure it is properly reinitialized.
♻️ Duplicate comments (1)
examples/device/ep/csrc/kernels/nixl_ep_ll.cu (1)

43-44: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Preserve the null-MVH fallback before nixlGetPtr.

The new group guard prevents cross-group P2P caching, but same-group cache population still unconditionally calls nixlGetPtr(remote_mvh_for_rank(...), ...). If IPC/P2P MVH is unavailable and the helper returns nullptr, this can fault instead of caching nullptr and falling back to nixlPut.

Proposed fix
 __device__ __forceinline__ void* remote_mvh_ptr(const gpu_nixl_ctx& ctx, int global_rank) {
-    return nixlGetPtr(remote_mvh_for_rank(ctx, global_rank), remote_mvh_index(ctx, global_rank));
+    auto mvh = remote_mvh_for_rank(ctx, global_rank);
+    return mvh == nullptr ? nullptr : nixlGetPtr(mvh, remote_mvh_index(ctx, global_rank));
 }

Also applies to: 1148-1148

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu` around lines 43 - 44, In the
remote_mvh_ptr function, add a null-check on the result of
remote_mvh_for_rank(ctx, global_rank) before passing it to nixlGetPtr. If
remote_mvh_for_rank returns nullptr (indicating unavailable IPC/P2P MVH), the
function should return nullptr directly instead of calling nixlGetPtr, which
allows the calling code to properly handle the fallback to nixlPut instead of
causing a fault.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 322-329: Add a private boolean member variable (such as
ipc_handles_synced or local_ipc_ready) initialized to false to track whether IPC
handles have been synchronized on the GPU. Set this flag to true only after the
two cudaMemcpy calls that populate the GPU table (at lines 487-488 in
_ipc_handles_sync or similar). Guard both the intranode::barrier call and the
remote IPC cleanup loop that calls cudaIpcCloseMemHandle (lines 323-327) with
this flag, so they only execute if the GPU table has been properly synced. Allow
the local NVL allocation to still be freed regardless of the flag state. Reset
this flag to false during the destroy() method to ensure it is properly
reinitialized.

---

Duplicate comments:
In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu`:
- Around line 43-44: In the remote_mvh_ptr function, add a null-check on the
result of remote_mvh_for_rank(ctx, global_rank) before passing it to nixlGetPtr.
If remote_mvh_for_rank returns nullptr (indicating unavailable IPC/P2P MVH), the
function should return nullptr directly instead of calling nixlGetPtr, which
allows the calling code to properly handle the fallback to nixlPut instead of
causing a fault.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a24c6668-21f7-4ce8-85a1-70481d3c1485

📥 Commits

Reviewing files that changed from the base of the PR and between ff631b3 and f5748fd.

📒 Files selected for processing (11)
  • examples/device/ep/README.md
  • examples/device/ep/csrc/config.hpp
  • examples/device/ep/csrc/kernels/api.cuh
  • examples/device/ep/csrc/kernels/launch.cuh
  • examples/device/ep/csrc/kernels/layout.cu
  • examples/device/ep/csrc/kernels/nixl_ep_ht.cu
  • examples/device/ep/csrc/kernels/nixl_ep_ll.cu
  • examples/device/ep/csrc/kernels/utils.cuh
  • examples/device/ep/csrc/nixl_ep.cpp
  • examples/device/ep/csrc/nixl_ep.hpp
  • examples/device/ep/nixl_ep/buffer.py

Signed-off-by: xt66 <60164575+xtyao66@users.noreply.github.com>
@xtyao66 xtyao66 force-pushed the codex/nixl-ep-gb200-core branch from f5748fd to ce03f45 Compare June 24, 2026 01:06

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/device/ep/csrc/kernels/nixl_ep_ll.cu`:
- Around line 1145-1150: The code indexes p2p_ptrs with rank_id without
validating that rank_id is within valid bounds, which can cause device memory
corruption. Add a bounds check on rank_id before the cross-group branch that
sets p2p_ptrs[rank_id] = nullptr. The validation should ensure rank_id is a
valid index into the p2p_ptrs array, and if it is invalid, return early or skip
the indexing operation to prevent out-of-bounds access.

In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 1366-1367: The pointer declarations for remote_rdma_ptr and
remote_sync_ptr do not follow the repository's clang-format style guide which
uses PointerAlignment: Right. Modify both pointer declarations to move the
asterisk from the right of the type to the left, binding it to the type name
instead of the variable name (change void* to void * for both declarations).
- Around line 136-138: The assertion on line 136 is rejecting low-latency jobs
with rank counts not divisible by nvl_group_size, which contradicts the
exemption of low-latency mode from the HT RDMA-peer bound mentioned on line 135.
Modify the EP_HOST_ASSERT on line 136 to exclude the divisibility requirement
when low_latency_mode is enabled, allowing low-latency jobs to proceed with any
rank count less than or greater than nvl_group_size. Additionally, the second
assertion in lines 137-138 should be tightened to require num_rdma_bytes > 0
when num_ranks > nvl_group_size, unless low_latency_mode is active, to ensure
proper HT allocation (currently it allows multi-group configurations with only
the 8-byte barrier tail, which is insufficient).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d8421550-0654-43f0-ae80-d802081a89d9

📥 Commits

Reviewing files that changed from the base of the PR and between f5748fd and ce03f45.

📒 Files selected for processing (11)
  • examples/device/ep/README.md
  • examples/device/ep/csrc/config.hpp
  • examples/device/ep/csrc/kernels/api.cuh
  • examples/device/ep/csrc/kernels/launch.cuh
  • examples/device/ep/csrc/kernels/layout.cu
  • examples/device/ep/csrc/kernels/nixl_ep_ht.cu
  • examples/device/ep/csrc/kernels/nixl_ep_ll.cu
  • examples/device/ep/csrc/kernels/utils.cuh
  • examples/device/ep/csrc/nixl_ep.cpp
  • examples/device/ep/csrc/nixl_ep.hpp
  • examples/device/ep/nixl_ep/buffer.py

Comment thread examples/device/ep/csrc/kernels/nixl_ep_ll.cu
Comment thread examples/device/ep/csrc/nixl_ep.cpp Outdated
Comment thread examples/device/ep/csrc/nixl_ep.cpp Outdated
Signed-off-by: xt66 <60164575+xtyao66@users.noreply.github.com>
@nv-nmailhot nv-nmailhot requested review from a team, aranadive, brminich and ovidiusm as code owners June 24, 2026 06:28
@rakhmets rakhmets removed the request for review from a team June 24, 2026 15:41
@rakhmets rakhmets removed request for a team, aranadive, brminich and ovidiusm June 24, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants