examples/ep: prepare configurable NVL group size by xtyao66 · Pull Request #1819 · ai-dynamo/nixl

xtyao66 · 2026-06-23T23:25:40Z

What?

Prepare NIXL EP for configurable CUDA-IPC/NVLink-local group sizing by adding the public nvl_group_size API, defaulting to 8.

Why?

GB200 deployments may expose 4 CUDA-IPC-local GPUs per worker, but EP HT currently assumes 8-rank local groups. This first PR is the small API/validation prep step.

How?

Add nvl_group_size=8 to the EP Buffer constructor.
Add validation and read-only helpers.
Thread the value through buffer-size hints.
Preserve default behavior.

Stack:

This PR
codex/nixl-ep-gb200-ht-groups
codex/nixl-ep-gb200-ht-tests

Validation:

DCO signed
PR size gate: 54 modified-file additions
git diff --check: pass

Summary by CodeRabbit

New Features
- Buffer initialization now supports a configurable nvl_group_size parameter for flexible NVLink group partitioning (default: 8).
- New getter methods get_nvl_rank() and get_nvl_group_size() enable querying NVLink topology details.
Documentation
- Updated Buffer constructor documentation to explicitly include the new nvl_group_size parameter.

Signed-off-by: xt66 <60164575+xtyao66@users.noreply.github.com>

copy-pr-bot · 2026-06-23T23:25:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-06-23T23:25:49Z

👋 Hi xtyao66! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

coderabbitai · 2026-06-23T23:31:35Z

📝 Walkthrough

Walkthrough

Adds an optional nvl_group_size parameter (defaulting to NUM_MAX_NVL_PEERS/8) to nixl_ep::Buffer and the two Config buffer-size hint methods. The parameter is validated, stored, and used in NVL/RDMA rank partitioning arithmetic. Two new accessors get_nvl_rank() and get_nvl_group_size() are introduced and exposed via pybind11 and the Python wrapper. The README is updated to reflect the new default.

Changes

nvl_group_size Parameter Addition

Layer / File(s)	Summary
C++ header contracts: Buffer member, constructor, getters, and Config sizing `examples/device/ep/csrc/nixl_ep.hpp`, `examples/device/ep/csrc/config.hpp`	Adds `nvl_group_size` data member and optional constructor parameter to `Buffer`, declares `get_nvl_rank`/`get_nvl_group_size`, and extends both `Config` sizing helpers with an optional `nvl_group_size` parameter that replaces hardcoded `NUM_MAX_NVL_PEERS` in assertions and rank partition formulas.
C++ constructor body, getter implementations, and pybind bindings `examples/device/ep/csrc/nixl_ep.cpp`	Extends the `Buffer` constructor to initialize `nvl_group_size` via a bounds/divisibility lambda, implements the two new getters, and updates pybind11 `Config` and `Buffer` bindings to expose the new parameter and methods.
Python Buffer wrapper and README docs `examples/device/ep/nixl_ep/buffer.py`, `examples/device/ep/README.md`	Adds `nvl_group_size: int = 8` to `Buffer.__init__` with range/divisibility validation, stores it and forwards it to the C++ runtime, adds Python `get_nvl_rank`/`get_nvl_group_size` delegating to C++, and updates the README Key APIs signature.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 A new little knob for the NVL crew,
nvl_group_size — now eight by default is true!
Assertions checked, the ranks all align,
Python and C++ dance down the vine.
The README agrees, the getters are new —
Hop hop, dear buffer, the grouping shines through! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: preparing the EP module for configurable NVL group size by adding the nvl_group_size API parameter.
Description check	✅ Passed	The PR description comprehensively follows the template with clear What/Why/How sections, includes rationale, design approach, and validation details.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 80-90: The Buffer constructor now accepts a configurable
nvl_group_size parameter and validates it, but the init() method still uses the
hardcoded NUM_MAX_NVL_PEERS constant for rank partitioning and topology
derivation instead of using the configured nvl_group_size member variable.
Update the init() method to replace all references to NUM_MAX_NVL_PEERS with the
this->nvl_group_size member variable when performing rank partitioning, topology
initialization, and any related NVL/RDMA rank mapping calculations (including
those that affect get_nvl_rank() behavior) to ensure the configured group size
is actually used at runtime.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b019dd9b-9e79-4614-90e9-03930db419a2

📥 Commits

Reviewing files that changed from the base of the PR and between b775042 and 97c040b.

📒 Files selected for processing (5)

examples/device/ep/README.md
examples/device/ep/csrc/config.hpp
examples/device/ep/csrc/nixl_ep.cpp
examples/device/ep/csrc/nixl_ep.hpp
examples/device/ep/nixl_ep/buffer.py

xtyao66 · 2026-06-23T23:54:37Z

Superseded by the clean replacement PR #1820. The new PR lands the nvl_group_size API and runtime rank partitioning together so the earlier half-wired split is gone.

examples/ep: prepare configurable NVL group size

97c040b

Signed-off-by: xt66 <60164575+xtyao66@users.noreply.github.com>

xtyao66 requested review from a team, ebarilanM, itayalroy and rakhmets as code owners June 23, 2026 23:25

pull-request-size Bot added the size/M label Jun 23, 2026

github-actions Bot added the external-contribution label Jun 23, 2026

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread examples/device/ep/csrc/nixl_ep.cpp

xtyao66 closed this Jun 23, 2026

xtyao66 reopened this Jun 23, 2026

xtyao66 closed this Jun 23, 2026

xtyao66 deleted the codex/nixl-ep-nvl-group-size-api branch June 23, 2026 23:54

xtyao66 mentioned this pull request Jun 23, 2026

examples/ep: support configurable NVL group size #1820

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

examples/ep: prepare configurable NVL group size#1819

examples/ep: prepare configurable NVL group size#1819
xtyao66 wants to merge 1 commit into
ai-dynamo:mainfrom
xtyao66:codex/nixl-ep-nvl-group-size-api

xtyao66 commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

xtyao66 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

xtyao66 commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xtyao66 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xtyao66 commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading