Skip to content

examples/ep: prepare configurable NVL group size#1819

Closed
xtyao66 wants to merge 1 commit into
ai-dynamo:mainfrom
xtyao66:codex/nixl-ep-nvl-group-size-api
Closed

examples/ep: prepare configurable NVL group size#1819
xtyao66 wants to merge 1 commit into
ai-dynamo:mainfrom
xtyao66:codex/nixl-ep-nvl-group-size-api

Conversation

@xtyao66

@xtyao66 xtyao66 commented Jun 23, 2026

Copy link
Copy Markdown

What?

Prepare NIXL EP for configurable CUDA-IPC/NVLink-local group sizing by adding the public nvl_group_size API, defaulting to 8.

Why?

GB200 deployments may expose 4 CUDA-IPC-local GPUs per worker, but EP HT currently assumes 8-rank local groups. This first PR is the small API/validation prep step.

How?

  • Add nvl_group_size=8 to the EP Buffer constructor.
  • Add validation and read-only helpers.
  • Thread the value through buffer-size hints.
  • Preserve default behavior.

Stack:

  1. This PR
  2. codex/nixl-ep-gb200-ht-groups
  3. codex/nixl-ep-gb200-ht-tests

Validation:

  • DCO signed
  • PR size gate: 54 modified-file additions
  • git diff --check: pass

Summary by CodeRabbit

  • New Features

    • Buffer initialization now supports a configurable nvl_group_size parameter for flexible NVLink group partitioning (default: 8).
    • New getter methods get_nvl_rank() and get_nvl_group_size() enable querying NVLink topology details.
  • Documentation

    • Updated Buffer constructor documentation to explicitly include the new nvl_group_size parameter.

Signed-off-by: xt66 <60164575+xtyao66@users.noreply.github.com>
@xtyao66 xtyao66 requested review from a team, ebarilanM, itayalroy and rakhmets as code owners June 23, 2026 23:25
@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown

👋 Hi xtyao66! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds an optional nvl_group_size parameter (defaulting to NUM_MAX_NVL_PEERS/8) to nixl_ep::Buffer and the two Config buffer-size hint methods. The parameter is validated, stored, and used in NVL/RDMA rank partitioning arithmetic. Two new accessors get_nvl_rank() and get_nvl_group_size() are introduced and exposed via pybind11 and the Python wrapper. The README is updated to reflect the new default.

Changes

nvl_group_size Parameter Addition

Layer / File(s) Summary
C++ header contracts: Buffer member, constructor, getters, and Config sizing
examples/device/ep/csrc/nixl_ep.hpp, examples/device/ep/csrc/config.hpp
Adds nvl_group_size data member and optional constructor parameter to Buffer, declares get_nvl_rank/get_nvl_group_size, and extends both Config sizing helpers with an optional nvl_group_size parameter that replaces hardcoded NUM_MAX_NVL_PEERS in assertions and rank partition formulas.
C++ constructor body, getter implementations, and pybind bindings
examples/device/ep/csrc/nixl_ep.cpp
Extends the Buffer constructor to initialize nvl_group_size via a bounds/divisibility lambda, implements the two new getters, and updates pybind11 Config and Buffer bindings to expose the new parameter and methods.
Python Buffer wrapper and README docs
examples/device/ep/nixl_ep/buffer.py, examples/device/ep/README.md
Adds nvl_group_size: int = 8 to Buffer.__init__ with range/divisibility validation, stores it and forwards it to the C++ runtime, adds Python get_nvl_rank/get_nvl_group_size delegating to C++, and updates the README Key APIs signature.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 A new little knob for the NVL crew,
nvl_group_size — now eight by default is true!
Assertions checked, the ranks all align,
Python and C++ dance down the vine.
The README agrees, the getters are new —
Hop hop, dear buffer, the grouping shines through! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: preparing the EP module for configurable NVL group size by adding the nvl_group_size API parameter.
Description check ✅ Passed The PR description comprehensively follows the template with clear What/Why/How sections, includes rationale, design approach, and validation details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 80-90: The Buffer constructor now accepts a configurable
nvl_group_size parameter and validates it, but the init() method still uses the
hardcoded NUM_MAX_NVL_PEERS constant for rank partitioning and topology
derivation instead of using the configured nvl_group_size member variable.
Update the init() method to replace all references to NUM_MAX_NVL_PEERS with the
this->nvl_group_size member variable when performing rank partitioning, topology
initialization, and any related NVL/RDMA rank mapping calculations (including
those that affect get_nvl_rank() behavior) to ensure the configured group size
is actually used at runtime.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b019dd9b-9e79-4614-90e9-03930db419a2

📥 Commits

Reviewing files that changed from the base of the PR and between b775042 and 97c040b.

📒 Files selected for processing (5)
  • examples/device/ep/README.md
  • examples/device/ep/csrc/config.hpp
  • examples/device/ep/csrc/nixl_ep.cpp
  • examples/device/ep/csrc/nixl_ep.hpp
  • examples/device/ep/nixl_ep/buffer.py

Comment thread examples/device/ep/csrc/nixl_ep.cpp
@xtyao66 xtyao66 closed this Jun 23, 2026
@xtyao66 xtyao66 reopened this Jun 23, 2026
@xtyao66 xtyao66 closed this Jun 23, 2026
@xtyao66 xtyao66 deleted the codex/nixl-ep-nvl-group-size-api branch June 23, 2026 23:54
@xtyao66

xtyao66 commented Jun 23, 2026

Copy link
Copy Markdown
Author

Superseded by the clean replacement PR #1820. The new PR lands the nvl_group_size API and runtime rank partitioning together so the earlier half-wired split is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant