examples/ep: prepare configurable NVL group size#1819
Conversation
Signed-off-by: xt66 <60164575+xtyao66@users.noreply.github.com>
|
👋 Hi xtyao66! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
📝 WalkthroughWalkthroughAdds an optional Changesnvl_group_size Parameter Addition
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 80-90: The Buffer constructor now accepts a configurable
nvl_group_size parameter and validates it, but the init() method still uses the
hardcoded NUM_MAX_NVL_PEERS constant for rank partitioning and topology
derivation instead of using the configured nvl_group_size member variable.
Update the init() method to replace all references to NUM_MAX_NVL_PEERS with the
this->nvl_group_size member variable when performing rank partitioning, topology
initialization, and any related NVL/RDMA rank mapping calculations (including
those that affect get_nvl_rank() behavior) to ensure the configured group size
is actually used at runtime.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b019dd9b-9e79-4614-90e9-03930db419a2
📒 Files selected for processing (5)
examples/device/ep/README.mdexamples/device/ep/csrc/config.hppexamples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hppexamples/device/ep/nixl_ep/buffer.py
|
Superseded by the clean replacement PR #1820. The new PR lands the nvl_group_size API and runtime rank partitioning together so the earlier half-wired split is gone. |
What?
Prepare NIXL EP for configurable CUDA-IPC/NVLink-local group sizing by adding the public
nvl_group_sizeAPI, defaulting to 8.Why?
GB200 deployments may expose 4 CUDA-IPC-local GPUs per worker, but EP HT currently assumes 8-rank local groups. This first PR is the small API/validation prep step.
How?
nvl_group_size=8to the EP Buffer constructor.Stack:
codex/nixl-ep-gb200-ht-groupscodex/nixl-ep-gb200-ht-testsValidation:
git diff --check: passSummary by CodeRabbit
New Features
nvl_group_sizeparameter for flexible NVLink group partitioning (default: 8).get_nvl_rank()andget_nvl_group_size()enable querying NVLink topology details.Documentation
nvl_group_sizeparameter.