Skip to content

nixlbench: change to range lookup in Neuron allocation tracking#1833

Open
fengjica wants to merge 1 commit into
ai-dynamo:mainfrom
fengjica:dev/trn_fix_neuron_allocaion_tracking
Open

nixlbench: change to range lookup in Neuron allocation tracking#1833
fengjica wants to merge 1 commit into
ai-dynamo:mainfrom
fengjica:dev/trn_fix_neuron_allocaion_tracking

Conversation

@fengjica

@fengjica fengjica commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Today in nixlbench, when data is copied from Neuron device to host, it checks the address of an IOV within the existing Neuron memory allocation. The latter is done once with a single large buffer, which has the base address different from the IOV's base address. Not found neuron allocation leads to nixlbench failure.

What?

Change to range lookup when searching in Neuron allocation tracking.

Summary by CodeRabbit

  • Bug Fixes
    • Improved device memory handling so copies and fills now work correctly even when operations target an address inside an existing allocation.
    • Fixed allocation tracking to better recognize the full range of device buffers, reducing the chance of missed or incorrect memory operations.
    • Added more helpful error reporting when device reads or writes fail.

@fengjica fengjica requested review from a team, aranadive, brminich and ovidiusm as code owners June 24, 2026 19:29
@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown

👋 Hi fengjica! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Device tensor allocations now track base addresses and sizes, and memory operations resolve arbitrary device VAs to the containing tensor plus byte offset. neuronMalloc, neuronFree, neuronMemcpy, and neuronMemset all use the updated lookup and access path.

Changes

Neuron address-range tracking

Layer / File(s) Summary
Tracker and lookup helper
benchmark/nixlbench/src/utils/neuron.cpp
std::map<uintptr_t, NrtAllocation> stores tensors by base VA with allocation size, and findTensorForVA returns the containing tensor and intra-allocation byte offset.
Malloc registration
benchmark/nixlbench/src/utils/neuron.cpp
neuronMalloc checks nrt_tensor_get_va, inserts NrtAllocation{tensor, buffer_size} under the computed base VA, and releases the tensor when no VA is returned.
Free, memcpy, and memset
benchmark/nixlbench/src/utils/neuron.cpp
neuronFree erases by base address, and neuronMemcpy/neuronMemset use the resolved offset for reads, writes, and chunked fills, with std::cerr logging on failures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I nibbled the map by moonlit light,
and found each tensor tucked just right.
Base hops in, offsets glide along,
memcpy and memset hum a bunny song.
Soft ears up—these bytes feel strong.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately describes the main change to Neuron allocation tracking.
Description check ✅ Passed The description includes the bug context and What/Why, and the missing How section is optional.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmark/nixlbench/src/utils/neuron.cpp`:
- Around line 202-214: The VA lookup in neuronMemcpy and neuronMemset only
confirms the starting address is inside an allocation, so you need to enforce
the full byte range before calling NRT. Update findTensorForVA to return the
remaining bytes in the tracked tensor, then in both neuronMemcpy and
neuronMemset reject requests where count exceeds that remaining space. Use the
existing tensor/offset flow to add the bounds check immediately after lookup and
before nrt_tensor_read, nrt_tensor_write, or any memset path.
- Around line 137-149: findTensorForVA currently returns a raw nrt_tensor* after
releasing allocation_tracker_mutex, which can leave neuronMemcpy and
neuronMemset with a dangling tensor if neuronFree runs concurrently. Update the
lookup flow around findTensorForVA and its callers to keep the tensor alive by
returning shared ownership from the allocation_tracker entry or by holding the
tracker lock through the NRT read/write operation, so the tensor cannot be freed
before use.
- Around line 195-196: The allocation tracking return value in the neuron
utility is relying on unsigned wraparound from std::map::erase, which is
implicit and brittle. Update the logic in the allocation-tracker path around
allocation_tracker.erase so the erased count is stored explicitly and converted
to a clear status code, returning 0 when the key is removed and -1 when it is
not found. Keep the fix localized to the code using allocation_tracker_mutex and
allocation_tracker.erase so the intent is obvious.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 63499274-5046-4495-b75c-d70c0a516ef6

📥 Commits

Reviewing files that changed from the base of the PR and between a9f456b and 07ca967.

📒 Files selected for processing (1)
  • benchmark/nixlbench/src/utils/neuron.cpp

Comment on lines +137 to +149
std::pair<nrt_tensor *, size_t>
findTensorForVA(const void *va) {
std::lock_guard lock{allocation_tracker_mutex};
uintptr_t addr = reinterpret_cast<uintptr_t>(va);

auto it = allocation_tracker.find(va);
if (it == allocation_tracker.end()) {
return nullptr;
// Find the first entry with base > addr, then step back one.
auto it = allocation_tracker.upper_bound(addr);
if (it == allocation_tracker.begin()) {
return {nullptr, 0};
}
--it;
if (addr < it->first + it->second.size) {
return {it->second.tensor.get(), addr - it->first};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Keep the tensor alive after lookup.

findTensorForVA releases allocation_tracker_mutex before callers use the returned raw nrt_tensor *. A concurrent neuronFree can erase the entry and free the tensor, leaving neuronMemcpy/neuronMemset with a dangling pointer. Return shared ownership or keep the tracker lock held through the NRT read/write.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/nixlbench/src/utils/neuron.cpp` around lines 137 - 149,
findTensorForVA currently returns a raw nrt_tensor* after releasing
allocation_tracker_mutex, which can leave neuronMemcpy and neuronMemset with a
dangling tensor if neuronFree runs concurrently. Update the lookup flow around
findTensorForVA and its callers to keep the tensor alive by returning shared
ownership from the allocation_tracker entry or by holding the tracker lock
through the NRT read/write operation, so the tensor cannot be freed before use.

Comment on lines 195 to +196
std::lock_guard lock{allocation_tracker_mutex};
return allocation_tracker.erase(addr) - 1;
return allocation_tracker.erase(reinterpret_cast<uintptr_t>(addr)) - 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant file and nearby context.
sed -n '150,230p' benchmark/nixlbench/src/utils/neuron.cpp

# Find the declaration/type of allocation_tracker and related functions.
rg -n "allocation_tracker|free|malloc|erase\\(" benchmark/nixlbench/src/utils/neuron.cpp benchmark/nixlbench/src/utils -g '!**/node_modules/**'

Repository: ai-dynamo/nixl

Length of output: 5278


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the API surface and callers to understand the intended status convention.
rg -n "neuronFree\\(|neuronMalloc\\(|neuronMemcpy\\(|neuronMemset\\(" benchmark/nixlbench/src -g '!**/node_modules/**'

# Show the declarations around the public interface.
sed -n '1,220p' benchmark/nixlbench/src/utils/neuron.cpp

Repository: ai-dynamo/nixl

Length of output: 7909


Return the status explicitly instead of relying on unsigned wraparound. std::map::erase(key) returns 0 or 1; erase(...) - 1 underflows when the key is missing and then narrows to int. Use return erased == 1 ? 0 : -1;.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/nixlbench/src/utils/neuron.cpp` around lines 195 - 196, The
allocation tracking return value in the neuron utility is relying on unsigned
wraparound from std::map::erase, which is implicit and brittle. Update the logic
in the allocation-tracker path around allocation_tracker.erase so the erased
count is stored explicitly and converted to a clear status code, returning 0
when the key is removed and -1 when it is not found. Keep the fix localized to
the code using allocation_tracker_mutex and allocation_tracker.erase so the
intent is obvious.

Comment on lines +202 to +214
auto [tensor, offset] = findTensorForVA(device_addr);
if (tensor == nullptr) {
std::cerr << "neuronMemcpy: no allocation found for VA " << device_addr
<< " (kind=" << (kind == neuronMemcpyHostToDevice ? "H2D" : "D2H")
<< ", count=" << count << ")" << std::endl;
return -1;
}

int status;
if (kind == neuronMemcpyHostToDevice) {
return nrt_tensor_write(tensor, src, 0, count);
status = nrt_tensor_write(tensor, src, offset, count);
} else {
return nrt_tensor_read(tensor, dest, 0, count);
status = nrt_tensor_read(tensor, dest, offset, count);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Validate the full byte range, not just the starting VA.

The range lookup proves only that device_addr/addr starts inside an allocation. If offset + count exceeds the tracked allocation size, these paths can call nrt_tensor_read/write past the owning tensor. Have the lookup return remaining bytes and reject count > remaining before calling NRT.

Suggested direction
-std::pair<nrt_tensor *, size_t>
+struct TensorLookup {
+    nrt_tensor *tensor;
+    size_t offset;
+    size_t remaining;
+};
+
+TensorLookup
 findTensorForVA(const void *va) {
 ...
-    if (addr < it->first + it->second.size) {
-        return {it->second.tensor.get(), addr - it->first};
+    const size_t offset = addr - it->first;
+    if (offset < it->second.size) {
+        return {it->second.tensor.get(), offset, it->second.size - offset};
     }
-    return {nullptr, 0};
+    return {nullptr, 0, 0};
 }

Then check count <= lookup.remaining in both neuronMemcpy and neuronMemset.

Also applies to: 227-238

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/nixlbench/src/utils/neuron.cpp` around lines 202 - 214, The VA
lookup in neuronMemcpy and neuronMemset only confirms the starting address is
inside an allocation, so you need to enforce the full byte range before calling
NRT. Update findTensorForVA to return the remaining bytes in the tracked tensor,
then in both neuronMemcpy and neuronMemset reject requests where count exceeds
that remaining space. Use the existing tensor/offset flow to add the bounds
check immediately after lookup and before nrt_tensor_read, nrt_tensor_write, or
any memset path.

Today, when data is copied from Neuron device to host, it checks the address
of an IOV within the existing Neuron memory allocation. The latter is done
once with a single large buffer, which has the base address different from
the IOV's base address.
@fengjica fengjica force-pushed the dev/trn_fix_neuron_allocaion_tracking branch from 07ca967 to b9bfbf3 Compare June 24, 2026 20:17

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
benchmark/nixlbench/src/utils/neuron.cpp (3)

137-150: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Keep the tensor alive through the NRT operation.

Line 140 releases allocation_tracker_mutex before the raw nrt_tensor * returned from findTensorForVA is used by neuronMemcpy/neuronMemset; a concurrent neuronFree can erase and free the tensor first. Return shared ownership or hold the tracker lock through the read/write.

Also applies to: 203-215, 227-239

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/nixlbench/src/utils/neuron.cpp` around lines 137 - 150, The raw
nrt_tensor pointer returned by findTensorForVA can outlive
allocation_tracker_mutex, so a concurrent neuronFree may erase the tensor before
neuronMemcpy/neuronMemset uses it. Update findTensorForVA and its callers to
keep the tensor alive through the NRT operation, either by returning shared
ownership from allocation_tracker or by holding the tracker lock across the
read/write in neuronMemcpy and neuronMemset. Make sure the fix covers the same
pattern in the related call sites referenced by the review.

196-197: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Return the erase status explicitly.

Line 197 relies on std::map::erase() unsigned wraparound when the key is missing. Store the erased count and return 0 only when one entry was removed.

Proposed fix
     std::lock_guard lock{allocation_tracker_mutex};
-    return allocation_tracker.erase(reinterpret_cast<uintptr_t>(addr)) - 1;
+    const auto erased = allocation_tracker.erase(reinterpret_cast<uintptr_t>(addr));
+    return erased == 1 ? 0 : -1;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/nixlbench/src/utils/neuron.cpp` around lines 196 - 197, The erase
result in the allocation tracking path is implicit and depends on unsigned
wraparound, which should be made explicit. Update the logic in the allocation
tracker erase flow around the
`allocation_tracker.erase(reinterpret_cast<uintptr_t>(addr))` call to store the
erase count first, then return 1 when an entry was removed and 0 otherwise. Keep
the change localized to the code using `allocation_tracker_mutex` so the status
is clear and does not rely on subtraction behavior.

149-150: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Validate the full requested range before calling NRT.

The lookup only proves the starting VA is inside an allocation. If offset + count exceeds the tracked size, nrt_tensor_read/write can run past the owning tensor; return remaining bytes from the lookup and reject oversized count before read/write/memset chunks.

Also applies to: 211-215, 233-239

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/nixlbench/src/utils/neuron.cpp` around lines 149 - 150, The range
check in the neuron address lookup only validates the starting address, so
oversized requests can still overrun the owning tensor. Update the lookup and
the callers in the read/write/memset paths to return the remaining bytes from
the matched allocation and verify that offset + count stays within the tracked
size before issuing any NRT calls. Use the existing tensor lookup helper and the
chunked transfer logic in the affected read/write/memset routines to reject or
trim requests that would exceed the allocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmark/nixlbench/src/utils/neuron.cpp`:
- Around line 185-187: The allocation tracking in neuronMalloc is accepting
duplicate virtual addresses without checking whether allocation_tracker.emplace
succeeded, which can leave *addr pointing to freed memory while still returning
success. Update the neuronMalloc path around allocation_tracker/emplace to
verify the insert result before returning 0, and on a duplicate base value,
cleanly report failure and avoid exposing the newly allocated tensor through
*addr. Use the existing allocation_tracker_mutex, allocation_tracker, and
NrtAllocation flow to locate the fix.

---

Duplicate comments:
In `@benchmark/nixlbench/src/utils/neuron.cpp`:
- Around line 137-150: The raw nrt_tensor pointer returned by findTensorForVA
can outlive allocation_tracker_mutex, so a concurrent neuronFree may erase the
tensor before neuronMemcpy/neuronMemset uses it. Update findTensorForVA and its
callers to keep the tensor alive through the NRT operation, either by returning
shared ownership from allocation_tracker or by holding the tracker lock across
the read/write in neuronMemcpy and neuronMemset. Make sure the fix covers the
same pattern in the related call sites referenced by the review.
- Around line 196-197: The erase result in the allocation tracking path is
implicit and depends on unsigned wraparound, which should be made explicit.
Update the logic in the allocation tracker erase flow around the
`allocation_tracker.erase(reinterpret_cast<uintptr_t>(addr))` call to store the
erase count first, then return 1 when an entry was removed and 0 otherwise. Keep
the change localized to the code using `allocation_tracker_mutex` so the status
is clear and does not rely on subtraction behavior.
- Around line 149-150: The range check in the neuron address lookup only
validates the starting address, so oversized requests can still overrun the
owning tensor. Update the lookup and the callers in the read/write/memset paths
to return the remaining bytes from the matched allocation and verify that offset
+ count stays within the tracked size before issuing any NRT calls. Use the
existing tensor lookup helper and the chunked transfer logic in the affected
read/write/memset routines to reject or trim requests that would exceed the
allocation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2aa7dae1-c580-41f9-bd0e-5377ac9ab9da

📥 Commits

Reviewing files that changed from the base of the PR and between 07ca967 and b9bfbf3.

📒 Files selected for processing (1)
  • benchmark/nixlbench/src/utils/neuron.cpp

Comment on lines 185 to +187
std::lock_guard lock{allocation_tracker_mutex};
allocation_tracker.emplace(*addr, std::move(ptr));
uintptr_t base = reinterpret_cast<uintptr_t>(*addr);
allocation_tracker.emplace(base, NrtAllocation{tensor, buffer_size});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Handle duplicate VA insertion before returning success.

Line 187 ignores emplace failure. If base already exists, the temporary NrtAllocation{tensor, buffer_size} owns and frees the newly allocated tensor, but neuronMalloc still returns 0 with *addr pointing at that freed allocation.

Proposed fix
     std::lock_guard lock{allocation_tracker_mutex};
     uintptr_t base = reinterpret_cast<uintptr_t>(*addr);
-    allocation_tracker.emplace(base, NrtAllocation{tensor, buffer_size});
+    auto [it, inserted] = allocation_tracker.try_emplace(base, tensor, buffer_size);
+    if (!inserted) {
+        nrt_tensor_free(&tensor);
+        *addr = nullptr;
+        return -1;
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::lock_guard lock{allocation_tracker_mutex};
allocation_tracker.emplace(*addr, std::move(ptr));
uintptr_t base = reinterpret_cast<uintptr_t>(*addr);
allocation_tracker.emplace(base, NrtAllocation{tensor, buffer_size});
std::lock_guard lock{allocation_tracker_mutex};
uintptr_t base = reinterpret_cast<uintptr_t>(*addr);
auto [it, inserted] = allocation_tracker.try_emplace(base, tensor, buffer_size);
if (!inserted) {
nrt_tensor_free(&tensor);
*addr = nullptr;
return -1;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/nixlbench/src/utils/neuron.cpp` around lines 185 - 187, The
allocation tracking in neuronMalloc is accepting duplicate virtual addresses
without checking whether allocation_tracker.emplace succeeded, which can leave
*addr pointing to freed memory while still returning success. Update the
neuronMalloc path around allocation_tracker/emplace to verify the insert result
before returning 0, and on a duplicate base value, cleanly report failure and
avoid exposing the newly allocated tensor through *addr. Use the existing
allocation_tracker_mutex, allocation_tracker, and NrtAllocation flow to locate
the fix.

@fengjica fengjica requested a review from amitrad-aws June 26, 2026 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant