Skip to content

telemetry: don't unlink the producer's file when attaching fails#1826

Open
EylonKrause wants to merge 2 commits into
ai-dynamo:mainfrom
EylonKrause:fix/telemetry-reader-unlink
Open

telemetry: don't unlink the producer's file when attaching fails#1826
EylonKrause wants to merge 2 commits into
ai-dynamo:mainfrom
EylonKrause:fix/telemetry-reader-unlink

Conversation

@EylonKrause

@EylonKrause EylonKrause commented Jun 24, 2026

Copy link
Copy Markdown

What?

sharedRingBuffer<T>::openCyclicBuffer() (src/utils/common/cyclic_buffer.tpp)
is the reader/attach path (create == false), but on two error branches it
unlink()s the shared-memory file it only opened:

  • header mmap failure (line 211)
  • version mismatch (line 223)

This PR removes those two unlink() calls.

Why?

A reader is attaching to a file that a producer agent created and is actively
using. Deleting it on the reader's error path removes the producer's live
telemetry file from the filesystem. The most realistic trigger is a version
mismatch
: a reader built from a different NIXL version (or attaching to an
exporter file from an older run) reads version != TELEMETRY_VERSION and unlinks
the producer's file.

The reader's other two error paths already do the right thing — "File too small
for buffer data" (~line 236) and the final whole-buffer mmap failure (~line 247)
both only munmap and throw, without unlinking. So this just makes all of
openCyclicBuffer consistent: a reader never removes a file it did not create.
The unlink()s in createCyclicBuffer() (the create == true path) are correct
and left unchanged — a creator may remove a file it just made. The file_fd
unique-ptr still closes the descriptor on the error path.

Reproduction

A self-contained fs+mmap reproducer: a "producer" creates the file stamped with an
older version, and a "reader" attaches expecting a newer version.

BEFORE:  file present before reader: yes -> version mismatch -> present after: NO   (reader deleted it)
AFTER :  file present before reader: yes -> version mismatch -> present after: yes  (intact)

How (verification)

  • Confirmed the before/after with the reproducer above.
  • Compiled the telemetry consumers (buffer_exporter.cpp, telemetry.cpp — which
    instantiate sharedRingBuffer / include cyclic_buffer.tpp) in-tree with
    -Dsanitizer=address,undefined (exit 0).

Happy to add a GoogleTest regression (construct the reader with a mismatched
version under EXPECT_THROW, then assert the file still exists) in test/gtest/
if you'd like one.

Related Issues

None.

Summary by CodeRabbit

  • Bug Fixes
    • Improved error handling when opening cyclic buffers: on header mapping failures or on-disk version mismatches, the backing file is now preserved instead of being removed.
    • Errors are still logged and reported, but the existing file remains available for inspection and recovery.

@EylonKrause EylonKrause requested a review from a team as a code owner June 24, 2026 15:12
@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown

👋 Hi EylonKrause! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

openCyclicBuffer no longer removes the backing file on header mmap failure or version mismatch. Both branches still log the error and throw std::runtime_error.

Changes

Preserve backing file on open errors

Layer / File(s) Summary
Remove unlink on open failure paths
src/utils/common/cyclic_buffer.tpp
Updates the file header comment and removes unlink(name.c_str()) from the header mmap failure branch and the version-mismatch branch in openCyclicBuffer.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐇 Two unlinks hopped out of sight,
The buffer’s file stayed put tonight.
On mmap woes and version fuss,
It logs and throws, but keeps the husk.
A gentler hop for error-light.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: stopping reader-side unlinking when attaching fails.
Description check ✅ Passed The description follows the template well with What, Why, How, and Related Issues sections and includes verification details.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

sharedRingBuffer::openCyclicBuffer() is the reader/attach path
(create=false), but on two error branches it unlink()s the shared-memory
file it only opened: header mmap failure and version mismatch. A reader
built from a different NIXL version (or attaching to an exporter file from
an older run) therefore deletes the producer agent's live telemetry file
from the filesystem.

The reader's other two error paths -- "File too small for buffer data"
and the final whole-buffer mmap failure -- already only munmap and throw
without unlinking, so this just makes all of openCyclicBuffer consistent:
a reader never removes a file it did not create. The unlink()s in
createCyclicBuffer() (the create=true path) are correct and left
unchanged -- a creator may remove a file it just made. The file_fd
unique-ptr still closes the descriptor on the error path.

Signed-off-by: Eylon Krause <eylon1909@gmail.com>
@EylonKrause EylonKrause force-pushed the fix/telemetry-reader-unlink branch from 5047ba2 to c2779a7 Compare June 24, 2026 15:27

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/utils/common/cyclic_buffer.tpp (1)

221-224: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Update telemetry docs to match new mismatch handling.

The C++ reader no longer unlinks on version mismatch, but docs/telemetry.md still documents unlink-on-mismatch behavior. Please update the docs contract to avoid operator confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/common/cyclic_buffer.tpp` around lines 221 - 224, The
version-mismatch handling in cyclic_buffer no longer unlinks the buffer, so the
telemetry documentation contract is now outdated. Update docs/telemetry.md to
reflect the current behavior described by the mismatch path in
cyclic_buffer.tpp, using the existing version-mismatch handling and NIXL_ERROR
semantics as the source of truth, and remove any mention that the reader unlinks
on mismatch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/utils/common/cyclic_buffer.tpp`:
- Around line 221-224: The version-mismatch handling in cyclic_buffer no longer
unlinks the buffer, so the telemetry documentation contract is now outdated.
Update docs/telemetry.md to reflect the current behavior described by the
mismatch path in cyclic_buffer.tpp, using the existing version-mismatch handling
and NIXL_ERROR semantics as the source of truth, and remove any mention that the
reader unlinks on mismatch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 97832167-6985-4419-b9fa-14521125d7fd

📥 Commits

Reviewing files that changed from the base of the PR and between 5047ba2 and c2779a7.

📒 Files selected for processing (1)
  • src/utils/common/cyclic_buffer.tpp

@iyastreb

Copy link
Copy Markdown
Contributor

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants