telemetry: don't unlink the producer's file when attaching fails#1826
telemetry: don't unlink the producer's file when attaching fails#1826EylonKrause wants to merge 2 commits into
Conversation
|
👋 Hi EylonKrause! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
📝 WalkthroughWalkthrough
ChangesPreserve backing file on open errors
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
sharedRingBuffer::openCyclicBuffer() is the reader/attach path (create=false), but on two error branches it unlink()s the shared-memory file it only opened: header mmap failure and version mismatch. A reader built from a different NIXL version (or attaching to an exporter file from an older run) therefore deletes the producer agent's live telemetry file from the filesystem. The reader's other two error paths -- "File too small for buffer data" and the final whole-buffer mmap failure -- already only munmap and throw without unlinking, so this just makes all of openCyclicBuffer consistent: a reader never removes a file it did not create. The unlink()s in createCyclicBuffer() (the create=true path) are correct and left unchanged -- a creator may remove a file it just made. The file_fd unique-ptr still closes the descriptor on the error path. Signed-off-by: Eylon Krause <eylon1909@gmail.com>
5047ba2 to
c2779a7
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/utils/common/cyclic_buffer.tpp (1)
221-224: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winUpdate telemetry docs to match new mismatch handling.
The C++ reader no longer unlinks on version mismatch, but
docs/telemetry.mdstill documents unlink-on-mismatch behavior. Please update the docs contract to avoid operator confusion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/utils/common/cyclic_buffer.tpp` around lines 221 - 224, The version-mismatch handling in cyclic_buffer no longer unlinks the buffer, so the telemetry documentation contract is now outdated. Update docs/telemetry.md to reflect the current behavior described by the mismatch path in cyclic_buffer.tpp, using the existing version-mismatch handling and NIXL_ERROR semantics as the source of truth, and remove any mention that the reader unlinks on mismatch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/utils/common/cyclic_buffer.tpp`:
- Around line 221-224: The version-mismatch handling in cyclic_buffer no longer
unlinks the buffer, so the telemetry documentation contract is now outdated.
Update docs/telemetry.md to reflect the current behavior described by the
mismatch path in cyclic_buffer.tpp, using the existing version-mismatch handling
and NIXL_ERROR semantics as the source of truth, and remove any mention that the
reader unlinks on mismatch.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 97832167-6985-4419-b9fa-14521125d7fd
📒 Files selected for processing (1)
src/utils/common/cyclic_buffer.tpp
|
/build |
What?
sharedRingBuffer<T>::openCyclicBuffer()(src/utils/common/cyclic_buffer.tpp)is the reader/attach path (
create == false), but on two error branches itunlink()s the shared-memory file it only opened:mmapfailure (line 211)This PR removes those two
unlink()calls.Why?
A reader is attaching to a file that a producer agent created and is actively
using. Deleting it on the reader's error path removes the producer's live
telemetry file from the filesystem. The most realistic trigger is a version
mismatch: a reader built from a different NIXL version (or attaching to an
exporter file from an older run) reads
version != TELEMETRY_VERSIONand unlinksthe producer's file.
The reader's other two error paths already do the right thing — "File too small
for buffer data" (~line 236) and the final whole-buffer
mmapfailure (~line 247)both only
munmapandthrow, without unlinking. So this just makes all ofopenCyclicBufferconsistent: a reader never removes a file it did not create.The
unlink()s increateCyclicBuffer()(thecreate == truepath) are correctand left unchanged — a creator may remove a file it just made. The
file_fdunique-ptr still closes the descriptor on the error path.
Reproduction
A self-contained fs+mmap reproducer: a "producer" creates the file stamped with an
older version, and a "reader" attaches expecting a newer version.
How (verification)
buffer_exporter.cpp,telemetry.cpp— whichinstantiate
sharedRingBuffer/ includecyclic_buffer.tpp) in-tree with-Dsanitizer=address,undefined(exit 0).Happy to add a GoogleTest regression (construct the reader with a mismatched
version under
EXPECT_THROW, then assert the file still exists) intest/gtest/if you'd like one.
Related Issues
None.
Summary by CodeRabbit