Skip to content

Fix(io): track RDMA notification completions in transfer status#434

Open
amd-dlimpus wants to merge 2 commits into
ROCm:mainfrom
amd-dlimpus:fix-io-notify-completion-status
Open

Fix(io): track RDMA notification completions in transfer status#434
amd-dlimpus wants to merge 2 commits into
ROCm:mainfrom
amd-dlimpus:fix-io-notify-completion-status

Conversation

@amd-dlimpus

Copy link
Copy Markdown

Motivation

Mori can use a separate network message to tell the remote side that a data transfer has finished. Before this change, a transfer could be marked successful as soon as the data write completed, even though that completion message was still in progress.

That created a race: the sender could report success, but the receiver might never get the completion message if that later send failed. In that case, both sides could disagree about whether the transfer was actually complete.

Technical Details

This change makes the data write and its completion message part of the same tracked operation.

The transfer is now marked successful only after both pieces have completed:

  • the data write
  • the completion message sent to the remote side

If sending the completion message fails, the transfer status is updated as a failure instead of being left as a prior success.

The implementation reuses the existing submission tracking path for completion-message sends, so these sends are handled the same way as other network work requests. A small fallback remains for older completion-message identifiers.

Test Plan

Added a focused C++ regression test that verifies:

  • completing the data write alone does not mark the transfer successful while the completion message is still pending
  • a completion-message failure is not overwritten by a later success update

Also ran editor diagnostics on the touched files.

Test Result

Editor diagnostics passed.

A local C++ syntax check could not complete in this environment because a required Boost header is missing:
boost/predef/other/endian.h.

Submission Checklist

dlimpus added 2 commits June 26, 2026 10:15
When transfer chunking rewrites batch accounting from request count to final WR count, keep the notification SEND completions in the total so success cannot be skipped or reported early when CQEs arrive out of order.

Signed-off-by: dlimpus <dlimpus@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants