Skip to content

EXAMPLES/DEVICE/EP/CSRC: Replaced torch::Event by cudaEvent_t wrapper.#1822

Open
rakhmets wants to merge 3 commits into
ai-dynamo:mainfrom
rakhmets:topic/device-api-rm-torch-event
Open

EXAMPLES/DEVICE/EP/CSRC: Replaced torch::Event by cudaEvent_t wrapper.#1822
rakhmets wants to merge 3 commits into
ai-dynamo:mainfrom
rakhmets:topic/device-api-rm-torch-event

Conversation

@rakhmets

@rakhmets rakhmets commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What?

Replaced torch::Event by nixl_ep::cuda::Event.
Cleaned up includes in nixl_ep.[c|h]pp.
Minor refactoring of nixl_ep::EventHandle.

Why?

This is a part of #1793.
Original PR is devided into parts to facilitate review.

Summary by CodeRabbit

  • New Features
    • Introduced RAII-based CUDA event support and a lightweight CUDA warning helper for clearer runtime messaging.
  • Bug Fixes
    • Improved event cleanup safety and centralized CUDA error reporting for operations like buffer destruction.
    • Updated cross-stream synchronization to use a direct CUDA event–based wait path.
  • Refactor
    • Reworked event synchronization to manage CUDA events internally, simplifying the public waiting APIs.
    • Updated CUDA-related header inclusion order to support the revised event flow.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 08c65954-cecb-49d5-92f5-53f205e76e78

📥 Commits

Reviewing files that changed from the base of the PR and between c0846fd and 3d0c532.

📒 Files selected for processing (5)
  • examples/device/ep/csrc/cuda_event.hpp
  • examples/device/ep/csrc/cuda_warn.hpp
  • examples/device/ep/csrc/event_handle.hpp
  • examples/device/ep/csrc/nixl_ep.cpp
  • examples/device/ep/csrc/nixl_ep.hpp

📝 Walkthrough

Walkthrough

Adds CUDA warning and event wrappers, refactors EventHandle to use them, and updates nixl_ep includes plus stream-synchronization call sites.

Changes

CUDA Event RAII Refactor

Layer / File(s) Summary
CUDA warning and event utilities
examples/device/ep/csrc/cuda_warn.hpp, examples/device/ep/csrc/cuda_event.hpp
Adds nixl_ep::cuda::warn(...) for CUDA status logging and nixl_ep::cuda::Event with RAII cudaEvent_t ownership, move semantics, get(), and record(stream).
EventHandle uses cuda::Event
examples/device/ep/csrc/event_handle.hpp
Changes EventHandle to hold std::shared_ptr<cuda::Event>, records events during construction, waits with cudaStreamWaitEvent via CUDA_CHECK, and removes the old free wait helpers.
nixl_ep includes and stream waits
examples/device/ep/csrc/nixl_ep.hpp, examples/device/ep/csrc/nixl_ep.cpp
Reorders includes to use event_handle.hpp, updates the anonymous stream_wait helper, routes Buffer::destroy() warnings through cuda::warn, and switches three previous_event checks to previous_event->stream_wait(comm_stream).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hop hop, the streams now sync in tune,
A little CUDA moonlight, bright by noon.
Events wear RAII coats so neat,
And warning notes keep every hop complete.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: replacing torch::Event with a CUDA event wrapper.
Description check ✅ Passed The description covers the required What and Why sections and is sufficiently specific for this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/device/ep/csrc/cuda_event.hpp`:
- Around line 28-67: The cuda::Event wrapper currently creates the event on
whichever device is current and then records it on an arbitrary caller stream,
which can cross CUDA contexts. Update Event::create() and
Event::record(cudaStream_t) to be device-aware by either taking an explicit
device index or deriving the stream’s device and switching to it before
cudaEventCreateWithFlags/cudaEventRecord, then restoring the previous device
state. Use the cuda::Event class, its create() helper, and record() method as
the main touchpoints.

In `@examples/device/ep/csrc/cuda_warn.hpp`:
- Around line 18-23: The header was already rewritten by the trailing-whitespace
pre-commit hook, so the PR still contains unstaged whitespace-only changes.
Update the `cuda_warn.hpp` header by committing the hook’s cleanup exactly as
produced, ensuring the whitespace-only edits are included in the change set so
CI can pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 52f639c5-a8c5-43d7-bc54-d013224c630b

📥 Commits

Reviewing files that changed from the base of the PR and between b11012e and b7ae5b5.

📒 Files selected for processing (5)
  • examples/device/ep/csrc/cuda_event.hpp
  • examples/device/ep/csrc/cuda_warn.hpp
  • examples/device/ep/csrc/event_handle.hpp
  • examples/device/ep/csrc/nixl_ep.cpp
  • examples/device/ep/csrc/nixl_ep.hpp

Comment thread examples/device/ep/csrc/cuda_event.hpp
Comment thread examples/device/ep/csrc/cuda_warn.hpp Outdated
event = nullptr;
}

cudaEvent_t event;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event_

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to keep the same code style as in nixl_ep.[h|c]pp.
Do you think that we should change code style here?

Comment thread examples/device/ep/csrc/cuda_warn.hpp Outdated

namespace nixl_ep::cuda {
inline void
warn(cudaError_t status, const char *caller, const char *operation) noexcept {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the noexcept intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this function is supposed to be used when objects are destroyed.

rakhmets added 3 commits June 24, 2026 18:39
Signed-off-by: Raul Akhmetshin <rakhmetshin@nvidia.com>
Signed-off-by: Raul Akhmetshin <rakhmetshin@nvidia.com>
Signed-off-by: Raul Akhmetshin <rakhmetshin@nvidia.com>
@rakhmets rakhmets force-pushed the topic/device-api-rm-torch-event branch from 8e0367d to 3d0c532 Compare June 24, 2026 15:39
@rakhmets rakhmets removed request for a team, aranadive, brminich and ovidiusm June 24, 2026 15:40
@ai-dynamo ai-dynamo deleted a comment from copy-pr-bot Bot Jun 24, 2026
@rakhmets

Copy link
Copy Markdown
Contributor Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants