Make a first pass at using NVSHMEM4Py for host-side library management, etc. by benhg · Pull Request #4 · hpcgroup/nvrar

benhg · 2025-11-11T00:06:18Z

Posting this here just for discussion and out of my own interest. This PR migrates from custom C/Python bindings to using nvshmem4py where it's easy/simple. I'll leave some comments with questions/comments around certain specific areas of the code.

First pass at replacing custom bindings with nvshmem4py versions

benhg · 2025-11-11T00:07:01Z

-        uid_bytes = nvshmem_comm_cuda.NVSHMEMCommWrapper.get_unique_id_bytes()
-        uid_gpu = uid_bytes.to(device)
-        dist.broadcast(uid_gpu, src=0)
+        # Set device current


This should really be a helper function because it's used in the benchmarks and the library itself. I couldn't think of where the best place to put it would be.

benhg · 2025-11-11T00:08:04Z

        if comm_wrapper is not None:
-            nvrar_tensor, nvrar_tensor_id = comm_wrapper.allocate_tensor(num_elems, dtype, device, nvshmem_comm_cuda.Protocol.LL8)
+            # Allocate symmetric tensor via nvshmem4py and register with wrapper
+            nvrar_tensor = nvshmem.tensor((num_elems,), dtype=dtype)


This is the first major difference. I couldn't think of a good way to handle the tensor_id stuff purely in python, so what I did is:

Replace tensor allocation with the nvshmem.core wrapper

keep the other parts of the process in your C code (and rename it to register_tensor instead of allocate_tensor)

benhg · 2025-11-11T00:08:15Z

+        # This should be idempotent
+        cuda_dev.set_current()
+        stream = torch.cuda.current_stream()
+


Here's the same boilerplate

benhg · 2025-11-11T00:09:38Z

+# Allow user override via -DCUDA_CCCL_INCLUDE_DIR
+set(CUDA_CCCL_INCLUDE_DIR "" CACHE PATH "Path to CUDA CCCL include directory (contains cuda/std)")
+set(_CUDA_ROOT "")
+if(DEFINED ENV{CUDA_HOME})


This is hacky and terrible and there is a better way to do it. In NVSHMEM's source, we handle it like this: https://github.com/NVIDIA/nvshmem/blob/2d7d25f0816235e3c2b51779571ec032606ea0dd/src/device/CMakeLists.txt#L188

benhg · 2025-11-11T00:09:52Z

-  virtual void free_tensor(uint64_t id) = 0;
+  // Register an externally-allocated symmetric tensor (e.g., via nvshmem4py)
+  // Returns a newly assigned tensor id
+  virtual uint64_t register_external_tensor(torch::Tensor& t) = 0;


Here's the renaming I mentioned above.

benhg · 2025-11-11T00:10:38Z

-    throw std::runtime_error("Failed to allocate signal memory");
+  uint64_t* seq_num_signal = nullptr;
+  // TODO: 
+  if (steps_inter_ > 0) {


This is just here so my tests would pass on 1 node. If it's 1 node but we don't have this check, the calloc will fail because steps_inter_ is 0 so we allocate nothing.

benhg · 2025-11-11T00:10:43Z


 void RecursiveLL8Coll::deregister_tensor(uint64_t id) {
-  // TODO: Implement
+  // TODO: Adding this so that I can test on 1-node. Is this valuable?


benhg · 2025-11-11T00:11:06Z

-  if (!chunk_signal_) {
-    throw std::runtime_error("Failed to allocate chunk signal memory");
+  // TODO: Adding this so that I can test on 1-node. Is this valuable?
+  if (steps_inter_ > 0) {


benhg · 2025-11-11T00:14:48Z

-    uid_gpu = uid_bytes.to(f"cuda:{local_device}")
-    dist.broadcast(uid_gpu, src=0)
+    # Initialize NVSHMEM via nvshmem4py using UID method
+    cuda_dev = Device(local_device)


Same boilerplate.

prajwal1210 · 2025-11-18T20:19:58Z

Oh, somehow I missed the notification for this PR last week. I will look over the comments and changes and respond to them as soon as possible.

benhg and others added 5 commits November 10, 2025 11:12

First pass at replacing custom bindings with nvshmem4py versions

36e4eb9

cuda 13 fix

82ba99c

cuda 13 fix

959f467

remove mpi dependency and c-side init path

4706f16

Merge pull request #1 from benhg/benjaming/nvshmem4py

1b2776c

First pass at replacing custom bindings with nvshmem4py versions

benhg commented Nov 11, 2025

View reviewed changes

benhg added 2 commits November 11, 2025 00:12

revert unwanted changes

1fd2280

Merge branch 'develop' of github.com:benhg/nvrar into develop

b548347

benhg commented Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make a first pass at using NVSHMEM4Py for host-side library management, etc.#4

Make a first pass at using NVSHMEM4Py for host-side library management, etc.#4
benhg wants to merge 7 commits intohpcgroup:developfrom
benhg:develop

benhg commented Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

prajwal1210 commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

benhg commented Nov 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prajwal1210 commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants