Make a first pass at using NVSHMEM4Py for host-side library management, etc.#4
Make a first pass at using NVSHMEM4Py for host-side library management, etc.#4benhg wants to merge 7 commits intohpcgroup:developfrom
Conversation
First pass at replacing custom bindings with nvshmem4py versions
| uid_bytes = nvshmem_comm_cuda.NVSHMEMCommWrapper.get_unique_id_bytes() | ||
| uid_gpu = uid_bytes.to(device) | ||
| dist.broadcast(uid_gpu, src=0) | ||
| # Set device current |
There was a problem hiding this comment.
This should really be a helper function because it's used in the benchmarks and the library itself. I couldn't think of where the best place to put it would be.
| if comm_wrapper is not None: | ||
| nvrar_tensor, nvrar_tensor_id = comm_wrapper.allocate_tensor(num_elems, dtype, device, nvshmem_comm_cuda.Protocol.LL8) | ||
| # Allocate symmetric tensor via nvshmem4py and register with wrapper | ||
| nvrar_tensor = nvshmem.tensor((num_elems,), dtype=dtype) |
There was a problem hiding this comment.
This is the first major difference. I couldn't think of a good way to handle the tensor_id stuff purely in python, so what I did is:
- Replace tensor allocation with the nvshmem.core wrapper
- keep the other parts of the process in your C code (and rename it to register_tensor instead of allocate_tensor)
| # This should be idempotent | ||
| cuda_dev.set_current() | ||
| stream = torch.cuda.current_stream() | ||
|
|
| # Allow user override via -DCUDA_CCCL_INCLUDE_DIR | ||
| set(CUDA_CCCL_INCLUDE_DIR "" CACHE PATH "Path to CUDA CCCL include directory (contains cuda/std)") | ||
| set(_CUDA_ROOT "") | ||
| if(DEFINED ENV{CUDA_HOME}) |
There was a problem hiding this comment.
This is hacky and terrible and there is a better way to do it. In NVSHMEM's source, we handle it like this: https://github.com/NVIDIA/nvshmem/blob/2d7d25f0816235e3c2b51779571ec032606ea0dd/src/device/CMakeLists.txt#L188
| virtual void free_tensor(uint64_t id) = 0; | ||
| // Register an externally-allocated symmetric tensor (e.g., via nvshmem4py) | ||
| // Returns a newly assigned tensor id | ||
| virtual uint64_t register_external_tensor(torch::Tensor& t) = 0; |
There was a problem hiding this comment.
Here's the renaming I mentioned above.
| throw std::runtime_error("Failed to allocate signal memory"); | ||
| uint64_t* seq_num_signal = nullptr; | ||
| // TODO: | ||
| if (steps_inter_ > 0) { |
There was a problem hiding this comment.
This is just here so my tests would pass on 1 node. If it's 1 node but we don't have this check, the calloc will fail because steps_inter_ is 0 so we allocate nothing.
|
|
||
| void RecursiveLL8Coll::deregister_tensor(uint64_t id) { | ||
| // TODO: Implement | ||
| // TODO: Adding this so that I can test on 1-node. Is this valuable? |
| if (!chunk_signal_) { | ||
| throw std::runtime_error("Failed to allocate chunk signal memory"); | ||
| // TODO: Adding this so that I can test on 1-node. Is this valuable? | ||
| if (steps_inter_ > 0) { |
| uid_gpu = uid_bytes.to(f"cuda:{local_device}") | ||
| dist.broadcast(uid_gpu, src=0) | ||
| # Initialize NVSHMEM via nvshmem4py using UID method | ||
| cuda_dev = Device(local_device) |
|
Oh, somehow I missed the notification for this PR last week. I will look over the comments and changes and respond to them as soon as possible. |
Posting this here just for discussion and out of my own interest. This PR migrates from custom C/Python bindings to using
nvshmem4pywhere it's easy/simple. I'll leave some comments with questions/comments around certain specific areas of the code.