Hi, I am testing mKernel on a two-node NVIDIA H20 RoCE environment with mlx5 HCAs and CUDA 13. The GEMM + AllReduce path can run, but the other inter-node RDMA-heavy paths appear to depend on CUDA VMM + DMA-BUF memory registration.
VMM DMA-BUF MR registration fails with:
ibv_reg_dmabuf_mr (retain) failed: Invalid argument
ibv_reg_mr(gpu) failed: Bad address
Do you have plans to support a nvidia_peermem / ibv_reg_mr(cudaMalloc) fallback path? If yes, what design would you prefer: global buffer backing option, per-kernel RDMA source/target option, or staging buffers?
Thanks!
Hi, I am testing mKernel on a two-node NVIDIA H20 RoCE environment with mlx5 HCAs and CUDA 13. The GEMM + AllReduce path can run, but the other inter-node RDMA-heavy paths appear to depend on CUDA VMM + DMA-BUF memory registration.
VMM DMA-BUF MR registration fails with:
Do you have plans to support a nvidia_peermem / ibv_reg_mr(cudaMalloc) fallback path? If yes, what design would you prefer: global buffer backing option, per-kernel RDMA source/target option, or staging buffers?
Thanks!