On a clean(-er) PyTorch-Torchcomms integration

I am trying to setup a test-bed to work with Torchcomms (and NCCLX specifically) via PyTorch and its distributed infrastructure.

However, I am starting to question whether this is even (sustainably) possible with current status, I would say more on the Torchcomms side.

Basically, as of now, I think this is not technically (in a clean way) possible due to name collision and API mismatch in the NCCL dependency.
Let me detail this.
1. PyTorch might require NCCL to support its NCCL backend implementation
2. Torchcomms requires PyTorch as a dependency since it is a drop-in replacement for its distributed backends
3. But... Torchcomms also requires a sort of NCCL implementation, which happens to be a modified implementation shipped within (which differs from the PyTorch NCCL dependency, most likely it being the standard NCCL implementation from NVIDIA).

And now... the dependency hell, which I don't seem to figure out how to solve.
Let me anticipate that, in order to better separate NCCL installations from "system wide" (the one PyTorch is set to use, default implementation) and Torchcomms one (which resides in a different location, i.e. ${CONDA_LIB} in Torchcomms's CMake terms), I've forced the build-time RPATH when building Torchcomms binaries to include ${CONDA_LIB}, so that any linker dependency is set to point at the "right" binary location.
However, this is not enough when dealing with Python, since it introduces its dynamic loading mechanism.
So, what happens is that if I try to do:
```python
import torchcomms
```
everything is fine.
However,
```python
import torchcomms._comms_ncclx

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: [...]/torchcomms/_comms_ncclx.cpython-312-x86_64-linux-gnu.so: undefined symbol: ncclReduceScatterQuantize
```
although
```text
ldd _comms_ncclx.cpython-312-x86_64-linux-gnu.so | grep "nccl"
	libnccl.so.2 => [...]/torchcomms/install/lib/libnccl.so.2 (0x00007ff96549d000)

nm -a [...]/torchcomms/install/lib/libnccl.so.2 | grep "Quantize"
000000000037edd0 t _ZL33validateReduceScatterQuantizeArgs14ncclDataType_tS_11ncclRedOp_tPm
000000000037f1c0 T ncclReduceScatterQuantize
00000000000b49c6 t ncclReduceScatterQuantize.cold
```
This is probably because Python is using the NCCL binaries loaded by the time, at some point in the import process, `import torch` is processed, which brings `libnccl`, so that probably any further `libnccl` dependency is automatically resolved to the first (incompatible for Torchcomms) occurrence.

Hence, I am not sure how this could and should be solved, but one thing that quickly comes to mind is:
if Torchcomms uses a modified version of NCCL for its NCCLX implementation, why not naming it properly to `libncclx.so` (and same goes for the API, which with a simple macro can be easily turned between `nccl*/pnccl*` and `ncclx*/pncclx*`)?
This would once and for all avoid any collision with system-wide NCCL installation.

I am open to set a constructive discussion on this topic, and eventually work on a PR so solve this issue if found relevant (I had already worked on the macro magic to easily set the Torchcomms' NCCL APIs to `ncclx*/pncclx*`, instead of needing to use the `rename_symbols.sh`, which does not work properly if one does not also change the name backed in the ELF sections).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On a clean(-er) PyTorch-Torchcomms integration #2665

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

On a clean(-er) PyTorch-Torchcomms integration #2665

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions