I am trying to setup a test-bed to work with Torchcomms (and NCCLX specifically) via PyTorch and its distributed infrastructure.
However, I am starting to question whether this is even (sustainably) possible with current status, I would say more on the Torchcomms side.
Basically, as of now, I think this is not technically (in a clean way) possible due to name collision and API mismatch in the NCCL dependency.
Let me detail this.
- PyTorch might require NCCL to support its NCCL backend implementation
- Torchcomms requires PyTorch as a dependency since it is a drop-in replacement for its distributed backends
- But... Torchcomms also requires a sort of NCCL implementation, which happens to be a modified implementation shipped within (which differs from the PyTorch NCCL dependency, most likely it being the standard NCCL implementation from NVIDIA).
And now... the dependency hell, which I don't seem to figure out how to solve.
Let me anticipate that, in order to better separate NCCL installations from "system wide" (the one PyTorch is set to use, default implementation) and Torchcomms one (which resides in a different location, i.e. ${CONDA_LIB} in Torchcomms's CMake terms), I've forced the build-time RPATH when building Torchcomms binaries to include ${CONDA_LIB}, so that any linker dependency is set to point at the "right" binary location.
However, this is not enough when dealing with Python, since it introduces its dynamic loading mechanism.
So, what happens is that if I try to do:
everything is fine.
However,
import torchcomms._comms_ncclx
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: [...]/torchcomms/_comms_ncclx.cpython-312-x86_64-linux-gnu.so: undefined symbol: ncclReduceScatterQuantize
although
ldd _comms_ncclx.cpython-312-x86_64-linux-gnu.so | grep "nccl"
libnccl.so.2 => [...]/torchcomms/install/lib/libnccl.so.2 (0x00007ff96549d000)
nm -a [...]/torchcomms/install/lib/libnccl.so.2 | grep "Quantize"
000000000037edd0 t _ZL33validateReduceScatterQuantizeArgs14ncclDataType_tS_11ncclRedOp_tPm
000000000037f1c0 T ncclReduceScatterQuantize
00000000000b49c6 t ncclReduceScatterQuantize.cold
This is probably because Python is using the NCCL binaries loaded by the time, at some point in the import process, import torch is processed, which brings libnccl, so that probably any further libnccl dependency is automatically resolved to the first (incompatible for Torchcomms) occurrence.
Hence, I am not sure how this could and should be solved, but one thing that quickly comes to mind is:
if Torchcomms uses a modified version of NCCL for its NCCLX implementation, why not naming it properly to libncclx.so (and same goes for the API, which with a simple macro can be easily turned between nccl*/pnccl* and ncclx*/pncclx*)?
This would once and for all avoid any collision with system-wide NCCL installation.
I am open to set a constructive discussion on this topic, and eventually work on a PR so solve this issue if found relevant (I had already worked on the macro magic to easily set the Torchcomms' NCCL APIs to ncclx*/pncclx*, instead of needing to use the rename_symbols.sh, which does not work properly if one does not also change the name backed in the ELF sections).
I am trying to setup a test-bed to work with Torchcomms (and NCCLX specifically) via PyTorch and its distributed infrastructure.
However, I am starting to question whether this is even (sustainably) possible with current status, I would say more on the Torchcomms side.
Basically, as of now, I think this is not technically (in a clean way) possible due to name collision and API mismatch in the NCCL dependency.
Let me detail this.
And now... the dependency hell, which I don't seem to figure out how to solve.
Let me anticipate that, in order to better separate NCCL installations from "system wide" (the one PyTorch is set to use, default implementation) and Torchcomms one (which resides in a different location, i.e. ${CONDA_LIB} in Torchcomms's CMake terms), I've forced the build-time RPATH when building Torchcomms binaries to include ${CONDA_LIB}, so that any linker dependency is set to point at the "right" binary location.
However, this is not enough when dealing with Python, since it introduces its dynamic loading mechanism.
So, what happens is that if I try to do:
everything is fine.
However,
although
This is probably because Python is using the NCCL binaries loaded by the time, at some point in the import process,
import torchis processed, which bringslibnccl, so that probably any furtherlibnccldependency is automatically resolved to the first (incompatible for Torchcomms) occurrence.Hence, I am not sure how this could and should be solved, but one thing that quickly comes to mind is:
if Torchcomms uses a modified version of NCCL for its NCCLX implementation, why not naming it properly to
libncclx.so(and same goes for the API, which with a simple macro can be easily turned betweennccl*/pnccl*andncclx*/pncclx*)?This would once and for all avoid any collision with system-wide NCCL installation.
I am open to set a constructive discussion on this topic, and eventually work on a PR so solve this issue if found relevant (I had already worked on the macro magic to easily set the Torchcomms' NCCL APIs to
ncclx*/pncclx*, instead of needing to use therename_symbols.sh, which does not work properly if one does not also change the name backed in the ELF sections).