Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs

### Bug description
Training does not start on Vertex AI when using >1 A100 GPUs with NCCL due to an unhandled system error. The problem currently only occurs on A100 GPUs, probably due to GPU partitioning in the K8s cluster. Bumping up NCCL during compilation to the most recent master (https://github.com/NVIDIA/nccl) seems to fix the issue. Could we update Marian NCCL fork (https://github.com/marian-nmt/nccl), or would that possibly break something else? @snukky what are your thoughts on that?

Sample log:
```
[2023-04-21 19:17:31] Error: NCCL error 2 'unhandled system error' - /marian/src/training/communicator_nccl.h:43: ncclGroupEnd()
[2023-04-21 19:17:31] Error: Aborted from void marian::NCCLCommunicator::groupEnd() const in /marian/src/training/communicator_nccl.h:43
[CALL STACK]
[0x561c348a40d2] marian::NCCLCommunicator:: groupEnd () const + 0x222
[0x561c348ab90a] marian::NCCLCommunicator:: NCCLCommunicator (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&, marian::ShardingMode, std::shared_ptr<marian::IMPIWrapper>) + 0x2c4a
[0x561c3489203b] marian:: createCommunicator (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&, bool, marian::ShardingMode, std::shared_ptr<marian::IMPIWrapper>) + 0x44b
[0x561c3482438d] marian::GraphGroup:: GraphGroup (std::shared_ptr<marian::Options>, std::shared_ptr<marian::IMPIWrapper>) + 0x4fd
[0x561c347fd4e4] marian::SyncGraphGroup:: SyncGraphGroup (std::shared_ptr<marian::Options>, std::shared_ptr<marian::IMPIWrapper>) + 0x74
[0x561c3433af33] marian::Train<marian::SyncGraphGroup>:: run () + 0x333
[0x561c34260fe7] mainTrainer (int, char**) + 0x157
[0x561c3421931c] main + 0x3c
[0x7fc4a2cce083] __libc_start_main + 0xf3
[0x561c3425fc6e] _start + 0x2e 
```

If NCCL debug log is enabled, the following warning shows up, while it does not show up e.g. when using V100 GPUs:
`graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs #998

Bug description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs #998

Description

Bug description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions