Auto models get ray.exceptions.ActorDiedError when run on multi-GPU node

### What happened + What you expected to happen

I started a multi-gpu node within the AWS SageMaker JupyterLab environment, then git cloned neuralforecast, navigated to the experiments/long_horizon directory, created a conda long_horizon environment using the environment.yml command (as described in the long_horizon/readme.md).  Then I ran:

- python run_nhits.py --dataset 'ETTh1' --horizon 96 --num_samples 1

and I got the following error:
--------------------------------------
(_train_tune pid=22946) [rank: 1] Child process with PID 23026 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff14b6988a6a4935518fd4985a01000000 Worker ID: fa49156216bca352caee735419dc8cf5a59f668a96d20f17a94ddec9 Node ID: 6eabca3b606628a676f68efc346d601b765017331b5032e290b6b99e Worker IP address: 169.255.255.2 Worker port: 33095 Worker PID: 22946 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2025-03-18 19:36:13,639 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_39116_00000
Traceback (most recent call last):
  File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
             ^^^^^^^^^^^^^^^
  File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 2771, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 921, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: ImplicitFunc
        actor_id: 14b6988a6a4935518fd4985a01000000
        pid: 22946
        namespace: efcdee2a-00c0-40d7-920b-84c180446a48
        ip: 169.255.255.2
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
--------------------------------------

I expected the model to train on all 4 GPUs and run to completion.  Here is the full error trace across all 4 GPUs:
--------------------------------------
(long_horizon) sagemaker-user@default:~/Nixtla/neuralforecast/experiments/long_horizon$ python run_nhits.py --dataset 'ETTh1' --horizon 96 --num_samples 1
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The `RunConfig` class should be imported from `ray.tune` when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
  _log_deprecation_warning(
2025-03-18 19:36:03,203 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:04,295 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     _train_tune_2025-03-18_19-36-02   │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator             │
│ Scheduler                        FIFOScheduler                     │
│ Number of trials                 1                                 │
╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-02
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-02_098330_19785/artifacts/2025-03-18_19-36-04/_train_tune_2025-03-18_19-36-02/driver_artifacts`
(_train_tune pid=22946) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: `ray.tune.integration.pytorch_lightning.TuneReportCallback` is deprecated. Use `ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback` instead.
(_train_tune pid=22946) Seed set to 2
(_train_tune pid=22946) GPU available: True (cuda), used: True
(_train_tune pid=22946) TPU available: False, using: 0 TPU cores
(_train_tune pid=22946) HPU available: False, using: 0 HPUs
(_train_tune pid=22946) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
(_train_tune pid=22946) [rank: 1] Child process with PID 23026 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff14b6988a6a4935518fd4985a01000000 Worker ID: fa49156216bca352caee735419dc8cf5a59f668a96d20f17a94ddec9 Node ID: 6eabca3b606628a676f68efc346d601b765017331b5032e290b6b99e Worker IP address: 169.255.255.2 Worker port: 33095 Worker PID: 22946 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2025-03-18 19:36:13,639 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_39116_00000
Traceback (most recent call last):
  File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
             ^^^^^^^^^^^^^^^
  File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 2771, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 921, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: ImplicitFunc
        actor_id: 14b6988a6a4935518fd4985a01000000
        pid: 22946
        namespace: efcdee2a-00c0-40d7-920b-84c180446a48
        ip: 169.255.255.2
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Trial _train_tune_39116_00000 errored after 0 iterations at 2025-03-18 19:36:13. Total running time: 9s
Error file: /tmp/ray/session_2025-03-18_19-36-02_098330_19785/artifacts/2025-03-18_19-36-04/_train_tune_2025-03-18_19-36-02/driver_artifacts/_train_tune_39116_00000_0_activation=ReLU,batch_size=7,dropout_prob_theta=0.5000,input_size=672,interpolation_mode=linear,learning_2025-03-18_19-36-04/error.txt
2025-03-18 19:36:13,646 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-02' in 0.0041s.

2025-03-18 19:36:13,647 ERROR tune.py:1037 -- Trials did not complete: [_train_tune_39116_00000]
Seed set to 2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The `RunConfig` class should be imported from `ray.tune` when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
  _log_deprecation_warning(
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The `RunConfig` class should be imported from `ray.tune` when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
  _log_deprecation_warning(
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The `RunConfig` class should be imported from `ray.tune` when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
  _log_deprecation_warning(
2025-03-18 19:36:19,745 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:19,755 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:19,786 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:21,623 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     _train_tune_2025-03-18_19-36-18   │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator             │
│ Scheduler                        FIFOScheduler                     │
│ Number of trials                 1                                 │
╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_635368_23291/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts`
2025-03-18 19:36:21,674 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     _train_tune_2025-03-18_19-36-18   │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator             │
│ Scheduler                        FIFOScheduler                     │
│ Number of trials                 1                                 │
╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_645623_23290/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts`
2025-03-18 19:36:21,956 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     _train_tune_2025-03-18_19-36-18   │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator             │
│ Scheduler                        FIFOScheduler                     │
│ Number of trials                 1                                 │
╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_682673_23289/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts`
(_train_tune pid=32621) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: `ray.tune.integration.pytorch_lightning.TuneReportCallback` is deprecated. Use `ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback` instead.
(_train_tune pid=32621) [rank: 2] Seed set to 2
(_train_tune pid=32554) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: `ray.tune.integration.pytorch_lightning.TuneReportCallback` is deprecated. Use `ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback` instead.
(_train_tune pid=32554) [rank: 3] Seed set to 2
(_train_tune pid=32822) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: `ray.tune.integration.pytorch_lightning.TuneReportCallback` is deprecated. Use `ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback` instead.
(_train_tune pid=32822) [rank: 1] Seed set to 2
(_train_tune pid=32621) Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
(_train_tune pid=32554) Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
(_train_tune pid=32822) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
(_train_tune pid=32554) LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Sanity Checking DataLoader 0:   0%|                                                                                                                                                                     | 0/1 [00:00<?, ?it/s](_train_tune pid=32621) LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
(_train_tune pid=32822) LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Epoch 999: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.94it/s, v_num=2, train_loss_step=0.126, train_loss_epoch=0.126, valid_loss=0.537]
2025-03-18 19:36:48,017 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0040s.                              
2025-03-18 19:36:48,018 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0036s.
2025-03-18 19:36:48,018 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0042s.



[rank: 3] Seed set to 2
[rank: 1] Seed set to 2
[rank: 2] Seed set to 2
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.60it/s]





Parsed results
NHITS ETTh1 h=96
test_size 2880
y_true.shape (n_series, n_windows, n_time_out):  (7, 2785, 96)
y_hat.shape  (n_series, n_windows, n_time_out):  (7, 2785, 96)
MSE:  0.33694613410636276
MAE:  0.4027628248310379
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in <module>
[rank2]:     Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
[rank2]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation
[rank2]:     return self._no_refit_cross_validation(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation
[rank2]:     model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size)
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit
[rank2]:     self.model = self._fit_model(
[rank2]:                  ^^^^^^^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model
[rank2]:     model = model.fit(
[rank2]:             ^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit
[rank2]:     return self._fit(
[rank2]:            ^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit
[rank2]:     trainer.fit(model, datamodule=datamodule)
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank2]:     call._call_and_handle_interrupt(
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank2]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank2]:     return function(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
[rank2]:     self._run(model, ckpt_path=ckpt_path)
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
[rank2]:     self.__setup_profiler()
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler
[rank2]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank2]:                                                                             ^^^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir
[rank2]:     dirpath = self.strategy.broadcast(dirpath)
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank2]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank2]:     broadcast(object_sizes_tensor, src=global_src, group=group)
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank2]:     work = group.broadcast([tensor], opts)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank2]: Last error:
[rank2]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in <module>
[rank1]:     Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
[rank1]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation
[rank1]:     return self._no_refit_cross_validation(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation
[rank1]:     model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size)
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit
[rank1]:     self.model = self._fit_model(
[rank1]:                  ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model
[rank1]:     model = model.fit(
[rank1]:             ^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit
[rank1]:     return self._fit(
[rank1]:            ^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit
[rank1]:     trainer.fit(model, datamodule=datamodule)
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank1]:     call._call_and_handle_interrupt(
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
[rank1]:     self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
[rank1]:     self.__setup_profiler()
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler
[rank1]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank1]:                                                                             ^^^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir
[rank1]:     dirpath = self.strategy.broadcast(dirpath)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank1]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank1]:     broadcast(object_sizes_tensor, src=global_src, group=group)
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank1]:     work = group.broadcast([tensor], opts)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank1]: Last error:
[rank1]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in <module>
[rank3]:     Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
[rank3]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation
[rank3]:     return self._no_refit_cross_validation(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation
[rank3]:     model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size)
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit
[rank3]:     self.model = self._fit_model(
[rank3]:                  ^^^^^^^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model
[rank3]:     model = model.fit(
[rank3]:             ^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit
[rank3]:     return self._fit(
[rank3]:            ^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit
[rank3]:     trainer.fit(model, datamodule=datamodule)
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank3]:     call._call_and_handle_interrupt(
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank3]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank3]:     return function(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
[rank3]:     self._run(model, ckpt_path=ckpt_path)
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
[rank3]:     self.__setup_profiler()
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler
[rank3]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank3]:                                                                             ^^^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir
[rank3]:     dirpath = self.strategy.broadcast(dirpath)
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank3]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank3]:     broadcast(object_sizes_tensor, src=global_src, group=group)
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank3]:     work = group.broadcast([tensor], opts)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank3]: Last error:
[rank3]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort
[rank: 1] Child process with PID 23289 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
---------------------------------------------------------

### Versions / Dependencies

Here is the pip freeze output:
aiohappyeyeballs==2.6.1
aiohttp==3.11.14
aiosignal==1.3.2
alembic==1.15.1
attrs==25.3.0
certifi==2025.1.31
charset-normalizer==3.4.1
click==8.1.8
colorlog==6.9.0
coreforecast==0.0.15
datasetsforecast @ git+https://github.com/Nixtla/datasetsforecast.git@c0023084c52c244740598affe7afafa3d59f2729
filelock==3.18.0
frozenlist==1.5.0
fsspec==2025.3.0
greenlet==3.1.1
idna==3.10
Jinja2==3.1.6
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
lightning-utilities==0.14.1
Mako==1.3.9
MarkupSafe==3.0.2
mpmath==1.3.0
msgpack==1.1.0
multidict==6.2.0
networkx==3.4.2
neuralforecast @ git+https://github.com/Nixtla/neuralforecast.git@e2f473a51ba15fbf4c33ff76cc8d1687ab68c517
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1668919096335/work
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
optuna==4.2.1
packaging==24.2
pandas==2.2.3
propcache==0.3.0
protobuf==6.30.1
pyarrow==19.0.1
python-dateutil==2.9.0.post0
pytorch-lightning==2.5.0.post0
pytz==2025.1
PyYAML==6.0.2
ray==2.43.0
referencing==0.36.2
requests==2.32.3
rpds-py==0.23.1
scikit-learn==1.6.1
scipy==1.15.2
six==1.17.0
SQLAlchemy==2.0.39
sympy==1.13.1
tensorboardX==2.6.2.2
threadpoolctl==3.6.0
torch==2.6.0
torchmetrics==1.6.3
tqdm==4.67.1
triton==3.2.0
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
utilsforecast==0.2.12
xlrd==2.0.1
yarl==1.18.3
### Reproduction script

On a node with multiple GPUs (e.g., an AWS ml.g4dn.12xlarge with 4 GPUs):
 - git clone https://github.com/Nixtla/neuralforecast.git
 - cd neuralforecast/experiments/long_horizon
 - conda env create -f environment.yml
 - conda activate long_horizon
 - python run_nhits.py --dataset 'ETTh1' --horizon 96 --num_samples 1

### Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto models get ray.exceptions.ActorDiedError when run on multi-GPU node #1291

What happened + What you expected to happen

and I got the following error:

I expected the model to train on all 4 GPUs and run to completion. Here is the full error trace across all 4 GPUs:

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Auto models get ray.exceptions.ActorDiedError when run on multi-GPU node #1291

Description

What happened + What you expected to happen

and I got the following error:

I expected the model to train on all 4 GPUs and run to completion. Here is the full error trace across all 4 GPUs:

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions