What happened + What you expected to happen
I started a multi-gpu node within the AWS SageMaker JupyterLab environment, then git cloned neuralforecast, navigated to the experiments/long_horizon directory, created a conda long_horizon environment using the environment.yml command (as described in the long_horizon/readme.md). Then I ran:
- python run_nhits.py --dataset 'ETTh1' --horizon 96 --num_samples 1
and I got the following error:
(_train_tune pid=22946) [rank: 1] Child process with PID 23026 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff14b6988a6a4935518fd4985a01000000 Worker ID: fa49156216bca352caee735419dc8cf5a59f668a96d20f17a94ddec9 Node ID: 6eabca3b606628a676f68efc346d601b765017331b5032e290b6b99e Worker IP address: 169.255.255.2 Worker port: 33095 Worker PID: 22946 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2025-03-18 19:36:13,639 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_39116_00000
Traceback (most recent call last):
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 921, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: ImplicitFunc
actor_id: 14b6988a6a4935518fd4985a01000000
pid: 22946
namespace: efcdee2a-00c0-40d7-920b-84c180446a48
ip: 169.255.255.2
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
I expected the model to train on all 4 GPUs and run to completion. Here is the full error trace across all 4 GPUs:
(long_horizon) sagemaker-user@default:~/Nixtla/neuralforecast/experiments/long_horizon$ python run_nhits.py --dataset 'ETTh1' --horizon 96 --num_samples 1
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The RunConfig class should be imported from ray.tune when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: ray-project/ray#49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
_log_deprecation_warning(
2025-03-18 19:36:03,203 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:04,295 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...) before Tuner(...).
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment _train_tune_2025-03-18_19-36-02 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 1 │
╰────────────────────────────────────────────────────────────────────╯
View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-02
To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-02_098330_19785/artifacts/2025-03-18_19-36-04/_train_tune_2025-03-18_19-36-02/driver_artifacts
(_train_tune pid=22946) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead.
(_train_tune pid=22946) Seed set to 2
(_train_tune pid=22946) GPU available: True (cuda), used: True
(_train_tune pid=22946) TPU available: False, using: 0 TPU cores
(_train_tune pid=22946) HPU available: False, using: 0 HPUs
(_train_tune pid=22946) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
(_train_tune pid=22946) [rank: 1] Child process with PID 23026 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff14b6988a6a4935518fd4985a01000000 Worker ID: fa49156216bca352caee735419dc8cf5a59f668a96d20f17a94ddec9 Node ID: 6eabca3b606628a676f68efc346d601b765017331b5032e290b6b99e Worker IP address: 169.255.255.2 Worker port: 33095 Worker PID: 22946 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2025-03-18 19:36:13,639 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_39116_00000
Traceback (most recent call last):
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 921, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: ImplicitFunc
actor_id: 14b6988a6a4935518fd4985a01000000
pid: 22946
namespace: efcdee2a-00c0-40d7-920b-84c180446a48
ip: 169.255.255.2
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Trial _train_tune_39116_00000 errored after 0 iterations at 2025-03-18 19:36:13. Total running time: 9s
Error file: /tmp/ray/session_2025-03-18_19-36-02_098330_19785/artifacts/2025-03-18_19-36-04/_train_tune_2025-03-18_19-36-02/driver_artifacts/_train_tune_39116_00000_0_activation=ReLU,batch_size=7,dropout_prob_theta=0.5000,input_size=672,interpolation_mode=linear,learning_2025-03-18_19-36-04/error.txt
2025-03-18 19:36:13,646 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-02' in 0.0041s.
2025-03-18 19:36:13,647 ERROR tune.py:1037 -- Trials did not complete: [_train_tune_39116_00000]
Seed set to 2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The RunConfig class should be imported from ray.tune when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: ray-project/ray#49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
_log_deprecation_warning(
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The RunConfig class should be imported from ray.tune when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: ray-project/ray#49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
_log_deprecation_warning(
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The RunConfig class should be imported from ray.tune when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: ray-project/ray#49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
_log_deprecation_warning(
2025-03-18 19:36:19,745 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:19,755 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:19,786 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:21,623 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...) before Tuner(...).
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment _train_tune_2025-03-18_19-36-18 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 1 │
╰────────────────────────────────────────────────────────────────────╯
View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18
To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_635368_23291/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts
2025-03-18 19:36:21,674 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...) before Tuner(...).
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment _train_tune_2025-03-18_19-36-18 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 1 │
╰────────────────────────────────────────────────────────────────────╯
View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18
To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_645623_23290/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts
2025-03-18 19:36:21,956 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...) before Tuner(...).
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment _train_tune_2025-03-18_19-36-18 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 1 │
╰────────────────────────────────────────────────────────────────────╯
View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18
To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_682673_23289/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts
(_train_tune pid=32621) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead.
(_train_tune pid=32621) [rank: 2] Seed set to 2
(_train_tune pid=32554) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead.
(_train_tune pid=32554) [rank: 3] Seed set to 2
(_train_tune pid=32822) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead.
(_train_tune pid=32822) [rank: 1] Seed set to 2
(_train_tune pid=32621) Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
(_train_tune pid=32554) Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
(_train_tune pid=32822) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
(_train_tune pid=32554) LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Sanity Checking DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s](_train_tune pid=32621) LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
(_train_tune pid=32822) LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Epoch 999: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.94it/s, v_num=2, train_loss_step=0.126, train_loss_epoch=0.126, valid_loss=0.537]
2025-03-18 19:36:48,017 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0040s.
2025-03-18 19:36:48,018 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0036s.
2025-03-18 19:36:48,018 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0042s.
[rank: 3] Seed set to 2
[rank: 1] Seed set to 2
[rank: 2] Seed set to 2
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.60it/s]
Parsed results
NHITS ETTh1 h=96
test_size 2880
y_true.shape (n_series, n_windows, n_time_out): (7, 2785, 96)
y_hat.shape (n_series, n_windows, n_time_out): (7, 2785, 96)
MSE: 0.33694613410636276
MAE: 0.4027628248310379
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in
[rank2]: Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation
[rank2]: return self._no_refit_cross_validation(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation
[rank2]: model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size)
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit
[rank2]: self.model = self._fit_model(
[rank2]: ^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model
[rank2]: model = model.fit(
[rank2]: ^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit
[rank2]: return self._fit(
[rank2]: ^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit
[rank2]: trainer.fit(model, datamodule=datamodule)
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank2]: call._call_and_handle_interrupt(
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank2]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank2]: return function(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
[rank2]: self._run(model, ckpt_path=ckpt_path)
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
[rank2]: self.__setup_profiler()
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler
[rank2]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank2]: ^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir
[rank2]: dirpath = self.strategy.broadcast(dirpath)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank2]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank2]: broadcast(object_sizes_tensor, src=global_src, group=group)
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank2]: work = group.broadcast([tensor], opts)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank2]: Last error:
[rank2]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in
[rank1]: Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation
[rank1]: return self._no_refit_cross_validation(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation
[rank1]: model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size)
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit
[rank1]: self.model = self._fit_model(
[rank1]: ^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model
[rank1]: model = model.fit(
[rank1]: ^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit
[rank1]: return self._fit(
[rank1]: ^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit
[rank1]: trainer.fit(model, datamodule=datamodule)
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
[rank1]: self.__setup_profiler()
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler
[rank1]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank1]: ^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir
[rank1]: dirpath = self.strategy.broadcast(dirpath)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank1]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank1]: broadcast(object_sizes_tensor, src=global_src, group=group)
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank1]: work = group.broadcast([tensor], opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in
[rank3]: Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation
[rank3]: return self._no_refit_cross_validation(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation
[rank3]: model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size)
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit
[rank3]: self.model = self._fit_model(
[rank3]: ^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model
[rank3]: model = model.fit(
[rank3]: ^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit
[rank3]: return self._fit(
[rank3]: ^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit
[rank3]: trainer.fit(model, datamodule=datamodule)
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank3]: call._call_and_handle_interrupt(
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank3]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank3]: return function(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
[rank3]: self._run(model, ckpt_path=ckpt_path)
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
[rank3]: self.__setup_profiler()
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler
[rank3]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank3]: ^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir
[rank3]: dirpath = self.strategy.broadcast(dirpath)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank3]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank3]: broadcast(object_sizes_tensor, src=global_src, group=group)
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank3]: work = group.broadcast([tensor], opts)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank3]: Last error:
[rank3]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort
[rank: 1] Child process with PID 23289 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
Versions / Dependencies
Here is the pip freeze output:
aiohappyeyeballs==2.6.1
aiohttp==3.11.14
aiosignal==1.3.2
alembic==1.15.1
attrs==25.3.0
certifi==2025.1.31
charset-normalizer==3.4.1
click==8.1.8
colorlog==6.9.0
coreforecast==0.0.15
datasetsforecast @ git+https://github.com/Nixtla/datasetsforecast.git@c0023084c52c244740598affe7afafa3d59f2729
filelock==3.18.0
frozenlist==1.5.0
fsspec==2025.3.0
greenlet==3.1.1
idna==3.10
Jinja2==3.1.6
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
lightning-utilities==0.14.1
Mako==1.3.9
MarkupSafe==3.0.2
mpmath==1.3.0
msgpack==1.1.0
multidict==6.2.0
networkx==3.4.2
neuralforecast @ git+https://github.com/Nixtla/neuralforecast.git@e2f473a51ba15fbf4c33ff76cc8d1687ab68c517
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1668919096335/work
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
optuna==4.2.1
packaging==24.2
pandas==2.2.3
propcache==0.3.0
protobuf==6.30.1
pyarrow==19.0.1
python-dateutil==2.9.0.post0
pytorch-lightning==2.5.0.post0
pytz==2025.1
PyYAML==6.0.2
ray==2.43.0
referencing==0.36.2
requests==2.32.3
rpds-py==0.23.1
scikit-learn==1.6.1
scipy==1.15.2
six==1.17.0
SQLAlchemy==2.0.39
sympy==1.13.1
tensorboardX==2.6.2.2
threadpoolctl==3.6.0
torch==2.6.0
torchmetrics==1.6.3
tqdm==4.67.1
triton==3.2.0
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
utilsforecast==0.2.12
xlrd==2.0.1
yarl==1.18.3
Reproduction script
On a node with multiple GPUs (e.g., an AWS ml.g4dn.12xlarge with 4 GPUs):
Issue Severity
Medium: It is a significant difficulty but I can work around it.
What happened + What you expected to happen
I started a multi-gpu node within the AWS SageMaker JupyterLab environment, then git cloned neuralforecast, navigated to the experiments/long_horizon directory, created a conda long_horizon environment using the environment.yml command (as described in the long_horizon/readme.md). Then I ran:
and I got the following error:
(_train_tune pid=22946) [rank: 1] Child process with PID 23026 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff14b6988a6a4935518fd4985a01000000 Worker ID: fa49156216bca352caee735419dc8cf5a59f668a96d20f17a94ddec9 Node ID: 6eabca3b606628a676f68efc346d601b765017331b5032e290b6b99e Worker IP address: 169.255.255.2 Worker port: 33095 Worker PID: 22946 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2025-03-18 19:36:13,639 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_39116_00000
Traceback (most recent call last):
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 921, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: ImplicitFunc
actor_id: 14b6988a6a4935518fd4985a01000000
pid: 22946
namespace: efcdee2a-00c0-40d7-920b-84c180446a48
ip: 169.255.255.2
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
I expected the model to train on all 4 GPUs and run to completion. Here is the full error trace across all 4 GPUs:
(long_horizon) sagemaker-user@default:~/Nixtla/neuralforecast/experiments/long_horizon$ python run_nhits.py --dataset 'ETTh1' --horizon 96 --num_samples 1
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The
RunConfigclass should be imported fromray.tunewhen passing it to the Tuner. Please update your imports. See this issue for more context and migration options: ray-project/ray#49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0_log_deprecation_warning(
2025-03-18 19:36:03,203 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:04,295 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call
ray.init(...)beforeTuner(...).╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment _train_tune_2025-03-18_19-36-02 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 1 │
╰────────────────────────────────────────────────────────────────────╯
View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-02
To visualize your results with TensorBoard, run:
tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-02_098330_19785/artifacts/2025-03-18_19-36-04/_train_tune_2025-03-18_19-36-02/driver_artifacts(_train_tune pid=22946) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198:
ray.tune.integration.pytorch_lightning.TuneReportCallbackis deprecated. Useray.tune.integration.pytorch_lightning.TuneReportCheckpointCallbackinstead.(_train_tune pid=22946) Seed set to 2
(_train_tune pid=22946) GPU available: True (cuda), used: True
(_train_tune pid=22946) TPU available: False, using: 0 TPU cores
(_train_tune pid=22946) HPU available: False, using: 0 HPUs
(_train_tune pid=22946) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
(_train_tune pid=22946) [rank: 1] Child process with PID 23026 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff14b6988a6a4935518fd4985a01000000 Worker ID: fa49156216bca352caee735419dc8cf5a59f668a96d20f17a94ddec9 Node ID: 6eabca3b606628a676f68efc346d601b765017331b5032e290b6b99e Worker IP address: 169.255.255.2 Worker port: 33095 Worker PID: 22946 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2025-03-18 19:36:13,639 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_39116_00000
Traceback (most recent call last):
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 921, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: ImplicitFunc
actor_id: 14b6988a6a4935518fd4985a01000000
pid: 22946
namespace: efcdee2a-00c0-40d7-920b-84c180446a48
ip: 169.255.255.2
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Trial _train_tune_39116_00000 errored after 0 iterations at 2025-03-18 19:36:13. Total running time: 9s
Error file: /tmp/ray/session_2025-03-18_19-36-02_098330_19785/artifacts/2025-03-18_19-36-04/_train_tune_2025-03-18_19-36-02/driver_artifacts/_train_tune_39116_00000_0_activation=ReLU,batch_size=7,dropout_prob_theta=0.5000,input_size=672,interpolation_mode=linear,learning_2025-03-18_19-36-04/error.txt
2025-03-18 19:36:13,646 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-02' in 0.0041s.
2025-03-18 19:36:13,647 ERROR tune.py:1037 -- Trials did not complete: [_train_tune_39116_00000]
Seed set to 2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The
RunConfigclass should be imported fromray.tunewhen passing it to the Tuner. Please update your imports. See this issue for more context and migration options: ray-project/ray#49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0_log_deprecation_warning(
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The
RunConfigclass should be imported fromray.tunewhen passing it to the Tuner. Please update your imports. See this issue for more context and migration options: ray-project/ray#49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0_log_deprecation_warning(
/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The
RunConfigclass should be imported fromray.tunewhen passing it to the Tuner. Please update your imports. See this issue for more context and migration options: ray-project/ray#49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0_log_deprecation_warning(
2025-03-18 19:36:19,745 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:19,755 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:19,786 INFO worker.py:1841 -- Started a local Ray instance.
2025-03-18 19:36:21,623 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call
ray.init(...)beforeTuner(...).╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment _train_tune_2025-03-18_19-36-18 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 1 │
╰────────────────────────────────────────────────────────────────────╯
View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18
To visualize your results with TensorBoard, run:
tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_635368_23291/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts2025-03-18 19:36:21,674 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call
ray.init(...)beforeTuner(...).╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment _train_tune_2025-03-18_19-36-18 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 1 │
╰────────────────────────────────────────────────────────────────────╯
View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18
To visualize your results with TensorBoard, run:
tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_645623_23290/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts2025-03-18 19:36:21,956 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call
ray.init(...)beforeTuner(...).╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment _train_tune_2025-03-18_19-36-18 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 1 │
╰────────────────────────────────────────────────────────────────────╯
View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18
To visualize your results with TensorBoard, run:
tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_682673_23289/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts(_train_tune pid=32621) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198:
ray.tune.integration.pytorch_lightning.TuneReportCallbackis deprecated. Useray.tune.integration.pytorch_lightning.TuneReportCheckpointCallbackinstead.(_train_tune pid=32621) [rank: 2] Seed set to 2
(_train_tune pid=32554) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198:
ray.tune.integration.pytorch_lightning.TuneReportCallbackis deprecated. Useray.tune.integration.pytorch_lightning.TuneReportCheckpointCallbackinstead.(_train_tune pid=32554) [rank: 3] Seed set to 2
(_train_tune pid=32822) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198:
ray.tune.integration.pytorch_lightning.TuneReportCallbackis deprecated. Useray.tune.integration.pytorch_lightning.TuneReportCheckpointCallbackinstead.(_train_tune pid=32822) [rank: 1] Seed set to 2
(_train_tune pid=32621) Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
(_train_tune pid=32554) Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
(_train_tune pid=32822) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
(_train_tune pid=32554) LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Sanity Checking DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s](_train_tune pid=32621) LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
(_train_tune pid=32822) LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Epoch 999: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.94it/s, v_num=2, train_loss_step=0.126, train_loss_epoch=0.126, valid_loss=0.537]
2025-03-18 19:36:48,017 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0040s.
2025-03-18 19:36:48,018 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0036s.
2025-03-18 19:36:48,018 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0042s.
[rank: 3] Seed set to 2
[rank: 1] Seed set to 2
[rank: 2] Seed set to 2
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.60it/s]
Parsed results
NHITS ETTh1 h=96
test_size 2880
y_true.shape (n_series, n_windows, n_time_out): (7, 2785, 96)
y_hat.shape (n_series, n_windows, n_time_out): (7, 2785, 96)
MSE: 0.33694613410636276
MAE: 0.4027628248310379
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in
[rank2]: Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation
[rank2]: return self._no_refit_cross_validation(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation
[rank2]: model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size)
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit
[rank2]: self.model = self._fit_model(
[rank2]: ^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model
[rank2]: model = model.fit(
[rank2]: ^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit
[rank2]: return self._fit(
[rank2]: ^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit
[rank2]: trainer.fit(model, datamodule=datamodule)
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank2]: call._call_and_handle_interrupt(
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank2]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank2]: return function(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
[rank2]: self._run(model, ckpt_path=ckpt_path)
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
[rank2]: self.__setup_profiler()
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler
[rank2]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank2]: ^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir
[rank2]: dirpath = self.strategy.broadcast(dirpath)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank2]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank2]: broadcast(object_sizes_tensor, src=global_src, group=group)
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank2]: work = group.broadcast([tensor], opts)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank2]: Last error:
[rank2]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in
[rank1]: Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation
[rank1]: return self._no_refit_cross_validation(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation
[rank1]: model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size)
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit
[rank1]: self.model = self._fit_model(
[rank1]: ^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model
[rank1]: model = model.fit(
[rank1]: ^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit
[rank1]: return self._fit(
[rank1]: ^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit
[rank1]: trainer.fit(model, datamodule=datamodule)
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
[rank1]: self.__setup_profiler()
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler
[rank1]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank1]: ^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir
[rank1]: dirpath = self.strategy.broadcast(dirpath)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank1]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank1]: broadcast(object_sizes_tensor, src=global_src, group=group)
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank1]: work = group.broadcast([tensor], opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in
[rank3]: Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation
[rank3]: return self._no_refit_cross_validation(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation
[rank3]: model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size)
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit
[rank3]: self.model = self._fit_model(
[rank3]: ^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model
[rank3]: model = model.fit(
[rank3]: ^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit
[rank3]: return self._fit(
[rank3]: ^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit
[rank3]: trainer.fit(model, datamodule=datamodule)
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank3]: call._call_and_handle_interrupt(
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank3]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank3]: return function(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
[rank3]: self._run(model, ckpt_path=ckpt_path)
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
[rank3]: self.__setup_profiler()
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler
[rank3]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank3]: ^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir
[rank3]: dirpath = self.strategy.broadcast(dirpath)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank3]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank3]: broadcast(object_sizes_tensor, src=global_src, group=group)
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank3]: work = group.broadcast([tensor], opts)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank3]: Last error:
[rank3]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort
[rank: 1] Child process with PID 23289 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
Versions / Dependencies
Here is the pip freeze output:
aiohappyeyeballs==2.6.1
aiohttp==3.11.14
aiosignal==1.3.2
alembic==1.15.1
attrs==25.3.0
certifi==2025.1.31
charset-normalizer==3.4.1
click==8.1.8
colorlog==6.9.0
coreforecast==0.0.15
datasetsforecast @ git+https://github.com/Nixtla/datasetsforecast.git@c0023084c52c244740598affe7afafa3d59f2729
filelock==3.18.0
frozenlist==1.5.0
fsspec==2025.3.0
greenlet==3.1.1
idna==3.10
Jinja2==3.1.6
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
lightning-utilities==0.14.1
Mako==1.3.9
MarkupSafe==3.0.2
mpmath==1.3.0
msgpack==1.1.0
multidict==6.2.0
networkx==3.4.2
neuralforecast @ git+https://github.com/Nixtla/neuralforecast.git@e2f473a51ba15fbf4c33ff76cc8d1687ab68c517
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1668919096335/work
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
optuna==4.2.1
packaging==24.2
pandas==2.2.3
propcache==0.3.0
protobuf==6.30.1
pyarrow==19.0.1
python-dateutil==2.9.0.post0
pytorch-lightning==2.5.0.post0
pytz==2025.1
PyYAML==6.0.2
ray==2.43.0
referencing==0.36.2
requests==2.32.3
rpds-py==0.23.1
scikit-learn==1.6.1
scipy==1.15.2
six==1.17.0
SQLAlchemy==2.0.39
sympy==1.13.1
tensorboardX==2.6.2.2
threadpoolctl==3.6.0
torch==2.6.0
torchmetrics==1.6.3
tqdm==4.67.1
triton==3.2.0
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
utilsforecast==0.2.12
xlrd==2.0.1
yarl==1.18.3
Reproduction script
On a node with multiple GPUs (e.g., an AWS ml.g4dn.12xlarge with 4 GPUs):
Issue Severity
Medium: It is a significant difficulty but I can work around it.