Unable to run TorchRec DLRM using the provided Dockerfile and requirements.txt. I'm using the latest revision of the master branch.
> cp Dockerfile Dockerfile.torchx
> torchx run -s local_docker dist.ddp -j 1x2 --script dlrm_main.py
torchx 2024-08-05 13:26:15 INFO Tracker configurations: {}
torchx 2024-08-05 13:26:15 INFO Checking for changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm`...
torchx 2024-08-05 13:26:15 INFO To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2024-08-05 13:26:15 INFO Workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm` resolved to filesystem path `/proj/java-gpu/training/recommendation_v2/torchrec_dlrm`
torchx 2024-08-05 13:26:16 INFO Building workspace docker image (this may take a while)...
torchx 2024-08-05 13:26:16 INFO Step 1/7 : ARG FROM_IMAGE_NAME=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
torchx 2024-08-05 13:26:16 INFO Step 2/7 : FROM ${FROM_IMAGE_NAME}
torchx 2024-08-05 13:26:16 INFO ---> 71eb2d092138
torchx 2024-08-05 13:26:16 INFO Step 3/7 : RUN apt-get -y update && apt-get -y install git
torchx 2024-08-05 13:26:16 INFO ---> Using cache
torchx 2024-08-05 13:26:16 INFO ---> 45eded198de2
torchx 2024-08-05 13:26:16 INFO Step 4/7 : WORKDIR /workspace/torchrec_dlrm
torchx 2024-08-05 13:26:16 INFO ---> Using cache
torchx 2024-08-05 13:26:16 INFO ---> 1b41a30dcd79
torchx 2024-08-05 13:26:16 INFO Step 5/7 : COPY . .
torchx 2024-08-05 13:26:16 INFO ---> ae30b5f5e5a1
torchx 2024-08-05 13:26:16 INFO Step 6/7 : RUN pip install --no-cache-dir -r requirements.txt
torchx 2024-08-05 13:26:16 INFO ---> Running in 3ef0c644fc38
...
torchx 2024-08-05 13:27:02 INFO ---> Removed intermediate container 3ef0c644fc38
torchx 2024-08-05 13:27:02 INFO ---> addfe3ce01cb
torchx 2024-08-05 13:27:02 INFO Step 7/7 : LABEL torchx.pytorch.org/version=0.7.0
torchx 2024-08-05 13:27:02 INFO ---> Running in 4e254643ce54
torchx 2024-08-05 13:27:02 INFO ---> Removed intermediate container 4e254643ce54
torchx 2024-08-05 13:27:02 INFO ---> 861ee2a4e5d3
torchx 2024-08-05 13:27:02 INFO [Warning] One or more build-args [IMAGE WORKSPACE] were not consumed
torchx 2024-08-05 13:27:02 INFO Successfully built 861ee2a4e5d3
torchx 2024-08-05 13:27:02 INFO Built new image `sha256:861ee2a4e5d33dca93d9fe8847feccd4028d2e27c8f281654307aeec203452bd` based on original image `ghcr.io/pytorch/torchx:0.7.0` and changes in workspace `file:///proj/java-gpu/training/recommendation_v2/torchrec_dlrm` for role[0]=dlrm_main.
local_docker://torchx/dlrm_main-sbz7tbpcb2sqvd
torchx 2024-08-05 13:27:03 INFO Waiting for the app to finish...
dlrm_main/0 WARNING:torch.distributed.run:
dlrm_main/0 *****************************************
dlrm_main/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
dlrm_main/0 *****************************************
dlrm_main/0 [0]:
dlrm_main/0 [0]:A module that was compiled using NumPy 1.x cannot be run in
dlrm_main/0 [0]:NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
dlrm_main/0 [0]:versions of NumPy, modules must be compiled with NumPy 2.0.
dlrm_main/0 [0]:Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
dlrm_main/0 [0]:
dlrm_main/0 [0]:If you are a user of the module, the easiest solution will be to
dlrm_main/0 [0]:downgrade to 'numpy<2' or try to upgrade the affected module.
dlrm_main/0 [0]:We expect that some modules will need time to support NumPy 2.
dlrm_main/0 [0]:
dlrm_main/0 [0]:Traceback (most recent call last): File "/workspace/torchrec_dlrm/dlrm_main.py", line 19, in <module>
dlrm_main/0 [0]: import torchmetrics as metrics
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module>
dlrm_main/0 [0]: from torchmetrics import functional # noqa: E402
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module>
dlrm_main/0 [0]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module>
dlrm_main/0 [0]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate # noqa: F401
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 21, in <module>
dlrm_main/0 [0]: from torchmetrics.utilities.imports import _SCIPY_AVAILABLE
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 1, in <module>
dlrm_main/0 [0]: from torchmetrics.utilities.checks import check_forward_full_state_property # noqa: F401
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 22, in <module>
dlrm_main/0 [0]: from torchmetrics.utilities.data import select_topk, to_onehot
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 19, in <module>
dlrm_main/0 [0]: from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 113, in <module>
dlrm_main/0 [0]: _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0")
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version
dlrm_main/0 [0]: if not _module_available(package):
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 60, in _module_available
dlrm_main/0 [0]: module = import_module(module_names[0])
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
dlrm_main/0 [0]: return _bootstrap._gcd_import(name[level:], package, level)
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/__init__.py", line 5, in <module>
dlrm_main/0 [0]: from torchvision import datasets, io, models, ops, transforms, utils
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/__init__.py", line 17, in <module>
dlrm_main/0 [0]: from . import detection, optical_flow, quantization, segmentation, video
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/__init__.py", line 1, in <module>
dlrm_main/0 [0]: from .faster_rcnn import *
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 16, in <module>
dlrm_main/0 [0]: from .anchor_utils import AnchorGenerator
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 10, in <module>
dlrm_main/0 [0]: class AnchorGenerator(nn.Module):
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 63, in AnchorGenerator
dlrm_main/0 [0]: device: torch.device = torch.device("cpu"),
dlrm_main/0 [0]:/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py:63: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/utils/tensor_numpy.cpp:77.)
dlrm_main/0 [0]: device: torch.device = torch.device("cpu"),
dlrm_main/0 [1]:
dlrm_main/0 [1]:A module that was compiled using NumPy 1.x cannot be run in
dlrm_main/0 [1]:NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
dlrm_main/0 [1]:versions of NumPy, modules must be compiled with NumPy 2.0.
dlrm_main/0 [1]:Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
dlrm_main/0 [1]:
dlrm_main/0 [1]:If you are a user of the module, the easiest solution will be to
dlrm_main/0 [1]:downgrade to 'numpy<2' or try to upgrade the affected module.
dlrm_main/0 [1]:We expect that some modules will need time to support NumPy 2.
dlrm_main/0 [1]:
dlrm_main/0 [1]:Traceback (most recent call last): File "/workspace/torchrec_dlrm/dlrm_main.py", line 19, in <module>
dlrm_main/0 [1]: import torchmetrics as metrics
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/__init__.py", line 14, in <module>
dlrm_main/0 [1]: from torchmetrics import functional # noqa: E402
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/__init__.py", line 14, in <module>
dlrm_main/0 [1]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module>
dlrm_main/0 [1]: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate # noqa: F401
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/audio/pit.py", line 21, in <module>
dlrm_main/0 [1]: from torchmetrics.utilities.imports import _SCIPY_AVAILABLE
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/__init__.py", line 1, in <module>
dlrm_main/0 [1]: from torchmetrics.utilities.checks import check_forward_full_state_property # noqa: F401
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/checks.py", line 22, in <module>
dlrm_main/0 [1]: from torchmetrics.utilities.data import select_topk, to_onehot
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 19, in <module>
dlrm_main/0 [1]: from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_12
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 113, in <module>
dlrm_main/0 [1]: _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0")
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version
dlrm_main/0 [1]: if not _module_available(package):
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/imports.py", line 60, in _module_available
dlrm_main/0 [1]: module = import_module(module_names[0])
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
dlrm_main/0 [1]: return _bootstrap._gcd_import(name[level:], package, level)
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/__init__.py", line 5, in <module>
dlrm_main/0 [1]: from torchvision import datasets, io, models, ops, transforms, utils
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/__init__.py", line 17, in <module>
dlrm_main/0 [1]: from . import detection, optical_flow, quantization, segmentation, video
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/__init__.py", line 1, in <module>
dlrm_main/0 [1]: from .faster_rcnn import *
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 16, in <module>
dlrm_main/0 [1]: from .anchor_utils import AnchorGenerator
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 10, in <module>
dlrm_main/0 [1]: class AnchorGenerator(nn.Module):
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py", line 63, in AnchorGenerator
dlrm_main/0 [1]: device: torch.device = torch.device("cpu"),
dlrm_main/0 [1]:/opt/conda/lib/python3.10/site-packages/torchvision/models/detection/anchor_utils.py:63: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/utils/tensor_numpy.cpp:77.)
dlrm_main/0 [1]: device: torch.device = torch.device("cpu"),
dlrm_main/0 [1]:Traceback (most recent call last):
dlrm_main/0 [1]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 939, in <module>
dlrm_main/0 [1]: main(sys.argv[1:])
dlrm_main/0 [1]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 813, in main
dlrm_main/0 [1]: plan = planner.collective_plan(
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/planner/planners.py", line 177, in collective_plan
dlrm_main/0 [1]: return invoke_on_rank_and_broadcast_result(
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/collective_utils.py", line 58, in invoke_on_rank_and_broadcast_result
dlrm_main/0 [1]: dist.broadcast_object_list(object_list, rank, group=pg)
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2106, in broadcast_object_list
dlrm_main/0 [1]: object_list[i] = _tensor_to_object(obj_view, obj_size)
dlrm_main/0 [1]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in _tensor_to_object
dlrm_main/0 [1]: buf = tensor.numpy().tobytes()[:tensor_size]
dlrm_main/0 [1]:RuntimeError: Numpy is not available
dlrm_main/0 [0]:Traceback (most recent call last):
dlrm_main/0 [0]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 939, in <module>
dlrm_main/0 [0]: main(sys.argv[1:])
dlrm_main/0 [0]: File "/workspace/torchrec_dlrm/dlrm_main.py", line 817, in main
dlrm_main/0 [0]: model = DistributedModelParallel(
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 232, in __init__
dlrm_main/0 [0]: self.init_data_parallel()
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 266, in init_data_parallel
dlrm_main/0 [0]: self._data_parallel_wrapper.wrap(self, self._env, self.device)
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 97, in wrap
dlrm_main/0 [0]: DistributedDataParallel(
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
dlrm_main/0 [0]: _verify_param_shape_across_processes(self.process_group, parameters)
dlrm_main/0 [0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
dlrm_main/0 [0]: return dist._verify_params_across_processes(process_group, tensors, logger)
dlrm_main/0 [0]:RuntimeError: [/opt/conda/conda-bld/pytorch_1670525552843/work/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.20.0.2]:54499
dlrm_main/0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 25) of binary: /opt/conda/bin/python
dlrm_main/0 [0]:libcuda.so.1: cannot open shared object file: No such file or directory
dlrm_main/0 [1]:libcuda.so.1: cannot open shared object file: No such file or directory
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889625323, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 660}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889625376, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 661}}
dlrm_main/0 [1]::::MLLOG {"namespace": "", "time_ms": 1722889625323, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 660}}
dlrm_main/0 [1]::::MLLOG {"namespace": "", "time_ms": 1722889625376, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 661}}
dlrm_main/0 [0]:{'adagrad': False,
dlrm_main/0 [0]: 'allow_tf32': False,
dlrm_main/0 [0]: 'batch_size': 32,
dlrm_main/0 [0]: 'collect_multi_hot_freqs_stats': False,
dlrm_main/0 [0]: 'dataset_name': 'criteo_1t',
dlrm_main/0 [0]: 'dcn_low_rank_dim': 512,
dlrm_main/0 Traceback (most recent call last):
dlrm_main/0 File "/opt/conda/bin/torchrun", line 33, in <module>
dlrm_main/0 sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
dlrm_main/0 [0]: 'dcn_num_layers': 3,
dlrm_main/0 [0]: 'dense_arch_layer_sizes': [512, 256, 64],
dlrm_main/0 [0]: 'drop_last_training_batch': False,
dlrm_main/0 [0]: 'embedding_dim': 64,
dlrm_main/0 [0]: 'epochs': 1,
dlrm_main/0 [0]: 'evaluate_on_epoch_end': False,
dlrm_main/0 [0]: 'evaluate_on_training_end': False,
dlrm_main/0 [0]: 'in_memory_binary_criteo_path': None,
dlrm_main/0 [0]: 'interaction_branch1_layer_sizes': [2048, 2048],
dlrm_main/0 [0]: 'interaction_branch2_layer_sizes': [2048, 2048],
dlrm_main/0 [0]: 'interaction_type': <InteractionType.ORIGINAL: 'original'>,
dlrm_main/0 [0]: 'learning_rate': 15.0,
dlrm_main/0 [0]: 'limit_test_batches': None,
dlrm_main/0 [0]: 'limit_train_batches': None,
dlrm_main/0 [0]: 'limit_val_batches': None,
dlrm_main/0 [0]: 'lr_decay_start': 0,
dlrm_main/0 [0]: 'lr_decay_steps': 0,
dlrm_main/0 [0]: 'lr_warmup_steps': 0,
dlrm_main/0 [0]: 'mmap_mode': False,
dlrm_main/0 [0]: 'multi_hot_distribution_type': None,
dlrm_main/0 [0]: 'multi_hot_sizes': None,
dlrm_main/0 [0]: 'num_embeddings': 100000,
dlrm_main/0 [0]: 'num_embeddings_per_feature': None,
dlrm_main/0 [0]: 'over_arch_layer_sizes': [512, 512, 256, 1],
dlrm_main/0 [0]: 'pin_memory': False,
dlrm_main/0 [0]: 'print_lr': False,
dlrm_main/0 [0]: 'print_progress': False,
dlrm_main/0 [0]: 'print_sharding_plan': False,
dlrm_main/0 [0]: 'seed': None,
dlrm_main/0 [0]: 'shuffle_batches': False,
dlrm_main/0 [0]: 'shuffle_training_set': False,
dlrm_main/0 [0]: 'synthetic_multi_hot_criteo_path': None,
dlrm_main/0 [0]: 'test_batch_size': None,
dlrm_main/0 [0]: 'validation_auroc': None,
dlrm_main/0 [0]: 'validation_freq_within_epoch': None}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "dlrm_dcnv2", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 7}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "reference_implementation", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 11}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 15}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626392, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 19}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "reference_implementation", "metadata": {"file": "/workspace/torchrec_dlrm/mlperf_logging_utils.py", "lineno": 23}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 64, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 705}}
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 709}}
dlrm_main/0 return f(*args, **kwargs)
dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
dlrm_main/0 [0]::::MLLOG {"namespace": "", "time_ms": 1722889626393, "event_type": "POINT_IN_TIME", "key": "seed", "value": null, "metadata": {"file": "/workspace/torchrec_dlrm/dlrm_main.py", "lineno": 713}}
dlrm_main/0 run(args)
dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
dlrm_main/0 elastic_launch(
dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
dlrm_main/0 return launch_agent(self._config, self._entrypoint, list(args))
dlrm_main/0 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
dlrm_main/0 raise ChildFailedError(
dlrm_main/0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
dlrm_main/0 ============================================================
dlrm_main/0 dlrm_main.py FAILED
dlrm_main/0 ------------------------------------------------------------
dlrm_main/0 Failures:
dlrm_main/0 [1]:
dlrm_main/0 time : 2024-08-05_20:27:09
dlrm_main/0 host : dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0
dlrm_main/0 rank : 1 (local_rank: 1)
dlrm_main/0 exitcode : 1 (pid: 26)
dlrm_main/0 error_file: <N/A>
dlrm_main/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
dlrm_main/0 ------------------------------------------------------------
dlrm_main/0 Root Cause (first observed failure):
dlrm_main/0 [0]:
dlrm_main/0 time : 2024-08-05_20:27:09
dlrm_main/0 host : dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0
dlrm_main/0 rank : 0 (local_rank: 0)
dlrm_main/0 exitcode : 1 (pid: 25)
dlrm_main/0 error_file: <N/A>
dlrm_main/0 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
dlrm_main/0 ============================================================
torchx 2024-08-05 13:27:10 INFO Job finished: FAILED
torchx 2024-08-05 13:27:10 ERROR AppStatus:
msg: <NONE>
num_restarts: -1
roles:
- replicas:
- hostname: dlrm_main-sbz7tbpcb2sqvd-dlrm_main-0
id: 0
role: dlrm_main
state: !!python/object/apply:torchx.specs.api.AppState
- 5
structured_error_msg: <NONE>
role: dlrm_main
state: FAILED (5)
structured_error_msg: <NONE>
ui_url: null
Unable to run TorchRec DLRM using the provided Dockerfile and requirements.txt. I'm using the latest revision of the master branch.