Summary
Add support for multi-node distributed training (2 nodes × 4 GPUs = 8 GPUs total) when using SLURM batch scripts with Singularity containers. The current setup only supports single-node training via main.py (multiprocessing) or main_distributed.py (submitit, which submits a new job and does not fit the existing SBATCH + Singularity workflow).
Context
- Current workflow:
dev/run_on_hpc/mn5/train_multinoise.sh — SBATCH script that stages data to $TMPDIR, runs singularity exec ... python main.py --devices cuda:0 cuda:1 cuda:2 cuda:3
- main.py is single-node only: uses
multiprocessing and MASTER_ADDR=localhost
- main_distributed.py uses submitit to submit a new SLURM job — not suitable when already inside an SBATCH job with data staging and Singularity
Problem 1: MASTER_ADDR in src/utils/distributed.py
Location: src/utils/distributed.py, line 30
Issue: When using SLURM env vars (SLURM_NTASKS, SLURM_PROCID), the code sets os.environ['MASTER_ADDR'] = os.environ['HOSTNAME']. Each process sets MASTER_ADDR to its own node's hostname. On 2 nodes, processes disagree on who the master is → init_process_group hangs or fails.
Fix: Resolve MASTER_ADDR to the first node in the job allocation.
Problem 2: scontrol not available inside Singularity
Issue: Using subprocess.getoutput('scontrol show hostnames ...') inside Python fails in a Singularity container — scontrol is not installed there.
Error observed:
[c10d] The IPv6 network addresses of (/bin/sh: 1: scontrol: not found, 40112) cannot be retrieved (gai error: -2 - Name or service not known)
Fix: Set MASTER_ADDR in the bash job script (where scontrol is available) before launching the container.
Required Changes
Change 1: src/utils/distributed.py
Add import subprocess and only set MASTER_ADDR when not already set:
import subprocess
# In init_distributed(), replace lines 26-34:
if (rank is None) or (world_size is None):
try:
world_size = int(os.environ['SLURM_NTASKS'])
rank = int(os.environ['SLURM_PROCID'])
if 'MASTER_ADDR' not in os.environ:
os.environ['MASTER_ADDR'] = subprocess.getoutput(
'scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1'
)
except Exception:
logger.info('SLURM vars not set (distributed training not available)')
world_size, rank = 1, 0
return world_size, rank
Change 2: New script dev/run_on_hpc/mn5/train_multinoise_2node.sh
SBATCH header:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=20
#SBATCH --gres=gpu:4
Before srun:
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=40112
Data staging: Wrap in srun --ntasks-per-node=1 --ntasks="$SLURM_NNODES" bash -c '...' so it runs on every node.
Launch command: Use srun --ntasks=8 --ntasks-per-node=4 and call train.main() directly (bypass main.py).
References
train.py line 16: CUDA_VISIBLE_DEVICES = SLURM_LOCALID
train.py line 138: init_distributed() with no args
- Base script:
dev/run_on_hpc/mn5/train_multinoise.sh
Summary
Add support for multi-node distributed training (2 nodes × 4 GPUs = 8 GPUs total) when using SLURM batch scripts with Singularity containers. The current setup only supports single-node training via
main.py(multiprocessing) ormain_distributed.py(submitit, which submits a new job and does not fit the existing SBATCH + Singularity workflow).Context
dev/run_on_hpc/mn5/train_multinoise.sh— SBATCH script that stages data to$TMPDIR, runssingularity exec ... python main.py --devices cuda:0 cuda:1 cuda:2 cuda:3multiprocessingandMASTER_ADDR=localhostProblem 1: MASTER_ADDR in
src/utils/distributed.pyLocation:
src/utils/distributed.py, line 30Issue: When using SLURM env vars (
SLURM_NTASKS,SLURM_PROCID), the code setsos.environ['MASTER_ADDR'] = os.environ['HOSTNAME']. Each process setsMASTER_ADDRto its own node's hostname. On 2 nodes, processes disagree on who the master is →init_process_grouphangs or fails.Fix: Resolve
MASTER_ADDRto the first node in the job allocation.Problem 2:
scontrolnot available inside SingularityIssue: Using
subprocess.getoutput('scontrol show hostnames ...')inside Python fails in a Singularity container —scontrolis not installed there.Error observed:
Fix: Set
MASTER_ADDRin the bash job script (wherescontrolis available) before launching the container.Required Changes
Change 1:
src/utils/distributed.pyAdd
import subprocessand only setMASTER_ADDRwhen not already set:Change 2: New script
dev/run_on_hpc/mn5/train_multinoise_2node.shSBATCH header:
Before srun:
Data staging: Wrap in
srun --ntasks-per-node=1 --ntasks="$SLURM_NNODES" bash -c '...'so it runs on every node.Launch command: Use
srun --ntasks=8 --ntasks-per-node=4and calltrain.main()directly (bypassmain.py).References
train.pyline 16:CUDA_VISIBLE_DEVICES = SLURM_LOCALIDtrain.pyline 138:init_distributed()with no argsdev/run_on_hpc/mn5/train_multinoise.sh