Skip to content

Multi-node training (2 nodes × 4 GPUs) for I-JEPA with SLURM + Singularity #73

Description

@kutayeroglu

Summary

Add support for multi-node distributed training (2 nodes × 4 GPUs = 8 GPUs total) when using SLURM batch scripts with Singularity containers. The current setup only supports single-node training via main.py (multiprocessing) or main_distributed.py (submitit, which submits a new job and does not fit the existing SBATCH + Singularity workflow).

Context

  • Current workflow: dev/run_on_hpc/mn5/train_multinoise.sh — SBATCH script that stages data to $TMPDIR, runs singularity exec ... python main.py --devices cuda:0 cuda:1 cuda:2 cuda:3
  • main.py is single-node only: uses multiprocessing and MASTER_ADDR=localhost
  • main_distributed.py uses submitit to submit a new SLURM job — not suitable when already inside an SBATCH job with data staging and Singularity

Problem 1: MASTER_ADDR in src/utils/distributed.py

Location: src/utils/distributed.py, line 30

Issue: When using SLURM env vars (SLURM_NTASKS, SLURM_PROCID), the code sets os.environ['MASTER_ADDR'] = os.environ['HOSTNAME']. Each process sets MASTER_ADDR to its own node's hostname. On 2 nodes, processes disagree on who the master is → init_process_group hangs or fails.

Fix: Resolve MASTER_ADDR to the first node in the job allocation.

Problem 2: scontrol not available inside Singularity

Issue: Using subprocess.getoutput('scontrol show hostnames ...') inside Python fails in a Singularity container — scontrol is not installed there.

Error observed:

[c10d] The IPv6 network addresses of (/bin/sh: 1: scontrol: not found, 40112) cannot be retrieved (gai error: -2 - Name or service not known)

Fix: Set MASTER_ADDR in the bash job script (where scontrol is available) before launching the container.


Required Changes

Change 1: src/utils/distributed.py

Add import subprocess and only set MASTER_ADDR when not already set:

import subprocess

# In init_distributed(), replace lines 26-34:
if (rank is None) or (world_size is None):
    try:
        world_size = int(os.environ['SLURM_NTASKS'])
        rank = int(os.environ['SLURM_PROCID'])
        if 'MASTER_ADDR' not in os.environ:
            os.environ['MASTER_ADDR'] = subprocess.getoutput(
                'scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1'
            )
    except Exception:
        logger.info('SLURM vars not set (distributed training not available)')
        world_size, rank = 1, 0
        return world_size, rank

Change 2: New script dev/run_on_hpc/mn5/train_multinoise_2node.sh

SBATCH header:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=20
#SBATCH --gres=gpu:4

Before srun:

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=40112

Data staging: Wrap in srun --ntasks-per-node=1 --ntasks="$SLURM_NNODES" bash -c '...' so it runs on every node.

Launch command: Use srun --ntasks=8 --ntasks-per-node=4 and call train.main() directly (bypass main.py).


References

  • train.py line 16: CUDA_VISIBLE_DEVICES = SLURM_LOCALID
  • train.py line 138: init_distributed() with no args
  • Base script: dev/run_on_hpc/mn5/train_multinoise.sh

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions