Multi-node training (2 nodes × 4 GPUs) for I-JEPA with SLURM + Singularity

## Summary

Add support for multi-node distributed training (2 nodes × 4 GPUs = 8 GPUs total) when using SLURM batch scripts with Singularity containers. The current setup only supports single-node training via `main.py` (multiprocessing) or `main_distributed.py` (submitit, which submits a new job and does not fit the existing SBATCH + Singularity workflow).

## Context

- **Current workflow:** `dev/run_on_hpc/mn5/train_multinoise.sh` — SBATCH script that stages data to `$TMPDIR`, runs `singularity exec ... python main.py --devices cuda:0 cuda:1 cuda:2 cuda:3`
- **main.py** is single-node only: uses `multiprocessing` and `MASTER_ADDR=localhost`
- **main_distributed.py** uses submitit to submit a new SLURM job — not suitable when already inside an SBATCH job with data staging and Singularity

## Problem 1: MASTER_ADDR in `src/utils/distributed.py`

**Location:** `src/utils/distributed.py`, line 30

**Issue:** When using SLURM env vars (`SLURM_NTASKS`, `SLURM_PROCID`), the code sets `os.environ['MASTER_ADDR'] = os.environ['HOSTNAME']`. Each process sets `MASTER_ADDR` to its own node's hostname. On 2 nodes, processes disagree on who the master is → `init_process_group` hangs or fails.

**Fix:** Resolve `MASTER_ADDR` to the first node in the job allocation.

## Problem 2: `scontrol` not available inside Singularity

**Issue:** Using `subprocess.getoutput('scontrol show hostnames ...')` inside Python fails in a Singularity container — `scontrol` is not installed there.

**Error observed:**
```
[c10d] The IPv6 network addresses of (/bin/sh: 1: scontrol: not found, 40112) cannot be retrieved (gai error: -2 - Name or service not known)
```

**Fix:** Set `MASTER_ADDR` in the bash job script (where `scontrol` is available) before launching the container.

---

## Required Changes

### Change 1: `src/utils/distributed.py`

Add `import subprocess` and only set `MASTER_ADDR` when not already set:

```python
import subprocess

# In init_distributed(), replace lines 26-34:
if (rank is None) or (world_size is None):
    try:
        world_size = int(os.environ['SLURM_NTASKS'])
        rank = int(os.environ['SLURM_PROCID'])
        if 'MASTER_ADDR' not in os.environ:
            os.environ['MASTER_ADDR'] = subprocess.getoutput(
                'scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1'
            )
    except Exception:
        logger.info('SLURM vars not set (distributed training not available)')
        world_size, rank = 1, 0
        return world_size, rank
```

### Change 2: New script `dev/run_on_hpc/mn5/train_multinoise_2node.sh`

**SBATCH header:**
```bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=20
#SBATCH --gres=gpu:4
```

**Before srun:**
```bash
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=40112
```

**Data staging:** Wrap in `srun --ntasks-per-node=1 --ntasks="$SLURM_NNODES" bash -c '...'` so it runs on every node.

**Launch command:** Use `srun --ntasks=8 --ntasks-per-node=4` and call `train.main()` directly (bypass `main.py`).

---

## References

- `train.py` line 16: `CUDA_VISIBLE_DEVICES = SLURM_LOCALID`
- `train.py` line 138: `init_distributed()` with no args
- Base script: `dev/run_on_hpc/mn5/train_multinoise.sh`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node training (2 nodes × 4 GPUs) for I-JEPA with SLURM + Singularity #73

Summary

Context

Problem 1: MASTER_ADDR in `src/utils/distributed.py`

Problem 2: `scontrol` not available inside Singularity

Required Changes

Change 1: `src/utils/distributed.py`

Change 2: New script `dev/run_on_hpc/mn5/train_multinoise_2node.sh`

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multi-node training (2 nodes × 4 GPUs) for I-JEPA with SLURM + Singularity #73

Description

Summary

Context

Problem 1: MASTER_ADDR in src/utils/distributed.py

Problem 2: scontrol not available inside Singularity

Required Changes

Change 1: src/utils/distributed.py

Change 2: New script dev/run_on_hpc/mn5/train_multinoise_2node.sh

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Problem 1: MASTER_ADDR in `src/utils/distributed.py`

Problem 2: `scontrol` not available inside Singularity

Change 1: `src/utils/distributed.py`

Change 2: New script `dev/run_on_hpc/mn5/train_multinoise_2node.sh`