Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster #19817

OswaldHe · 2024-04-25T21:09:44Z

Bug description

I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html) to submit a multi-node training job, but the job stuck at "Initializing distributed:...". I checked all related issues and none of them solve the problem.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

Training Script:

import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))


# define the LightningModule
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)

# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)

# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1, num_nodes=2, devices=4, strategy="ddp")
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

SLURM batch script:

#!/bin/bash

#SBATCH -p mi1004x
#SBATCH --nodes=2             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=4   # This needs to match Trainer(devices=...)
#SBATCH --time=0-00:30:00
#SBATCH -e slurm-%j.err

source ~/miniconda3/bin/activate pylight
# run script from above
srun python train.py

Error messages and logs

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('AMD Instinct MI100') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8

Environment

Current environment

CUDA:
- GPU:
  - AMD Instinct MI100
  - AMD Instinct MI100
  - AMD Instinct MI100
  - AMD Instinct MI100
- available: True
- version: None
Lightning:
- lightning: 2.2.1
- lightning-utilities: 0.11.2
- pytorch-lightning: 2.2.1
- pytorch-triton-rocm: 2.2.0
- torch: 2.2.0+rocm5.6
- torchaudio: 2.2.0+rocm5.6
- torchmetrics: 1.3.2
- torchvision: 0.17.0+rocm5.6
Packages:
- absl-py: 2.1.0
- aiohttp: 3.9.3
- aiosignal: 1.3.1
- annotated-types: 0.6.0
- async-timeout: 4.0.3
- attrs: 23.2.0
- certifi: 2022.12.7
- charset-normalizer: 2.1.1
- deepspeed: 0.14.0
- filelock: 3.9.0
- frozenlist: 1.4.1
- fsspec: 2023.4.0
- future: 1.0.0
- grpcio: 1.62.1
- hjson: 3.1.0
- idna: 3.4
- imageio: 2.34.0
- jinja2: 3.1.2
- lightning: 2.2.1
- lightning-utilities: 0.11.2
- markdown: 3.6
- markupsafe: 2.1.3
- mpmath: 1.3.0
- multidict: 6.0.5
- networkx: 3.2.1
- ninja: 1.11.1.1
- numpy: 1.26.3
- packaging: 24.0
- pandas: 2.2.1
- pillow: 10.2.0
- pip: 23.3.1
- protobuf: 5.26.1
- psutil: 5.9.8
- py-cpuinfo: 9.0.0
- pydantic: 2.7.0
- pydantic-core: 2.18.1
- pynvml: 11.5.0
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.2.1
- pytorch-triton-rocm: 2.2.0
- pytz: 2024.1
- pyyaml: 6.0.1
- requests: 2.28.1
- setuptools: 68.2.2
- six: 1.16.0
- sympy: 1.12
- tensorboard: 2.16.2
- tensorboard-data-server: 0.7.2
- test-tube: 0.7.5
- torch: 2.2.0+rocm5.6
- torchaudio: 2.2.0+rocm5.6
- torchmetrics: 1.3.2
- torchvision: 0.17.0+rocm5.6
- tqdm: 4.66.2
- typing-extensions: 4.8.0
- tzdata: 2024.1
- urllib3: 1.26.13
- werkzeug: 3.0.1
- wheel: 0.41.2
- yarl: 1.9.4
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.10.14
- release: 5.14.0-162.18.1.el9_1.x86_64
- version: SMP PREEMPT_DYNAMIC Wed Mar 1 22:02:24 UTC 2023

More info

No response

The text was updated successfully, but these errors were encountered:

jaydeepradeJD · 2024-04-30T16:51:53Z

Try using "srun python3 train.py". python --> python3

OswaldHe · 2024-04-30T17:24:24Z

I tried python3, but the issue still remains.

FelixBrakel · 2024-05-01T12:20:16Z

I have the same issue. It works fine when using srun but as a job submitted with sbatch it hangs.

OswaldHe added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster #19817

Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster #19817

OswaldHe commented Apr 25, 2024 •

edited

jaydeepradeJD commented Apr 30, 2024

OswaldHe commented Apr 30, 2024

FelixBrakel commented May 1, 2024

Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster #19817

Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster #19817

Comments

OswaldHe commented Apr 25, 2024 • edited

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

jaydeepradeJD commented Apr 30, 2024

OswaldHe commented Apr 30, 2024

FelixBrakel commented May 1, 2024

OswaldHe commented Apr 25, 2024 •

edited