DDP training never starts #11865

maxmatical · 2022-02-10T23:09:50Z

maxmatical
Feb 10, 2022

I'm trying to run ddp training with pytorch lightning trainer via hydra on a multi-gpu GCP instance, but when i launch the experiment, i get the following output

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
[2022-02-08 19:04:09,575][pipeline][INFO] - Logging hyperparameters!
[2022-02-08 19:04:09,575][pipeline][INFO] - Experiment path: /hydra/default-experiment/default2022-02-08-19-03-53
Experiment with name default-experiment not found. Creating it.
[2022-02-08 19:04:09,713][ir_pipeline][INFO] - Starting training!
Global seed set to 42
Global seed set to 42
Global seed set to 42
Global seed set to 42
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[2022-02-08 19:04:24,552][faiss.loader][INFO] - Loading faiss with AVX2 support.
[2022-02-08 19:04:24,553][faiss.loader][INFO] - Loading faiss.
Global seed set to 42
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[2022-02-08 19:04:25,632][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1
[2022-02-08 19:04:27,193][faiss.loader][INFO] - Loading faiss with AVX2 support.
[2022-02-08 19:04:27,193][faiss.loader][INFO] - Loading faiss.
Global seed set to 42
initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
[2022-02-08 19:04:28,240][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 2
[2022-02-08 19:04:31,902][faiss.loader][INFO] - Loading faiss with AVX2 support.
[2022-02-08 19:04:31,903][faiss.loader][INFO] - Loading faiss.
Global seed set to 42
initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[2022-02-08 19:04:32,947][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 3
[2022-02-08 19:04:32,957][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2022-02-08 19:04:32,957][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[2022-02-08 19:04:32,958][torch.distributed.distributed_c10d][INFO] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[2022-02-08 19:04:32,961][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[2022-02-08 19:04:32,963][torch.distributed.distributed_c10d][INFO] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.

And the process gets stuck at this point. I cannot exit the process, nor can i ssh back into the vm if i exit. If i switch the strategy to dp instead, the experiment launches normally and I'm able to complete training. Does anyone know what the issue might be, and how to go about solving this issue?

maogewudi007 · 2022-02-18T07:24:25Z

maogewudi007
Feb 18, 2022

I met the same problem.

1 reply

Ir1d Apr 27, 2022

Same problem. Using hydra + lightning for ddp on 4 gpus

akihironitta · 2022-07-05T06:41:57Z

akihironitta
Jul 5, 2022

FYI, there's #13087 created from this discussion although there's no update nor repro.

0 replies

shahin-trunk · 2023-01-20T12:54:51Z

shahin-trunk
Jan 20, 2023

+1, I am facing the same problem

0 replies

maxmatical · 2023-01-23T18:01:29Z

maxmatical
Jan 23, 2023
Author

an update to this for those who might be facing the same issue

the issue i ran into was due to the dataset and tokenization logic causing CPU to OOM. in my case it was due to tokenizing all the text when the data is loaded, instead of at the __getitem__ level. switching that logic fixed the issue for me

0 replies

PussyCat0700 · 2023-08-03T08:49:24Z

PussyCat0700
Aug 3, 2023

Soultion for 4090 users or anyone with NCCL P2P communication unsupported or disabled:

Set env variable export NCCL_P2P_DISABLE=1 and run your script again. This might help.

I ran into this problem when running my code on 4090 with exclusive-process computation mode.

1 reply

namKolorfuL Mar 26, 2024

Worked on 2x 3090. Thanks a ton!

bjaeger1 · 2023-08-28T15:15:46Z

bjaeger1
Aug 28, 2023

Setting NCCL_P2P_DISABLE=1 also worked for me. (4x Nvidia A5000).
Here is the issue: NVIDIA/nccl#631 and also the possible effects when disabling P2P communication, although I did not see any reduced performance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP training never starts #11865

{{title}}

Replies: 6 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DDP training never starts #11865

Replies: 6 comments · 2 replies

maxmatical Jan 23, 2023 Author

Replies: 6 comments 2 replies

maxmatical
Jan 23, 2023
Author