DDP training never starts #11865
Replies: 6 comments 2 replies
-
I met the same problem. |
Beta Was this translation helpful? Give feedback.
-
FYI, there's #13087 created from this discussion although there's no update nor repro. |
Beta Was this translation helpful? Give feedback.
-
+1, I am facing the same problem |
Beta Was this translation helpful? Give feedback.
-
an update to this for those who might be facing the same issue the issue i ran into was due to the dataset and tokenization logic causing CPU to OOM. in my case it was due to tokenizing all the text when the data is loaded, instead of at the |
Beta Was this translation helpful? Give feedback.
-
Soultion for 4090 users or anyone with NCCL P2P communication unsupported or disabled: Set env variable I ran into this problem when running my code on 4090 with exclusive-process computation mode. |
Beta Was this translation helpful? Give feedback.
-
Setting NCCL_P2P_DISABLE=1 also worked for me. (4x Nvidia A5000). |
Beta Was this translation helpful? Give feedback.
-
I'm trying to run ddp training with pytorch lightning trainer via hydra on a multi-gpu GCP instance, but when i launch the experiment, i get the following output
And the process gets stuck at this point. I cannot exit the process, nor can i ssh back into the vm if i exit. If i switch the strategy to
dp
instead, the experiment launches normally and I'm able to complete training. Does anyone know what the issue might be, and how to go about solving this issue?Beta Was this translation helpful? Give feedback.
All reactions