You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use DeepSpeed ZeRO Stage 2 Offload integrated in lightning to train my model. When I use single gpu, the training time for one epoch is about 5h. However, When I use two ranks, the time extends to about 12.5h.
Something strange is that when I use single gpu, there is only one process by checking nvidia-smi, while it becomes two processes on each rank for 2-gpu training.
If I use large batch size, it will report Out of CUDA memory error and exit. For single-gpu, the training just terminated. However, for 2-gpu, only one process on each rank exits and the remaining one is still running, after that the training continues normally with much faster speed (2.5h/epoch), which seems to be the desired speed (half of what with single-gpu).
It looks that the additional process limits the speed of multi-gpu training. Is its existence normal? If so, what's its function? Is there anyway to avoid it? (Obviously I don't want to exit it by making Out of CUDA memory error manually.)
Bug description
I use DeepSpeed ZeRO Stage 2 Offload integrated in lightning to train my model. When I use single gpu, the training time for one epoch is about 5h. However, When I use two ranks, the time extends to about 12.5h.
Something strange is that when I use single gpu, there is only one process by checking
nvidia-smi
, while it becomes two processes on each rank for 2-gpu training.If I use large batch size, it will report
Out of CUDA memory
error and exit. For single-gpu, the training just terminated. However, for 2-gpu, only one process on each rank exits and the remaining one is still running, after that the training continues normally with much faster speed (2.5h/epoch), which seems to be the desired speed (half of what with single-gpu).It looks that the additional process limits the speed of multi-gpu training. Is its existence normal? If so, what's its function? Is there anyway to avoid it? (Obviously I don't want to exit it by making
Out of CUDA memory
error manually.)What version are you seeing the problem on?
v2.1
How to reproduce the bug
Error messages and logs
2-gpu:
single-gpu:
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: