seconds/iteration is fast in first epoch, gets slower every subsequent epoch #8659
Unanswered
angadkalra
asked this question in
DDP / multi-GPU / multi-node
Replies: 2 comments 3 replies
-
Do you have any data cached across epochs that you need to reset? Does the performance dramatically drop off at each epoch boundary, or does it gradually slow down during training execution? |
Beta Was this translation helpful? Give feedback.
3 replies
-
@tchaton Any updates on this ? what probably is causing this error ? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm training a resnet101 3D model on grayscale images, 224x224x320, 16-bit precision, on VM with 4xV100 GPUs, using DDP and num_workers = 4. Batch size is 8 (2 per GPU). My first epoch goes very fast with 2.5s/it, but every epoch after that gets slower and slower and at epoch 6 I'm getting 6.5s/it. Any idea why this is happening or tips to speed up?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions