Gradient Clipping in DDP/DDPSharded strategy behavior #19340
Unanswered
chetwinlow
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Based on my understanding, when using Pytorch DDP to do training (not lightning), loss.backward() automatically triggers a gradient sync / gradient averaging. However based on the provided image in one of the documentation, it says gradients are not synced until optimizer.step(), and the local effective batchsize is (bs * n_accums). If I have gradient clipping enabled, does it mean the gradient clipping is done before the sync / averaging or after the sync / average. I believe the correct approach is to clip after the global averaging / sync but I cannot find the actual implementation of this mechanism in the codebase, can someone help confirm the actual implementation of this flow in pytorch lightining? or shed some light on this matter?
Beta Was this translation helpful? Give feedback.
All reactions