Gradient Clipping in DDP/DDPSharded strategy behavior #19340

chetwinlow · 2024-01-24T15:18:22Z

chetwinlow
Jan 24, 2024

Based on my understanding, when using Pytorch DDP to do training (not lightning), loss.backward() automatically triggers a gradient sync / gradient averaging. However based on the provided image in one of the documentation, it says gradients are not synced until optimizer.step(), and the local effective batchsize is (bs * n_accums). If I have gradient clipping enabled, does it mean the gradient clipping is done before the sync / averaging or after the sync / average. I believe the correct approach is to clip after the global averaging / sync but I cannot find the actual implementation of this mechanism in the codebase, can someone help confirm the actual implementation of this flow in pytorch lightining? or shed some light on this matter?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient Clipping in DDP/DDPSharded strategy behavior #19340

{{title}}

Replies: 0 comments

Select a reply

Gradient Clipping in DDP/DDPSharded strategy behavior #19340

chetwinlow Jan 24, 2024

Replies: 0 comments

chetwinlow
Jan 24, 2024