Skip to content

How to scale learning rate with batch size for DDP training? #3706

Discussion options

You must be logged in to vote

As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.

Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes

In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512.

Replies: 8 comments 11 replies

Comment options

You must be logged in to vote
1 reply
@abhijeetdhakane
Comment options

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
3 replies
@rubencart
Comment options

@cjsg
Comment options

@fawazsammani
Comment options

Answer selected by Borda
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@JHawkley
Comment options

Comment options

You must be logged in to vote
6 replies
@matthijsvk
Comment options

@yzhang-gh
Comment options

@schlagercollin
Comment options

@abhijeet00
Comment options

@abhijeet00
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic strategy: ddp DistributedDataParallel
Converted from issue

This discussion was converted from issue #3706 on December 23, 2020 19:51.