How to scale learning rate with batch size for DDP training? #3706

huyvnphan · 2020-09-28T17:14:01Z

huyvnphan
Sep 28, 2020

When using LARS optimizer, usually the batch size is scale linearly with the learning rate.
Suppose I set the base_lr to be 0.1 * batch_size / 256.
Now for 1 GPU training with batch size 512, the learning rate should be 0.1 * 2 = 0.2

However when I use 2 GPUs with DDP backend and batch size of 512 on each GPU. Should my learning rate be:

0.1 * 2 = 0.2
or 0.1 * 2 * 2 (no. GPUs) = 0.4

Answered by itsikad

Sep 29, 2020

As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.

Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes

In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512.

View full answer

rohitgr7 · 2020-09-28T19:57:43Z

rohitgr7
Sep 28, 2020

Just to clarify if you use batch_size=512 in DDP backend, each GPU will train on 512 batch_size in lightning. Do you want 512 on each or 256 on each GPU?

1 reply

abhijeetdhakane Dec 7, 2023

@rohitgr7 do you mind to highlight the code/lines that do these operations in lightning?

huyvnphan · 2020-09-28T20:00:50Z

huyvnphan
Sep 28, 2020
Author

Hi I want each GPUs to has batch size of 512. So two GPUs will have a total batch size of 1024. I don't know if I should set the learinng rate base on the total batch size or batch size on each GPU

0 replies

rohitgr7 · 2020-09-28T22:29:34Z

rohitgr7
Sep 28, 2020

in DDP the gradients are averaged and synced across each device before optimzer_step, so I don't think lr should be changed here.

0 replies

itsikad · 2020-09-29T08:16:31Z

itsikad
Sep 29, 2020

As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.

Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes

In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512.

3 replies

rubencart Mar 29, 2021

Why sqrt(2) and not 2?

cjsg Jun 22, 2021

Because sqrt(2) is the factor that will preserve the variance.

In more details: if g1 and g2 are the two gradients computed on gpu 1 and 2 respectively, then DDP apparently uses g = (g1 + g2) / 2. Now if g1 and g2 are i.i.d. random variables with variance sigma^2, then Var(g) = (Var(g1) + Var(g2)) / 4 = sigma^2 / 2. So, if you want your SGD step in direction g to have the same variance than in the directions g1 and/or g2, then you'll need to increase its size by sqrt(2) [because Var( sqrt(2) g) = 2 Var(g) = sigma^2].

fawazsammani Aug 1, 2023

You should just set your global learning rate directly as if your huge batch size fits into GPU memory (in this case lr * sqrt(num_gpus) as mentioned by the answer above). This is because the learning rate functions during optimizer.step(), and by this step the gradients from all GPUs are already obtained and averaged, so the model that optimizer.step() will function on is the global model (not the per-GPU local model).

huyvnphan · 2020-09-29T20:30:57Z

huyvnphan
Sep 29, 2020
Author

Thank you all for your answers. I'll scale the LR with the total effective batch size.

0 replies

evanatyourservice · 2020-10-18T01:56:51Z

evanatyourservice
Oct 18, 2020

This is mentioned very briefly in the DDP documentation, perhaps it should also be mentioned in the TPU section in the docs since TPU uses DDP. This was my case, and I understood it as needing to scale batch size to match effective learning rate, but it was hard to find confirmation on this even with several threads on the subject in various places.

0 replies

austinmw · 2022-08-30T21:08:11Z

austinmw
Aug 30, 2022

@itsikad Hi, I read your explanation and it makes sense to me, but when I ran an experiment with Lightning DDP I got:

# Simple CNN Autoencoder using MSE loss on MNIST

# baseline
1 GPU, bs=32, lr=1e-4: MSE=0.016281472519040108

# sqrt scaling
2 GPU, bs=32, lr=1e-4 * sqrt(2): MSE=0.01761590503156185
3 GPU, bs=32, lr=1e-4 * sqrt(3): MSE=0.01891041174530983
4 GPU, bs=32, lr=1e-4 * sqrt(4): MSE=0.01945831999182701

# linear scaling
2 GPU, bs=32, lr=1e-4 * 2: MSE=0.016594046726822853
3 GPU, bs=32, lr=1e-4 * 3: MSE= 0.016459766775369644
4 GPU, bs=32, lr=1e-4 * 4: MSE=0.016359923407435417

Which seems to show that linear scaling instead of scaling with the sqrt() is what actually enables scaling GPUs while maintaining performance for me. I assume I'm missing something, but can you help me understand how to unite your theoretical explanation with my results? I think this paper also suggests linear scaling: https://arxiv.org/abs/1706.02677

1 reply

JHawkley Oct 9, 2022

@austinmw I believe each GPU you add scales up the batch size configured in the DataModule, which means your tests were probably increasing the effective batch size.

In your example, with 4 GPUs the effective batch size was 128 compared to 32 for only a single GPU. This would have an effect on the learning rate. I don't really know what kind of side-effect it would have on learning quality, even with the linear scaling bringing it back down to the same effective learning rate.

I'm by no means an expert in Lightning, though, so please do actually test it to make sure I'm not feeding you incorrect information. I have seen debates online over this peculiarity in the API and how many consider it a bit of a newbie trap.

Try repeating your experiment like this to keep the effective batch size consistent:
With 1 GPU and batch size set to 32 (1 * 32 = 32).
With 2 GPUs and batch size set to 16 (2 * 16 = 32).
With 4 GPUs and batch size set to 8 (4 * 8 = 32).

Does this allow the sqrt scaling to work correctly?

matthijsvk · 2022-10-25T13:12:39Z

matthijsvk
Oct 25, 2022

Since there is still confusion on this topic, below my understanding.
There are 2 distinct issues:

how to scale 'batch_size' parameter when using multiple GPUs to keep training unmodified
how to scale learning rate when increasing the batch size

The confusion arises because keeping 'batch_size' the same in your script and going multi-GPU changes the effective batch size.
To keep the optimization process the same, you need to set 'batch_size=B/N' instead.

1. How to scale 'batch_size' parameter when using multiple GPUs to keep training unmodified**

Pytorch averages loss across the minibatch by default (reduce='mean' is the default in loss functions).

Say you train on images with batch_size=B on 1 GPU, and now use DDP with N GPUs setting batch_size=B as well.
With DDP, each of N GPUs will get B (not B/N!) images to process, and computes its own gradients, averaging across its batch size of B.
Then these gradients are averaged across GPUs. So all together you're averaging B gradients per GPU, across N GPUs, so B*N items.
This is different from the 1-GPU case as you have N times more items! Your 'effective' batch size is increased by N, and you will not get the same accuracy!

If you want DDP to do the same as the 1-GPU case, you need to set batch_size to B/N. Then each GPU processes B/N elements, averages locally across B/N elements, and then DDP averages across the N GPUs so all together you average across B/N * N = B elements.

Note: All this is different from DataParallel which does split the B images across GPUs so that each GPU gets B/N images to process, and then sums up the gradients (sum, not average like DDP). There you would need to divide the learning rate by N.

Note: if your model uses BN you'll need to use Syncronized BN, as remarked in Section 2.3 in Goyal et al so that the statistics are not computed across only B/N but across B examples.

The confusion arises from the fact that setting batch_size=B/N with DDP doesn't scale to large N because then each GPU sees only a few images and GPUs run most efficient when processing a lot of items in parallel, i.e. large batches.
So to use GPUs optimally you want each GPU to still process B images at a time. With DDP you're then averaging across B*N images and you've increased you effective batch size.
Now to the second issue.

2. How to scale learning rate when increasing the batch size

If you had a single gigantic GPU you also increase your batch_size to B*N. In practice we can't do this because we choose B as large as the GPU memory allows to maximize GPU efficiency, but imagine we have infinite GPU memory and we do that.
Now how to change the learning rate to keep accuracy?

The important thing to understand is that contrary to before, now we are modifying the optimization process and you will NOT get the same results. There will be N times less iterations in an epoch, so N times less weight updates.
Each step is now an average over B*N instead of B images, so in theory the estimate of the real gradient is better and you can increase the learning rate for faster optimization. It's not clear how exactly though.

One way is to scale lr_new = sqrt(N) * lr which preserves the gradient variances.
The much-cited paper by Goyal et al advises linear scaling, so lr_new = N * lr.

Note: in this case, if your model use BN you should to disable SyncBN so that statistics are still calculated across B examples like in the original case (again, Goyal et al section 2.3). Otherwise the linear LR scaling rule may not work because the loss landscape is different.

6 replies

matthijsvk Oct 27, 2022

It depends on the effective batch size. Say 1-GPU uses batch_size B.
If the effective batch size is larger (DDP multi-GPU with batch_size argument B), you should not use synBN and scale the learning rate.
If the effective batch size is the same (DDP multi-GPU with batch_size argument B/N), you should use syncBN and not scale the learning rate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to scale learning rate with batch size for DDP training? #3706

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to scale learning rate with batch size for DDP training? #3706

Replies: 8 comments · 11 replies

huyvnphan Sep 28, 2020 Author

huyvnphan Sep 29, 2020 Author

1. How to scale 'batch_size' parameter when using multiple GPUs to keep training unmodified**

2. How to scale learning rate when increasing the batch size

Replies: 8 comments 11 replies

huyvnphan
Sep 28, 2020
Author

huyvnphan
Sep 29, 2020
Author