Does pytorch lightning divide the loss by number of gradient accumulation steps? #17035
Answered
by
JuanFMontesinos
offchan42
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
-
For example, in this code, does the Trainer divide the loss by gradient accumulation steps (i.e. 7)? # Accumulate gradients for 7 batches
trainer = Trainer(accumulate_grad_batches=7) I want to know this in order to understand how to set learning rate correctly. I saw that huggingface accelerate package does divide the loss: https://huggingface.co/docs/accelerate/usage_guides/gradient_accumulation - loss = loss / gradient_accumulation_steps # this is the line that does divide the loss
accelerator.backward(loss)
- if (index+1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad() If the loss is not being divided, it will cause the gradient to become bigger and that has implication on whether you should multiply the learning rate by gradient accumulation steps. |
Beta Was this translation helpful? Give feedback.
Answered by
JuanFMontesinos
Mar 21, 2023
Replies: 1 comment 2 replies
-
Now I know why my model doesn't converge :) |
Beta Was this translation helpful? Give feedback.
2 replies
Answer selected by
offchan42
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
https://github.com/Lightning-AI/lightning/blob/bb861cba7e2a4597c56def506f0a64c9a30b9e8a/src/lightning/pytorch/core/module.py#L1038-L1054
Now I know why my model doesn't converge :)