Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge metrics jump between epochs && Step and epoch log not matched, when accumulate_grad_batches > 1 #19779

Open
stg1205 opened this issue Apr 15, 2024 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers

Comments

@stg1205
Copy link

stg1205 commented Apr 15, 2024

Bug description

So at first I noticed the huge jump between epochs for both loss and accuracy calculated by torchmetrics, I debugged for a couple days by adding the "drop_last=True" in the dataloader, add some dropout or changing the model but nothing changed.

To clarify, Exp 4302c358770fe8041adbdc5137f079b8 has accumulate_grad_batches=4, batch size=2, ddp in 8 gpus, and exp a67c8a809390fe0b06b0d6737009f6e2: accumulated_grad_batches=1, batch_size=4, ddp in 4 gpus, all the other configurations are same so the overall batch sizes are equal.

There are some observations maybe related to this problem:

  1. There's no cycled lr schedule and I shuffle the train dataset before each epoch
  2. The loss fluctuated a lot due to some random masking in training
  3. The train epoch metrics curves and validation curves are quite normal, keeps decreasing. However, as you can see in the picture, exp 4302c358770fe8041adbdc5137f079b8, the step metrics and epoch metrics are not matched (log_every_n_steps=1). I tried to average the step metrics manually and they were not matched.
  4. After debugging for a couple days, I set the accumulated_grad_batches to 1, exp a67c8a809390fe0b06b0d6737009f6e2, the problem solved.

So after these experiments I tested the models. Both jumped and normal experiments worked just fine. Maybe there are some issues in the logging process but it's too hard to trace the code. The code to reproduce may take me a while so I just drop the descriptions here, if you have any thoughts on it. If you do need the code I'll see what I can do.

image image image image

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

@stg1205 stg1205 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers
Projects
None yet
Development

No branches or pull requests

1 participant