Huge metrics jump between epochs && Step and epoch log not matched, when accumulate_grad_batches > 1 #19779

stg1205 · 2024-04-15T14:50:08Z

Bug description

So at first I noticed the huge jump between epochs for both loss and accuracy calculated by torchmetrics, I debugged for a couple days by adding the "drop_last=True" in the dataloader, add some dropout or changing the model but nothing changed.

To clarify, Exp 4302c358770fe8041adbdc5137f079b8 has accumulate_grad_batches=4, batch size=2, ddp in 8 gpus, and exp a67c8a809390fe0b06b0d6737009f6e2: accumulated_grad_batches=1, batch_size=4, ddp in 4 gpus, all the other configurations are same so the overall batch sizes are equal.

There are some observations maybe related to this problem:

There's no cycled lr schedule and I shuffle the train dataset before each epoch
The loss fluctuated a lot due to some random masking in training
The train epoch metrics curves and validation curves are quite normal, keeps decreasing. However, as you can see in the picture, exp 4302c358770fe8041adbdc5137f079b8, the step metrics and epoch metrics are not matched (log_every_n_steps=1). I tried to average the step metrics manually and they were not matched.
After debugging for a couple days, I set the accumulated_grad_batches to 1, exp a67c8a809390fe0b06b0d6737009f6e2, the problem solved.

So after these experiments I tested the models. Both jumped and normal experiments worked just fine. Maybe there are some issues in the logging process but it's too hard to trace the code. The code to reproduce may take me a while so I just drop the descriptions here, if you have any thoughts on it. If you do need the code I'll see what I can do.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

The text was updated successfully, but these errors were encountered:

stg1205 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge metrics jump between epochs && Step and epoch log not matched, when accumulate_grad_batches > 1 #19779

Huge metrics jump between epochs && Step and epoch log not matched, when accumulate_grad_batches > 1 #19779

stg1205 commented Apr 15, 2024 •

edited

Huge metrics jump between epochs && Step and epoch log not matched, when accumulate_grad_batches > 1 #19779

Huge metrics jump between epochs && Step and epoch log not matched, when accumulate_grad_batches > 1 #19779

Comments

stg1205 commented Apr 15, 2024 • edited

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

stg1205 commented Apr 15, 2024 •

edited