You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So at first I noticed the huge jump between epochs for both loss and accuracy calculated by torchmetrics, I debugged for a couple days by adding the "drop_last=True" in the dataloader, add some dropout or changing the model but nothing changed.
To clarify, Exp 4302c358770fe8041adbdc5137f079b8 has accumulate_grad_batches=4, batch size=2, ddp in 8 gpus, and exp a67c8a809390fe0b06b0d6737009f6e2: accumulated_grad_batches=1, batch_size=4, ddp in 4 gpus, all the other configurations are same so the overall batch sizes are equal.
There are some observations maybe related to this problem:
There's no cycled lr schedule and I shuffle the train dataset before each epoch
The loss fluctuated a lot due to some random masking in training
The train epoch metrics curves and validation curves are quite normal, keeps decreasing. However, as you can see in the picture, exp 4302c358770fe8041adbdc5137f079b8, the step metrics and epoch metrics are not matched (log_every_n_steps=1). I tried to average the step metrics manually and they were not matched.
After debugging for a couple days, I set the accumulated_grad_batches to 1, exp a67c8a809390fe0b06b0d6737009f6e2, the problem solved.
So after these experiments I tested the models. Both jumped and normal experiments worked just fine. Maybe there are some issues in the logging process but it's too hard to trace the code. The code to reproduce may take me a while so I just drop the descriptions here, if you have any thoughts on it. If you do need the code I'll see what I can do.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
The text was updated successfully, but these errors were encountered:
Bug description
So at first I noticed the huge jump between epochs for both loss and accuracy calculated by torchmetrics, I debugged for a couple days by adding the "drop_last=True" in the dataloader, add some dropout or changing the model but nothing changed.
To clarify, Exp 4302c358770fe8041adbdc5137f079b8 has accumulate_grad_batches=4, batch size=2, ddp in 8 gpus, and exp a67c8a809390fe0b06b0d6737009f6e2: accumulated_grad_batches=1, batch_size=4, ddp in 4 gpus, all the other configurations are same so the overall batch sizes are equal.
There are some observations maybe related to this problem:
So after these experiments I tested the models. Both jumped and normal experiments worked just fine. Maybe there are some issues in the logging process but it's too hard to trace the code. The code to reproduce may take me a while so I just drop the descriptions here, if you have any thoughts on it. If you do need the code I'll see what I can do.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: