OOM after automatic batch size finder #19811
Unanswered
JonathanDZiegler
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi and thanks in advance for reading!
I am running into a situation where, on occasion, a training will oom on the first training step after the automatic batch size finder has completed. The callback takes a fixed effective batch size and scales GPU batch size and gradient accumulation steps. I've even backed up the batch size by a factor of 2 after the callback has run to be on the safe side (or so I assumed). Does anyone have any experience with this kind of behavior? The batch size finder should handle all the garbage collection, and I am using the callback in a fairly vanilla way, running on an A100. Thanks for any pointers!
Beta Was this translation helpful? Give feedback.
All reactions