New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PT potential CUDA mem leak? #1478
Comments
Note, in the PYTORCH_CUDA_ALLOC_CONF doc, maybe the
|
I introduced the option Before (via
And then: After:
And then: So the option |
From log (
/work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.XPpeLPG9camH/log.run.1
), filtered the CUDA mem usage reports:And then OOM on the A10 (24GB, effectively 22GB):
From the very first output, you see that the model params take 427.8MB ("alloc cur"). Then it seems somewhat constant at around 2GB, maybe for some caches for convolution etc (maybe see #1450).
The "alloc cur" increases then, and fluctuates, but also sometimes still has the old low value of around 2GB. This indicates that maybe the Python GC has not yet freed everything, and maybe a
gc.collect()
would free this (but not sure, not tested).But it's strange that overall the magnitude of the fluctuations increase more and more. That also causes the reserved memory to increase. Maybe it's just then bad memory fragmentation?
The text was updated successfully, but these errors were encountered: