-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto3Dseg cuda OOM during Ensembling #1505
Comments
Hi @dongyang0122, could you please share some comments here? |
hi @KumoLiu just following up on this, are there any other similar issues I could reference to trouble shoot? thanks! |
hi Kumo, #1089 worked for me to get training going, and like I mentioned in that issue, the same fixes (i.e. setting the spacing in swinunetr to 1.5, 1.5, 1.5), so thanks for this! I am, however, still running into the ensembling issue that doesn't seem to be addressed in #975 specifically. The good thing about the crash happening so late is that the inferences from the test images are indeed saved, but what I am missing is a model.pth to run the model on some ground truth images as I had hoped to do. Do you know if there is a way to extract this? similar to a model trained with the https://github.com/Project-MONAI/tutorials/blob/main/3d_segmentation/swin_unetr_btcv_segmentation_3d.ipynb routine. Once I have that model file, I shouldn't necessarily need to go through the rest of the auto3dseg pipeline. All of this with the caveat that Auto3dseg doing this automatically without GPU issues would be great! |
Hi @udiram, I looked at the source code, and found that the model is saved under "bundle_root/models". Thanks! |
Thanks @KumoLiu! I'll give it a go! |
hi @KumoLiu is there anywhere for me to see which model performed best during training? so I can run inference using that model, I notice that every fold for every model has an associated .pt file but I'm not seeing a global best model/fold. thanks |
Hi @udiram, I think "model.pt" is the best model for each fold. There is also a final model has been saved. You may need to ensemble to get the final result. |
Hi @KumoLiu , thanks for the info, so I guess I'm a bit stuck until this ensembling issue is figured out, is there anything else, debugging or log wise, that you or @dongyang0122 need in order to figure it out? thanks! |
Hi @KumoLiu Thanks for the resources, does this integrate into the Auto3dseg pipeline in any way? Is there any ways to point the ensembler at the files generated by auto3dseg? Thanks |
Hi @udiram, yes, it has already been integrated into the You can also override it by:
Just FYI: Thanks! |
sure, I'll give the over ride a try, do you have any ideas on how to run ensembling with less gpu usage, similar to the fix during validation for #1089 ? thanks! |
Hi @KumoLiu, just following up on this issue! |
Describe the bug
Models have all finished training, and during the ensembling process, cuda runs out of memory.
Reproduce
Steps to reproduce the behavior:
Run Autorunner on AMOS22 dataset
manually resetting cuda cache, restarting kernel and instance all come back to this error.
Expected behavior
training proceeds without error
MONAI version: 1.2.0
Numpy version: 1.25.2
Pytorch version: 2.0.1+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: c33f1ba588ee00229a309000e888f9817b4f1934
MONAI file: /home/exouser/.local/lib/python3.10/site-packages/monai/init.py
Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.3.0
Nibabel version: 5.1.0
scikit-image version: 0.21.0
Pillow version: 9.0.1
Tensorboard version: 2.14.0
gdown version: 4.7.1
TorchVision version: 0.15.2+cu117
tqdm version: 4.66.1
lmdb version: 1.4.1
psutil version: 5.9.0
pandas version: 2.0.3
einops version: 0.6.1
transformers version: 4.21.3
mlflow version: 2.6.0
pynrrd version: 1.0.0
Environment (please complete the following information):
OS: ubuntu 22.04
Python 3.10.12
Driver Version: 525.85.05 CUDA Version: 12.0
GRID A100X-40C - 125GB RAM
I'm happy to provide any other logs to help, this is the second time I've run into this issue, the issue persists after a full kernel restart and RAM clearing.
The text was updated successfully, but these errors were encountered: