Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto3Dseg cuda OOM during Ensembling #1505

Open
udiram opened this issue Sep 4, 2023 · 14 comments
Open

Auto3Dseg cuda OOM during Ensembling #1505

udiram opened this issue Sep 4, 2023 · 14 comments

Comments

@udiram
Copy link
Contributor

udiram commented Sep 4, 2023

Describe the bug
Models have all finished training, and during the ensembling process, cuda runs out of memory.

Reproduce
Steps to reproduce the behavior:
Run Autorunner on AMOS22 dataset

manually resetting cuda cache, restarting kernel and instance all come back to this error.

Expected behavior
training proceeds without error
MONAI version: 1.2.0
Numpy version: 1.25.2
Pytorch version: 2.0.1+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: c33f1ba588ee00229a309000e888f9817b4f1934
MONAI file: /home/exouser/.local/lib/python3.10/site-packages/monai/init.py

Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.3.0
Nibabel version: 5.1.0
scikit-image version: 0.21.0
Pillow version: 9.0.1
Tensorboard version: 2.14.0
gdown version: 4.7.1
TorchVision version: 0.15.2+cu117
tqdm version: 4.66.1
lmdb version: 1.4.1
psutil version: 5.9.0
pandas version: 2.0.3
einops version: 0.6.1
transformers version: 4.21.3
mlflow version: 2.6.0
pynrrd version: 1.0.0

Environment (please complete the following information):
OS: ubuntu 22.04
Python 3.10.12
Driver Version: 525.85.05 CUDA Version: 12.0
GRID A100X-40C - 125GB RAM

image image

I'm happy to provide any other logs to help, this is the second time I've run into this issue, the issue persists after a full kernel restart and RAM clearing.

@KumoLiu
Copy link
Contributor

KumoLiu commented Sep 5, 2023

Hi @dongyang0122, could you please share some comments here?
Thanks in advance!

@udiram
Copy link
Contributor Author

udiram commented Sep 7, 2023

hi @KumoLiu just following up on this, are there any other similar issues I could reference to trouble shoot? thanks!

@KumoLiu
Copy link
Contributor

KumoLiu commented Sep 8, 2023

Hi @udiram, here are some similar issues you could refer to:
#1089
#975
Thanks!

@udiram
Copy link
Contributor Author

udiram commented Sep 8, 2023

hi Kumo, #1089 worked for me to get training going, and like I mentioned in that issue, the same fixes (i.e. setting the spacing in swinunetr to 1.5, 1.5, 1.5), so thanks for this!

I am, however, still running into the ensembling issue that doesn't seem to be addressed in #975 specifically. The good thing about the crash happening so late is that the inferences from the test images are indeed saved, but what I am missing is a model.pth to run the model on some ground truth images as I had hoped to do. Do you know if there is a way to extract this? similar to a model trained with the https://github.com/Project-MONAI/tutorials/blob/main/3d_segmentation/swin_unetr_btcv_segmentation_3d.ipynb routine. Once I have that model file, I shouldn't necessarily need to go through the rest of the auto3dseg pipeline.

All of this with the caveat that Auto3dseg doing this automatically without GPU issues would be great!

@udiram
Copy link
Contributor Author

udiram commented Sep 12, 2023

Thanks @KumoLiu! I'll give it a go!

@udiram
Copy link
Contributor Author

udiram commented Sep 12, 2023

hi @KumoLiu is there anywhere for me to see which model performed best during training? so I can run inference using that model, I notice that every fold for every model has an associated .pt file but I'm not seeing a global best model/fold.

thanks

@udiram
Copy link
Contributor Author

udiram commented Sep 13, 2023

Hi @KumoLiu , thanks for the info, so I guess I'm a bit stuck until this ensembling issue is figured out, is there anything else, debugging or log wise, that you or @dongyang0122 need in order to figure it out?

thanks!

@udiram
Copy link
Contributor Author

udiram commented Sep 18, 2023

Hi @KumoLiu

Thanks for the resources, does this integrate into the Auto3dseg pipeline in any way? Is there any ways to point the ensembler at the files generated by auto3dseg?

Thanks

@KumoLiu
Copy link
Contributor

KumoLiu commented Sep 18, 2023

Hi @udiram, yes, it has already been integrated into the AutoRunner.
https://github.com/Project-MONAI/MONAI/blob/281cb0119c01eaa8e6c841880b91f92f45e8d7f7/monai/apps/auto3dseg/auto_runner.py#L815

You can also override it by:

runner = AutoRunner(input=input)
runner.set_ensemble_method(ensemble_method_name="AlgoEnsembleBestByFold")

Just FYI:
https://github.com/Project-MONAI/tutorials/tree/main/auto3dseg/notebooks

Thanks!

@udiram
Copy link
Contributor Author

udiram commented Sep 18, 2023

sure, I'll give the over ride a try, do you have any ideas on how to run ensembling with less gpu usage, similar to the fix during validation for #1089 ?

thanks!

@udiram
Copy link
Contributor Author

udiram commented Sep 27, 2023

Hi @KumoLiu, just following up on this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants