Auto3Dseg cuda OOM during Ensembling #1505

udiram · 2023-09-04T15:36:42Z

Describe the bug
Models have all finished training, and during the ensembling process, cuda runs out of memory.

Reproduce
Steps to reproduce the behavior:
Run Autorunner on AMOS22 dataset

manually resetting cuda cache, restarting kernel and instance all come back to this error.

Expected behavior
training proceeds without error
MONAI version: 1.2.0
Numpy version: 1.25.2
Pytorch version: 2.0.1+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: c33f1ba588ee00229a309000e888f9817b4f1934
MONAI file: /home/exouser/.local/lib/python3.10/site-packages/monai/init.py

Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.3.0
Nibabel version: 5.1.0
scikit-image version: 0.21.0
Pillow version: 9.0.1
Tensorboard version: 2.14.0
gdown version: 4.7.1
TorchVision version: 0.15.2+cu117
tqdm version: 4.66.1
lmdb version: 1.4.1
psutil version: 5.9.0
pandas version: 2.0.3
einops version: 0.6.1
transformers version: 4.21.3
mlflow version: 2.6.0
pynrrd version: 1.0.0

Environment (please complete the following information):
OS: ubuntu 22.04
Python 3.10.12
Driver Version: 525.85.05 CUDA Version: 12.0
GRID A100X-40C - 125GB RAM

I'm happy to provide any other logs to help, this is the second time I've run into this issue, the issue persists after a full kernel restart and RAM clearing.

KumoLiu · 2023-09-05T02:29:41Z

Hi @dongyang0122, could you please share some comments here?
Thanks in advance!

udiram · 2023-09-07T18:31:27Z

hi @KumoLiu just following up on this, are there any other similar issues I could reference to trouble shoot? thanks!

KumoLiu · 2023-09-08T02:34:03Z

Hi @udiram, here are some similar issues you could refer to:
#1089
#975
Thanks!

udiram · 2023-09-08T15:51:52Z

hi Kumo, #1089 worked for me to get training going, and like I mentioned in that issue, the same fixes (i.e. setting the spacing in swinunetr to 1.5, 1.5, 1.5), so thanks for this!

I am, however, still running into the ensembling issue that doesn't seem to be addressed in #975 specifically. The good thing about the crash happening so late is that the inferences from the test images are indeed saved, but what I am missing is a model.pth to run the model on some ground truth images as I had hoped to do. Do you know if there is a way to extract this? similar to a model trained with the https://github.com/Project-MONAI/tutorials/blob/main/3d_segmentation/swin_unetr_btcv_segmentation_3d.ipynb routine. Once I have that model file, I shouldn't necessarily need to go through the rest of the auto3dseg pipeline.

All of this with the caveat that Auto3dseg doing this automatically without GPU issues would be great!

KumoLiu · 2023-09-11T04:03:15Z

Hi @udiram, I looked at the source code, and found that the model is saved under "bundle_root/models".
https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/scripts/segmenter.py#L1136
https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/configs/hyper_parameters.yaml#L2

Thanks!

udiram · 2023-09-12T15:01:36Z

Thanks @KumoLiu! I'll give it a go!

udiram · 2023-09-12T23:49:20Z

hi @KumoLiu is there anywhere for me to see which model performed best during training? so I can run inference using that model, I notice that every fold for every model has an associated .pt file but I'm not seeing a global best model/fold.

thanks

KumoLiu · 2023-09-13T03:42:37Z

Hi @udiram, I think "model.pt" is the best model for each fold. There is also a final model has been saved. You may need to ensemble to get the final result.
https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/scripts/segmenter.py#L1288-L1295
https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/scripts/segmenter.py#L1136-L1137

udiram · 2023-09-13T13:04:28Z

Hi @KumoLiu , thanks for the info, so I guess I'm a bit stuck until this ensembling issue is figured out, is there anything else, debugging or log wise, that you or @dongyang0122 need in order to figure it out?

thanks!

KumoLiu · 2023-09-18T02:45:42Z

Hi @udiram, for how to ensemble, you can refer to:
https://github.com/Project-MONAI/tutorials/blob/main/modules/cross_validation_models_ensemble.ipynb
https://github.com/Project-MONAI/MONAI/blob/281cb0119c01eaa8e6c841880b91f92f45e8d7f7/monai/apps/auto3dseg/ensemble_builder.py#L404

Thanks!

udiram · 2023-09-18T02:46:56Z

Hi @KumoLiu

Thanks for the resources, does this integrate into the Auto3dseg pipeline in any way? Is there any ways to point the ensembler at the files generated by auto3dseg?

Thanks

KumoLiu · 2023-09-18T06:39:33Z

Hi @udiram, yes, it has already been integrated into the AutoRunner.
https://github.com/Project-MONAI/MONAI/blob/281cb0119c01eaa8e6c841880b91f92f45e8d7f7/monai/apps/auto3dseg/auto_runner.py#L815

You can also override it by:

runner = AutoRunner(input=input)
runner.set_ensemble_method(ensemble_method_name="AlgoEnsembleBestByFold")

Just FYI:
https://github.com/Project-MONAI/tutorials/tree/main/auto3dseg/notebooks

Thanks!

udiram · 2023-09-18T15:07:46Z

sure, I'll give the over ride a try, do you have any ideas on how to run ensembling with less gpu usage, similar to the fix during validation for #1089 ?

thanks!

udiram · 2023-09-27T14:38:02Z

Hi @KumoLiu, just following up on this issue!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto3Dseg cuda OOM during Ensembling #1505

Auto3Dseg cuda OOM during Ensembling #1505

udiram commented Sep 4, 2023 •

edited

KumoLiu commented Sep 5, 2023

udiram commented Sep 7, 2023

KumoLiu commented Sep 8, 2023

udiram commented Sep 8, 2023

KumoLiu commented Sep 11, 2023

udiram commented Sep 12, 2023

udiram commented Sep 12, 2023 •

edited

KumoLiu commented Sep 13, 2023

udiram commented Sep 13, 2023 •

edited

KumoLiu commented Sep 18, 2023

udiram commented Sep 18, 2023

KumoLiu commented Sep 18, 2023

udiram commented Sep 18, 2023

udiram commented Sep 27, 2023

Auto3Dseg cuda OOM during Ensembling #1505

Auto3Dseg cuda OOM during Ensembling #1505

Comments

udiram commented Sep 4, 2023 • edited

KumoLiu commented Sep 5, 2023

udiram commented Sep 7, 2023

KumoLiu commented Sep 8, 2023

udiram commented Sep 8, 2023

KumoLiu commented Sep 11, 2023

udiram commented Sep 12, 2023

udiram commented Sep 12, 2023 • edited

KumoLiu commented Sep 13, 2023

udiram commented Sep 13, 2023 • edited

KumoLiu commented Sep 18, 2023

udiram commented Sep 18, 2023

KumoLiu commented Sep 18, 2023

udiram commented Sep 18, 2023

udiram commented Sep 27, 2023

udiram commented Sep 4, 2023 •

edited

udiram commented Sep 12, 2023 •

edited

udiram commented Sep 13, 2023 •

edited