Lightning sends SIGTERM when using other SLURM manager #14893

YannDubs · 2022-09-26T04:00:35Z

Bug description

Pytorch lightning does not work when using another tool for SLURM scheduling. In particular, all my jobs receive a lot of SIGTERM when using submitit.

This and similar issues seem to have been raised many times but never solved (maybe due to lack of reproducible code), see: #5969 #5225 (maybe #10154)

How to reproduce the bug

I made a minimal reproducible repo for the bug here. Please see the README there. Needless to say that you need SLURM, and hopefully the error does not depend on SLURM config.

The code only consists of scheduling some model on SLURM and checking the logs. The main code (main.py) runs a logistic regression.

The rest is simply the SLURM config (config/sigterm.py), where you should change the partition for your SLURM.

Once you run python main.py -m this will schedule the job on SLURM and print the logging directory (eg multirun/2022-09-25/20-28-21/). If you open the logging file (eg less multirun/2022-09-25/20-28-21/0/main.log) you should see all the SIGTERM signals Bypassing signal SIGTERM

Error messages and logs

Important info

Please see the requirements.txt. The lightning version is 1.7.7 but I had those SIGTERMs since at least version 1.5

More info

More generally there should be an easy way to deactivate completely SLURM+pytorch lightning. This has already caused many issues (eg #6389 #3651 ) and will probably continue doing so. The thread in #6389 shows that there's really a lot of interest to be able to deactivate (as suggested by @Queuecumber @carmocca ) and it seems very cheap to do.

In my case, I often need 2 pytorch lightning models in a single script (self supervised learning + linear probing) so I want to be able to manage SLURM for multiple lightning trainers and thus don't want lightning to do it for me (there are also other reasons, this is the most prominent).

Tagging people that seem to have thoughts and knowledge about all of that: @awaelchli

The text was updated successfully, but these errors were encountered:

awaelchli · 2022-09-26T10:17:43Z

Hi @YannDubs

To be clear, Lightning does not trigger the SIGTERM, right? It is the SLURM cluster. The messages "bypassing signal" you see are from Lightning handling

In 1.6 we introduced a flag auto_requeue=True|False (#10601) that you can set to False if you prefer that Lightning does not handle any signals to requeue the job. Try to set it to False and see if it works for you :)

Also, I think it would be awesome if we had a Submitit example in our SLURM docs :)

Queuecumber · 2022-09-26T10:36:15Z

I'm pretty sure auto_requeue=False is what you want but I haven't actually tried it

I use PL and submitit quite heavily and I don't have any big issues since landing #14626

I do see these messages in my logs but it doesn't seem like they do anything besides look ugly, I always assumed this wasn't being caused by PL but maybe it's worth looking into?

Queuecumber · 2022-09-26T10:38:01Z

The messages "bypassing signal" you see are from Lightning handling

They're actually from submitit but they're getting printed multiple times as though the SIGTERM is sent more than once

awaelchli · 2022-09-26T11:11:19Z

True, they are from submitit. We have a very similar info message in PL, which is why I got mislead.

Queuecumber · 2022-09-26T12:38:37Z

Also one more thing to keep in mind (it may be unrelated): when using submitit, unless you take particular steps, lightning doesn't even set the signal handlers.

This is because lightning is "polite" and won't set its signal handlers if some library has already set them up. Submitit does its signal very early on in the lifetime of the application so by the time lightning gets around to slurm stuff the handlers from submitit are already present.

I had to work around this by doing signal.signal(signal.SIGUSR2, signal.SIG_DFL) right before I make my Trainer

See facebookincubator/submitit#1709 and https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/trainer/connectors/signal_connector.py#L63

YannDubs · 2022-09-26T21:59:11Z

Thanks for the quick answer @Queuecumber @awaelchli

I forgot to say that I tried plugins=[SLURMEnvironment(auto_requeue=False)] but this did not make any difference. I even tried to delete the SLURM environment variables as suggested by this comment but I still see the warnings.

I did some digging: the warning is raised due to CUDAAccelerator.is_available() called in different places when initializing the trainer. In particular, it seems to come from pool.apply(torch.cuda.device.count).

I'm not sure why this sends a SIGTERM and why this line runs with multiprocessing in the first place. Any thoughts?

Queuecumber · 2022-09-26T22:01:08Z

I have no thoughts other than that this is super weird and interesting

awaelchli · 2022-09-26T22:06:40Z

I did some digging: the warning is raised due to CUDAAccelerator.is_available() called in different places when initializing the trainer. In particular, it seems to come from pool.apply(torch.cuda.device.count).

I'm not sure why this sends a SIGTERM and why this line runs with multiprocessing in the first place. Any thoughts?

This was a workaround for a torch issue in combination with cuda and forking. The code was removed again on master recently for a different solution that does not use multiprocessing. I also can't say why it would be emitting the SIGTERM.

Maybe it's worth it to test your code with the latest version on master. You can install from source via pip install https://github.com/Lightning-AI/lightning/archive/refs/heads/master.zip -U. Hope this helps and sorry for the trouble.

Queuecumber · 2022-09-26T22:09:29Z

Just to clarify, is this actually crashing your script or is it just that your logs have extra stuff in them?

YannDubs · 2022-09-26T22:26:38Z

No my scripts aren't crashing because submitit bypassed those, I've actually been having these warning for a year. But my logs are full of them and I wanted to make sure that it was not an issue with our internal SLURM configs. Now I'm confident it is actually not an important warning.

Thanks @awaelchli, there seems to be no error using the last version. Let's see once it's merged and I use it for larger projects.

Thanks to both, I'm closing the issue for now. Although I'm still very surprised about why that happened

awaelchli · 2022-09-27T09:09:35Z

Thanks @awaelchli, there seems to be no error using the last version. Let's see once it's merged and I use it for larger projects.

Thanks @YannDubs
This will be released in the next few days as part of the 1.8 release.

Queuecumber · 2022-09-27T12:44:42Z

Actually I think it's good that this is resolved because it may actually be causing a problem.

Apparently you're not supposed to print things in signal handlers and it can cause crashes randomly to do so.

Since submitit is printing inside its signal handlers (and I think lightning does this too) I've actually been getting crashes intermittently.

Of course the more times that print statement is executed the more likely you are to see a crash and because whatever is happening here raises many sigterm which in turn triggers the signal handler and the print, it seems to make this crash much more likely.

Will try this again on 1.8 when it's released

YannDubs added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Sep 26, 2022

awaelchli added environment: slurm and removed needs triage Waiting to be triaged by maintainers labels Sep 26, 2022

YannDubs closed this as completed Sep 26, 2022

awaelchli self-assigned this Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lightning sends SIGTERM when using other SLURM manager #14893

Lightning sends SIGTERM when using other SLURM manager #14893

YannDubs commented Sep 26, 2022 •

edited

awaelchli commented Sep 26, 2022

Queuecumber commented Sep 26, 2022

Queuecumber commented Sep 26, 2022

awaelchli commented Sep 26, 2022

Queuecumber commented Sep 26, 2022

YannDubs commented Sep 26, 2022

Queuecumber commented Sep 26, 2022

awaelchli commented Sep 26, 2022 •

edited

Queuecumber commented Sep 26, 2022

YannDubs commented Sep 26, 2022 •

edited

awaelchli commented Sep 27, 2022

Queuecumber commented Sep 27, 2022

Lightning sends SIGTERM when using other SLURM manager #14893

Lightning sends SIGTERM when using other SLURM manager #14893

Comments

YannDubs commented Sep 26, 2022 • edited

Bug description

How to reproduce the bug

Error messages and logs

Important info

More info

awaelchli commented Sep 26, 2022

Queuecumber commented Sep 26, 2022

Queuecumber commented Sep 26, 2022

awaelchli commented Sep 26, 2022

Queuecumber commented Sep 26, 2022

YannDubs commented Sep 26, 2022

Queuecumber commented Sep 26, 2022

awaelchli commented Sep 26, 2022 • edited

Queuecumber commented Sep 26, 2022

YannDubs commented Sep 26, 2022 • edited

awaelchli commented Sep 27, 2022

Queuecumber commented Sep 27, 2022

YannDubs commented Sep 26, 2022 •

edited

awaelchli commented Sep 26, 2022 •

edited

YannDubs commented Sep 26, 2022 •

edited