Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toward a better signal handling #1705

Open
concatime opened this issue Apr 18, 2020 · 19 comments
Open

Toward a better signal handling #1705

concatime opened this issue Apr 18, 2020 · 19 comments
Labels
not-stale Tell stalebot to ignore this issue

Comments

@concatime
Copy link

concatime commented Apr 18, 2020

  • nodemon -v: 2.0.3
  • node -v: v13.13.0
  • Operating system/terminal environment: Void Linux / kitty + fish.
  • Using Docker? Nop.
  • Command you ran: npx nodemon

This is not a bug, but rather a discussion on how to handle signals.
Heavely related to #1667.

Actual behaviour

Currently, it’s quite a mess ( ͡° ͜ʖ ͡°).

Basics

to close, to end, to exit, to quit, to stop, to terminate, to shut down, a bit confusing isn’t it?

Let me quote the official documentation.

SIGTERM:

It is the normal way to politely ask a program to terminate.
The shell command kill generates SIGTERM by default.

SIGQUIT:

Certain kinds of cleanups are best omitted in handling SIGQUIT.

SIGHUP:

[…] used to report that the user’s terminal is disconnected

Expected behaviour

This is rather debatable, but here I go.

When nodmeon receives SIGINT or SIGHUP, it should pass the signal to the child process, and if the process does not quit after X seconds, nodemon should ignore the signal. This way, the child process is free to do whatever it wants (e.g. reload config or gracefully quit). If the child stops before the timeout, nodemon should also stops. This should solves #1661. Also, if SIGINT is received before timeout, SIGKILL is sent to the child. This way, the user can forcefully quit with two ^C.

When nodemon receives SIGTERM, it should pass the signal to the child process, and if the process does not quit after X seconds, send SIGKILL to the child, then quit nodemon. This is the behaviour of systemd and it simplifies integration with system’s init.

When nodmeon receives SIGQUIT, it should immediatly SIGKILL the child, then quit nodemon. This way, no cleaning is done by the child.

Also, the restart should be interpretad more as a realod. This way, we can specify a reloadSignal (set to SIGHUP by default) which is sent each time a change is detected by watch or user types restart. nodemon should not use (implicitely at least) SIGUSR2 since it’s a user-defined condition (aka. reserved for the child process for other tasks).

We should also add a timeout config to set the value used by nodemon.

WDYT?

@remy
Copy link
Owner

remy commented Apr 18, 2020

The issue raised on #1667 wasn't entirely clear on what the problem was, but this explanation has given me a lot more context, so for starters: thank you.

On pass through of signals, on that I'm agreed - though it raises some questions:

If I, as a user of nodemon, start nodemon on a process, then hit ctrl-c - it should send a SIGINT signal to the subprocess but it should also actually send the signal to nodemon - and thus nodemon should quit. Going by your description of the issue, it suggests that if the subprocess doesn't quit, neither should nodemon - which feels…odd (though maybe entirely correct, IDK).

On sending the SIGUSR2 to the subprocess, this is entirely configurable by the user. Since there's a limited number of signals, I had to pick one, and this one was … the best from the available choice. IIRC, sending a SIGHUP to the subprocess didn't always cause the subprocess to stop (or there was higher chance of the user-code intercepting it and handling), so I settled on SIGUSR2 - but I settled nearly a decade ago, so my memory is hazy on exactly why. Again, this is user configurable, so I don't see it as being a problem.

Finally, you mention a timeout config. In what respect? It should default to a timeout before restarting only to make the default usage as simple as possible (i.e. run the thing, it works, focus on dev), but the user can configure a delay before sending the restart signal (currently). But you might be thinking of another use for timeout?

@concatime
Copy link
Author

concatime commented Apr 18, 2020

From documentation:

The (obvious) default action for all of these signals is to cause the process to terminate.

  1. When the user hit ^c, it sends SIGINT to nodemon. Then nodemon forward the signal to the child. The process child then terminates if it does not explicitly listen for SIGINT. And so nodemon also quits. This behaviour is perfect. Also, if the process explicitly does not want to stop on SIGINT, then nodemon should also stay up.
    Right now, nodemon spams with still waiting for 1 sub-process to finish for 10 seconds then quits while the child process is still running in background. I find this problematic.

  2. I think it would be better if SIGHUP is used by default (over SIGUSR2). This way, we respect ssh, tmux, vscode, etc. The child process either terminates if it does not listen for SIGHUP or handle the signal (see Incorrect handling of SIGHUP #1667 (comment)). Like I said, we should let SIGUSR1/SIGUSR2 for the child process. Nodemon itself should not listen for SIGUSR1 or SIGUSR2. Sure, the user can always set reloadSignal to SIGUSR2.

  3. The timeout is the value that replaces the "X seconds" in my comment. 10 seconds should be enough (default value).

PS: Thank you for your quick response!

@concatime
Copy link
Author

concatime commented Apr 18, 2020

"reload"/watch ==> nodemon ==> reloadSignal (SIGHUP by default) ==> child
*if child exits:
                           start child
*endif
-----------------------------------------------
SIGINT|HUP ==> nodemon ==> SIGINT|HUP ==> child
*if SIGINT before $timeout:
                       ==> SIGKILL ==> child
*endif
*if child exits before $timeout:
                       exit nodemon
*endif
-----------------------------------------------
SIGTERM ==> nodemon ==> SIGTERM ==> child
*if ! child exits before $timeout:
                    ==> SIGKILL ==> child
*endif
                    exit nodemon
-----------------------------------------------
SIGQUIT ==> nodemon ==> SIGKILL ==> child
                    exit nodemon

@TheLudd
Copy link

TheLudd commented Apr 21, 2020

@concatime
Nice diagram. I read up on the signals and this approach seems generally correct to me, just one quesion:
Isn't the case for SIG[INT|HUP|TERM] generally the same? I.e. send the same signal to the child process and then:

  • Exit if the child exits
  • Send SIGKILL to child if it still runs after a timeout.
  • Send SIGKILL to child if the same signal is received by nodemon again (not sure if this is a special case for SIGINT)

@remy
Copy link
Owner

remy commented Apr 21, 2020

I'm trying to get my head fully around the problem this is trying to address - but it seems that it is either: about the signal forwarding from nodemon to the sub-process, or it's about all signals but gently brushing over how signals are handled in a restart request (this is the bulk of the complexity in nodemon, so expect suprises here).


On this example:

SIGINT|HUP ==> nodemon ==> SIGINT|HUP ==> child

Currently nodemon will listen for a SIGHUP to perform a triggered reload (as per unix daemons), but with the scheme above it would cause nodemon to exit.


Re - forwarding signals, this is the current logic with the exception of SIGQUIT (which nodemon doesn't listen for at present).


There's also logic that states (in nodemon) that for the full process tree, each sub-process (of the user code) is also sent the signal (which is why you're seeing the spammy waiting message - which I think should be moved to verbose messaging instead of the standard log). Only after those sub-sub-processes have quit, is the final signal sent to the user process - but this only happens on linux - on a mac the OS handles it differently - which is nice and confusing. Currently there's no timeout for this - it simply waits until all the user sub-processes have finished.


I'm sure there's more to consider (like Windows is entirely ignored in this logic, understandably because windows is mostly handled through a "simple" taskkill /T /F), but I think these initial items are worth fleshing out.

On last point, it seems that the SIGUSR2 being changed to SIGHUP for the internal restart signal (from file changes) - a) doesn't make sense, it seems more approate, if changed, to changed to SIGTERM (since that's what's happening from the sub-process POV), b) if it has no real tangle effect, I'm wary of changing something so core from one default to another after nearly a decade (not because I care about holding on to it, but because I know something will break and it will break in new fun and frankly horrible ways).

* Side note: I suspect, but am not 100% certain, that sending a SIGTERM to the sub-process can't be caught inside of nodemon. I could be wrong…

@concatime
Copy link
Author

concatime commented Apr 30, 2020

Let’s start with SIGHUP. This signal is quite problematic because it has two roles in the UNIX world.

  1. […] used to report that the user’s terminal is disconnected

  2. Reloading.
    SSHd:

    rereads its configuration file when it receives a hangup signal, SIGHUP, by executing itself with the name and options it was started with

    NGiИX:

    HUP changing configuration, […] starting new worker processes with a new configuration, graceful shutdown of old worker processes

    systemd:

    /bin/kill -HUP $MAINPID
    Note however that reloading a daemon by sending a signal (as with the example line above) is usually not a good choice, because this is an asynchronous operation and hence not suitable to order reloads of multiple services against each other. It is strongly recommended to set ExecReload= to a command that not only triggers a configuration reload of the daemon, but also synchronously waits for it to complete.

However, the second role applies to only daemonized processes because deamons are detached from their terminal. So the system will never send this signal to them.

In our case, nodemon is NOT a daemon but rather a foreground process, so it should not interpret SIGHUP as a reload, but rather an exit.

Now,

  1. Should nodemon listen on SIGUSR1 (or 2) to reload the children?

Also, after thinking about it, it’s useless for nodemon to listen to SIGQUIT, so forget my previous recommendation on this signal. That being said, my recommendation about SIGINT remains valid.

@remy, I understand. Changing the SIGUSR2 to SIGHUP now would be too drastic. So forget about what I’ve said. The current behaviour is ok (send signal from signal).

One last thing. The command “rs” should be renamed to “rl” (reload). Strictly speaking, a restart is sending SIGTERM, SIGKILL after timeout if child does not exit, then start the child (systemd).


Version 2:

(SIGUSR1?)/“rl”/watch ==> nodemon ==> `signal` (SIGUSR2 by default) ==> child
*if child exits:
	start child
*endif
-----------------------------------------------
SIGINT ==> nodemon ==> SIGINT ==> child
*if SIGINT before $timeout:
                   ==> SIGKILL ==> child
*endif
*if child exits before $timeout:
	exit nodemon
*endif
-----------------------------------------------
SIGHUP|TERM ==> nodemon ==> SIGTERM ==> child
*if ! child exits before $timeout:
                        ==> SIGKILL ==> child
*endif
exit nodemon

@stale
Copy link

stale bot commented May 15, 2020

This issue has been automatically marked as idle and stale because it hasn't had any recent activity. It will be automtically closed if no further activity occurs. If you think this is wrong, or the problem still persists, just pop a reply in the comments and @remy will (try!) to follow up.
Thank you for contributing <3

@stale stale bot added the stale no activity for 2 weeks label May 15, 2020
@remy
Copy link
Owner

remy commented May 15, 2020

So…

Firstly, there's no real reason to justify changing the rs to rl particularly as it's in common usage. The design of my software is driven by ease of use rather than strict technical perfection (particularly after the code has shipped).

Coming back to the SIGHUP - a specific issue was raised asking how to programmatically reload the subprocess. nodemon handling SIGHUP seemed like the right way to do this. Your proposal will change this and set SIGHUP to terminate the subprocess.

Which begs the question: how do you now programatically reload the subprocess whilst nodemon is running?

On SIGINT - this is missing functionality in nodemon and should be added 👍

@stale stale bot removed the stale no activity for 2 weeks label May 15, 2020
@remy
Copy link
Owner

remy commented May 15, 2020

I've also documented the current state of nodemon and how signals are handled when sent both to nodemon and to the subprocess: https://docs.google.com/spreadsheets/d/1gFxwZqv1cBPJpiqeVtYh0MIC30RJlx6g0VF-2B8pQDE/edit?usp=sharing

@remy
Copy link
Owner

remy commented May 15, 2020

This doesn't cover the common case where the subprocess of nodemon then spawns it's own processes (with concurrently for for instance), but I would image the signal handling is up to the subprocess that was spawned by nodemon so perhaps it's a moot point.

@remy
Copy link
Owner

remy commented May 15, 2020

This is possibly the better justification for a change to the system (as tmux is a common tool):

I found this issue because it also gives me trouble.
I use tmux when developing, and in some windows I run nodemon to run my web server in dev mode.
When executing :kill-session in tmux it sends SIGHUP to all processes. What then happens is that nodemon restarts and the window that nodemon is in is destroyed. But my server is still running.

So now I have an "invisible" process running and if I try to start up a new dev session it says that the port I use is already in use.

So what should happen in my specific case would be that
1 tmux sends SIGHUP to nodemon
2 nodemon sends SIGHUP to my application that can listen for this and stop itself
3 nodemon exists gracefully

@remy
Copy link
Owner

remy commented May 15, 2020

I've just close #1667 to continue conversation here, and want to thank @axxie for starting this initial discussion.

I hope it's clear, but my hesitation is based entirely on trying to keep a stable system (nodemon is downloaded some 500,000 times a day and used in over a million repositories - so I want to be careful about these changes - something I've learnt and been burnt from the past!).

After documenting the current state of signal handling and specifically seeing the tmux example (which I think I missed originally), I'm moving towards the following:

  1. Change nodemon reload to listen on SIGUSR2
  2. Provide an option to change the reload signal (as I do for the signal to send the kill signal to the subprocess)
  3. Forward SIGHUP to the subprocess (as it does with SIGINT)
  4. Handle SIGQUIT correctly (currently it leaves the subprocess running - a simple fix)

The code that tries to shutdown the sub-subprocesses needs review, currently this is the situation:

When the user hit ^c, it sends SIGINT to nodemon. Then nodemon forward the signal to the child. The process child then terminates if it does not explicitly listen for SIGINT. And so nodemon also quits. This behaviour is perfect. Also, if the process explicitly does not want to stop on SIGINT, then nodemon should also stay up.
Right now, nodemon spams with still waiting for 1 sub-process to finish for 10 seconds then quits while the child process is still running in background.

This spamming logic isn't for signal handling but because subprocesses would get stuck (particularly when there was a broad tree of spawned processes) and nodemon had no way to know whether the spawned process was ignoring the signal or if it was stuck somehow.

A simple search shows that nodemon leaving a process running after exit or after it thinks it has restarted the subprocess is very much an issue for devs - and it's hard to get a good handle on the problem because it ranges from Windows to linux to alpine (where there's no ps tooling) to oddball programs being spawned.

--

Finally, on reflection, I'm not sure this logic is right:

SIGINT ==> nodemon ==> SIGINT ==> child
*if SIGINT before $timeout:
                   ==> SIGKILL ==> child
*endif
*if child exits before $timeout:
	exit nodemon
*endif

It means that if the subprocess intentionally handles SIGINT then nodemon will decide to send another signal out of brute force.


Final thought and question: should this be a BREAKING CHANGE and cause a bump to v3? I'd rather not, but technically, for one very thin slice of user (I assume exists) sending SIGHUP will no longer work.

Potential solution: go back to the original issue that caused the change, and understand the context it was created in.

@concatime
Copy link
Author

SIGINT ==> nodemon ==> SIGINT ==> child
*if SIGINT before $timeout:
                   ==> SIGKILL ==> child
*endif
*if child exits before $timeout:
	exit nodemon
*endif

It means that if the subprocess intentionally handles SIGINT then nodemon will decide to send another signal out of brute force.

The SIGINT in the if is a second SIGINT. So I hit ^c, a timeout of say 10 seconds starts. And if I hit ^c a second time, before the timeout ends, then nodemon will SIGKILL the child. If the child process handles SIGINT without exiting and I do not send a second SIGINT before timeout, nodemon will not exit (condition *if child exits before $timeout is false).


  1. Forward SIGHUP to the subprocess (as it does with SIGINT)

That was my first proposition. Good choice as long as SIGHUP, by default at least, is not interpreted as a reload because nodemon, despite its name 😅, is not a daemon.
Just to be sure, what should happen if the child does not exit? Should nodemon stay up?

@remy
Copy link
Owner

remy commented May 16, 2020

The SIGINT in the if is a second SIGINT. So I hit ^c, a timeout of say 10 seconds starts. And if I hit ^c a second time, before the timeout ends, then nodemon will SIGKILL the child. If the child process handles SIGINT without exiting and I do not send a second SIGINT before timeout, nodemon will not exit (condition *if child exits before $timeout is false).

Is that normal behaviour though - in general? (I hadn't read it as a second SIGINT which makes sense reading now). Are there any examples of this type of system (double tap to end).


nodemon -> "no demon"? Or "node mon"? Or "nodemon" (like pokemon)? 😆

@concatime
Copy link
Author

concatime commented May 16, 2020

I don’t remember where I’ve seen the two consecutive SIGINT, but I know it’s quite useful to debug a program not handling correctly SIGINT or to test/bypass the graceful shutdown. Say you interpret SIGINT as graceful shutdown (just like SIGTERM), but during the shutdown, you have a bug or blocking code. You can then forcefully quit by sending another SIGINT (^c) and fix your code.
EDIT: how about “digimon”

@stale
Copy link

stale bot commented May 30, 2020

This issue has been automatically marked as idle and stale because it hasn't had any recent activity. It will be automtically closed if no further activity occurs. If you think this is wrong, or the problem still persists, just pop a reply in the comments and @remy will (try!) to follow up.
Thank you for contributing <3

@stale stale bot added the stale no activity for 2 weeks label May 30, 2020
@remy remy added not-stale Tell stalebot to ignore this issue and removed stale no activity for 2 weeks labels May 31, 2020
@memark
Copy link

memark commented Oct 28, 2020

I just read through this issue out of pure interest. You are both very knowledgeable and well-spoken. Keep up the good work! @remy @concatime

@bduffany
Copy link

bduffany commented Dec 14, 2020

Just a data point here: I remember using a program (though I can't remember what it is) that printed out something like "attempting to shut down gracefully, press Ctrl+C again to force quit" when I pressed Ctrl+C. And I have used software before that didn't seem to correctly shut down when I press Ctrl+C, and in those cases, I usually mash Ctrl+C until the software seems to listen (I can't be the only one who does this).

So IMO the suggestion of first sending SIGINT (on first Ctrl+C), then SIGKILL (on second or maybe even third Ctrl+C) makes complete sense.

@corytheboyd
Copy link

corytheboyd commented Feb 8, 2021

@bduffany Docker Compose does this, and IMO it's a great escape hatch for early development, before you have nailed down your process handling.

image

Any anyway, I got to this issue because I too was wondering why my signal handling logic wasn't working behind nodemon. I'm split between participating here to figure out a nodemon solution, or just biting the bullet and using something like PM2. PM2 feels like complete overkill because I only need a file watcher to restart my script on change, which nodemon is good at, but I also want to be able to exercise the production signal handling logic in my development environment. Not really sure what to do :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not-stale Tell stalebot to ignore this issue
Projects
None yet
Development

No branches or pull requests

6 participants