Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libev engine has inconsistent handling of signals that raise an exception on the main thread #134

Open
metcalf opened this issue Mar 24, 2017 · 8 comments

Comments

@metcalf
Copy link

metcalf commented Mar 24, 2017

When sending a signal that raises an exception (e.g SIGTERM) while waiting on a NIO select on the main thread, behavior is either:

  • Raise the correct SignalException
  • Fail to raise the correct SignalException but exit with a status code that indicates that SIGTERM was processed at some level.
  • Fail to raise the correct SignalException and exit with a normal status code.

This only reproduces with libev. My test fails for Java because I'm using fork which isn't supported in JRuby.

You can see this reproduced in: #133

@metcalf
Copy link
Author

metcalf commented Mar 24, 2017

The specific issue is not that the signal causes the selector to wake up, but that the behavior is inconsistent. It seems like something internal to Ruby is getting into an unexpected state if we can exit with a status code indicating a SIGTERM without raising the correct SignalException.

@tarcieri
Copy link
Contributor

tarcieri commented Mar 24, 2017

I agree that consistent signal handling behavior is desirable. The correct way to handle this in libev would be to wrap its own signal handling behavior:

http://search.cpan.org/~mlehmann/EV-4.21/libev/ev.pod#ev_signal_-_signal_me_when_a_signal_gets_signalled!

Handling forks is especially tricky. From the libev docs:

The special problem of inheritance over fork/execve/pthread_create

Both the signal mask (sigprocmask) and the signal disposition (sigaction) are unspecified after starting a signal watcher (and after stopping it again), that is, libev might or might not block the signal, and might or might not set or restore the installed signal handler (but see EVFLAG_NOSIGMASK).

While this does not matter for the signal disposition (libev never sets signals to SIG_IGN, so handlers will be reset to SIG_DFL on execve), this matters for the signal mask: many programs do not expect certain signals to be blocked.

This means that before calling exec (from the child) you should reset the signal mask to whatever "default" you expect (all clear is a good choice usually).

The simplest way to ensure that the signal mask is reset in the child is to install a fork handler with pthread_atfork that resets it. That will catch fork calls done by libraries (such as the libc) as well.

@metcalf
Copy link
Author

metcalf commented Mar 25, 2017

I confirmed that I can reproduce this with Process.spawn instead of forking:
metcalf@ca870e3

@tarcieri
Copy link
Contributor

tarcieri commented Mar 25, 2017

The tl;dr: of the libev advice is to install a fork handler (i.e. pthread_atfork()) which resets the signal mask. However this claims it isn't as simple as that:

https://lwn.net/Articles/415684/

From the post, it looks like it should hopefully provide "best effort" treatment of the signal mask, at least on Linux, and should work with posix_spawn() (which I would assume is what Process.spawn invokes, at least on Linux):

  1. For the system() call, POSIX says: "It is unspecified whether the handlers registered with pthread_atfork() are called as part of the creation of the child process." In glibc, they aren't.
  2. Regarding posix_spawn, POSIX says: "It is implementation-defined whether the fork handlers are run when posix_spawn() or posix_spawnp() is called." In glibc, they are.
  3. The linux-specific clone() system-call does not have atfork handlers called.

@andrew-stripe
Copy link

I was also able to repro outside the tests where I don't even require 'nio' until after spawning the new process. That leads me to believe that this isn't specific to forking and is instead a timing issue with when the signal is received in any process that uses NIO.

I'm using this for a relatively low-concurrency application so it appears that I could use the pure Ruby backend as a short term mitigation. You mention here that the Ruby backend should only be used under "exceptional circumstances". Do you think it's a reasonable choice for event loops that handle <<100 monitors?

@andrew-stripe
Copy link

Ah sorry, commented from my work account andrew-stripe == metcalf

@tarcieri
Copy link
Contributor

You can certainly try to use it, although I'd definitely like to get the libev backend fixed.

@ioquatix
Copy link
Member

ioquatix commented May 4, 2018

Just wondering where we are at with this. Do we have a test case to reproduce the bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants