Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return proper exit code for TERM signal #1337

Merged
merged 1 commit into from Aug 16, 2017

Conversation

shayonj
Copy link
Contributor

@shayonj shayonj commented Jun 19, 2017

Attempt at returning the proper exit code (128+15) when TERM signal
is sent to the server, for both single and clustered mode.

The changes are achieved by restoring signal from within the trap
and accordingly killing the process using TERM event.

Added plus, stopping single mode gracefully now.

Potential fix for : #1183

@shayonj shayonj force-pushed the s/exit-handler branch 3 times, most recently from eddad58 to 3fa684d Compare June 19, 2017 03:45
@shayonj shayonj force-pushed the s/exit-handler branch 3 times, most recently from 08f16c9 to fd38f5c Compare July 10, 2017 04:35
@dekellum
Copy link
Contributor

Note if this is merged (or #1183 is fixed as intended) that systemd service configs, as in /docs/systemd, should be updated to include:

SuccessExitStatus=143

...to avoid normal shutdown/restart reporting warning and a failed state. I'm assuming this isn't a systemd default because traditional, forked unix daemons only return an exit status during startup (before forking away).

@shayonj
Copy link
Contributor Author

shayonj commented Jul 15, 2017

@dekellum Thanks for the review!

I believe both SIGINT and SIGTERM are already part of existing success exit handlers for SuccessExitStatus

Takes a list of exit status definitions that, when returned by the main service process, will be considered successful termination, in addition to the normal successful exit code 0 and the signals SIGHUP, SIGINT, SIGTERM, and SIGPIPE.

Source: https://www.freedesktop.org/software/systemd/man/systemd.service.html

So, for instance, if we were to run a systemd service, with Restart= with on-success, a TERM signal will be interpreted as a success and a proper restart would be conducted by the systemd service.

Existing implementations still wont be affected, because the exit status being returned until this PR was 0, which is also considered success exit in this case.

Let me know if I missed something :) and/or if you think it might be worth adding a note about success exit handlers in https://github.com/puma/puma/blob/master/docs/systemd.md ?

@dekellum
Copy link
Contributor

dekellum commented Jul 17, 2017

I agree that your interpretation of the systemd docs is plausible and that the docs are not particularly clear. I've assumed that SuccessExitStatus=143 is necessary because I similarly had to do this with a java-based project. (JVM exits with 143 by SIGTERM) See this serverfault answer. Without SuccessExitStatus=143 on that project, I would get the following on a systemctl stop:

Jul 17 12:05:24 klein systemd[877]: Stopping Guppy (Fishwife) HTTP Server...
Jul 17 12:05:24 klein systemd[877]: guppy.service: Main process exited, code=exited, status=143/n/a
Jul 17 12:05:24 klein systemd[877]: Stopped Guppy (Fishwife) HTTP Server.
Jul 17 12:05:24 klein systemd[877]: guppy.service: Unit entered failed state.
Jul 17 12:05:24 klein systemd[877]: guppy.service: Failed with result 'exit-code'.

But perhaps the JVM is doing something wrong, while your PR does it correctly? Next I've gone ahead and setup a direct test of your changes, using MRI ruby. Here is the complete tree for this test application (branch 'puma'):

https://github.com/dekellum/lionfish

With this setup if I use your exact PR branch, it fails on startup here:

Jul 17 11:33:46 klein systemd[877]: Started Lionfish (Puma) HTTP Server.
Jul 17 11:33:46 klein puma[2025]: % Registered tcp:0.0.0.0:9238 for activation from LISTEN_FDS
Jul 17 11:33:46 klein puma[2025]: /home/david/.gem/ruby/2.2.0/bundler/gems/puma-fd38f5c8ce97/lib/puma/launcher.rb:171:in `run': undefined method `include?' for nil:NilClass (
Jul 17 11:33:46 klein puma[2025]:         from /home/david/.gem/ruby/2.2.0/bundler/gems/puma-fd38f5c8ce97/lib/puma/cli.rb:77:in `run'
Jul 17 11:33:46 klein puma[2025]:         from /home/david/.gem/ruby/2.2.0/bundler/gems/puma-fd38f5c8ce97/bin/puma:10:in `<top (required)>'
Jul 17 11:33:46 klein puma[2025]:         from /home/david/src/lionfish/sbin/puma:17:in `load'
Jul 17 11:33:46 klein puma[2025]:         from /home/david/src/lionfish/sbin/puma:17:in `<main>'
Jul 17 11:33:46 klein systemd[877]: lionfish.service: Main process exited, code=exited, status=1/FAILURE
Jul 17 11:33:46 klein systemd[877]: lionfish.service: Unit entered failed state.
Jul 17 11:33:46 klein systemd[877]: lionfish.service: Failed with result 'exit-code'.
Jul 17 11:33:47 klein systemd[877]: lionfish.service: Service hold-off time over, scheduling restart.
Jul 17 11:33:47 klein systemd[877]: Stopped Lionfish (Puma) HTTP Server.
Jul 17 11:33:47 klein systemd[877]: lionfish.service: Start request repeated too quickly.
Jul 17 11:33:47 klein systemd[877]: Failed to start Lionfish (Puma) HTTP Server.
Jul 17 11:33:47 klein systemd[877]: lionfish.service: Unit entered failed state.
Jul 17 11:33:47 klein systemd[877]: lionfish.service: Failed with result 'exit-code'.

If I instead cherrypick fd38f5c onto a new testing branch rooted at the v3.9.1 tag, it starts correctly but I get this error on stop:

Jul 17 11:47:20 klein puma[2487]: - Gracefully stopping, waiting for requests to finish
Jul 17 11:47:20 klein systemd[877]: Stopping Lionfish (Puma) HTTP Server...
Jul 17 11:47:20 klein puma[2487]: /home/david/src/puma/lib/puma/server.rb:392:in `close': can't modify frozen IOError (RuntimeError)
Jul 17 11:47:20 klein puma[2487]:         from /home/david/src/puma/lib/puma/server.rb:392:in `ensure in handle_servers'
Jul 17 11:47:20 klein puma[2487]:         from /home/david/src/puma/lib/puma/server.rb:396:in `handle_servers'
Jul 17 11:47:20 klein puma[2487]:         from /home/david/src/puma/lib/puma/server.rb:325:in `block in run'
Jul 17 11:47:20 klein systemd[877]: lionfish.service: Main process exited, code=exited, status=1/FAILURE
Jul 17 11:47:20 klein systemd[877]: Stopped Lionfish (Puma) HTTP Server.
Jul 17 11:47:20 klein systemd[877]: lionfish.service: Unit entered failed state.
Jul 17 11:47:20 klein systemd[877]: lionfish.service: Failed with result 'exit-code'.

So in summary, I'm finding other problems on the master branch and/or your change that occur before I can even determine if the SuccessExitStatus=143 setting is necessary. Could you try and reproduce this yourself? One more note in case it matters, my local version of MRI ruby is:

ruby 2.2.7p470 (2017-03-28 revision 58194) [x86_64-linux]

dekellum added a commit to dekellum/guppy that referenced this pull request Jul 17, 2017
@dekellum
Copy link
Contributor

dekellum commented Jul 17, 2017

The puma/launcher.rb:171:in `run': undefined method `include?' issue is apparently #1344. I found workaround for this (853c01d) and now I'm able to confirm that your initial branch is working as you expected with sytstemd, no problems and without SuccessExitStatus=143 being required. I tested it both with and without clustered mode.

Again with the workaround, if I merge this PR to master it also works.

My cherry-picked variant based on v3.9.1 has that other error, but I don't think its relevant to this PR and current master.

So I guess the JVM has it wrong and you have it right, at least for MRI. Thanks @shayonj. Next I want to check this with jruby (on JVM).

@dekellum
Copy link
Contributor

OK this PR seems to also work fine under jruby with systemd (without socket activation), and SuccessExitStatus=143 is not required for any of these combinations. I assume the difference is that the JVM's signal handler is calling exit(128+15) instead of kill("TERM") on itself after removing the exit handler? Even on jruby installing the DEFAULT signal handler (presumably via JNI) is working as intended.

@shayonj
Copy link
Contributor Author

shayonj commented Jul 18, 2017

Thanks so much for such a thorough review @dekellum 🎉 .

Yes, I have puma running using systemd config on my VPS, now that I am thinking, I should have perhaps shared the config here :). And yes MRI ruby.

Another quick way to test SuccessExitStatus requirements, would be to fire up a miniature systemd config, pointing to a bash script that is just sleep. You can kill that process with -15 and would see the on-success behavior accordingly.

Lastly, exit(143) is different than kill("TERM"). exit more or less means a program terminated abnormally (and the status code is only for the terminal), where as kill allows to shutdown a process gracefully (if any).

@shayonj
Copy link
Contributor Author

shayonj commented Jul 19, 2017

TL;DR on the above: No additional changes to the existing implementation was introduced and TERM signals return the exit status code correctly, plus the graceful stop behavior on single mode.

@stereobooster
Copy link
Contributor

Please add test for this

@shayonj
Copy link
Contributor Author

shayonj commented Aug 4, 2017

@stereobooster do you mean tests for making sure the right exit status code is returned? I suppose, verifying the exit status of a server (or process) exit from within an existing process (test suite) could be tricky ($? would return the current process) and probably why there aren't existing specs/tests.

I am going to attempt it and also look into any other additional way to expect/assert the trapping of Signal

@stereobooster
Copy link
Contributor

stereobooster commented Aug 4, 2017

It is supposed to integration test. By idea you should launch instance of puma in subshell and capture return status from it.

# returns pid of process
def start_puma()
  fork do
    exec "bundle exec puma -C puma_config.rb"
  end
end

def stop(pid)
  Process.kill("TERM", pid)
end

need to google on how to capture respond code in this scenario

@shayonj shayonj force-pushed the s/exit-handler branch 3 times, most recently from 3928781 to f43ceea Compare August 5, 2017 05:23
@shayonj
Copy link
Contributor Author

shayonj commented Aug 5, 2017

@stereobooster yes I am going for integration tests. I was looking for a way to working with sub process. I have added some integration tests, they pass locally for me, but appears to be failing in TravisCI. Looks like re-raising/killing the process might be causing an exception to be raised. I will follow up on this soon. Also, the teardown isn't working as expected.

_, status = Process.wait2(@server.pid)
assert_equal 15, status

@server = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@server.close
@server = nil

@shayonj shayonj force-pushed the s/exit-handler branch 3 times, most recently from 8db6ee5 to c9eca37 Compare August 5, 2017 17:45
@shayonj shayonj force-pushed the s/exit-handler branch 3 times, most recently from 96fa815 to d4b5ad2 Compare August 5, 2017 22:43
@shayonj
Copy link
Contributor Author

shayonj commented Aug 11, 2017

r? @schneems (I see you worked on the launcher migration) or @nateberkopec || @stereobooster
🙂

@schneems
Copy link
Contributor

We can do something like this to re-raise the exact exception https://stackoverflow.com/questions/29568298/run-code-when-signal-is-sent-but-do-not-trap-the-signal-in-ruby

Attempt at returning the proper exit code (128+15) when TERM signal
is sent to the server, for both single and clustered mode.

The changes are achieved by restoring signal from within the trap
and accordingly killing the process using TERM event.

Added plus, stopping single mode gracefully now.
@shayonj
Copy link
Contributor Author

shayonj commented Aug 11, 2017

TIL, we can re-raise the same signal exception instead of restoring signal and killing the process.

Updated 🙂

@nateberkopec nateberkopec merged commit 137a80d into puma:master Aug 16, 2017
@wodka
Copy link

wodka commented Oct 11, 2017

Hi, with the applied changes (3.10) I get an error on heroku during a restart:

screen shot 2017-10-11 at 10 02 11

is this intended (in terms of me having to catch the exception somehow)?

@dekellum
Copy link
Contributor

This PR was merged. I suggest you (@wodka) please create a new issue for this, including the text of the error. I do believe the change of this PR has uncovered a bug, but the bug probably resides with how bundler's bundle exec is rescuing Puma's SignalException, logging it to stderr, and calling Kernel::abort (status 1). I am able to reproduce the error, outside of Heroku, with bundle exec. The issue is avoided when using a bundle binstubs puma wrapper. I'm not sure if that is under user control and thus a viable workaround on Heroku.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants