Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket removed after restart #1988

Closed
mcg opened this issue Sep 24, 2019 · 30 comments
Closed

Socket removed after restart #1988

mcg opened this issue Sep 24, 2019 · 30 comments
Labels

Comments

@mcg
Copy link

mcg commented Sep 24, 2019

Using systemd and hot reloading as described(https://github.com/puma/puma/blob/master/docs/systemd.md#socket-activation) the puma socket file is removed when systemctl restart puma.service is issues. This worked until Puma version > 3.12.

ruby 2.6.3
puma 4.2.0

Puma config:

directory '/var/app/current'
environment 'staging'

pidfile '/var/app/shared/tmp/pids/puma.pid'
state_path '/var/app/shared/tmp/pids/puma.state'
stdout_redirect '/var/app/shared/log/puma_access.log', '/var/app/shared/log/puma_error.log', true

workers 2
threads 1, 2

bind 'unix:///var/app/shared/tmp/sockets/puma.sock'
activate_control_app 'tcp://127.0.0.1:9293', { auth_token: '...' }

preload_app!

on_restart do
  puts 'Refreshing Gemfile'
  ENV["BUNDLE_GEMFILE"] = '/var/app/current/Gemfile'
end

before_fork do
  ActiveRecord::Base.connection_pool.disconnect!
end

on_worker_boot do
  SemanticLogger.reopen if defined?(SemanticLogger)

  ActiveSupport.on_load(:active_record) do
    ActiveRecord::Base.establish_connection
  end
end

puma.service:

[Unit]
Description=Puma HTTP Server
After=network.target

Requires=puma.socket

[Service]
Type=simple

User=appuser
Group=appuser

WorkingDirectory=/var/app/current

Environment=PUMA_DEBUG=1

ExecStart=/var/app/rvm1scripts/rvm-auto.sh . bundle exec --keep-file-descriptors puma -C /var/app/shared/config/puma.rb

Restart=always

TimeoutStopSec=180

[Install]
WantedBy=multi-user.target

puma.socket:

[Unit]
Description=Puma HTTP Server Accept Sockets

[Socket]
SocketUser=appuser
SocketGroup=appuser
ListenStream=/var/app/shared/tmp/sockets/puma.sock

NoDelay=true
ReusePort=true
Backlog=1024

[Install]
WantedBy=sockets.target

To Reproduce

  • With Puma running under systemd with socket activation, confirm that the configured socket file exists.
  • Issue a systemctl restart puma.service
  • Confirm that the configured socket file is now missing
@MSP-Greg
Copy link
Member

@mcg,

Trying to tie all this together. I'm looking at #1842 (comment), and confused, as that's Puma 4.0.1, and it seems to be working.

Hence, what am I missing? If you have any logs, they may be helpful...

@mcg
Copy link
Author

mcg commented Sep 24, 2019

@MSP-Greg #1842 (comment) isn't working for me, that's my non-working config.

Logs look "normal" tho the socket gets removed:

puma.service log

-- Unit puma.service has finished starting up.
--
-- The start-up result is done.
Sep 24 18:39:18 staging-appserver-0 rvm-auto.sh[8174]: % Registered unix:/var/app/shared/tmp/sockets/puma.sock for activation from LISTEN_FDS
Sep 24 18:39:18 staging-appserver-0 rvm-auto.sh[8174]: [8174] Puma starting in cluster mode...
Sep 24 18:39:18 staging-appserver-0 rvm-auto.sh[8174]: [8174] * Version 4.2.0 (ruby 2.6.3-p62), codename: Distant Airhorns
Sep 24 18:39:18 staging-appserver-0 rvm-auto.sh[8174]: [8174] * Min threads: 1, max threads: 2
Sep 24 18:39:18 staging-appserver-0 rvm-auto.sh[8174]: [8174] * Environment: staging
Sep 24 18:39:18 staging-appserver-0 rvm-auto.sh[8174]: [8174] * Process workers: 2
Sep 24 18:39:18 staging-appserver-0 rvm-auto.sh[8174]: [8174] * Preloading application
Sep 24 18:39:29 staging-appserver-0 rvm-auto.sh[8174]: [8174] * Activated unix:///var/app/shared/tmp/sockets/puma.sock
Sep 24 18:39:29 staging-appserver-0 rvm-auto.sh[8174]: [8174] ! WARNING: Detected 4 Thread(s) started in app boot:
Sep 24 18:39:29 staging-appserver-0 rvm-auto.sh[8174]: [8174] Use Ctrl-C to stop

puma_access.log

[12489] - Gracefully shutting down workers...
=== puma startup: 2019-09-24 18:39:29 +0000 ===
[8174] * Starting control server on tcp://127.0.0.1:9293
[8174] - Worker 0 (pid: 9222) booted, phase: 0
[8174] - Worker 1 (pid: 9232) booted, phase: 0

puma_error.log

=== puma startup: 2019-09-24 18:39:29 +0000 ===

@MSP-Greg
Copy link
Member

@gingerime @mcg

A decision was made to remove UNIXSocket files, and the change took effect in 4.2.0. I believe this may affect systemd when (and only when) a stop/start is done.

I think, but I'm not sure, that it should not affect puma 'phased-restart' or puma 'restart'.

JFYI:

Puma 'phased-restart' leaves all the listener sockets open.

Puma 'restart' replaces itself via Kernel.exec, and doesn't remove the UNIXSocket files.

If the UNIXSocket files exists, but is not connected to a UNIXServer, Puma will delete it and start as it should.

@mcg
Copy link
Author

mcg commented Sep 25, 2019

This has been broken since, 4.0.0. not 4.2.0. This effects systemd when used as documented with socket activation and hot restarts. If this change is intentional, it needs to be documented and marked as a breaking change.

@gingerlime
Copy link
Contributor

gingerlime commented Sep 25, 2019

@MSP-Greg It's tricky for me to reproduce this, but I believe this also affects phased-restart systemctl reload puma.service, and I see that log file handles aren't being updated with logrotate... I mentioned this on #1962 and also #1842 (comment)

EDIT: to clarify, my systemd script runs pumactl ... phased-restart when you issue systemctl reload puma.service

EDIT2: my repro was focusing on the restart (not phased-restart) issue so far, which was easier to reproduce... I can try to also reproduce the phased-restart issue, but this might be trickier.

@mcg
Copy link
Author

mcg commented Sep 25, 2019

Referencing @gingerlime repo that seems to reproduce this here so it's not lost. https://github.com/gingerlime/puma-test-socket

@mcg
Copy link
Author

mcg commented Sep 25, 2019

May be an unrelated code change but while trying to create a reproducible VM for this I notice that the puma_access.log is different for 4.2.0 on restarts.
For 4.2.0

=== puma startup: 2019-09-25 15:36:02 +0000 ===
[17048] * Starting control server on tcp://127.0.0.1:9293
[17048] - Worker 0 (pid: 17059) booted, phase: 0
[17048] - Worker 1 (pid: 17061) booted, phase: 0
[17059] ! Detected parent died, dying
[17061] ! Detected parent died, dying

For 3.21

=== puma startup: 2019-09-25 15:33:40 +0000 ===
[16513] * Starting control server on tcp://127.0.0.1:9293
[16513] - Worker 0 (pid: 16524) booted, phase: 0
[16513] - Worker 1 (pid: 16526) booted, phase: 0
[16513] - Gracefully shutting down workers...

@MSP-Greg
Copy link
Member

@mcg

Thanks for the info. All this is helpful.

But, I'm only windows locally, and using systemd is something that hasn't been implemented in CI yet. Maybe another Actions workflow can test it, not sure...

From above:

ExecStart=/var/app/rvm1scripts/rvm-auto.sh . bundle exec --keep-file-descriptors puma -C /var/app/shared/config/puma.rb

Restart=always

From @gingerlime's https://github.com/gingerlime/puma-test-socket

ExecStart=/usr/bin/bundle exec --keep-file-descriptors puma -C /app/puma.rb
ExecReload=/usr/bin/bundle exec pumactl -F /app/puma.rb phased-restart

Those are different. As I tried to mention above, the code paths for start/stop/start, start/restart, and start/phased-restart are all different.

Also, Puma's existing CI is mostly 'signal', but people seem to be using pumactl socket control with systemd. I mention this because I've been working in my Puma fork on an updated test framework, and there are some differences between control behavior, dependent on whether signal or socket used.

So, at least from my perspective, it's very important to know what commands/protocl were used with logs showing problems. IOW, start/stop/start, start/restart, or start/phased-restart, and whether signal or socket.

Small point, I've seen 3.12 mentioned, but 3.12.0 and 3.12.1 were released about 9 months apart. I haven't really reviewed the differences, but the exact version may be helpful to know....

I can't speak for @nateberkopec, but I suspect he feels the same way I do. I strongly want to get Puma working and stable in all configurations. For instance, I've got a lot of test changes (improvements?) , but I don't want to push them until Puma is bug free and stable...

Finally, what you've shown above I'll see if I can repo without systemd...

@mcg
Copy link
Author

mcg commented Sep 25, 2019

The documented systemd config is using signals. It seems almost like a waste of time to try and reproduce without systemd, at least for my issue as it's using systemd's socket activation.

@MSP-Greg
Copy link
Member

@mcg

Thanks. I've looked at the systemd files a bit more, and found some systemd man pages (haven't read yet).

A recent change affected closing UNIXSocket files, PR #1992 reverts the recent change.

There is one more place where this is done, it was added between 3.12.1 and 4.0.0. I haven't reverted that, but if this doesn't change anything, we can try that next.

I'm assuming that systemd will consider UNIXSockets 'handled' if the file exists, and that the issue now is that because the files have been removed, it's trying to 'hold' them?

@mcg
Copy link
Author

mcg commented Sep 25, 2019

@MSP-Greg No change for me using #1992. Socket is still removed.

@MSP-Greg
Copy link
Member

@mcg

Thanks for checking. I know this is a PITA.

I mentioned I'm working on revising some of the test framework, and I've added a lot of tests just because. Puma is stable, this issue is particular to using Puma with systemd. Excepting the log issue that I haven't looked at.

The other place where files are unlinked only happens when SIGTERM is sent to the process. Could that be the case with your setup?

Otherwise, the remaining unlink statements happen in the code involved with inheriting listeners...

Sanity check. You're issue is with start(1)/stop/start(2) and start(2) is missing the socket?

@mcg
Copy link
Author

mcg commented Sep 25, 2019

Sanity check. You're issue is with start(1)/stop/start(2) and start(2) is missing the socket?

@MSP-Greg Correct.

Starting with a system that Puma is not running on. If I systemctl start puma.socket puma.service. The socket is created by systemd, and Puma starts and works. Then if a systemctl restart puma.service is issued, puma is stopped and restarted by systemd. Puma comes back up, but the socket is gone.

@MSP-Greg
Copy link
Member

MSP-Greg commented Sep 25, 2019

@mcg

Thanks. Not going crazy.

=== puma startup: 2019-09-25 15:36:02 +0000 ===
[17048] * Starting control server on tcp://127.0.0.1:9293
[17048] - Worker 0 (pid: 17059) booted, phase: 0
[17048] - Worker 1 (pid: 17061) booted, phase: 0
[17059] ! Detected parent died, dying
[17061] ! Detected parent died, dying

The above seems to indicate that Puma is being killed by maybe SIGKILL?

I don't know if you can try a SIGINT (stop) or SIGTERM (graceful stop)?

JFYI, I'm looking at
https://www.freedesktop.org/software/systemd/man/systemd.service.html
and
https://www.freedesktop.org/software/systemd/man/systemd.kill.html
but not being able to actually use it is not good...

EDIT: Original systemd PR: #231 commit: 32088ec

@truongduyng
Copy link

I also meet this issue without using systemd
#1860 (comment)

@mcg mcg changed the title Socket removed after reload Socket removed after restart Sep 26, 2019
@mcg
Copy link
Author

mcg commented Sep 26, 2019

@MSP-Greg

  • Tried sending 2 - SIGINT, Puma restarted, socket removed
  • Tried sending 15 - SIGTERM, Puma restarted, socket removed
  • Tried sending 12 - SIGUSR2, Initially appeared to restart workers as it should, then the Puma primary process died and was restarted by systemd. Socket also removed.

@MSP-Greg
Copy link
Member

@mcg

Thanks for checking. After looking at both Puma & systemd docs last night, I may have an idea where the issue is, but that's based on some assumptions that may be incorrect.

I need to check some things with Puma first, as I'm going to see if I can mimic the LISTEN_FDS used by systemd, as that's where the info re activated sockets is contained.

Overall, there is one issue that I am confused about regarding start/stop/start operations not working. The implication is that systemd is providing a different environment on the second start than it is on the first. Why is that?

JFYI,

  1. 'inherited sockets' exists when (and only when) Puma is restarted. They do not exist with stop/start, or with phased-restart. These are created based on ENV[/PUMA_INHERIT_\d+/] values.

  2. 'activated sockets' only exist when 'imported' from the systemd environment. These are created based on ENV['LISTEN_FDS'] and ENV['LISTEN_PID'] values.

BTW, sorry for all the info. but months from now, if there are issues with systemd, maybe I or someone else may find them helpful...

@mcg
Copy link
Author

mcg commented Sep 26, 2019

Found a bad workaround. chmod the directory the socket is stored in such that the user Puma is running under cannot delete the socket. Restarts work fine.

@mcg
Copy link
Author

mcg commented Sep 26, 2019

Perhaps this helps find where the socket is being deleted. With socket deletion blocked by chmod, errors in the puma_error.log.

Errno::EACCES: Permission denied @ apply2files - /var/www/app/shared/tmp/sockets/puma.sock
/ruby-2.6.3@app/gems/puma-4.2.0/lib/puma/binder.rb:427:in `unlink'
/ruby-2.6.3@app/gems/puma-4.2.0/lib/puma/binder.rb:427:in `block in close_unix_paths'
/ruby-2.6.3@app/gems/puma-4.2.0/lib/puma/binder.rb:427:in `each'
/ruby-2.6.3@app/gems/puma-4.2.0/lib/puma/binder.rb:427:in `close_unix_paths'
/ruby-2.6.3@app/gems/puma-4.2.0/lib/puma/launcher.rb:208:in `close_binder_unix_paths'
/ruby-2.6.3@app/gems/puma-4.2.0/lib/puma/cluster.rb:532:in `ensure in run'
/ruby-2.6.3@app/gems/puma-4.2.0/lib/puma/cluster.rb:532:in `run'
/ruby-2.6.3@app/gems/puma-4.2.0/lib/puma/launcher.rb:172:in `run'
/ruby-2.6.3@app/gems/puma-4.2.0/lib/puma/cli.rb:80:in `run'
/ruby-2.6.3@app/gems/puma-4.2.0/bin/puma:10:in `<top (required)>'
/ruby-2.6.3@app/bin/puma:23:in `load'
/ruby-2.6.3@app/bin/puma:23:in `<top (required)>'

@MSP-Greg
Copy link
Member

MSP-Greg commented Sep 26, 2019

@mcg

Thanks.

Hate to ask, are you sure you tested off of #1992? The first diff shown in the first commit removes cluster.rb:532 (you may have to scroll to it, shown in 'orange'), which is shown above? IOW, #1992 should have worked?

@mcg
Copy link
Author

mcg commented Sep 26, 2019

I tested restarts with #1992 but reverted back to released 4.2.0. I'll go back to #1992 and try this again.

@mcg
Copy link
Author

mcg commented Sep 26, 2019

With #1992

Errno::EACCES: Permission denied @ apply2files - /var/www/app/shared/tmp/sockets/puma.sock
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/lib/puma/binder.rb:422:in `unlink'
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/lib/puma/binder.rb:422:in `block in close_listeners'
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/lib/puma/binder.rb:418:in `each'
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/lib/puma/binder.rb:418:in `close_listeners'
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/lib/puma/launcher.rb:205:in `close_binder_listeners'
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/lib/puma/cluster.rb:391:in `block in setup_signals'
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/lib/puma/cluster.rb:491:in `select'
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/lib/puma/cluster.rb:491:in `run'
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/lib/puma/launcher.rb:172:in `run'
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/lib/puma/cli.rb:80:in `run'
/ruby-2.6.3@app/bundler/gems/puma-23a61ad827e5/bin/puma:10:in `<top (required)>'
/ruby-2.6.3@app/bin/puma:23:in `load'
/ruby-2.6.3@app/bin/puma:23:in `<top (required)>'

@MSP-Greg
Copy link
Member

MSP-Greg commented Sep 26, 2019

@mcg

Thank you. Remember how I said in #1988 (comment) that there was a 2nd instance where the files were unlinked, but I thought it didn't affect this use case?

I was wrong.

EDIT:

You've been very helpful and patient. Thank you. Personally, I'd like to try to mimic some of the environment settings used by systemd in the test suite, I may have some questions.

@MSP-Greg
Copy link
Member

@nateberkopec

I think I know what changes are needed for this. Puma has always worked fine if the UNIXSocket files were not removed, so I think we should remove all code that does so. Maybe even test for that.

But, I think there are some duplicate test method names, so maybe we should use file names that also include the test class...

You're working on some refactoring, I'm doing the same with the tests and also puma/control_cli.rb.

Maybe get this, #1986, and whatever else patched, do a point release, let the dust settle, and start back on the refactoring work?

Also, maybe set JRuby back to 'allow-failure' until it can be made stable? After all the work on tests, always having to check logs, etc, I'd really like to see a few green checks...

@nateberkopec nateberkopec added this to the 4.2.1 milestone Sep 30, 2019
This was referenced Mar 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants