Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket removed after reload #1775

Closed
j15e opened this issue Apr 16, 2019 · 18 comments
Closed

Socket removed after reload #1775

j15e opened this issue Apr 16, 2019 · 18 comments

Comments

@j15e
Copy link
Contributor

j15e commented Apr 16, 2019

Steps to reproduce

I followed instruction on the wiki to setup puma with systemd on Ubuntu 18.04 (https://github.com/puma/puma/blob/master/docs/systemd.md#socket-activation) but puma reload is not working as expected with the latest version of puma.

During a reload, I see these logs which highlight the problem (with PUMA_DEBUG=1):

Apr 16 03:38:22 fizz bundle[1184]: % Registered unix:/home/app/web/shared/tmp/sockets/puma.sock for activation from LISTEN_FDS
Apr 16 03:38:22 fizz bundle[1184]: [1184] Puma starting in cluster mode...
Apr 16 03:38:22 fizz bundle[1184]: [1184] * Version 3.12.1 (ruby 2.5.5-p157), codename: Llamas in Pajamas
Apr 16 03:38:22 fizz bundle[1184]: [1184] * Min threads: 8, max threads: 16
Apr 16 03:38:22 fizz bundle[1184]: [1184] * Environment: production
Apr 16 03:38:22 fizz bundle[1184]: [1184] * Process workers: 2
Apr 16 03:38:22 fizz bundle[1184]: [1184] * Phased restart available
Apr 16 03:38:22 fizz bundle[1184]: [1184] * Activated unix:///home/app/web/shared/tmp/sockets/puma.sock
Apr 16 03:38:22 fizz bundle[1184]: [1184] Use Ctrl-C to stop
Apr 16 03:38:25 fizz bundle[1184]: [1184] - Worker 1 (pid: 1225) booted, phase: 0
Apr 16 03:38:25 fizz bundle[1184]: [1184] - Worker 0 (pid: 1221) booted, phase: 0
Apr 16 03:38:43 fizz systemd[1]: Reloading Puma HTTP Server.
Apr 16 03:38:43 fizz bundle[1184]: [1184] - Starting phased worker restart, phase: 1
Apr 16 03:38:43 fizz bundle[1184]: [1184] + Changing to /home/app/web/current
Apr 16 03:38:43 fizz systemd[1]: Reloaded Puma HTTP Server.
Apr 16 03:38:45 fizz bundle[1184]: [1184] - Stopping 1221 for phased upgrade...
Apr 16 03:38:45 fizz bundle[1184]: [1184] - TERM sent to 1221...
Apr 16 03:38:49 fizz bundle[1184]: [1184] - Worker 0 (pid: 1302) booted, phase: 1
Apr 16 03:38:49 fizz bundle[1184]: [1184] - Stopping 1225 for phased upgrade...
Apr 16 03:38:49 fizz bundle[1184]: [1184] - TERM sent to 1225...
Apr 16 03:38:49 fizz bundle[1184]: Exception handling servers: No such file or directory @ apply2files - /home/app/web/shared/tmp/sockets/puma.sock (Errno::ENOENT)
Apr 16 03:38:49 fizz bundle[1184]: /home/app/web/shared/bundle/ruby/2.5.0/bundler/gems/puma-200c95d30c37/lib/puma/binder.rb:53:in unlink' Apr 16 03:38:49 fizz bundle[1184]: /home/app/web/shared/bundle/ruby/2.5.0/bundler/gems/puma-200c95d30c37/lib/puma/binder.rb:53:in block in close'
Apr 16 03:38:49 fizz bundle[1184]: /home/app/web/shared/bundle/ruby/2.5.0/bundler/gems/puma-200c95d30c37/lib/puma/binder.rb:53:in each' Apr 16 03:38:49 fizz bundle[1184]: /home/app/web/shared/bundle/ruby/2.5.0/bundler/gems/puma-200c95d30c37/lib/puma/binder.rb:53:in close'
Apr 16 03:38:49 fizz bundle[1184]: /home/app/web/shared/bundle/ruby/2.5.0/bundler/gems/puma-200c95d30c37/lib/puma/server.rb:939:in graceful_shutdown' Apr 16 03:38:49 fizz bundle[1184]: /home/app/web/shared/bundle/ruby/2.5.0/bundler/gems/puma-200c95d30c37/lib/puma/server.rb:419:in handle_servers'
Apr 16 03:38:49 fizz bundle[1184]: /home/app/web/shared/bundle/ruby/2.5.0/bundler/gems/puma-200c95d30c37/lib/puma/server.rb:354:in `block in run'
Apr 16 03:38:52 fizz bundle[1184]: [1184] - Worker 1 (pid: 1327) booted, phase: 1

After the reload, the workers process are reloaded & running, but nginx isn't serving anything after that point since the puma.sock socket is removed (not only closed, but the file itself is not there anymore).

I think that removing the @own_binder logic in #1685 may now cause puma to incorrectly unlink socket as Puma::Binder#close does unlink unix socket files and is not doing so in the correct order during a phased restart (unlink occurs after the new workers are spawned, when old workers are stopped) :

puma/lib/puma/binder.rb

Lines 51 to 54 in 200c95d

def close
@ios.each { |i| i.close }
@unix_paths.each { |i| File.unlink i }
end

Expected behavior

Should keep working via unix socket after reload.

Actual behavior

Unix socket is removed after being activated by new restarted worker.

System configuration

Ruby version: 2.5.5
Rails version: 5.1.6.2
Puma version: master


Edit: Just for your information, sadly I am now using port binding with my systemd setup (without any issue). Thanks to everyone investigating & trying to fix those socket issues.

@j15e
Copy link
Contributor Author

j15e commented Apr 16, 2019

Also, not sure how this is related, but maybe it can help pin point the issue(s) with sockets : reloading puma also removes the socket on v3.12.1 with ruby 2.6.2 and 2.5.5 when preloading application is enabled.

In that situation there is no error message since there is no additional attempt to remove the socket, but the socket disappear after the first reload too.

In the systemd puma documentation it states that preload is compatible :

 Among other advantages, this keeps listening sockets open across puma restarts and achieves graceful restarts, including when upgraded puma, and is compatible with both clustered mode and application preload.

But sample configurations does not specify preload so maybe this wasn't really tested. Maybe I look at the wrong place but I didn't see any test of the signal management & reload management. Maybe we should add integrations test covering these usages in https://github.com/puma/puma/blob/master/test/test_integration.rb ?

@dekellum
Copy link
Contributor

dekellum commented Apr 19, 2019

This issue might be better described as: "phased restart" is not currently compatible with systemd socket activation. The systemd documentation you quote doesn't actually claim otherwise—cluster mode, preload, and socket activation work with systemd level (with socket activation, graceful) restarts (non-rolling, master and all workers in parallel).

Assuming this is true, I'm not sure if the limitation can/should be fixed (your Binder::close suggestion) or might at least be better documented. For latter, the restart doc might be improved by mentioning/linking systemd and socket activation as a 4th way to achieve graceful restarts.

If you really need phased restart (e.g. because your load time is too long), then you might need, as a workaround, to disable systemd socket activation.

@j15e
Copy link
Contributor Author

j15e commented Apr 19, 2019

Ok so if I understand correctly, when using puma with systemd socket activation, puma should not be reloaded via a reload signal - this isn't supported by design? It should only be stop the start by systemd and it systemd will keep the socket active during the stop/start transition?

@dekellum
Copy link
Contributor

I'm personally not sure of the answer is to your first question, "…by design?" That puma phased/rolling restart unlinks the UNIX socket file on reload (which I must presume has some motivation) is not going to work with systemd socket activation. There might be some permission scheme that would disallow puma from doing this, but I'd guess that would abort the reload.

As for your second question: yes, systemd will gracefully start a new worker or cluster of works and shut down the old without any request loss. The major difference is that the entire cluster is started in parallel (like a cold start) instead of restarting workers serially, as with phased/rolling restart. That can be an advantage, in that it more likely avoids a single user interacting with different deployed versions for subsequent requests within a short time period. The linked restart doc discusses this for migrations, for example.

A related question is: could you also workaround this by using TCP instead of UNIX socket? I see lots of puma issues specific to UNIX sockets where the OP isn't super clear why they want UNIX sockets in the first place. Minor possible performance advantage would be one, but IMO likely not measurably significant, at least at peak throughput if Rails is in use.

@dekellum
Copy link
Contributor

Also regarding your original comment on testing: yes I agree there are many test coverage holes in this area and the potential for regressions is high. As part of my work on puma+systemd, I have https://github.com/dekellum/lionfish . I just updated this for puma 3.12.1 to sanity check that recent releases haven't totally hosed systemd and (tcp) socket activation. Parts of this could be reused for puma tests and CI. Unlikely I'll find time soon, but I think my next step would be to add CI to lionfish. If that looks good then could repackage it for a puma PR. I'd accept a PR for lionfish, if you wanted to work on it.

@j15e
Copy link
Contributor Author

j15e commented Apr 19, 2019

Well I was kind of asking myself the same question as why unix socket seems to be suggested in documentation, but seems to cause many issues and performance gains are not really well documented either. It has a security advantage because it opens not port, but I think a that is not really a good argument either since you should configure your binding & firewall correctly.

I think I'll move to a 0.0.0.0 port socket.

@dekellum
Copy link
Contributor

So hopefully there are two possible workarounds, but it does seem like #1685 must have caused this regression. Reading there that Binder::close was never called, previously. Perhaps puma should include some configuration for this behavior, as a followup to #1685 and possible fix of this? I would suggest keeping this issue open even if your workaround succeeds.

@calmyournerves
Copy link

Any progress on this one? Seems like Puma doesn't work with systemd socket activation anymore as of 4.0.0.

@nateberkopec
Copy link
Member

Can you provide more details @calmyournerves? The original issue here was on 3.12.

@calmyournerves
Copy link

Hey @nateberkopec!

I think this issue was reported on master. The changes in #1685 (which was already merged when this issue was created) seem to cause that the socket is removed when Puma gets stopped. So when we call systemctl stop app (or restart) it will cause the socket that is managed by systemd to disappear, breaking socket activation. This works perfectly fine for us in 3.12.

As described in the first comment, removing the logic around @own_binder seems to be the problem.

@mcg
Copy link

mcg commented Jun 26, 2019

Ran into this updating from 3.12.1 to 4.0.0 using socket activation. Socket is removed on restart.

@Fudoshiki
Copy link
Contributor

Fudoshiki commented Jun 30, 2019

ruby 2.6.3
rails 6.1
puma 4

config/puma.rb

# frozen_string_literal: true

# Puma can serve each request in a thread from an internal thread pool.
# The `threads` method setting takes two numbers: a minimum and maximum.
# Any libraries that use thread pools should be configured to match
# the maximum value specified for Puma. Default is set to 5 threads for minimum
# and maximum; this matches the default thread size of Active Record.
#
max_threads_count = ENV.fetch('RAILS_MAX_THREADS') { 5 }
min_threads_count = ENV.fetch('RAILS_MIN_THREADS') { max_threads_count }
threads min_threads_count, max_threads_count

# Specifies the `environment` that Puma will run in.
#
environment ENV.fetch('RAILS_ENV') { 'development' }
# early_hints true

# Specifies the number of `workers` to boot in clustered mode.
# Workers are forked webserver processes. If using threads and workers together
# the concurrency of the application would be max `threads` * `workers`.
# Workers do not work on JRuby or Windows (both of which do not support
# processes).
#
workers ENV.fetch('WEB_CONCURRENCY') { 2 }

# Use the `preload_app!` method when specifying a `workers` number.
# This directive tells Puma to first boot the application and load code
# before forking the application. This takes advantage of Copy On Write
# process behavior so workers use less memory. If you use this option
# you need to make sure to reconnect any threads in the `on_worker_boot`
# block.
#
# preload_app!

# Allow workers to reload bundler context when master process is issued
# a USR1 signal. This allows proper reloading of gems while the master
# is preserved across a phased-restart. (incompatible with preload_app)
# (off by default)
#
prune_bundler

# Bind unix domain socket
bind ENV.fetch('PUMA_SOCK') { 'unix://tmp/sockets/puma.sock' }

# Use `path` as the file to store the server info state. This is
# used by `pumactl` to query and control the server.
#
state_path ENV.fetch('SERVER_STATE') { 'tmp/pids/puma.state' }

# === Puma control rack application ===

# Start the puma control rack application on `url`. This application can
# be communicated with to control the main server. Additionally, you can
# provide an authentication token, so all requests to the control server
# will need to include that token as a query parameter. This allows for
# simple authentication.
#
# Check out https://github.com/puma/puma/blob/master/lib/puma/app/status.rb
# to see what the app has available.
#
activate_control_app ENV.fetch('PUMACTL_SOCK') { 'unix://tmp/sockets/pumactl.sock' }

# Store the pid of the server in the file at `path`.
#
pidfile 'tmp/pids/puma.pid'

# Change the default timeout of workers
#
worker_timeout 10

# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart

config/deploy.rb

after 'deploy:published', 'puma:phased-restart'

After cap production deploy

$ ps aux | grep puma
webapp   18304  0.0  0.8 368084  8324 ?        Sl   04:17   0:00 puma 4.0.0 (unix://tmp/sockets/puma.sock) [20190630041534]
webapp   30011  0.2 11.7 591616 118548 ?       Sl   04:54   0:01 puma: cluster worker 0: 18304 [20190630041534]
webapp   30082  0.2 11.8 591596 119296 ?       Sl   04:54   0:01 puma: cluster worker 1: 18304 [20190630041534]
webapp   30871  0.0  0.2 226484  2756 pts/0    S+   05:02   0:00 grep --color=auto puma

$ ls /srv/webapp/repository/current/tmp/sockets
pumactl.sock

puma.sock is missing and Nginx can't establish connection

@Fudoshiki
Copy link
Contributor

@nateberkopec this isn't related to systemd

pumactl -S tmp/pids/puma.state -F config/puma.rb phased-restart

puma 4 is removes puma.sock, puma 3.12 doesn't

@Fudoshiki
Copy link
Contributor

seems to work https://github.com/Fudoshiki/puma/commit/ae44a9333354ce45b473550dca492277f4dafb58

@schneems
Copy link
Contributor

schneems commented Jul 2, 2019

seems to work Fudoshiki@ae44a93

Can you send us a PR? Is it possible to add a test for this behavior since it was working and then we broke it? I can't guarantee we won't break it again

@nateberkopec
Copy link
Member

PR is here #1829

@mcg
Copy link

mcg commented Jul 11, 2019

Using puma 4.0.1 that should have this fix? Socket still missing after restart.

@nateberkopec
Copy link
Member

Well, that's not good. Moving discussion to #1842.

@puma puma locked as resolved and limited conversation to collaborators Jul 11, 2019
nateberkopec added a commit that referenced this issue Sep 2, 2019
This reverts commit 3b48d32.
nateberkopec added a commit that referenced this issue Sep 3, 2019
This reverts commit 3b48d32.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants