Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminating timed out worker on phased-restarts #2527

Closed
gingerlime opened this issue Jan 14, 2021 · 14 comments
Closed

Terminating timed out worker on phased-restarts #2527

gingerlime opened this issue Jan 14, 2021 · 14 comments

Comments

@gingerlime
Copy link
Contributor

gingerlime commented Jan 14, 2021

Describe the bug

related to #2470 ?

We noticed recently that it takes longer time to notice changes after deploy, much slower than it was with puma 3.x (we never used 4 due to issues with prune_bundler only fixed in 5.x).

Looking at the puma.log file, I can see Terminating timed out worker notices and a rather long wait ... so eventually everything is reloaded, but it definitely takes some time.

If I do a full restart, the problem seems to go away, until we deploy again (and change the code folder), and then it takes a long time to reload

[24875] - Starting phased worker restart, phase: 3
[24875] + Changing to /var/local/app/app
[24875] - Stopping 25921 for phased upgrade...
[24875] - TERM sent to 25921...
[2006] + Gemfile in context: /var/local/app/app.d/1790/Gemfile
[24875] ! Terminating timed out worker: 2006
...

this is reproducible on our staging environment, when it's not busy at all. So it's not likely the puma processes are really busy and take time to receive the signal ... it really feels like a long delay for no apparent reason.

Puma config:

#!/usr/bin/env puma

# The directory to operate out of.
#
# The default is the current directory.
#
directory "/var/local/app/app"

# Set the environment in which the rack's app will run. The value must be a string.
#
# The default is "development".
#
environment ENV["RAILS_ENV"] || "development"

# Store the pid of the server in the file at "path".
#
pidfile "/var/run/puma/puma.pid"

# Use "path" as the file to store the server info state. This is
# used by "pumactl" to query and control the server.
#
state_path "/var/run/puma/puma.state"

# Redirect STDOUT and STDERR to files specified. The 3rd parameter
# ("append") specifies whether the output is appended, the default is
# "false".
#
stdout_redirect "/var/log/app/puma.log", "/var/log/app/puma.log", true

# Configure "min" to be the minimum number of threads to use to answer
# requests and "max" the maximum.
#
# The default is "0, 16".
#
# threads 0, 16
threads 8, 8


# Bind the server to "url". "tcp://", "unix://" and "ssl://" are the only
# accepted protocols.
#
# The default is "tcp://0.0.0.0:9292".
#
bind 'unix:///var/run/puma/puma.sock'


# === Cluster mode ===

# How many worker processes to run.  Typically this is set to
# to the number of available cores.
#
# The default is "0".
#
workers ENV["PUMA_WORKERS"] || 16

# Allow workers to reload bundler context when master process is issued
# a USR1 signal. This allows proper reloading of gems while the master
# is preserved across a phased-restart. (incompatible with preload_app)
# (off by default)

# see https://github.com/puma/puma/blob/master/docs/deployment.md#restarting
prune_bundler

To Reproduce

This seems to happen when doing phased-restarts using versioned deploy folders with symlinks.

We use a folder structure similar to this:

/app/app.d/{version (int)}
/app/app -> /app/app.d/{version} # symlink
/app/prev -> /app/app.d/{version-1} # symlink

When we deploy, we increment the version, run bundle etc, then switch the symlink and do a phased-restart of puma using /usr/bin/bundle exec pumactl -F /var/local/app/app/config/puma.rb phased-restart

We then see Terminating timed out worker in the puma log, and the phased restart takes a while, even when no puma processes are busy

Expected behavior
The phased-restart should be reasonably fast without timeouts, so the code changes are reflected quickly

Desktop (please complete the following information):

  • OS: Linux (Ubuntu)
  • Puma Version 5.1.1
@cjlarose
Copy link
Member

If you're seeing that Terminating timed out worker appears it the logs, it indicates that either an already-booted worker failed to report itself to the master process within the last worker_timeout seconds (default 6), or than a not-yet-fully-booted worker failed to check in within the last worker_boot_timeout seconds (default 60).

Because you're seeing those logs around phased restarts, my first guess is that sometimes workers in your application take fewer than 60 seconds to boot and sometimes more than 60 seconds. The reason a phased restart would take a long time is that the cluster has to wait for each worker to boot one-by-one within that 60-second time limit. If a worker fails to boot, it is killed and the cluster tries to launch a new one, with a new 60-second timeout.

In order to debug this, I'd start with increasing your worker_boot_timeout to 5 minutes or so and see if that helps. If it does, that potentially indicates that your application is just taking too long to start up, and you should investigate that in order to get faster phased restarts. If increasing worker_boot_timeout doesn't help (and the application doesn't actually take 60 seconds to boot), then it might be a puma bug.

@gingerlime
Copy link
Contributor Author

gingerlime commented Jan 14, 2021

Thanks for looking into this so quickly @cjlarose !

We're using bootsnap, and normally the app loads within 5-6 seconds (e.g. rails console), so I find it hard to imagine some workers take longer than 60 seconds to boot... Another thing is that if I restart puma, and then do some phased-restarts it works fine (If I look at the log when I issue a phased restart, it's cycling through each worker, and each one takes like 3-4s), but then the problem happens when we deploy (where a new folder is created, symlink etc, even when there are tiny code changes).

I'll try to run some simulations with various settings and see.

Any tips looking at the boot time of our app specifically? i.e. how to benchmark or profile what might be taking long to boot?

@gingerlime
Copy link
Contributor Author

gingerlime commented Jan 14, 2021

hmm... it seems harder to reproduce consistently than I thought. I tried a few deploys and the phased restarts are doing fine... I'll have to try and see if I can figure out what triggers this.

I'm pretty sure our workers don't normally take anywhere near 60 seconds to start though.

@cjlarose
Copy link
Member

cjlarose commented Jan 14, 2021

We're using bootsnap, and normally the app loads within 5-6 seconds (e.g. rails console), so I find it hard to imagine some workers take longer than 60 seconds to boot... Another thing is that if I restart puma, and then do some phased-restarts it works fine (If I look at the log when I issue a phased restart, it's cycling through each worker, and each one takes like 3-4s), but then the problem happens when we deploy (where a new folder is created, symlink etc, even when there are tiny code changes).

I'll try to run some simulations with various settings and see.

Any tips looking at the boot time of our app specifically? i.e. how to benchmark or profile what might be taking long to boot?

One possible explanation as to why the Rails console would load quickly, but new worker boots after a phased restart take a while is that the bootsnap cache is written into a subdirectory of the release directory (a new release would start with a clean bootsnap cache). So one thing to try to benchmark would be to see how long it takes to start up a new production Rails console (eager loading on, etc), with a clean bootsnap cache.

@gingerlime
Copy link
Contributor Author

The reason a phased restart would take a long time is that the cluster has to wait for each worker to boot one-by-one within that 60-second time limit.

Does this mean that the more workers an app has, the longer it takes to see the deployed changes with phased-restarts? Is there a way to execute this in parallel rather than sequentially? (I assume the 60-second limit applies to each worker individually, right?)

@gingerlime
Copy link
Contributor Author

One possible explanation as to why the Rails console would load quickly, but new worker boots after a phased restart take a while is that the bootsnap cache is written into a subdirectory of the release directory (a new release would start with a clean bootsnap cache). So one thing to try to benchmark would be to see how long it takes to start up a new production Rails console (eager loading on, etc), with a clean bootsnap cache.

The bootsnap cache is shared. I also tested removing the bootsnap cache and restarting, and it's still pretty fast...

Hope it's not a ghost chase, but I checked our logs and we don't see this timeout very frequently, which would make reproducing it tricky. Luckily(?) it only seems to happen on our staging and not live environment though. Maybe it's specific to staging, where we might have multiple deploys in quick succession, load spikes etc, so things can go funny.

@cjlarose
Copy link
Member

Does this mean that the more workers an app has, the longer it takes to see the deployed changes with phased-restarts?

Phased restarts take at least as long as (average worker boot time * number of workers). It's possible to "see the deployed changes" before the phased restart is complete, though, because new workers will serve requests. So during a phased restart, you can have a request served from the "old" version and your next request served by the "new" version and the next served by the "old" version again, depending on which worker picks up your request.

Is there a way to execute this in parallel rather than sequentially? (I assume the 60-second limit applies to each worker individually, right?)

Your understanding of the 60-second timeout is correct. The reason why puma restarts workers one-by-one is for availibility: we make the promise that if workers is set to x, then at no point will you have fewer than x-1 workers running. If we reboot workers in parallel (essentially just a hot restart instead of a phased restart), it's possible that the cluster will be temporarily unable to service the volume of requests it needs to.

In theory, we could boot workers in small batches in parallel as a compromise between availability and fast restarts (this has been attempted by #1861). But that idea was at the time shot down because the core of the issue is that phased restarts take a long time for applications that take a long time to boot (assuming preload_app is off). That issue was meant to be solved by the fork_worker option introduced in 5.x (still experimental).

@cjlarose
Copy link
Member

cjlarose commented Jan 14, 2021

One thing to try just to rule out that the problem is slow worker booting is to take a look a the logs whenever you see some of those Terminated timed out worker logs.

Your logs might look like these during a phased restart

[19844] - Starting phased worker restart, phase: 2
[19844] + Changing to /Users/chris.larose/.puma-dev/mycase
[19844] - Stopping 20648 for phased upgrade...
[19844] - TERM sent to 20648...
[19844] - Worker 0 (pid: 20934) booted, phase: 2
[19844] - Stopping 20705 for phased upgrade...
[19844] - TERM sent to 20705...
[19844] - Worker 1 (pid: 20966) booted, phase: 2
[19844] - Stopping 20814 for phased upgrade...
[19844] - TERM sent to 20814...
[19844] - Worker 2 (pid: 20996) booted, phase: 2
[19844] - Stopping 20846 for phased upgrade...
[19844] - TERM sent to 20846...
[19844] - Worker 3 (pid: 21029) booted, phase: 2
[19844] - Stopping 20877 for phased upgrade...
[19844] - TERM sent to 20877...
[19844] - Worker 4 (pid: 21059) booted, phase: 2

I don't have timestamps attached to these logs, but if you did, you could measure the time between the messages that say "TERM sent to x" and "Worker y booted" to give you the approximate amount of time it takes workers to boot.

If that's close to the 60-second threshold ever, then that might be a clue as to what's happening.

@gingerlime
Copy link
Contributor Author

@cjlarose thank you soooo much. It makes a lot ot sense!

And yeah I’ll check the timestamps on our centralized logs and see if I can spot anything and keep trying to reproduce.

@cjlarose
Copy link
Member

Awesome! I'm hoping of course it's not a bug in puma, but we'll find out with some more debugging hopefully.

One thing that might come out of this issue, too, is maybe just better log messages around worker booting, especially when those Terminating timed out worker messages appear.

@gingerlime
Copy link
Contributor Author

Hey @cjlarose !

Looking at it again this morning, it looks very likely that this is simply caused when the system is busy. We run some functional tests on deploy on our staging environment, and those kick off as soon as the deploy finishes. This happens shortly after issuing a phased-restart. Given that the phased restart might take a little while to complete, cycling through each worker, it means that we start getting hit pretty heavy by automated requests. This in turn makes the workers busy and loads the system, which slows down things, and therefore can cause timeouts.

Is there a way to check when the phased restart completes? because then we could only trigger the automated tests once that happens. This should reduce the chances of timeouts I imagine.

One thing that might come out of this issue, too, is maybe just better log messages around worker booting, especially when those Terminating timed out worker messages appear.

Yeah, great idea. Maybe adding timestamps to the logs could be useful as well?

I'll close this for now, since it doesn't look like a puma bug. Sorry for the trouble and thanks again so much for your patience. I always learn something new from you.

@cjlarose
Copy link
Member

Is there a way to check when the phased restart completes? because then we could only trigger the automated tests once that happens. This should reduce the chances of timeouts I imagine.

Such an option is proposed in #2213 but has not yet been implemented. In the meantime, it possible to do this yourself with use of a script that queries the puma control/status server. There, you can extract the phase that each worker is in. Once the phase can been incremented for all workers, you know that the phased restart is complete.

One thing that might come out of this issue, too, is maybe just better log messages around worker booting, especially when those Terminating timed out worker messages appear.

Yeah, great idea. Maybe adding timestamps to the logs could be useful as well?

I think we leave timestamps up to the user's logging implementation (e.g. syslog). But there is an open issue that would let you specify your own logger instance which would make this possible #2511

I'll close this for now, since it doesn't look like a puma bug. Sorry for the trouble and thanks again so much for your patience. I always learn something new from you.

Always happy to help!

@cjlarose
Copy link
Member

I opened #2528 to help improve log messaging around worker boot timeouts.

@gingerlime
Copy link
Contributor Author

Thanks again @cjlarose. Super-useful stuff 👍

Such an option is proposed in #2213 but has not yet been implemented. In the meantime, it possible to do this yourself with use of a script that queries the puma control/status server. There, you can extract the phase that each worker is in. Once the phase can been incremented for all workers, you know that the phased restart is complete.

I did a bit of digging for using the control/status server via pumactl stats. correct me if I'm wrong, but I cannot set up the control server via the config, only with the command line, right? and this means running the control server, which we currently have no need for / opens up potential ports/attack vectors... Otherwise stats are not available via the standard commandline PID, nor from ruby code outside the main thread (as far as I gathered from the new docs). So a huge +1 for implementing #2213 from my side.

I think we leave timestamps up to the user's logging implementation (e.g. syslog). But there is an open issue that would let you specify your own logger instance which would make this possible #2511

Yeah, we use datadog for logging, so I can pick up the timestamp there. Adding it by default or using our existing logger would be nice too. In general though, timestamps are kinda essential in any logging, so adding them by default makes sense in my opinion. I can't see any big downside... But I can see it not being a super high priority either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants