New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminating timed out worker on phased-restarts #2527
Comments
If you're seeing that Because you're seeing those logs around phased restarts, my first guess is that sometimes workers in your application take fewer than 60 seconds to boot and sometimes more than 60 seconds. The reason a phased restart would take a long time is that the cluster has to wait for each worker to boot one-by-one within that 60-second time limit. If a worker fails to boot, it is killed and the cluster tries to launch a new one, with a new 60-second timeout. In order to debug this, I'd start with increasing your |
Thanks for looking into this so quickly @cjlarose ! We're using bootsnap, and normally the app loads within 5-6 seconds (e.g. rails console), so I find it hard to imagine some workers take longer than 60 seconds to boot... Another thing is that if I restart puma, and then do some phased-restarts it works fine (If I look at the log when I issue a phased restart, it's cycling through each worker, and each one takes like 3-4s), but then the problem happens when we deploy (where a new folder is created, symlink etc, even when there are tiny code changes). I'll try to run some simulations with various settings and see. Any tips looking at the boot time of our app specifically? i.e. how to benchmark or profile what might be taking long to boot? |
hmm... it seems harder to reproduce consistently than I thought. I tried a few deploys and the phased restarts are doing fine... I'll have to try and see if I can figure out what triggers this. I'm pretty sure our workers don't normally take anywhere near 60 seconds to start though. |
One possible explanation as to why the Rails console would load quickly, but new worker boots after a phased restart take a while is that the bootsnap cache is written into a subdirectory of the release directory (a new release would start with a clean bootsnap cache). So one thing to try to benchmark would be to see how long it takes to start up a new production Rails console (eager loading on, etc), with a clean bootsnap cache. |
Does this mean that the more workers an app has, the longer it takes to see the deployed changes with phased-restarts? Is there a way to execute this in parallel rather than sequentially? (I assume the 60-second limit applies to each worker individually, right?) |
The bootsnap cache is shared. I also tested removing the bootsnap cache and restarting, and it's still pretty fast... Hope it's not a ghost chase, but I checked our logs and we don't see this timeout very frequently, which would make reproducing it tricky. Luckily(?) it only seems to happen on our staging and not live environment though. Maybe it's specific to staging, where we might have multiple deploys in quick succession, load spikes etc, so things can go funny. |
Phased restarts take at least as long as (average worker boot time * number of workers). It's possible to "see the deployed changes" before the phased restart is complete, though, because new workers will serve requests. So during a phased restart, you can have a request served from the "old" version and your next request served by the "new" version and the next served by the "old" version again, depending on which worker picks up your request.
Your understanding of the 60-second timeout is correct. The reason why puma restarts workers one-by-one is for availibility: we make the promise that if In theory, we could boot workers in small batches in parallel as a compromise between availability and fast restarts (this has been attempted by #1861). But that idea was at the time shot down because the core of the issue is that phased restarts take a long time for applications that take a long time to boot (assuming |
One thing to try just to rule out that the problem is slow worker booting is to take a look a the logs whenever you see some of those Your logs might look like these during a phased restart
I don't have timestamps attached to these logs, but if you did, you could measure the time between the messages that say "TERM sent to x" and "Worker y booted" to give you the approximate amount of time it takes workers to boot. If that's close to the 60-second threshold ever, then that might be a clue as to what's happening. |
@cjlarose thank you soooo much. It makes a lot ot sense! And yeah I’ll check the timestamps on our centralized logs and see if I can spot anything and keep trying to reproduce. |
Awesome! I'm hoping of course it's not a bug in puma, but we'll find out with some more debugging hopefully. One thing that might come out of this issue, too, is maybe just better log messages around worker booting, especially when those |
Hey @cjlarose ! Looking at it again this morning, it looks very likely that this is simply caused when the system is busy. We run some functional tests on deploy on our staging environment, and those kick off as soon as the deploy finishes. This happens shortly after issuing a phased-restart. Given that the phased restart might take a little while to complete, cycling through each worker, it means that we start getting hit pretty heavy by automated requests. This in turn makes the workers busy and loads the system, which slows down things, and therefore can cause timeouts. Is there a way to check when the phased restart completes? because then we could only trigger the automated tests once that happens. This should reduce the chances of timeouts I imagine.
Yeah, great idea. Maybe adding timestamps to the logs could be useful as well? I'll close this for now, since it doesn't look like a puma bug. Sorry for the trouble and thanks again so much for your patience. I always learn something new from you. |
Such an option is proposed in #2213 but has not yet been implemented. In the meantime, it possible to do this yourself with use of a script that queries the puma control/status server. There, you can extract the
I think we leave timestamps up to the user's logging implementation (e.g. syslog). But there is an open issue that would let you specify your own logger instance which would make this possible #2511
Always happy to help! |
I opened #2528 to help improve log messaging around worker boot timeouts. |
Thanks again @cjlarose. Super-useful stuff 👍
I did a bit of digging for using the control/status server via
Yeah, we use datadog for logging, so I can pick up the timestamp there. Adding it by default or using our existing logger would be nice too. In general though, timestamps are kinda essential in any logging, so adding them by default makes sense in my opinion. I can't see any big downside... But I can see it not being a super high priority either. |
Describe the bug
related to #2470 ?
We noticed recently that it takes longer time to notice changes after deploy, much slower than it was with puma 3.x (we never used 4 due to issues with
prune_bundler
only fixed in 5.x).Looking at the puma.log file, I can see
Terminating timed out worker
notices and a rather long wait ... so eventually everything is reloaded, but it definitely takes some time.If I do a full restart, the problem seems to go away, until we deploy again (and change the code folder), and then it takes a long time to reload
this is reproducible on our staging environment, when it's not busy at all. So it's not likely the puma processes are really busy and take time to receive the signal ... it really feels like a long delay for no apparent reason.
Puma config:
To Reproduce
This seems to happen when doing phased-restarts using versioned deploy folders with symlinks.
We use a folder structure similar to this:
When we deploy, we increment the version, run bundle etc, then switch the symlink and do a phased-restart of puma using
/usr/bin/bundle exec pumactl -F /var/local/app/app/config/puma.rb phased-restart
We then see
Terminating timed out worker
in the puma log, and the phased restart takes a while, even when no puma processes are busyExpected behavior
The phased-restart should be reasonably fast without timeouts, so the code changes are reflected quickly
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: