New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requests fail during phased restarts, hot restarts, and shutdown #2337
Comments
Thanks for testing Puma.
Phased restarts are not meant to be repeatedly done in a short timespan. There are one or more time settings that affect how long it takes to perform a phased restart. Do you have unsuccessful responses when doing a single phased restart? |
For the repo that reproduces the problem, I just did a bunch of phased restarts repeatedly to increase the likelihood of failure; I understand these are not normal operating conditions. But yes, I originally became aware of this problem because it happens to two different applications of mine where we perform phased restarts once every hour or so. Specifically, we encounter the |
Ok, that's reasonable. Using MRI or JRuby? |
MRI 2.6.1 and 2.6.5 |
Scenario 3 definitely seems like something we could easily prevent. |
A colleague and I are looking into fixing the |
Just wanted to post an update on this: my company is spending resources on this since it's important to us that this is fixed. We have tried several approaches that have failed in various ways, but we think we're close to getting a solution that we're happy with. We'll open a PR soon with a new integration test and a fix for all 3 of the problems in the issue description. Also I should mention that root issue a little more broad than the title of this issue implies. We've demonstrated that requests can fail during normal "hot" restarts on a non-cluster puma deployment (single server process), as long as you're using |
Thanks so much @cjlarose! I appreciate Appfolio's continued investment in the ecosystem! |
I have been investigating this problem for the last few days. In our case, it is the #2343 issue. However, will share findings here as it can help your work. I am seeing very similar issues mentioned in the first comment. From the Nginx side they spelled as:
I can easily reproduce these errors with 1s requests overflowing multiple puma pods in non-cluster mode and one of the pods going through normal k8s restart. I can confirm that I also experimented with |
@volodymyr-mykhailyk Thanks for this info! This will help in our efforts to fix the problem. For some of the errors you describe, I can say confidently that we're able to reproduce them (
I haven't experimented much with this setting. My understanding is that it's off by default, but, if enabled, it'll make puma process all connections on its socket before shutting down. If you're running puma in k8s, it seems like the right thing to do in order to make sure that puma processes connections that were sent to the pod before the pod dies. My experiments have mostly been of two types
In both cases, |
Just to provide more insight into what we've discovered so far: one source of a lot of problems is related to how the Lines 320 to 324 in 7d00e1d
The intent here is that after the Reactor shuts down, any connections that are handled by the ThreadPool won't have the option of sending the request to the Reactor (the connection must either be closed, assuming we've written a response to the socket, or the ThreadPool must wait until the client has finished writing its request, then write a response, then finally close the connection). This works most of the time just fine. The problem is that there are races in the implementation of the ThreadPool where it's possible for a thread to see Lines 401 to 404 in 7d00e1d
This can cause connection reset errors or empty replies from the server. We have a fix for this (just wrapping the critical sections in a Mutex acquisition), but we want to make sure we don't adversely affect puma's concurrency by introducing lock contention. Even after fixing that issue, though, we still see some connection reset errors for requests that enter the ThreadPool after the shutdown sequence has initiated: the ThreadPool writes the final response to the client socket, then closes the socket, but the client gets a Lastly, I should mention that while the error |
@cjlarose Actually
The setup is very straight forward. nginx-ingress puma in single-mode server with 4+ pods. We are using rolling restart/deploy. The Looking forward to your updates and will be happy to test them out. |
@cjlarose I'm sorry I didn't come upon this discussion until now, it might have saved a bit of time.
|
@wjordan Thanks for posting these notes and thanks for your work on Puma!
This issue has since been resolved by #2377
This is good to know! I haven't done a lot of testing myself with the |
Just tested the 5.0 puma in our setup. I can confirm that the Big thanks to all Puma contributors for your hard work. |
Worth to mention one interesting finding I've seen while testing The below graphs show response time when there are more requests then puma can handle (4 pods with 4 threads each, request that take 1 second, 18 requests per second). Spikes at the beginning is rolling deploy effect. I think disabling queue - is still the way to go in setup if there is Nginx in front and you have longer requests. It produces more predictable results, prevents overload of any particular instance (taking too much requests puma will not be able to handle in grace shutdown period), and allows to balance the load evenly across puma instances. |
Just a quick update: #2279 seems to improved things generally with regard to connection handling during phased restarts, hot restarts, and shutdowns. I opened #2417 to add an integration test that'll hopefully prevent regressions related to these kinds of failures. I'm not ready to declare this issue closed quite yet (I still want to do a little more testing), but things are looking really good. |
#2423 is merged. It adds integration tests to both single-mode and cluster-mode puma to help prevent the kinds of concurrency errors that were causing connections to be dropped during hot restarts, phased restarts, and shutdowns. From my testing, I think #2279 fixed some of the last problems. That fix isn't in a release of puma yet, but I'd expect it to be soon. I think we can close this issue for now. If another, more specific issue crops up, we can open a new issue. |
Describe the bug
During a phased restart, some clients do not get successful responses from the server.
Requests fail in a variety of ways. The ones I've observed and that I'm able to reproduce reliably are these:
curl: (52) Empty reply from server
curl: (56) Recv failure: Connection reset by peer
Unable to add work while shutting down
is raised, invoking thelowlevel_error_handler
(by default, this returns a response with response code 500)Puma config:
To Reproduce
I've created a repo that reliably reproduces the errors: https://github.com/cjlarose/puma-phased-restart-errors
Expected behavior
All requests are passed to the application and the application's responses are all delivered to the requesting clients. The restart documentation for puma suggests that phased restarts are meant to have zero downtime.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: