New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Still getting H13s during autoscaling #2200
Comments
How did you figure that out? |
It's included in the log message that heroku logs and I know the rough range of requests for the paths that are H13ing. Here's a scrubbed example. There are some requests that do take a couple seconds, but that's well below the 30 second timeout.
|
Not sure if it is an reasonable thing to try, but it would be interesting to know if setting https://github.com/puma/puma/blob/v4.3.3/docs/architecture.md#disabling-queue_requests |
I wouldn't recommend turning off request queuing all together. Heroku will "quarantine" (fitting word currently) any dyno that it gets a rejected request from for 5 seconds, if all your dynos are rejecting requests because you need burst capacity that you've not got, then it will mean our router will constantly be quarantining all your dynos on and off which would be not great. There is another setting you can try though that may make a difference. The ability to drain the queue when shut down is initiated: Line 217 in 636732c
You can add |
Thanks, I will try that out. How does it ever work without that? Seems like that should be the default. |
@schneems it seems like this may have helped reduce them, but we just got another one. |
Perhaps fixed by #2122. |
Looks promising, any idea when it will get merged? |
Internal ticket id for Heroku 841375 |
Just adding some more anecdotal evidence here. I've also been continuing to see H13 errors when autoscaling or restarting. I tried adding I was pretty sure I wasn't seeing any H13 errors after #1802 was merged, but at some point they crept back in. I hadn't taken note of it until recently, which led me to this issue. Unfortunately I don't have enough history to know when I started seeing the H13's again. |
I revisited my example repo for my prior H13 work https://github.com/schneems/puma_connection_closed_reproduction and When I run it on Heroku i'm not able to reproduce H13 exceptions meaning that if there's a regression, it's not a full regression of my prior work. Right now i've got a few different theories of this behavior. I think that this behavior is either race condition or timing based somehow otherwise we would be seeing it more frequently when scaling and deploying instead of intermittently. Theory 1) This behavior causing the H13s has always existed in Puma but it wasn't noticed before due to other problems causing even more H13s.It's possible, but if both of you remember an H13 free time before then something else would have had to change between right after 4.0.1 was released and now. It's possible that your request volume increased per dyno and that only after a certain threshold does this start to show up. I'm not sure how we test this theory, but it's not my main candidate. (It seems unlikely that you both used to see no H13s and now you see some). I did an indepth review of #2122. It doesn't look like it's fixing a regression, but rather is new behavior (though I'm happy to be proven wrong if anyone can investigate and confirm/deny, that would be amazing). If that's the case then it's possible intermittent H13s could be caused by the behavior described in one of the scenarios #2122 (review). Where a request is being accepted RIGHT before the server is being shut down, then the reactor is getting turned off and the connection is being closed. I think that would return a H13. Now I'm not totally sure what behavior we want from Scenario C, in an ideal world we could tell the Heroku router "hey, I can't serve this request right now, hold on to it and try again please". But I don't think there's a HTTP compliant spec for communicating that upstream to a routing server. In that case there would still be a handful of dropped connections or 503 responses (depending on how we want to handle the scenario) To test If we can answer the questions and figure out what our desired behavior is on that PR then we can work towards shipping it and then y'all can try it out to see if it helps with your problem. Even if it doesn't it looks like it's worth investing more in understanding what we want to happen. Theory 2) This H13 is a regression, but not from my prior PRIf we've got multiple people saying they had a time where they saw no H13s and now they see H13s it seems likely that something got broken and we're just now noticing. To Test: Does anyone have an app that they would be willing to run an older version of puma on it for a bit version 4.0.1 seems like a likely candidate. It's not ideal, as there are some security patches since then, so I wouldn't run in that mode for long (a few hours or a day should show it). There's a little security in obscurity if you don't let anyone know what you're doing publically until after you're back to the latest, most secure version of Puma. Maybe you've to a server that isn't as critical but still sees this behavior you could try? Theory 3) Not a regression and not #2122If the race condition is somewhere else and not in #2122 then I'm not sure how we could narrow things down. To test Eliminate theory 1 and 2. Action items
|
Thanks for explaining that, I didn't know about the quarantine. For anyone interested in more details, they are at https://devcenter.heroku.com/articles/http-routing#dyno-connection-behavior-on-the-common-runtime |
Since #2122 was merged in, would any of you who have reported this issue be willing to try out master and let us know if your issue is resolved or if we need to keep hunting? |
@schneems Just deployed, I'll let you know how it goes. As an aside, thanks for the theories you posted last month. I've had it on my list to revisit your post and put some thought into it, but it just hasn't happened yet. As another aside, it looks like Barnes is not compatible with Puma master. I had to remove Barnes in order to deploy, so there will be a couple of variables at play in any results I provide. |
Thanks for the report, I'll take a look at that. |
I ran on Puma master (3060a75) for several days, then ran on v4.0.1 for several days. Both versions showed the same behavior—a few H13 errors every day or two. I'm now on 5.0.0.beta1, and I expect I'll continue to see the same. Based on these results, I'm leaning toward this theory from @schneems:
I may have thought there was a time when I was seeing no H13 errors, but my request volume (and the amount of autoscaling I'm doing) is much higher now. |
Has anyone had much success in resolving this? |
I think what we need to do is:
Right now I'm thinking that we could add some kind of detailed logging during shutdown. But I don't know quite what information we need to prove/disprove the above theories. If this is the case:
At the low level what do we think might be happening? The error code itself can tell us something as there can be multiple errors due to socket closure:
Since we're seeing H13s and not H18 we know that no data is being written back to the socket. Unfortunately this doesn't narrow down things for us. Here's the places where a request could be:
Right now we don't know WHERE the request is when the server shuts down and the H13 is logged. I think that's the biggest information gap. I have a feeling that there are cases that it's not possible to 100% handle. For example:
But if that's the case then there's nothing Puma could do, except perhaps try to hint to the client that they should retry. Maybe there's a good status code for this 408 seems like it fits, but I don't like that we're throwing a 4xx for what is a server problem. I'm open to suggestions here. In #2122 they're using 503. Recap
I'm thinking we could add a flag for logging requests in 3 locations:
We could either output these stats on some kind of a timer that is triggered by shutdown. For example: every 1 second we output those results until the server completely shuts down. Alternatively we can output those stats when some trigger is hit such as server shutdown starts and then again when SIGKILL is received. I don't know how much time or wiggle room we have once a SIGKILL happens. |
Another internal ticket id: https://heroku.support/914138 |
@schneems Have you seen any issues re: this on 5.0.3+? |
I’m watching my kids full time until January right now. I haven’t seen
anything unusual on Heroku support side though.
…On Monday, December 14, 2020, Nate Berkopec ***@***.***> wrote:
@schneems <https://github.com/schneems> Have you seen any issues re: this
on 5.0.3+?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2200 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAOSYAGBE52FTQZV7XMYG3SUZFRPANCNFSM4LSLXNKQ>
.
--
Richard Schneeman
https://www.schneems.com
he/him
|
👋 I'm running into the same issue on Puma 5.2.1. Scale down events trigger H13: We have 4 to 6 dynos running so only I was suspecting that these were due to the short Here's our Puma config: require 'barnes'
threads_count = 1
workers 30
threads threads_count, threads_count
# based on defaults from
# https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts#puma
worker_timeout 15
worker_shutdown_timeout 8
preload_app!
rackup DefaultRackup
port ENV['PORT'] || 7000
environment ENV['RACK_ENV'] || 'development'
before_fork do
Barnes.start # send metrics to Heroku's statsd agent
Barnes.start(statsd: Statsd.new('127.0.0.1', 8125)) # send metrics to Datadog's statsd agent
end
on_worker_boot do
if Rails.configuration.respond_to?(:initialize_features_client)
Rails.configuration.initialize_features_client.call
Rails.logger.info('Connected to LaunchDarkly (Puma)')
end
end
plugin :tmp_restart
lowlevel_error_handler do |e|
Raven.capture_exception(e)
[500, {}, ["An error has occurred, and engineers have been informed. Please reload the page. If you continue to have problems, contact XXX\n"]]
end |
This is messy. Part of the problem is that frontends and Puma are not always processing data from the other immediately. There may be room for improvement during shutdown re the handling of clients that are sending multiple requests. You might try adding |
Closing as stale/not actionable with the information we have. |
I can confirm that we still see this on Heroku in production during autoscaling scenarios when the dyno formation is scaling down. We are on puma 6.3.1. |
We're also experiencing this in autoscaling scenarios when formation is scaling down. As far as I know there's no way to force Heroku to extend the life of the dyno past the 30s kill signal to make sure existing connections are completed before the dyno instance is removed. I'm going to open a support request on Heroku to see what they say, as I think this problem may land more on their side, but can be reduce the type of response to better as @MSP-Greg suggested.. |
I read https://blog.heroku.com/puma-4-hammering-out-h13s-a-debugging-story again and this part stands out to me (emphasis added)
Considering that behaviour from Heroku / the router, I'm not surprised there's still H13 errors happening (despite the fix in #1808). Sounds like there can still be race conditions. Maybe someone should take a closer look at these other webservers that can handle this just fine? What is the difference with them and Puma? |
FWIW, I am experiencing the same thing with Heroku in a Python/Gunicorn app. The router continues to send requests to the Dyno despite having already sent a SIGTERM. |
We're having this issue and not sure it's even Puma related. I've tried walking through support with it multiple times and have gotten nowhere. It looks like the router continues to send requests to the process that received a SIGTERM in downscale events. |
Chiming in here. We're also seeing this behavior on autoscaling events. Should this issue be reopened? Not sure why it was closed in the first place. |
Here's the reason why the issue was closed: #2200 (comment) Can you reproduce the issue without Heroku? Can you suggest what to change in Puma to address this? It is not clear if this is an issue that can be addressed within Puma, therefore it was closed. |
Describe the bug
Hi, despite the fix described in https://blog.heroku.com/puma-4-hammering-out-h13s-a-debugging-story our high load app is still seeing H13s during autoscaling down. These requests are only running for a couple hundred ms when they H13.
I've reported this to Heroku, but so far they've just explained to me how H12s happen and suggested I install rack-timeout. While this app does have some timeouts, it's not the timeout requests that are H13ing.
/cc @schneems
Puma config:
RAILS_MAX_THREADS
is 4RAILS_MIN_THREADS
is 1WEB_CONCURRENCY
is 8We're running on Performance-L dynos
To Reproduce
This would be a challenge, but I'd be happy to work with someone to narrow the issue down.
Expected behavior
No H13s during shutdown/scale down
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: