Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still getting H13s during autoscaling #2200

Closed
aaronjensen opened this issue Mar 24, 2020 · 32 comments
Closed

Still getting H13s during autoscaling #2200

aaronjensen opened this issue Mar 24, 2020 · 32 comments
Labels

Comments

@aaronjensen
Copy link

aaronjensen commented Mar 24, 2020

Describe the bug
Hi, despite the fix described in https://blog.heroku.com/puma-4-hammering-out-h13s-a-debugging-story our high load app is still seeing H13s during autoscaling down. These requests are only running for a couple hundred ms when they H13.

I've reported this to Heroku, but so far they've just explained to me how H12s happen and suggested I install rack-timeout. While this app does have some timeouts, it's not the timeout requests that are H13ing.

/cc @schneems

Puma config:

RAILS_MAX_THREADS is 4
RAILS_MIN_THREADS is 1
WEB_CONCURRENCY is 8

We're running on Performance-L dynos

# Puma can serve each request in a thread from an internal thread pool.
# The `threads` method setting takes two numbers: a minimum and maximum.
# Any libraries that use thread pools should be configured to match
# the maximum value specified for Puma. Default is set to 5 threads for minimum
# and maximum; this matches the default thread size of Active Record.
#
max_threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
min_threads_count = ENV.fetch("RAILS_MIN_THREADS") { max_threads_count }
threads min_threads_count, max_threads_count

# Specifies the `port` that Puma will listen on to receive requests; default is 3000.
#
port        ENV.fetch("PORT") { 5000 }

# Specifies the `environment` that Puma will run in.
#
environment ENV.fetch("RAILS_ENV") { "development" }

# Specifies the `pidfile` that Puma will use.
pidfile ENV.fetch("PIDFILE") { "tmp/pids/server.pid" }

# Specifies the number of `workers` to boot in clustered mode.
# Workers are forked web server processes. If using threads and workers together
# the concurrency of the application would be max `threads` * `workers`.
# Workers do not work on JRuby or Windows (both of which do not support
# processes).
#
workers ENV.fetch("WEB_CONCURRENCY") { 2 }

# Use the `preload_app!` method when specifying a `workers` number.
# This directive tells Puma to first boot the application and load code
# before forking the application. This takes advantage of Copy On Write
# process behavior so workers use less memory.
#
preload_app!

before_fork do
  Barnes.start
end

on_worker_boot do
  ActiveRecord::Base.establish_connection
end

# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart

To Reproduce

This would be a challenge, but I'd be happy to work with someone to narrow the issue down.

Expected behavior
No H13s during shutdown/scale down

Desktop (please complete the following information):

  • OS: Heroku Cedar-18 (Linux)
  • Puma Version 4.3.3
@nateberkopec
Copy link
Member

These requests are only running for a couple hundred ms when they H13.

How did you figure that out?

@aaronjensen
Copy link
Author

It's included in the log message that heroku logs and I know the rough range of requests for the paths that are H13ing. Here's a scrubbed example. There are some requests that do take a couple seconds, but that's well below the 30 second timeout.

23 Mar 2020 10:31:55.592394 <158>1 2020-03-23T17:31:55.266210+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=PATCH path=PATH host=HOST request_id=5737ed07-17ea-4eb2-9f85-e13f7cb033dd fwd=FWD dyno=web.8 connect=1ms service=3170ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:34:59.358422 <158>1 2020-03-23T17:34:59.130272+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=GET path=PATH host=HOST request_id=95324969-e012-4218-b9de-cc210397a632 fwd=FWD dyno=web.9 connect=3ms service=1274ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:34:59.370364 <158>1 2020-03-23T17:34:59.124933+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=GET path=PATH host=HOST request_id=c8aae1d3-bc75-43d8-bbda-8374ee7d04c4 fwd=FWD dyno=web.9 connect=2ms service=1277ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:34:59.485381 <158>1 2020-03-23T17:34:59.118949+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=GET path=PATH host=HOST request_id=66a2dc13-188f-49ef-8e5a-0f809d9394f8 fwd=FWD dyno=web.9 connect=0ms service=1291ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:36:14.113391 <158>1 2020-03-23T17:36:13.959009+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=PATCH path=PATH host=HOST request_id=e5b73e1f-bbe6-48ee-995a-92965815862a fwd=FWD dyno=web.8 connect=1ms service=186ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:36:14.158403 <158>1 2020-03-23T17:36:13.945749+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=GET path=PATH host=HOST request_id=058c67ca-e3f8-44ac-a1aa-174b7daee166 fwd=FWD dyno=web.8 connect=0ms service=152ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:36:14.244391 <158>1 2020-03-23T17:36:13.940852+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=GET path=PATH host=HOST request_id=d37a6102-824d-485d-b69d-94c11c1aa13e fwd=FWD dyno=web.8 connect=1ms service=156ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:36:17.825403 <158>1 2020-03-23T17:36:17.557040+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=GET path=PATH host=HOST request_id=aa26df7a-3b7b-401b-a5d1-8d0f5cb0b55a fwd=FWD dyno=web.6 connect=0ms service=143ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:36:17.882380 <158>1 2020-03-23T17:36:17.586285+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=GET path=PATH host=HOST request_id=f80fb5b6-9f5b-4814-97f5-e9150627deac fwd=FWD dyno=web.6 connect=1ms service=704ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:36:23.445404 <158>1 2020-03-23T17:36:23.175333+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=GET path=PATH host=HOST request_id=b7523cdd-6a9f-43a8-a3b2-157eaa90fa67 fwd=FWD dyno=web.9 connect=0ms service=3382ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:36:39.933430 <158>1 2020-03-23T17:36:39.681539+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=GET path=PATH host=HOST request_id=d02afeba-e738-4e01-8c14-f79c47e7b076 fwd=FWD dyno=web.6 connect=1ms service=4086ms status=503 bytes=0 protocol=httpsCritical
23 Mar 2020 10:37:12.263393 <158>1 2020-03-23T17:37:12.108762+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=PATCH path=PATH host=HOST request_id=3f4f6140-3432-4816-bfc6-aa0b0c14e2ce fwd=FWD dyno=web.10 connect=1ms service=1022ms status=503 bytes=0 protocol=httpsCritical

@dentarg
Copy link
Member

dentarg commented Mar 24, 2020

Not sure if it is an reasonable thing to try, but it would be interesting to know if setting queue_requests to false would make the H13 errors go away. (Changing this might impact the performance of your app.)

https://github.com/puma/puma/blob/v4.3.3/docs/architecture.md#disabling-queue_requests

@schneems
Copy link
Contributor

I wouldn't recommend turning off request queuing all together. Heroku will "quarantine" (fitting word currently) any dyno that it gets a rejected request from for 5 seconds, if all your dynos are rejecting requests because you need burst capacity that you've not got, then it will mean our router will constantly be quarantining all your dynos on and off which would be not great.

There is another setting you can try though that may make a difference. The ability to drain the queue when shut down is initiated:

def drain_on_shutdown(which=true)

You can add drain_on_shutdown to your config/puma.rb to see if it's got any benefit.

@aaronjensen
Copy link
Author

Thanks, I will try that out. How does it ever work without that? Seems like that should be the default.

@aaronjensen
Copy link
Author

aaronjensen commented Mar 24, 2020

@schneems it seems like this may have helped reduce them, but we just got another one. ☹️

@nateberkopec
Copy link
Member

Perhaps fixed by #2122.

@aaronjensen
Copy link
Author

Looks promising, any idea when it will get merged?

@schneems
Copy link
Contributor

Internal ticket id for Heroku 841375

@adamlogic
Copy link

Just adding some more anecdotal evidence here. I've also been continuing to see H13 errors when autoscaling or restarting. I tried adding drain_on_shutdown, which may have reduced the H13 errors (too early to tell), but it has not eliminated them.

CleanShot 2020-03-26 at 16 47 21

I was pretty sure I wasn't seeing any H13 errors after #1802 was merged, but at some point they crept back in. I hadn't taken note of it until recently, which led me to this issue. Unfortunately I don't have enough history to know when I started seeing the H13's again.

@schneems
Copy link
Contributor

schneems commented Apr 2, 2020

I revisited my example repo for my prior H13 work https://github.com/schneems/puma_connection_closed_reproduction and When I run it on Heroku i'm not able to reproduce H13 exceptions meaning that if there's a regression, it's not a full regression of my prior work.

Right now i've got a few different theories of this behavior. I think that this behavior is either race condition or timing based somehow otherwise we would be seeing it more frequently when scaling and deploying instead of intermittently.

Theory 1) This behavior causing the H13s has always existed in Puma but it wasn't noticed before due to other problems causing even more H13s.

It's possible, but if both of you remember an H13 free time before then something else would have had to change between right after 4.0.1 was released and now. It's possible that your request volume increased per dyno and that only after a certain threshold does this start to show up. I'm not sure how we test this theory, but it's not my main candidate. (It seems unlikely that you both used to see no H13s and now you see some).

I did an indepth review of #2122. It doesn't look like it's fixing a regression, but rather is new behavior (though I'm happy to be proven wrong if anyone can investigate and confirm/deny, that would be amazing). If that's the case then it's possible intermittent H13s could be caused by the behavior described in one of the scenarios #2122 (review). Where a request is being accepted RIGHT before the server is being shut down, then the reactor is getting turned off and the connection is being closed. I think that would return a H13.

Now I'm not totally sure what behavior we want from Scenario C, in an ideal world we could tell the Heroku router "hey, I can't serve this request right now, hold on to it and try again please". But I don't think there's a HTTP compliant spec for communicating that upstream to a routing server. In that case there would still be a handful of dropped connections or 503 responses (depending on how we want to handle the scenario)

To test If we can answer the questions and figure out what our desired behavior is on that PR then we can work towards shipping it and then y'all can try it out to see if it helps with your problem. Even if it doesn't it looks like it's worth investing more in understanding what we want to happen.

Theory 2) This H13 is a regression, but not from my prior PR

If we've got multiple people saying they had a time where they saw no H13s and now they see H13s it seems likely that something got broken and we're just now noticing.

To Test: Does anyone have an app that they would be willing to run an older version of puma on it for a bit version 4.0.1 seems like a likely candidate. It's not ideal, as there are some security patches since then, so I wouldn't run in that mode for long (a few hours or a day should show it). There's a little security in obscurity if you don't let anyone know what you're doing publically until after you're back to the latest, most secure version of Puma. Maybe you've to a server that isn't as critical but still sees this behavior you could try?

Theory 3) Not a regression and not #2122

If the race condition is somewhere else and not in #2122 then I'm not sure how we could narrow things down.

To test Eliminate theory 1 and 2.

Action items

@dentarg
Copy link
Member

dentarg commented Apr 6, 2020

I wouldn't recommend turning off request queuing all together. Heroku will "quarantine" (fitting word currently) any dyno that it gets a rejected request from for 5 seconds, if all your dynos are rejecting requests because you need burst capacity that you've not got, then it will mean our router will constantly be quarantining all your dynos on and off which would be not great.

Thanks for explaining that, I didn't know about the quarantine. For anyone interested in more details, they are at https://devcenter.heroku.com/articles/http-routing#dyno-connection-behavior-on-the-common-runtime

@schneems
Copy link
Contributor

schneems commented May 7, 2020

Since #2122 was merged in, would any of you who have reported this issue be willing to try out master and let us know if your issue is resolved or if we need to keep hunting?

@adamlogic
Copy link

@schneems Just deployed, I'll let you know how it goes.

As an aside, thanks for the theories you posted last month. I've had it on my list to revisit your post and put some thought into it, but it just hasn't happened yet.

As another aside, it looks like Barnes is not compatible with Puma master. I had to remove Barnes in order to deploy, so there will be a couple of variables at play in any results I provide.

@nateberkopec
Copy link
Member

Thanks for the report, I'll take a look at that.

@adamlogic
Copy link

None enough data yet for anything conclusive, but here's an update.

CleanShot 2020-05-11 at 07 46 48@2x

I did see one H13 error since deploying with Puma master, but so far it seems less frequent. Could've also been due to less traffic and autoscaling on the weekend. I'll keep watching.

@adamlogic
Copy link

I ran on Puma master (3060a75) for several days, then ran on v4.0.1 for several days. Both versions showed the same behavior—a few H13 errors every day or two.

CleanShot 2020-05-13 at 07 54 48@2x

CleanShot 2020-05-17 at 07 58 27@2x

I'm now on 5.0.0.beta1, and I expect I'll continue to see the same.

Based on these results, I'm leaning toward this theory from @schneems:

Theory 1) This behavior causing the H13s has always existed in Puma but it wasn't noticed before due to other problems causing even more H13s. ... It's possible that your request volume increased per dyno and that only after a certain threshold does this start to show up.

I may have thought there was a time when I was seeing no H13 errors, but my request volume (and the amount of autoscaling I'm doing) is much higher now.

@jclusso
Copy link
Contributor

jclusso commented Sep 9, 2020

Has anyone had much success in resolving this?

@schneems
Copy link
Contributor

I think what we need to do is:

  1. Generate a list of theories such as Still getting H13s during autoscaling #2200 (comment). Keep in mind this is not comprehensive. The actual cause might not be listed.
  2. Generate a systematic way of testing the theories
  • This might require engineering, and/or adding logging.

Right now I'm thinking that we could add some kind of detailed logging during shutdown. But I don't know quite what information we need to prove/disprove the above theories.

If this is the case:

Theory 1) This behavior causing the H13s has always existed in Puma but it wasn't noticed before due to other problems causing even more H13s. ... It's possible that your request volume increased per dyno and that only after a certain threshold does this start to show up.

At the low level what do we think might be happening? The error code itself can tell us something as there can be multiple errors due to socket closure:

  • H13 - the socket was connected, then destroyed without sending any data
  • H18 - the socket connected, some data was sent as part of a response by the app, but then the socket was destroyed without completing the response

Since we're seeing H13s and not H18 we know that no data is being written back to the socket. Unfortunately this doesn't narrow down things for us.

Here's the places where a request could be:

  • On the socket, accepted by the OS but not read by Puma
  • On the socket, read by puma but incomplete
  • On the socket, read by puma and complete ready to be worked on (in the waiting queue)
  • Being actively worked on by a Puma thread.

Right now we don't know WHERE the request is when the server shuts down and the H13 is logged. I think that's the biggest information gap. I have a feeling that there are cases that it's not possible to 100% handle. For example:

  • A client on a slow 2g mobile connection wants to send 1mb of json to your server
  • It connects to the socket of the dyno, there are 10 other requests ahead of it
  • The dyno gets the restart signal
  • Puma processes the other 10 requests first and starts slowly accepting data from the mobile connection
  • The reactor is still buffering the request on the Puma side, but 10 seconds goes by.
  • The dyno gets the hard KILL signal
  • The 2g client still hasn't uploaded the whole request when puma gets SIGKILL and the worker shuts down
  • The router sees the socket was closed and throws an H13

But if that's the case then there's nothing Puma could do, except perhaps try to hint to the client that they should retry. Maybe there's a good status code for this 408 seems like it fits, but I don't like that we're throwing a 4xx for what is a server problem. I'm open to suggestions here. In #2122 they're using 503.

Recap

  • We need to find out where requests are when the H13 is triggered.
    • If they're accepted by Puma then we handle them which is what Pass queued requests to thread pool on server shutdown #2122 is trying to do. If this is the case Puma could add or change its logic to aleviate the situation
    • If they're not yet being worked on by puma, then they're on the TCP socket and this is much harder.

I'm thinking we could add a flag for logging requests in 3 locations:

  • Being processed in threads (we have request IDs for them)
  • Waiting to be processed by threads (we have request IDs for them)
  • In the reactor, waiting to be buffered (We might not have request IDs for them)

We could either output these stats on some kind of a timer that is triggered by shutdown. For example: every 1 second we output those results until the server completely shuts down. Alternatively we can output those stats when some trigger is hit such as server shutdown starts and then again when SIGKILL is received.

I don't know how much time or wiggle room we have once a SIGKILL happens.

@schneems
Copy link
Contributor

Another internal ticket id: https://heroku.support/914138

@nateberkopec
Copy link
Member

@schneems Have you seen any issues re: this on 5.0.3+?

@schneems
Copy link
Contributor

schneems commented Dec 14, 2020 via email

@pcreux
Copy link

pcreux commented Mar 18, 2021

👋 I'm running into the same issue on Puma 5.2.1.

Scale down events trigger H13:

Screen Shot 2021-03-18 at 10 00 08 AM

We have 4 to 6 dynos running so only web.5 and web.6 get shutdown. The connection is closed within 2 seconds max.

Screen_Shot_2021-03-17_at_1_41_44_PM

I was suspecting that these were due to the short worker_shutdown_timeout we have set to 8 so it would only affect long running requests (8+ seconds). The logs show connections closed after 100ms to 2 seconds so it seems to be affecting all kind of requests.

Here's our Puma config:

require 'barnes'

threads_count = 1

workers 30
threads threads_count, threads_count

# based on defaults from
# https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts#puma
worker_timeout          15
worker_shutdown_timeout 8

preload_app!

rackup      DefaultRackup
port        ENV['PORT'] || 7000
environment ENV['RACK_ENV'] || 'development'

before_fork do
  Barnes.start # send metrics to Heroku's statsd agent
  Barnes.start(statsd: Statsd.new('127.0.0.1', 8125)) # send metrics to Datadog's statsd agent
end

on_worker_boot do
  if Rails.configuration.respond_to?(:initialize_features_client)
    Rails.configuration.initialize_features_client.call
    Rails.logger.info('Connected to LaunchDarkly (Puma)')
  end
end

plugin :tmp_restart

lowlevel_error_handler do |e|
  Raven.capture_exception(e)
  [500, {}, ["An error has occurred, and engineers have been informed. Please reload the page. If you continue to have problems, contact XXX\n"]]
end

@MSP-Greg
Copy link
Member

@pcreux

This is messy. Part of the problem is that frontends and Puma are not always processing data from the other immediately.

There may be room for improvement during shutdown re the handling of clients that are sending multiple requests.

You might try adding drain_on_shutdown to your config file, it may lower the qty of H13's, but not eliminate them...

@nateberkopec
Copy link
Member

nateberkopec commented Sep 7, 2023

Closing as stale/not actionable with the information we have.

@amitsaxena
Copy link

I can confirm that we still see this on Heroku in production during autoscaling scenarios when the dyno formation is scaling down. We are on puma 6.3.1.

@dinsley
Copy link

dinsley commented Dec 7, 2023

We're also experiencing this in autoscaling scenarios when formation is scaling down. As far as I know there's no way to force Heroku to extend the life of the dyno past the 30s kill signal to make sure existing connections are completed before the dyno instance is removed.

I'm going to open a support request on Heroku to see what they say, as I think this problem may land more on their side, but can be reduce the type of response to better as @MSP-Greg suggested..

@dentarg
Copy link
Member

dentarg commented Dec 7, 2023

I read https://blog.heroku.com/puma-4-hammering-out-h13s-a-debugging-story again and this part stands out to me (emphasis added)

To shut down a dyno two things have to happen, we need to send a SIGTERM to the processes on the dyno which tells the webserver to safely shutdown. We also need to tell our router to stop sending requests to that dyno since it will be shut down soon.

These two operations happen on two different systems. The dyno runs on one server, the router which serves our requests is a separate system. It's itself a distributed system. It turns out that while both systems get the message at about the same time, the router might still let a few requests trickle into the dyno that is being shut down after it receives the SIGTERM.

That explains the problem then, right? The reason this only happens on apps with a large amount of traffic is they get so many requests there is more chance that there will be a race condition between when the router stops sending requests and the dyno receives the SIGTERM.

That sounds like a bug with the router then right? Before we get too deep into the difficulties of distributed coordination, I noticed that other apps with just as much load weren't getting H13 errors. What did that tell me? It told me that the distributed behavior of our system wasn't to blame. If other webservers can handle this just fine, then we need to update our webserver, Puma in this case.

Considering that behaviour from Heroku / the router, I'm not surprised there's still H13 errors happening (despite the fix in #1808). Sounds like there can still be race conditions.

Maybe someone should take a closer look at these other webservers that can handle this just fine? What is the difference with them and Puma?

@dougbradbury
Copy link

FWIW, I am experiencing the same thing with Heroku in a Python/Gunicorn app. The router continues to send requests to the Dyno despite having already sent a SIGTERM.

@dinsley
Copy link

dinsley commented Mar 23, 2024

We're having this issue and not sure it's even Puma related. I've tried walking through support with it multiple times and have gotten nowhere.

It looks like the router continues to send requests to the process that received a SIGTERM in downscale events.

@ezekg
Copy link

ezekg commented Apr 7, 2024

Chiming in here. We're also seeing this behavior on autoscaling events. Should this issue be reopened? Not sure why it was closed in the first place.

@dentarg
Copy link
Member

dentarg commented Apr 7, 2024

Here's the reason why the issue was closed: #2200 (comment)

Can you reproduce the issue without Heroku? Can you suggest what to change in Puma to address this?

It is not clear if this is an issue that can be addressed within Puma, therefore it was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests