Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keepalive connections blocking threads (again?) #2625

Closed
jarthod opened this issue May 10, 2021 · 12 comments
Closed

Keepalive connections blocking threads (again?) #2625

jarthod opened this issue May 10, 2021 · 12 comments
Labels

Comments

@jarthod
Copy link

jarthod commented May 10, 2021

What I noticed here looks extremely similar to GHSA-7xx3-m584-x994 (and #1565) but somehow I reproduce it on 5.3.0 and 4.3.7 so I'm not sure if the problem was not completely fixed or if it's something else. Basically when hitting puma (default config) with more keepalive connections than the number of threads it is configured with, with some traffic in them, all other connections will hang almost indefinitely without getting a response.

After seing this in our production (using puma 5.0) and reproducing it locally, I tried looking for a cause (that's when I found GHSA-7xx3-m584-x994) and also tried other versions of puma (4.3.7 and 5.3.0) but reproduced the issue every time. Here are the minimal instructions to reproduce:

I used the following sample application (put this in config.ru) which simply sleep 10ms to simulate work.

run ->(env) { sleep 0.01; [200, {}, ["Puma test"]] }

Then starting puma with 4 threads (for example):

> puma -t 4:4
Puma starting in single mode...
* Puma version: 5.3.0 (ruby 2.6.4-p104) ("Sweetnighter")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 2174364
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop

Then I run some benchmark using the hey tool (which unlike wrk shows when some connection just waited the entire time for a response) and a concurrency higher than the number of threads in puma (here -c 16). -z 10s is the duration of the test:

> hey -z 10s -c 16 http://localhost:9292
Summary:
  Total:	10.6404 secs
  Slowest:	10.6359 secs
  Fastest:	0.0101 secs
  Average:	0.0433 secs
  Requests/sec:	357.8826
  
  Total data:	34272 bytes
  Size/request:	9 bytes

Response time histogram:
  0.010 [1]	|
  1.073 [3797]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  2.135 [1]	|
  3.198 [0]	|
  4.260 [0]	|
  5.323 [0]	|
  6.386 [0]	|
  7.448 [0]	|
  8.511 [0]	|
  9.573 [0]	|
  10.636 [9]	|


Latency distribution:
  10% in 0.0102 secs
  25% in 0.0103 secs
  50% in 0.0106 secs
  75% in 0.0108 secs
  90% in 0.0117 secs
  95% in 0.1133 secs
  99% in 0.1173 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0101 secs, 10.6359 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0015 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0011 secs
  resp wait:	0.0432 secs, 0.0101 secs, 10.6344 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0003 secs

Status code distribution:
  [200]	3808 responses

Here we can see that in the response time histogram, 9 requests waited 10 seconds (which is the duration of the test) before getting their response. We can also see the "slowest" resp wait at the bottom which is 10s +. If we put just 4 concurrency it's fine, with 5-7 it's a bit in between, it does not always wait 10s but there's still some slow responses. But after that it's almost always the case and there's always some requests (roughly proportional to: test concurrency - puma threads) which waits for 10 seconds.

Expected behavior

The expected behavior would be for all requests to be processed fairly in the order they arrive and not having one connection blocking the others as explained in #1565 (comment).

If we disable keepalive on the hey side, we can see what is basically the expected behavior IMO (same throughput overall but no long response, every requests is fairly answered):

> hey -z 10s -disable-keepalive -c 16 http://localhost:9292

Summary:
  Total:	10.0423 secs
  Slowest:	0.0483 secs
  Fastest:	0.0124 secs
  Average:	0.0427 secs
  Requests/sec:	374.1170
  
  Total data:	33813 bytes
  Size/request:	9 bytes

Response time histogram:
  0.012 [1]	|
  0.016 [3]	|
  0.020 [0]	|
  0.023 [1]	|
  0.027 [3]	|
  0.030 [0]	|
  0.034 [0]	|
  0.038 [4]	|
  0.041 [495]	|■■■■■■
  0.045 [3109]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.048 [141]	|■■


Latency distribution:
  10% in 0.0410 secs
  25% in 0.0415 secs
  50% in 0.0429 secs
  75% in 0.0436 secs
  90% in 0.0442 secs
  95% in 0.0445 secs
  99% in 0.0457 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0003 secs, 0.0124 secs, 0.0483 secs
  DNS-lookup:	0.0001 secs, 0.0000 secs, 0.0018 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0012 secs
  resp wait:	0.0422 secs, 0.0112 secs, 0.0467 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0006 secs

Status code distribution:
  [200]	3757 responses

I also noticed something interesting (maybe an unrelated side-effect): if I use the queue_requests false option on puma side, this problem is not present any more. But still with queue_requests enabled (default) I don't believe this behavior is desired? is it? cc @nateberkopec @ioquatix as you both worked on this issue specifically maybe you'll have more insight into this.

A bit more details about my local machine:

> sudo lsb_release -a
[sudo] password for bigbourin: 
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.10
Release:	20.10
Codename:	groovy
> uname -a
Linux desktop 5.8.0-49-generic #55-Ubuntu SMP Wed Mar 24 14:45:45 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> puma --version
puma version 5.3.0

Thanks!

@nateberkopec
Copy link
Member

I'm not sure if the problem was not completely fixed or if it's something else

That security issue was a much, much larger effect. You're demonstrating an effect on the order of 1 request in 300 under heavy traffic, the original issue was a true denial-of-service where you could take down a Puma server with keepalive. So that issue was definitely fixed.

That said, seems like you're on to something here and we definitely want to fix the p99 behavior you've demonstrated.

@MSP-Greg
Copy link
Member

@jarthod

Using an updated test framework, I have the following with Puma running with -w2 -t5:5

Each client sends 10 requests and is 'keep-alive'.

The first set runs clients in 10 loops/threads, which matches Puma's capacity (2 workers * 5 threads).

The second set runs clients in 25 loops/threads, so Puma is 'overloaded'. Notice the large increase in time in the 90% and 95% times.

I've got a patch for this, let me see if I can post a simple change you could test against?

────────────────────────────────────────────────────────────────────────────
10,000 successful requests (10 loops of 100 clients * 10 requests per client) - total request time
          5%     10%     20%     40%     50%     60%     80%     90%     95%
  mS   0.307   0.309   0.321   0.348   0.355   0.368   0.461   0.689   1.127

────────────────────────────────────────────────────────────────────────────
10,000 successful requests (25 loops of 40 clients * 10 requests per client) - total request time
          5%     10%     20%     40%     50%     60%     80%     90%     95%
  mS    0.23    0.25    0.29    0.34    0.38    0.44    0.68    2.06  103.93

@ioquatix
Copy link
Contributor

Last time I checked this was still an issue but I am happy to review the reported details align with what I was seeing.

@ioquatix
Copy link
Contributor

By the way it’s not P99 latency, those connections only get served when all the other connections are finished to they have effectively infinite latency at least that’s how it looked last time I checked.

@jarthod
Copy link
Author

jarthod commented May 10, 2021

Thank you all for these quick responses! 🙇

That security issue was a much, much larger effect. You're demonstrating an effect on the order of 1 request in 300 under heavy traffic, the original issue was a true denial-of-service where you could take down a Puma server with keepalive. So that issue was definitely fixed.

Last time I checked this was still an issue but I am happy to review the reported details align with what I was seeing
By the way it’s not P99 latency, those connections only get served when all the other connections are finished to they have effectively infinite latency at least that’s how it looked last time I checked.

Indeed this is not just a 1 in 300 problem, this is only my testing setup were hey starts the couple extra requests as a demonstration and waits patiently while they are stuck indefinitely. But with a production traffic where requests keeps coming in, this could block an infinite amount of requests as soon as the threads are saturated.

So it still does sound exactly like the description of the security issue unless I'm missing something here? Was this previous fix about a variant of this situation maybe? as the description matches to me, I just would like to understand better the distinction between that previous security issue and this problem we are seing.

@MSP-Greg yes I will gladly test some patches if you have one 👍

@nateberkopec
Copy link
Member

We've got a fix. We're gonna sleep on it and work on it tomorrow.

@nateberkopec
Copy link
Member

Fixed by df72887, released in 5.3.1 and 4.3.8.

CVE pending, GHSA is here: GHSA-q28m-8xjw-8vr5

In the future, please report anything that you think might be a security issue in accordance with our security policy. Reporting security issues in a public forum stresses maintainers out and gives us less time to work on fixes.

@jarthod
Copy link
Author

jarthod commented May 11, 2021

Thanks @nateberkopec for the quick update!

So if I understand well after this fix puma will close the connection after 10 (default) requests in the same keep-alive socket? I didn't dive deeper but is the connection stopped at a proper time? When testing with POST requests, I do see a lot of "EOF" and "connection reset by peer" errors with this new version whereas there's none with 5.3.0.

(same sample app as before)

5.3.1:

> hey -z 10s -c 16 -m POST -d '{}' http://localhost:9292

Summary:
  Total:	10.6382 secs
  Slowest:	0.7210 secs
  Fastest:	0.0101 secs
  Average:	0.0439 secs
  Requests/sec:	356.3573
  
  Total data:	33867 bytes
  Size/request:	9 bytes

Response time histogram:
  0.010 [1]	|
  0.081 [3378]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.152 [4]	|
  0.223 [0]	|
  0.294 [5]	|
  0.366 [362]	|■■■■
  0.437 [1]	|
  0.508 [3]	|
  0.579 [1]	|
  0.650 [4]	|
  0.721 [4]	|


Latency distribution:
  10% in 0.0103 secs
  25% in 0.0105 secs
  50% in 0.0107 secs
  75% in 0.0108 secs
  90% in 0.2305 secs
  95% in 0.3319 secs
  99% in 0.3366 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0101 secs, 0.7210 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0002 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0003 secs
  resp wait:	0.0437 secs, 0.0101 secs, 0.7207 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0003 secs

Status code distribution:
  [200]	3763 responses

Error distribution:
  [8]	Post http://localhost:9292: EOF
  [9]	Post http://localhost:9292: http: server closed idle connection
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56114->127.0.0.1:9292: read: connection reset by peer
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56130->127.0.0.1:9292: read: connection reset by peer
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56172->127.0.0.1:9292: read: connection reset by peer
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56190->127.0.0.1:9292: read: connection reset by peer
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56194->127.0.0.1:9292: read: connection reset by peer
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56206->127.0.0.1:9292: read: connection reset by peer
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56214->127.0.0.1:9292: read: connection reset by peer
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56290->127.0.0.1:9292: read: connection reset by peer
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56356->127.0.0.1:9292: read: connection reset by peer
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56456->127.0.0.1:9292: read: connection reset by peer
  [1]	Post http://localhost:9292: read tcp 127.0.0.1:56840->127.0.0.1:9292: read: connection reset by peer

5.3.0:

> hey -z 10s -c 16 -m POST -d '{}' http://localhost:9292

Summary:
  Total:	10.6423 secs
  Slowest:	10.6353 secs
  Fastest:	0.0101 secs
  Average:	0.0437 secs
  Requests/sec:	354.9971
  
  Total data:	34002 bytes
  Size/request:	9 bytes

Response time histogram:
  0.010 [1]	|
  1.073 [3767]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  2.135 [0]	|
  3.198 [0]	|
  4.260 [0]	|
  5.323 [0]	|
  6.385 [0]	|
  7.448 [0]	|
  8.510 [0]	|
  9.573 [0]	|
  10.635 [10]	|


Latency distribution:
  10% in 0.0103 secs
  25% in 0.0104 secs
  50% in 0.0108 secs
  75% in 0.0108 secs
  90% in 0.0119 secs
  95% in 0.1042 secs
  99% in 0.1160 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0101 secs, 10.6353 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0017 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0002 secs
  resp wait:	0.0436 secs, 0.0101 secs, 10.6337 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0002 secs

Status code distribution:
  [200]	3778 responses

I didn't dive deeper yet but this doesn't sound reassuring, is puma closing the socket after receiving the requests and before sending the reply maybe? If this is after the response maybe it doesn't send the Connection: close header? (so the client is not aware and start sending the next request which leads to the first case).

In the future, please report anything that you think might be a security issue in accordance with our security policy. Reporting security issues in a public forum stresses maintainers out and gives us less time to work on fixes.

Will do! sorry about that 🙇

@nateberkopec
Copy link
Member

If this is after the response maybe it doesn't send the Connection: close header

Certainly possible and this was a known issue with the fix. Our concern was security > correctness at the margins and work will be done over the next few weeks to fix the edges created here. I'm sufficiently confident that because this only affects clients that have sent more than 10 keepalive requests that the impact here is small.

@nateberkopec
Copy link
Member

Tracking at #2627

@jarthod
Copy link
Author

jarthod commented May 11, 2021

Certainly possible and this was a known issue with the fix. Our concern was security > correctness at the margins and work will be done over the next few weeks to fix the edges created here.

Ok that is totally understandable

I'm sufficiently confident that because this only affects clients that have sent more than 10 keepalive requests that the impact here is small.

Yes, I agree.

Thanks for the new issue! I'll follow it then and also verify if this behavior could impact us negatively in the meantime.

@dthomason
Copy link

dthomason commented May 18, 2021

I have been troubleshooting slow response times from one of our backend apis hitting the other backend api. The failing server uses web sockets/ActionCable and has enough threads so I think the queue_requests true setting may still be a problem. We’re in a kubernetes deployment so I suspect the round robin requests are getting queued to death. Great documentation by the way, there’s some other config options I just learned about that should help our prod performance.

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants