New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keepalive connections blocking threads (again?) #2625
Comments
That security issue was a much, much larger effect. You're demonstrating an effect on the order of 1 request in 300 under heavy traffic, the original issue was a true denial-of-service where you could take down a Puma server with That said, seems like you're on to something here and we definitely want to fix the p99 behavior you've demonstrated. |
Using an updated test framework, I have the following with Puma running with Each client sends 10 requests and is 'keep-alive'. The first set runs clients in 10 loops/threads, which matches Puma's capacity (2 workers * 5 threads). The second set runs clients in 25 loops/threads, so Puma is 'overloaded'. Notice the large increase in time in the 90% and 95% times. I've got a patch for this, let me see if I can post a simple change you could test against?
|
Last time I checked this was still an issue but I am happy to review the reported details align with what I was seeing. |
By the way it’s not P99 latency, those connections only get served when all the other connections are finished to they have effectively infinite latency at least that’s how it looked last time I checked. |
Thank you all for these quick responses! 🙇
Indeed this is not just a 1 in 300 problem, this is only my testing setup were So it still does sound exactly like the description of the security issue unless I'm missing something here? Was this previous fix about a variant of this situation maybe? as the description matches to me, I just would like to understand better the distinction between that previous security issue and this problem we are seing. @MSP-Greg yes I will gladly test some patches if you have one 👍 |
We've got a fix. We're gonna sleep on it and work on it tomorrow. |
Fixed by df72887, released in 5.3.1 and 4.3.8. CVE pending, GHSA is here: GHSA-q28m-8xjw-8vr5 In the future, please report anything that you think might be a security issue in accordance with our security policy. Reporting security issues in a public forum stresses maintainers out and gives us less time to work on fixes. |
Thanks @nateberkopec for the quick update! So if I understand well after this fix puma will close the connection after 10 (default) requests in the same keep-alive socket? I didn't dive deeper but is the connection stopped at a proper time? When testing with POST requests, I do see a lot of "EOF" and "connection reset by peer" errors with this new version whereas there's none with 5.3.0. (same sample app as before) 5.3.1:
5.3.0:
I didn't dive deeper yet but this doesn't sound reassuring, is puma closing the socket after receiving the requests and before sending the reply maybe? If this is after the response maybe it doesn't send the
Will do! sorry about that 🙇 |
Certainly possible and this was a known issue with the fix. Our concern was |
Tracking at #2627 |
Ok that is totally understandable
Yes, I agree. Thanks for the new issue! I'll follow it then and also verify if this behavior could impact us negatively in the meantime. |
I have been troubleshooting slow response times from one of our backend apis hitting the other backend api. The failing server uses web sockets/ActionCable and has enough threads so I think the queue_requests true setting may still be a problem. We’re in a kubernetes deployment so I suspect the round robin requests are getting queued to death. Great documentation by the way, there’s some other config options I just learned about that should help our prod performance. thanks |
What I noticed here looks extremely similar to GHSA-7xx3-m584-x994 (and #1565) but somehow I reproduce it on 5.3.0 and 4.3.7 so I'm not sure if the problem was not completely fixed or if it's something else. Basically when hitting puma (default config) with more keepalive connections than the number of threads it is configured with, with some traffic in them, all other connections will hang almost indefinitely without getting a response.
After seing this in our production (using puma 5.0) and reproducing it locally, I tried looking for a cause (that's when I found GHSA-7xx3-m584-x994) and also tried other versions of puma (
4.3.7
and5.3.0
) but reproduced the issue every time. Here are the minimal instructions to reproduce:I used the following sample application (put this in
config.ru
) which simply sleep 10ms to simulate work.Then starting puma with 4 threads (for example):
Then I run some benchmark using the hey tool (which unlike wrk shows when some connection just waited the entire time for a response) and a concurrency higher than the number of threads in puma (here
-c 16
).-z 10s
is the duration of the test:Here we can see that in the response time histogram, 9 requests waited 10 seconds (which is the duration of the test) before getting their response. We can also see the "slowest" resp wait at the bottom which is 10s +. If we put just 4 concurrency it's fine, with 5-7 it's a bit in between, it does not always wait 10s but there's still some slow responses. But after that it's almost always the case and there's always some requests (roughly proportional to: test concurrency - puma threads) which waits for 10 seconds.
Expected behavior
The expected behavior would be for all requests to be processed fairly in the order they arrive and not having one connection blocking the others as explained in #1565 (comment).
If we disable keepalive on the
hey
side, we can see what is basically the expected behavior IMO (same throughput overall but no long response, every requests is fairly answered):I also noticed something interesting (maybe an unrelated side-effect): if I use the
queue_requests false
option on puma side, this problem is not present any more. But still withqueue_requests
enabled (default) I don't believe this behavior is desired? is it? cc @nateberkopec @ioquatix as you both worked on this issue specifically maybe you'll have more insight into this.A bit more details about my local machine:
Thanks!
The text was updated successfully, but these errors were encountered: