New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation with multithreading and connection pools #1254
Comments
@mwpastore Sorry, I didn't visit your links... Did you share your puma config (specifically the min and max value for the threads)? Also, what is the rails connection pool configuration? It's very interesting. |
@mwpastore is it easy to run the benchmark against a different adapter such as PostgreSQL? I'm curious if you'd see the exact same behavior... I guess I would assume so... I don't have any answers but am curious to know what's happening here. |
@jrafanie Tests were run under Puma, Unicorn, and Passenger, with Sequel and ActiveRecord, against MySQL and PostgreSQL. All the configuration is available at the linked GitHub repository, and all the benchmark data is available at the linked Round 14 Preview 1.1 results. |
@mwpastore Since you use that autotune script to set the workers and threads, I can't know their exact values on the machine. Could you confirm what they were? You said 100 workers and 4 threads per work, is that correct? |
@evanphx Sorry, it's somewhat inscrutable. Yes, that's correct.
|
@mwpastore Ok. I suspect the issue isn't really something that puma can correct sadly. What I believe is happening is that the workload is causing ruby threads to fight for the GVL because there is enough work being done in ruby, especially if the queries don't really interleave, but rather they all return results at the same time. In that case, all 4 threads would be asleep, waiting on IO, then all wake up at the same time to process the results. Because wrk does not simulate a "normal" traffic pattern but just tries requests as fast as possible you'll see results like this. You might consider trying to generate staggered requests, which better represent a normal traffic pattern. That situation should be better. |
@evanphx That's a pleasantly simple and elegant explanation of this issue, and hearing it from the master himself definitely eases my mind. 🙇 I was starting to question my sanity, or at least question my understanding of how computers work! I will try to reproduce the anomaly using Siege instead of wrk and then fiddle with Siege's request staggering option to see if it "solves" the "problem." If it does, maybe I can try convincing wg to add a similar feature to wrk, or failing that, try convincing TechEmpower to use a different benchmarking tool (which isn't likely since they rely pretty heavily on wrk's Lua feature). Thank you! Is it alright if we leave this issue open while I keep poking at it? |
Absolutely, leave it open. It's good for documentation. I'll try to think of a way to detect and verify my explanation as well. |
@mwpastore Couldn't TechEmpower just delay requests to create a more realistic arrival pattern? |
@mwpastore I'm interested to see how the threadpool changes in 3.9 affect our benchmark results |
@nateberkopec Running some quick benchmarks just to get a feel for it: In a database test (query) with Sequel, Unicorn 5.3.0 (three workers, single-threaded) outperforms Puma 3.9.1 (three workers, 86 threads/worker) by up to 60% (e.g. 3,454 req/sec cf. 2,152 at a wrk concurrency of 256) on my test system. That thread count is definitely oversized for this machine; when TFB R15-preview is available we can see the results on larger hardware. However, this is consistent with the results I've been seeing, and the results published in TFB R14. Interestingly, single-threaded Puma outperforms Unicorn (cf. above at 4,117 req/sec). In a non-database test (json), a similar pattern can be observed. Single-threaded Puma clocks in at 8,385 req/sec, Unicorn at 6,520, multi-threaded Puma at 5,856. |
I'm currently testing Puma with a large, high-traffic Rails web-application, specifically for multi-threaded workers as a mitigation against cascading failures caused by I/O latency spikes. To compare performance on my production workload, I deployed a 'canary' frontend running Puma v3.12.0 configured with min 0, max 4 threads alongside a fleet of 'baseline' frontends running Unicorn. Both configurations fork 40 worker-processes with application preloaded and a connection-pool size of 5. I'm seeing ~2x higher average database-query latency, and significantly higher variation (std dev) using Puma compared to Unicorn, within metrics captured by New Relic: This query latency results in a noticeably more spiky 99%ile for DB-query transactions overall: I suspect the results I'm seeing are related to the benchmark results discussed in this issue, however my use-case is a 'normal' production workload (mostly CPU-bound, servers kept under 50% CPU-utilization). |
The reason for the database latency spikes is that while the thread that did the query was waiting, another thread began to run. And only once the data from the query was returned did the original thread go back into the run queue. Because you're traffic load is causing all threads to run, there is just more perceived time waiting for the database to return, when really what you're seeing is thread scheduling delay. The question is: are you ok with that? In the unicorn case, the worker process can monopolize the database connection all to itself, so when the database results come back, it begins to work immediately. |
Sounds plausible that thread contention within a single worker process (made worse by Ruby's global VM lock, I assume) is behind the increased DB-query latency variation. Correct me if I'm mistaken, I don't think the performance degradation in the multi-thread case is only a matter of traffic load being too high. Rather, it seems that requests don't get distributed across processes as evenly as in the single-thread case, causing some requests to encounter thread/GVL contention on an already-busy process while other processes on the server are sitting idle without any requests. Given my own use-case, I'm inclined to think that this scenario is not at all pathological, and may indeed be "quietly afflicting thousands of Ruby applications running under Puma." Further, I think the issue is something that can be corrected in Puma, but not necessarily easily. For comparison, Passenger has implemented least-busy-process-first request load balancing to address this issue. A code-comment in Passenger's
Does this sound like a plausible description of the issue underlying the performance degradation here? If so, then one possibility of fixing would be to implement a similar load-balancer in Puma that accepts client connections and forwards them to individual worker threads in a least-busy, round-robin manner. There might also be a short-cut possible for Puma to distribute request load more evenly by adding a bit of cross-process synchronization to the existing thread auto-scaling behavior. I'm going to investigate this alternative a bit and see if I can come up with anything tangible. |
It's certainly possible that is happening, though less so because puma now attempts to not accept new sockets it can not work on ( Line 399 in 5a7d884
But that change doesn't solve the condition you're talking about, which is where a a process that keeps having 1 thread slot available and filling it as opposed to where another process has 5 and rarely fills it. Implement a mechanism for having only the least loaded worker accept new connections would require some sort of communication between the workers (where is none atm) or having the master process accept and then distribute the socket to one of the workers it chooses. Even in that later case, the worker processes would have to communicate their load to the master so it can make the proper decision. I'll take a look at how Passenger solves this. |
It appears that Passenger attempts to solve this problem by having the master process accept sockets and hand them out to the clients not based on load, but in just a simple round-robin process, with the hope that each request takes the same amount of work. Ergo it's still possible for passenger to be handing sockets to backends that would not be the most efficient one. But knowing which one is the most efficient requires that master <-> worker communication about load. |
I put together a first attempt at partially addressing this issue in PR #1646. Rather than adding extra communication/synchronization between the workers or proxying requests through the master, I found that non-idle workers can do a non-blocking poll on the socket for queued work as a form of indirect communication, keeping the PR relatively simple compared to other alternatives. This seems to help balance load across processes slightly and gets much closer to single-thread performance in my tests, but the caveat of this simple implementation is that it's only effective when the average thread depth is <= 1. Balancing requests across higher thread-depths might result in diminishing returns anyway, due to the GVL. @mwpastore I'd be interested in hearing if this PR has any impact on the benchmarks you've been checking in this issue. In my local tests against |
One key detail about this TechEmpower Framework benchmark I didn't realize earlier- the So changes in connection-balancing can only impact this benchmark by routing the initial connections across processes as evenly as possible. However, an optimal solution would be to move the Reactor (which buffers requests and manages keepalive connections) into the parent process (or a separate layer of workers), and then send individual, load-balanced HTTP requests across processes. An alternative solution would be to modify Puma to have each worker-process listen to separate sockets, and run an upsteam proxy (nginx or haproxy) to manage the client connections and load-balance individual incoming requests across the set of Puma-worker sockets directly. |
We're probably going to look a lot different on any benchmarks using |
I'm probably going to close this with 5.0 and the various perf/balancing PRs that are going to be involved with that release. |
Fixed in #2079. |
I maintain several of the Ruby framework samples for TechEmpower's FrameworkBenchmarks. The latest Round 14 Preview 1.1 numbers reflect very favorably on all the perf work that's been going on in the Ruby community over the past few months and years. However, there is an anomaly in Puma's numbers that may be worth digging into!
Context
Round 14 Preview 1.1 was executed on a machine with 80 hardware threads (and some ungodly amount of RAM).
wrk
benchmark was executed against these instances with a concurrency of 256.I'll be talking about YARV/MRI 2.4 throughout the rest of this report unless otherwise specified. (There is a JRuby/TorqueBox benchmark but it has some other undiagnosed performance issues.)
The Good
On most of the tests, Puma wins handily thanks to its increased concurrency, even on CPU-bound tasks. For example, JSON serialization under Sinatra:
And single SELECT queries against PostgreSQL under Roda/Sequel (note the FE column):
We can see in that last example that Puma pulls ahead despite the overhead of Sequel automatically checking connections in to and out from its pool. The route being called looks like this (Roda automatically serializes the resulting Hash object to JSON):
The Bad
However, the wheels start to come off as soon as we start executing more than one query in a route. For example, multiple SELECT queries against PostgreSQL under Roda/Sequel:
Oops! That's not good at all. Here's roughly what that code looks like:
Note: I am able to reproduce this "anomaly" locally, and wrapping a Sequel::Database#synchronize around the loop (to ensure a single database connection is checked out from the pool and used for all 20 queries) improves the performance of this route by 10%, still 30% behind Unicorn and Passenger on my machine.
This performance disparity can't be explained by the overhead of Sequel's connection pool as shown in the prior example of single-query performance. We can see it in ActiveRecord, too; multiple SELECT and UPDATE queries performed against MySQL under Sinatra/ActiveRecord:
Here's roughly the code for that route:
The Ugly
It seems like the issue manifests from a "tight" loop of queries occurring in a single request thread with a connection checked out from a pool. When the queries are distributed across multiple request threads, Puma plays nice with a connection pool. When Sequel or ActiveRecord are single-threaded, there are no performance issues whatsoever. That's true even for Puma: single-threaded Puma beats Unicorn and Passenger in these benchmarks (on my machine)!
This is particularly pernicious because it's exactly the use case where you would expect Puma's multithreading to help you, not hurt you: lots of IO-bound tasks. In the real world, we could easily solve this problem by rewriting these routes to perform a single
SELECT .. WHERE .. IN
query. However, I think these benchmarks raise an interesting question: Is this a pathological case that can't be fixed without changing the application, or does it demonstrate an edge case that could be quietly afflicting thousands of Ruby applications running under Puma?Updates
(25-Mar) Interestingly, a
SELECT .. WHERE .. IN
is also slower (fewer requests/second) on Puma—multi-threaded (1,443) cf. single-threaded (2,200). So the issue doesn't appear to be strictly related to a "tight" loop of IO-bound tasks.(25-Mar) I was able to reproduce my findings under Ruby 2.2.6 and Puma 2.15.3—multi-threaded (256) cf. single-threaded (387)—so the performance anomaly (if there truly is one) has been around for some time.
(25-Mar) Using a connection pool with fewer slots than Puma's thread depth seems to give decent compromise numbers. I can get within 20% of single-threaded performance using a connection pool size of
floor(2*ln(x))
, but that's just a wild guess at a formula based on what I'm seeing at a variety of Puma thread depths.The text was updated successfully, but these errors were encountered: