Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation with multithreading and connection pools #1254

Closed
mwpastore opened this issue Mar 24, 2017 · 21 comments
Closed

Performance degradation with multithreading and connection pools #1254

mwpastore opened this issue Mar 24, 2017 · 21 comments
Labels
Milestone

Comments

@mwpastore
Copy link
Contributor

mwpastore commented Mar 24, 2017

I maintain several of the Ruby framework samples for TechEmpower's FrameworkBenchmarks. The latest Round 14 Preview 1.1 numbers reflect very favorably on all the perf work that's been going on in the Ruby community over the past few months and years. However, there is an anomaly in Puma's numbers that may be worth digging into!

Context

Round 14 Preview 1.1 was executed on a machine with 80 hardware threads (and some ungodly amount of RAM).

  • Puma, Unicorn, and Passenger were configured to run with 100 processes.
  • Puma was additionally configured to run with four threads (min and max) per process.
  • Under Puma, Sequel and ActiveRecord were configured to use connection pools with four slots.
  • Under Unicorn and Passenger, Sequel and ActiveRecord were configured to run single-threaded.
  • The wrk benchmark was executed against these instances with a concurrency of 256.

I'll be talking about YARV/MRI 2.4 throughout the rest of this report unless otherwise specified. (There is a JRuby/TorqueBox benchmark but it has some other undiagnosed performance issues.)

The Good

On most of the tests, Puma wins handily thanks to its increased concurrency, even on CPU-bound tasks. For example, JSON serialization under Sinatra:

sinatra-json

And single SELECT queries against PostgreSQL under Roda/Sequel (note the FE column):

roda-sequel-postgres-db

We can see in that last example that Puma pulls ahead despite the overhead of Sequel automatically checking connections in to and out from its pool. The route being called looks like this (Roda automatically serializes the resulting Hash object to JSON):

# Test type 2: Single database query
static_get '/db' do
  World.with_pk(rand(10_000).succ).values
end

The Bad

However, the wheels start to come off as soon as we start executing more than one query in a route. For example, multiple SELECT queries against PostgreSQL under Roda/Sequel:

roda-sequel-postgres-query

Oops! That's not good at all. Here's roughly what that code looks like:

# Test type 3: Multiple database queries
static_get '/queries' do
  Array.new(20) do
    World.with_pk(rand(10_000).succ).values
  end
end

Note: I am able to reproduce this "anomaly" locally, and wrapping a Sequel::Database#synchronize around the loop (to ensure a single database connection is checked out from the pool and used for all 20 queries) improves the performance of this route by 10%, still 30% behind Unicorn and Passenger on my machine.

This performance disparity can't be explained by the overhead of Sequel's connection pool as shown in the prior example of single-query performance. We can see it in ActiveRecord, too; multiple SELECT and UPDATE queries performed against MySQL under Sinatra/ActiveRecord:

sinatra-activerecord-update

Here's roughly the code for that route:

# Test type 5: Database updates
get '/updates' do
  worlds =
    ActiveRecord::Base.connection_pool.with_connection do
      Array.new(20) do
        world = World.find(rand(10_000).succ)
        world.update(:randomnumber=>rand(10_000).succ)
        world.attributes
      end
    end

  json worlds
end

The Ugly

It seems like the issue manifests from a "tight" loop of queries occurring in a single request thread with a connection checked out from a pool. When the queries are distributed across multiple request threads, Puma plays nice with a connection pool. When Sequel or ActiveRecord are single-threaded, there are no performance issues whatsoever. That's true even for Puma: single-threaded Puma beats Unicorn and Passenger in these benchmarks (on my machine)!

This is particularly pernicious because it's exactly the use case where you would expect Puma's multithreading to help you, not hurt you: lots of IO-bound tasks. In the real world, we could easily solve this problem by rewriting these routes to perform a single SELECT .. WHERE .. IN query. However, I think these benchmarks raise an interesting question: Is this a pathological case that can't be fixed without changing the application, or does it demonstrate an edge case that could be quietly afflicting thousands of Ruby applications running under Puma?

Updates

  1. (25-Mar) Interestingly, a SELECT .. WHERE .. IN is also slower (fewer requests/second) on Puma—multi-threaded (1,443) cf. single-threaded (2,200). So the issue doesn't appear to be strictly related to a "tight" loop of IO-bound tasks.

  2. (25-Mar) I was able to reproduce my findings under Ruby 2.2.6 and Puma 2.15.3—multi-threaded (256) cf. single-threaded (387)—so the performance anomaly (if there truly is one) has been around for some time.

  3. (25-Mar) Using a connection pool with fewer slots than Puma's thread depth seems to give decent compromise numbers. I can get within 20% of single-threaded performance using a connection pool size of floor(2*ln(x)), but that's just a wild guess at a formula based on what I'm seeing at a variety of Puma thread depths.

@jrafanie
Copy link

@mwpastore Sorry, I didn't visit your links... Did you share your puma config (specifically the min and max value for the threads)? Also, what is the rails connection pool configuration? It's very interesting.

@jrafanie
Copy link

@mwpastore is it easy to run the benchmark against a different adapter such as PostgreSQL? I'm curious if you'd see the exact same behavior... I guess I would assume so...

I don't have any answers but am curious to know what's happening here.

@mwpastore
Copy link
Contributor Author

@jrafanie Tests were run under Puma, Unicorn, and Passenger, with Sequel and ActiveRecord, against MySQL and PostgreSQL. All the configuration is available at the linked GitHub repository, and all the benchmark data is available at the linked Round 14 Preview 1.1 results.

@evanphx
Copy link
Member

evanphx commented Mar 29, 2017

@mwpastore Since you use that autotune script to set the workers and threads, I can't know their exact values on the machine. Could you confirm what they were? You said 100 workers and 4 threads per work, is that correct?

@mwpastore
Copy link
Contributor Author

mwpastore commented Mar 29, 2017 via email

@evanphx
Copy link
Member

evanphx commented Mar 29, 2017

@mwpastore Ok. I suspect the issue isn't really something that puma can correct sadly. What I believe is happening is that the workload is causing ruby threads to fight for the GVL because there is enough work being done in ruby, especially if the queries don't really interleave, but rather they all return results at the same time. In that case, all 4 threads would be asleep, waiting on IO, then all wake up at the same time to process the results.

Because wrk does not simulate a "normal" traffic pattern but just tries requests as fast as possible you'll see results like this. You might consider trying to generate staggered requests, which better represent a normal traffic pattern. That situation should be better.

@mwpastore
Copy link
Contributor Author

mwpastore commented Mar 31, 2017

@evanphx That's a pleasantly simple and elegant explanation of this issue, and hearing it from the master himself definitely eases my mind. 🙇 I was starting to question my sanity, or at least question my understanding of how computers work!

I will try to reproduce the anomaly using Siege instead of wrk and then fiddle with Siege's request staggering option to see if it "solves" the "problem." If it does, maybe I can try convincing wg to add a similar feature to wrk, or failing that, try convincing TechEmpower to use a different benchmarking tool (which isn't likely since they rely pretty heavily on wrk's Lua feature).

Thank you! Is it alright if we leave this issue open while I keep poking at it?

@evanphx
Copy link
Member

evanphx commented Mar 31, 2017

Absolutely, leave it open. It's good for documentation. I'll try to think of a way to detect and verify my explanation as well.

@nateberkopec
Copy link
Member

@mwpastore Couldn't TechEmpower just delay requests to create a more realistic arrival pattern?

@nateberkopec
Copy link
Member

@mwpastore I'm interested to see how the threadpool changes in 3.9 affect our benchmark results

@mwpastore
Copy link
Contributor Author

@nateberkopec Running some quick benchmarks just to get a feel for it:

In a database test (query) with Sequel, Unicorn 5.3.0 (three workers, single-threaded) outperforms Puma 3.9.1 (three workers, 86 threads/worker) by up to 60% (e.g. 3,454 req/sec cf. 2,152 at a wrk concurrency of 256) on my test system. That thread count is definitely oversized for this machine; when TFB R15-preview is available we can see the results on larger hardware. However, this is consistent with the results I've been seeing, and the results published in TFB R14. Interestingly, single-threaded Puma outperforms Unicorn (cf. above at 4,117 req/sec).

In a non-database test (json), a similar pattern can be observed. Single-threaded Puma clocks in at 8,385 req/sec, Unicorn at 6,520, multi-threaded Puma at 5,856.

@wjordan
Copy link
Contributor

wjordan commented Sep 11, 2018

I'm currently testing Puma with a large, high-traffic Rails web-application, specifically for multi-threaded workers as a mitigation against cascading failures caused by I/O latency spikes. To compare performance on my production workload, I deployed a 'canary' frontend running Puma v3.12.0 configured with min 0, max 4 threads alongside a fleet of 'baseline' frontends running Unicorn. Both configurations fork 40 worker-processes with application preloaded and a connection-pool size of 5.

I'm seeing ~2x higher average database-query latency, and significantly higher variation (std dev) using Puma compared to Unicorn, within metrics captured by New Relic:

Unicorn:
image

Puma:
image

This query latency results in a noticeably more spiky 99%ile for DB-query transactions overall:

Unicorn:
image

Puma:
image

I suspect the results I'm seeing are related to the benchmark results discussed in this issue, however my use-case is a 'normal' production workload (mostly CPU-bound, servers kept under 50% CPU-utilization).

@evanphx
Copy link
Member

evanphx commented Sep 11, 2018

The reason for the database latency spikes is that while the thread that did the query was waiting, another thread began to run. And only once the data from the query was returned did the original thread go back into the run queue. Because you're traffic load is causing all threads to run, there is just more perceived time waiting for the database to return, when really what you're seeing is thread scheduling delay. The question is: are you ok with that?

In the unicorn case, the worker process can monopolize the database connection all to itself, so when the database results come back, it begins to work immediately.

@wjordan
Copy link
Contributor

wjordan commented Sep 11, 2018

Sounds plausible that thread contention within a single worker process (made worse by Ruby's global VM lock, I assume) is behind the increased DB-query latency variation.

Correct me if I'm mistaken, I don't think the performance degradation in the multi-thread case is only a matter of traffic load being too high. Rather, it seems that requests don't get distributed across processes as evenly as in the single-thread case, causing some requests to encounter thread/GVL contention on an already-busy process while other processes on the server are sitting idle without any requests.

Given my own use-case, I'm inclined to think that this scenario is not at all pathological, and may indeed be "quietly afflicting thousands of Ruby applications running under Puma." Further, I think the issue is something that can be corrected in Puma, but not necessarily easily.

For comparison, Passenger has implemented least-busy-process-first request load balancing to address this issue. A code-comment in Passenger's AcceptLoadBalancer implementation describes the issue and its solution:

Furthermore, it can also be very easy for threads to become unbalanced. If a burst of clients connect to the server socket, then it is very likely that a single Server accepts all of those clients. This can result in situations where, for example, thread 1 has 40 clients and thread 2 has only 3.
The AcceptLoadBalancer solves this problem by being the sole entity that listens on the server socket. All client sockets that it accepts are distributed to all registered Server objects, in a round-robin manner.

Does this sound like a plausible description of the issue underlying the performance degradation here? If so, then one possibility of fixing would be to implement a similar load-balancer in Puma that accepts client connections and forwards them to individual worker threads in a least-busy, round-robin manner.

There might also be a short-cut possible for Puma to distribute request load more evenly by adding a bit of cross-process synchronization to the existing thread auto-scaling behavior. I'm going to investigate this alternative a bit and see if I can come up with anything tangible.

@evanphx
Copy link
Member

evanphx commented Sep 11, 2018

It's certainly possible that is happening, though less so because puma now attempts to not accept new sockets it can not work on (

pool.wait_until_not_full
).

But that change doesn't solve the condition you're talking about, which is where a a process that keeps having 1 thread slot available and filling it as opposed to where another process has 5 and rarely fills it.

Implement a mechanism for having only the least loaded worker accept new connections would require some sort of communication between the workers (where is none atm) or having the master process accept and then distribute the socket to one of the workers it chooses. Even in that later case, the worker processes would have to communicate their load to the master so it can make the proper decision.

I'll take a look at how Passenger solves this.

@evanphx
Copy link
Member

evanphx commented Sep 11, 2018

It appears that Passenger attempts to solve this problem by having the master process accept sockets and hand them out to the clients not based on load, but in just a simple round-robin process, with the hope that each request takes the same amount of work. Ergo it's still possible for passenger to be handing sockets to backends that would not be the most efficient one. But knowing which one is the most efficient requires that master <-> worker communication about load.

@wjordan
Copy link
Contributor

wjordan commented Sep 12, 2018

I put together a first attempt at partially addressing this issue in PR #1646.

Rather than adding extra communication/synchronization between the workers or proxying requests through the master, I found that non-idle workers can do a non-blocking poll on the socket for queued work as a form of indirect communication, keeping the PR relatively simple compared to other alternatives.

This seems to help balance load across processes slightly and gets much closer to single-thread performance in my tests, but the caveat of this simple implementation is that it's only effective when the average thread depth is <= 1. Balancing requests across higher thread-depths might result in diminishing returns anyway, due to the GVL.

@mwpastore I'd be interested in hearing if this PR has any impact on the benchmarks you've been checking in this issue. In my local tests against roda-sequel-postgres I'm seeing some improvement.

@wjordan
Copy link
Contributor

wjordan commented Dec 18, 2019

One key detail about this TechEmpower Framework benchmark I didn't realize earlier- the wrk performance tests use keepalive on its HTTP connections. In Puma, incoming connections are routed to a worker-process when first accepted and are stuck on that process until closed.

So changes in connection-balancing can only impact this benchmark by routing the initial connections across processes as evenly as possible. However, an optimal solution would be to move the Reactor (which buffers requests and manages keepalive connections) into the parent process (or a separate layer of workers), and then send individual, load-balanced HTTP requests across processes.

An alternative solution would be to modify Puma to have each worker-process listen to separate sockets, and run an upsteam proxy (nginx or haproxy) to manage the client connections and load-balance individual incoming requests across the set of Puma-worker sockets directly.

@nateberkopec
Copy link
Member

We're probably going to look a lot different on any benchmarks using wrk after 4.3.1 and the patch that included, as well.

@nateberkopec
Copy link
Member

I'm probably going to close this with 5.0 and the various perf/balancing PRs that are going to be involved with that release.

@nateberkopec nateberkopec added this to the 5.0.0 milestone Mar 5, 2020
@nateberkopec
Copy link
Member

Fixed in #2079.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants