Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Puma does not prefer idle-workers in cluster mode #2078

Closed
ayufan opened this issue Nov 21, 2019 · 12 comments
Closed

Puma does not prefer idle-workers in cluster mode #2078

ayufan opened this issue Nov 21, 2019 · 12 comments
Labels

Comments

@ayufan
Copy link
Contributor

ayufan commented Nov 21, 2019

Describe the bug

We at GitLab.com are production testing Puma with prepartions for switching to it.
However, we noticed a few oddness when running Puma vs Unicorn in mostly similar configurations.

An worker of Puma when running in Cluster mode has no knowledge about siblings and their utilisation. This results in some of the sockets to be accepted by sub-optimal workers, that are already processing requests. This does have a statistical significance as in my local testing, and GitLab.com testing in the above workload we see an increase of around 20-30% of durations on P60%. Which is a significant increase.

Capacity of Puma

Capacity of Puma web-server is defined by workers * threads. Each of these
form a slot that can accept a new request. It means that each slot can accept
a new request at random (effectively round-robin).

Now, what happens if we have 2 requests waiting to be processed and two workers,
and two threads:

  1. Our total capacity is 4 (W2 * T2),
  2. Each request can be assigned at random to any worker, even the worker that is
    processing currently request, as we do not control that,
  3. It means that in ideal scenario if we have 2 requests, for the optimal performance
    they should be assigned to two separate workers,
  4. Puma does not implement any mechanism for 3., rather it is first accepting, first wins.
    It does mean that it is plausible that the two requests will be assigned in sub-optimal
    way: to the single worker, but multiple threads.
  5. If the two requests are being processed by the same worker, they do share CPU-time,
    it means that they processing time is increased by the noisy neighbor due to Ruby MRI GVL.

Interestingly, the same latency impact is present on Sidekiq, we just don't see it,
as we do not care about real-time aspect of background processing that much.

However, the scheduling changes can improve Sidekiq performance as well if we would
target sidekiq as well.

This is more described in detail here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247859173

Our Configuration

We do run Puma and Unicorn in the following configurations:

  1. Puma runs in W16/T2, the 16 corresponds to 16 CPUs available, nodes do not process anything else,
  2. Unicorn runs in W30, the 30 seems like some artificial factor taken to have some sort of node saturation,
  3. The nodes are lightly utilised, peeking at around 45% of CPU usage,
  4. We seem to process around 70-80rq/sec.

We did notice oddness in Puma scheduling, as it was sometimes hitting the workers that are already processing requests. This resulted in increase of latency and duration of requests processing, even though we had a ton of spare capacity on other workers that were simply idle.

The nodes are configured today like that, due to graceful restart/overwrite/shutdown of Unicorn, to have a spare capacity to handle as many as twice amount of regular W30. They are highly underutilized on average due to that, but this is something to change.

Data

The exact data are presented here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247824814 with the description of what we are seeing.

Expected behavior: Ideal-scheduling algorithm in multi-threaded scenario

In ideal scenario we would always want to use our capacity efficiently:

  1. Assign to free workers first, as they do offer the best latency,
  2. Assign to workers with the least amount of other requests being processed, as they are the least busy,
  3. Assign to workers that are close to finish the requests being processed, as they will have a free capacity in the future (this is hard to implement, as this becomes a quite complex heuristic to understand the request completion time).

Currently, Puma does nothing from that. For round-robin we pay around 20% of ~performance penalty due to sub-optimal request processing assignment.

It is expected that the closer we get to 100% CPU-usage the more expected is that the Puma-non-scheduling algorithm will not be a problem, due to node being saturated, and threads by balanced by the kernel.

Workaround or Solution?

I started exploring an simplest solution for that in this issue:
https://gitlab.com/gitlab-org/gitlab/issues/36858.

This adds a very small sleep into busy-workers. This results in busy workers in delaying accepting
socket, and give time for idle-workers to do it first, as they should be fairly quick to respond in such cases, as they are not processing anything significant.

This has surprising results as it seems to allow to remove this performance penalty and prefer to evenly distribute across workers first, instead of round-robin as it is happening across all workers and all threads today. This is more described here: https://gitlab.com/gitlab-org/gitlab/issues/36858#note_247918506.

This is a proof of concept: ayufan-research@e069902.

It seems that a delay as small as 1ms does a trick. Well, it does not seem to
be 1ms specifically, due to Ruby and kernel ticks, which can be as long as 10ms.
However, it is significant enough to make idle-worker to be always first responder.
We should not really loose any performance, as busy-worker-that-is-sleeping is already
processing request, so it is doing something.

Some non-scientific tests are here: https://docs.google.com/spreadsheets/u/2/d/1y2YqrPPgZ-RtjKiCGplJ7YkwjMXkZx6vEPq7prFbUn4/edit#gid=0
Test script is here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247859173

@ayufan
Copy link
Contributor Author

ayufan commented Nov 21, 2019

Any other ideas what we could do to make it better? :)

@dentarg
Copy link
Member

dentarg commented Nov 21, 2019

@ayufan have you seen #1920? (and #1254, #1646)

@ayufan
Copy link
Contributor Author

ayufan commented Nov 21, 2019

@dentarg

Thanks. This slip my eyes. Yes, it describes almost the same problem.

@ayufan
Copy link
Contributor Author

ayufan commented Nov 21, 2019

OK @dentarg

Re-visiting the#1920 it does not seem to solve idle-worker problem, as this assumes that the new connections are delayed before getting accepted until there's at least one thread available.

I will run the #1920 against my performance testing to see exactly if and when it helps.

I did run, and posted comment in #1920 (comment). The solution proposed does not really solve the problem defined here.

@ayufan ayufan mentioned this issue Nov 21, 2019
@travisbell
Copy link

travisbell commented Nov 21, 2019

Great writeup @ayufan. I recently switched from Unicorn to Puma on TMDb (decent traffic - 16,000+ r/s) and noticed the issues described in #1254 but didn't take the time to put everything together like you did here. I just took the utilization/latency hit and moved on.

It would be great if this could improve. I made the choice to stick with Puma for reasons that outweighed the difference I noticed. But I'd sure be happy to see it improve.

@ayufan
Copy link
Contributor Author

ayufan commented Nov 21, 2019

@travisbell Could you test this PoC ayufan-research@e069902? I did not yet to get to running that on production, but I was able to reproduce the issue locally and see that this should improve things.

@ayufan
Copy link
Contributor Author

ayufan commented Nov 21, 2019

I created the #2079. Which implements the above proposal, instead of using sleep it uses conditional variable with timeout which is much better as this gets signaled once one of the threads finishes the work.

@nateberkopec
Copy link
Member

It's an interesting solution to the problem. You're literally sleep-sorting Puma workers, which is diabolical, but if it works, it works, right?

I'll ask @evanphx what he thinks.

@ayufan
Copy link
Contributor Author

ayufan commented Dec 11, 2019

@nateberkopec I merged that to GitLab. We will be testing that in a few days on a production. We had a great run in staging and canary showing that patch helps significantly. Once I have data I will post it here with more explanation and comparison of different modes. I'm waiting for that before adding and doing more work on this PR.

@nateberkopec
Copy link
Member

Love it. Eager to hear back.

@ayufan
Copy link
Contributor Author

ayufan commented Apr 22, 2020

I'm closing this issue as we have PR: #2079 with all description moved.

@ayufan ayufan closed this as completed Apr 22, 2020
@igorwwwwwwwwwwwwwwwwwwww

One thing I came across was this LWN article on SO_REUSEPORT, specifically:

The problem with this technique, as Tom pointed out, is that when multiple threads are waiting in the accept() call, wake-ups are not fair, so that, under high load, incoming connections may be distributed across threads in a very unbalanced fashion. At Google, they have seen a factor-of-three difference between the thread accepting the most connections and the thread accepting the fewest connections; that sort of imbalance can lead to underutilization of CPU cores. By contrast, the SO_REUSEPORT implementation distributes connections evenly across all of the threads (or processes) that are blocked in accept() on the same port.

That seems to match the symptoms we saw here.

I wonder if SO_REUSEPORT would load balance accept more evenly. If so, it may make sense to re-evaluate #1307.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants