Puma does not prefer idle-workers in cluster mode #2078

ayufan · 2019-11-21T12:49:16Z

Describe the bug

We at GitLab.com are production testing Puma with prepartions for switching to it.
However, we noticed a few oddness when running Puma vs Unicorn in mostly similar configurations.

An worker of Puma when running in Cluster mode has no knowledge about siblings and their utilisation. This results in some of the sockets to be accepted by sub-optimal workers, that are already processing requests. This does have a statistical significance as in my local testing, and GitLab.com testing in the above workload we see an increase of around 20-30% of durations on P60%. Which is a significant increase.

Capacity of Puma

Capacity of Puma web-server is defined by workers * threads. Each of these
form a slot that can accept a new request. It means that each slot can accept
a new request at random (effectively round-robin).

Now, what happens if we have 2 requests waiting to be processed and two workers,
and two threads:

Our total capacity is 4 (W2 * T2),
Each request can be assigned at random to any worker, even the worker that is
processing currently request, as we do not control that,
It means that in ideal scenario if we have 2 requests, for the optimal performance
they should be assigned to two separate workers,
Puma does not implement any mechanism for 3., rather it is first accepting, first wins.
It does mean that it is plausible that the two requests will be assigned in sub-optimal
way: to the single worker, but multiple threads.
If the two requests are being processed by the same worker, they do share CPU-time,
it means that they processing time is increased by the noisy neighbor due to Ruby MRI GVL.

Interestingly, the same latency impact is present on Sidekiq, we just don't see it,
as we do not care about real-time aspect of background processing that much.

However, the scheduling changes can improve Sidekiq performance as well if we would
target sidekiq as well.

This is more described in detail here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247859173

Our Configuration

We do run Puma and Unicorn in the following configurations:

Puma runs in W16/T2, the 16 corresponds to 16 CPUs available, nodes do not process anything else,
Unicorn runs in W30, the 30 seems like some artificial factor taken to have some sort of node saturation,
The nodes are lightly utilised, peeking at around 45% of CPU usage,
We seem to process around 70-80rq/sec.

We did notice oddness in Puma scheduling, as it was sometimes hitting the workers that are already processing requests. This resulted in increase of latency and duration of requests processing, even though we had a ton of spare capacity on other workers that were simply idle.

The nodes are configured today like that, due to graceful restart/overwrite/shutdown of Unicorn, to have a spare capacity to handle as many as twice amount of regular W30. They are highly underutilized on average due to that, but this is something to change.

Data

The exact data are presented here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247824814 with the description of what we are seeing.

Expected behavior: Ideal-scheduling algorithm in multi-threaded scenario

In ideal scenario we would always want to use our capacity efficiently:

Assign to free workers first, as they do offer the best latency,
Assign to workers with the least amount of other requests being processed, as they are the least busy,
Assign to workers that are close to finish the requests being processed, as they will have a free capacity in the future (this is hard to implement, as this becomes a quite complex heuristic to understand the request completion time).

Currently, Puma does nothing from that. For round-robin we pay around 20% of ~performance penalty due to sub-optimal request processing assignment.

It is expected that the closer we get to 100% CPU-usage the more expected is that the Puma-non-scheduling algorithm will not be a problem, due to node being saturated, and threads by balanced by the kernel.

Workaround or Solution?

I started exploring an simplest solution for that in this issue:
https://gitlab.com/gitlab-org/gitlab/issues/36858.

This adds a very small sleep into busy-workers. This results in busy workers in delaying accepting
socket, and give time for idle-workers to do it first, as they should be fairly quick to respond in such cases, as they are not processing anything significant.

This has surprising results as it seems to allow to remove this performance penalty and prefer to evenly distribute across workers first, instead of round-robin as it is happening across all workers and all threads today. This is more described here: https://gitlab.com/gitlab-org/gitlab/issues/36858#note_247918506.

This is a proof of concept: ayufan-research@e069902.

It seems that a delay as small as 1ms does a trick. Well, it does not seem to
be 1ms specifically, due to Ruby and kernel ticks, which can be as long as 10ms.
However, it is significant enough to make idle-worker to be always first responder.
We should not really loose any performance, as busy-worker-that-is-sleeping is already
processing request, so it is doing something.

Some non-scientific tests are here: https://docs.google.com/spreadsheets/u/2/d/1y2YqrPPgZ-RtjKiCGplJ7YkwjMXkZx6vEPq7prFbUn4/edit#gid=0
Test script is here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247859173

The text was updated successfully, but these errors were encountered:

ayufan · 2019-11-21T13:01:29Z

Any other ideas what we could do to make it better? :)

dentarg · 2019-11-21T14:48:37Z

@ayufan have you seen #1920? (and #1254, #1646)

ayufan · 2019-11-21T15:14:00Z

@dentarg

Thanks. This slip my eyes. Yes, it describes almost the same problem.

ayufan · 2019-11-21T15:20:45Z

OK @dentarg

Re-visiting the#1920 it does not seem to solve idle-worker problem, as this assumes that the new connections are delayed before getting accepted until there's at least one thread available.

I will run the #1920 against my performance testing to see exactly if and when it helps.

I did run, and posted comment in #1920 (comment). The solution proposed does not really solve the problem defined here.

travisbell · 2019-11-21T16:09:52Z

Great writeup @ayufan. I recently switched from Unicorn to Puma on TMDb (decent traffic - 16,000+ r/s) and noticed the issues described in #1254 but didn't take the time to put everything together like you did here. I just took the utilization/latency hit and moved on.

It would be great if this could improve. I made the choice to stick with Puma for reasons that outweighed the difference I noticed. But I'd sure be happy to see it improve.

ayufan · 2019-11-21T16:12:04Z

@travisbell Could you test this PoC ayufan-research@e069902? I did not yet to get to running that on production, but I was able to reproduce the issue locally and see that this should improve things.

ayufan · 2019-11-21T18:15:38Z

I created the #2079. Which implements the above proposal, instead of using sleep it uses conditional variable with timeout which is much better as this gets signaled once one of the threads finishes the work.

nateberkopec · 2019-11-27T02:38:49Z

It's an interesting solution to the problem. You're literally sleep-sorting Puma workers, which is diabolical, but if it works, it works, right?

I'll ask @evanphx what he thinks.

ayufan · 2019-12-11T19:15:36Z

@nateberkopec I merged that to GitLab. We will be testing that in a few days on a production. We had a great run in staging and canary showing that patch helps significantly. Once I have data I will post it here with more explanation and comparison of different modes. I'm waiting for that before adding and doing more work on this PR.

nateberkopec · 2019-12-12T01:07:08Z

Love it. Eager to hear back.

ayufan · 2020-04-22T14:27:23Z

I'm closing this issue as we have PR: #2079 with all description moved.

igorwwwwwwwwwwwwwwwwwwww · 2023-02-13T23:32:58Z

One thing I came across was this LWN article on SO_REUSEPORT, specifically:

The problem with this technique, as Tom pointed out, is that when multiple threads are waiting in the accept() call, wake-ups are not fair, so that, under high load, incoming connections may be distributed across threads in a very unbalanced fashion. At Google, they have seen a factor-of-three difference between the thread accepting the most connections and the thread accepting the fewest connections; that sort of imbalance can lead to underutilization of CPU cores. By contrast, the SO_REUSEPORT implementation distributes connections evenly across all of the threads (or processes) that are blocked in accept() on the same port.

That seems to match the symptoms we saw here.

I wonder if SO_REUSEPORT would load balance accept more evenly. If so, it may make sense to re-evaluate #1307.

ayufan mentioned this issue Nov 21, 2019

Even accept #1920

Closed

ayufan mentioned this issue Nov 21, 2019

Inject small delay for busy workers to improve requests distribution #2079

Merged

8 tasks

nateberkopec added perf question labels Nov 26, 2019

wjordan mentioned this issue Dec 17, 2019

Least-busy load balancing in cluster mode #2092

Closed

8 tasks

nateberkopec removed the question label Mar 8, 2020

ayufan closed this as completed Apr 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Puma does not prefer idle-workers in cluster mode #2078

Puma does not prefer idle-workers in cluster mode #2078

ayufan commented Nov 21, 2019 •

edited

ayufan commented Nov 21, 2019

dentarg commented Nov 21, 2019

ayufan commented Nov 21, 2019 •

edited

ayufan commented Nov 21, 2019 •

edited

travisbell commented Nov 21, 2019 •

edited

ayufan commented Nov 21, 2019 •

edited

ayufan commented Nov 21, 2019

nateberkopec commented Nov 27, 2019

ayufan commented Dec 11, 2019

nateberkopec commented Dec 12, 2019

ayufan commented Apr 22, 2020

igorwwwwwwwwwwwwwwwwwwww commented Feb 13, 2023

Puma does not prefer idle-workers in cluster mode #2078

Puma does not prefer idle-workers in cluster mode #2078

Comments

ayufan commented Nov 21, 2019 • edited

ayufan commented Nov 21, 2019

dentarg commented Nov 21, 2019

ayufan commented Nov 21, 2019 • edited

ayufan commented Nov 21, 2019 • edited

travisbell commented Nov 21, 2019 • edited

ayufan commented Nov 21, 2019 • edited

ayufan commented Nov 21, 2019

nateberkopec commented Nov 27, 2019

ayufan commented Dec 11, 2019

nateberkopec commented Dec 12, 2019

ayufan commented Apr 22, 2020

igorwwwwwwwwwwwwwwwwwwww commented Feb 13, 2023

ayufan commented Nov 21, 2019 •

edited

ayufan commented Nov 21, 2019 •

edited

ayufan commented Nov 21, 2019 •

edited

travisbell commented Nov 21, 2019 •

edited

ayufan commented Nov 21, 2019 •

edited