New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Puma does not prefer idle-workers in cluster mode #2078
Comments
Any other ideas what we could do to make it better? :) |
Thanks. This slip my eyes. Yes, it describes almost the same problem. |
OK @dentarg Re-visiting the#1920 it does not seem to solve idle-worker problem, as this assumes that the new connections are delayed before getting accepted until there's at least one thread available. I will run the #1920 against my performance testing to see exactly if and when it helps. I did run, and posted comment in #1920 (comment). The solution proposed does not really solve the problem defined here. |
Great writeup @ayufan. I recently switched from Unicorn to Puma on TMDb (decent traffic - 16,000+ r/s) and noticed the issues described in #1254 but didn't take the time to put everything together like you did here. I just took the utilization/latency hit and moved on. It would be great if this could improve. I made the choice to stick with Puma for reasons that outweighed the difference I noticed. But I'd sure be happy to see it improve. |
@travisbell Could you test this PoC ayufan-research@e069902? I did not yet to get to running that on production, but I was able to reproduce the issue locally and see that this should improve things. |
I created the #2079. Which implements the above proposal, instead of using |
It's an interesting solution to the problem. You're literally sleep-sorting Puma workers, which is diabolical, but if it works, it works, right? I'll ask @evanphx what he thinks. |
@nateberkopec I merged that to GitLab. We will be testing that in a few days on a production. We had a great run in staging and canary showing that patch helps significantly. Once I have data I will post it here with more explanation and comparison of different modes. I'm waiting for that before adding and doing more work on this PR. |
Love it. Eager to hear back. |
I'm closing this issue as we have PR: #2079 with all description moved. |
One thing I came across was this LWN article on SO_REUSEPORT, specifically:
That seems to match the symptoms we saw here. I wonder if |
Describe the bug
We at GitLab.com are production testing Puma with prepartions for switching to it.
However, we noticed a few oddness when running Puma vs Unicorn in mostly similar configurations.
An worker of Puma when running in Cluster mode has no knowledge about siblings and their utilisation. This results in some of the sockets to be accepted by sub-optimal workers, that are already processing requests. This does have a statistical significance as in my local testing, and GitLab.com testing in the above workload we see an increase of around 20-30% of durations on P60%. Which is a significant increase.
Capacity of Puma
Capacity of Puma web-server is defined by
workers * threads
. Each of theseform a slot that can accept a new request. It means that each slot can accept
a new request at random (effectively round-robin).
Now, what happens if we have 2 requests waiting to be processed and two workers,
and two threads:
processing currently request, as we do not control that,
they should be assigned to two separate workers,
It does mean that it is plausible that the two requests will be assigned in sub-optimal
way: to the single worker, but multiple threads.
it means that they processing time is increased by the noisy neighbor due to Ruby MRI GVL.
Interestingly, the same latency impact is present on
Sidekiq
, we just don't see it,as we do not care about real-time aspect of background processing that much.
However, the scheduling changes can improve
Sidekiq
performance as well if we wouldtarget
sidekiq
as well.This is more described in detail here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247859173
Our Configuration
We do run Puma and Unicorn in the following configurations:
W16/T2
, the16
corresponds to 16 CPUs available, nodes do not process anything else,W30
, the30
seems like some artificial factor taken to have some sort of node saturation,We did notice oddness in Puma scheduling, as it was sometimes hitting the workers that are already processing requests. This resulted in increase of latency and duration of requests processing, even though we had a ton of spare capacity on other workers that were simply idle.
The nodes are configured today like that, due to graceful restart/overwrite/shutdown of Unicorn, to have a spare capacity to handle as many as twice amount of regular
W30
. They are highly underutilized on average due to that, but this is something to change.Data
The exact data are presented here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247824814 with the description of what we are seeing.
Expected behavior: Ideal-scheduling algorithm in multi-threaded scenario
In ideal scenario we would always want to use our capacity efficiently:
Currently, Puma does nothing from that. For round-robin we pay around 20% of ~performance penalty due to sub-optimal request processing assignment.
It is expected that the closer we get to 100% CPU-usage the more expected is that the Puma-non-scheduling algorithm will not be a problem, due to node being saturated, and threads by balanced by the kernel.
Workaround or Solution?
I started exploring an simplest solution for that in this issue:
https://gitlab.com/gitlab-org/gitlab/issues/36858.
This adds a very small sleep into busy-workers. This results in busy workers in delaying accepting
socket, and give time for idle-workers to do it first, as they should be fairly quick to respond in such cases, as they are not processing anything significant.
This has surprising results as it seems to allow to remove this performance penalty and prefer to evenly distribute across workers first, instead of round-robin as it is happening across all workers and all threads today. This is more described here: https://gitlab.com/gitlab-org/gitlab/issues/36858#note_247918506.
This is a proof of concept: ayufan-research@e069902.
It seems that a delay as small as
1ms
does a trick. Well, it does not seem tobe
1ms
specifically, due to Ruby and kernel ticks, which can be as long as 10ms.However, it is significant enough to make idle-worker to be always first responder.
We should not really loose any performance, as busy-worker-that-is-sleeping is already
processing request, so it is doing something.
Some non-scientific tests are here: https://docs.google.com/spreadsheets/u/2/d/1y2YqrPPgZ-RtjKiCGplJ7YkwjMXkZx6vEPq7prFbUn4/edit#gid=0
Test script is here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247859173
The text was updated successfully, but these errors were encountered: