New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inject small delay for busy workers to improve requests distribution #2079
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny typo note
This along with the original Issue are well researched and explained, thanks so much for that! Would you be wiling to test a suite of values for latency and tick so we can get a feel for how they influence the average response time? Could you run your test against a request that only lasts 150ms so we can get a feel for if your patch causes a meaningful slowdown in it? If it does, one option would be to sample average time to complete a job and feed it back into the wait algorithm so that it only ever attempts to sleep for, say, 1% or less of the average request time. |
I work on doing some tests on GitLab.com to validate it :)
…On Wed, 27 Nov 2019 at 06:52, Evan Phoenix ***@***.***> wrote:
This along with the original Issue are well researched and explained,
thanks so much for that!
Would you be wiling to test a suite of values for latency and tick so we
can get a feel for how they influence the average response time?
Could you run your test against a request that only lasts 150ms so we can
get a feel for if your patch causes a meaningful slowdown in it? If it
does, one option would be to sample average time to complete a job and feed
it back into the wait algorithm so that it only ever attempts to sleep for,
say, 1% or less of the average request time.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2079?email_source=notifications&email_token=AASOSQKQZARQOEKMHKNN6KLQVYDILA5CNFSM4JQGH2X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFIME6Q#issuecomment-558940794>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASOSQJD3PYLGXBLGP2XWBDQVYDILANCNFSM4JQGH2XQ>
.
|
073e545
to
30f2206
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach seems promising. I'll try to run it on my production workload and the benchmark mentioned in #1646 (comment) to separately validate the improvements.
I'm particularly interested in comparison against the approach in #1646, which I'm currently running in production (note that #1920 is incorrectly implemented). If I understand correctly, the two main differences between this approach and the one in #1646:
- Each worker calls
@not_full.wait
afterIO.select
has completed (when an incoming request is available on the socket) and beforeaccept_nonblock
is called. By comparison Balance incoming requests across processes #1646 callswait
afteraccept_nonblock
and beforeIO.select
is called again. wait
is called a number of times equal to the current number of requests queued up on the worker, which implicitly sorts the workers so that one with the shortest request-queue should accept the pending request first. By comparison, Balance incoming requests across processes #1646 waits whenever no requests are immediately available.
I suspect this approach will do better (particularly in heavily-loaded scenarios at higher thread-depths), but it would be good to have data confirmation since the approaches are a bit different.
Yes, this approach targets particularly idle workers, or semi-idle. I did not see the other one to have big benefits for workloads and our configuration. The other one just prevents workers to pick new connections if they are already fully utilised (it seems), so it might seem that the other one is beneficial as well. |
Let me check my notes @wjordan, but also I was testing under the different expectations I believe :) So, I was testing the #1920: https://docs.google.com/spreadsheets/d/1y2YqrPPgZ-RtjKiCGplJ7YkwjMXkZx6vEPq7prFbUn4/edit#gid=0. It shows no effect on my tests. |
I tested out this PR on the Sinatra-MySQL TechEmpower benchmark mentioned in #1646 (comment):
Results (request/sec): 1 thread: 3917.86 5 threads:
|
Thanks @wjordan. Let me check what happens with your changes when I run against my testing suite :) I believe we test different scenarios:
We might basically need both changes. |
Puma BASELINE without patches
PUMA with PR 2079
Puma with PR 1646
|
I also put the results in: https://docs.google.com/spreadsheets/d/1y2YqrPPgZ-RtjKiCGplJ7YkwjMXkZx6vEPq7prFbUn4/edit#gid=1614859337. I picked the best result recorded for each requests/concurrency. I deliberately tested that against GitLab, as it has many/many layers. It is hard to achieve throughput of 4k rq/sec, but also this is not an intent. I optimise towards the fastest processing time, not towards the highest throughput. The higher throughput can be achieved with higher amount of workers, but processing time is a function of amount of threads. |
We are running the |
We are running Puma on We are discussing that in: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_277452635. The results from running today will be posted tomorrow, but so far distributions look pretty much equal between Puma/Unicorn. It means that threading does not affect for us on the request processing. |
@nateberkopec @olleolleolle @wjordan It seems that simple change has a big impact for us: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_279516488. Doing extensive tests on We are now discussing the do additional testing to compare on a big scale: puma without patch, puma with a patch, unicorn for limited time (likely 1h, to reduce impact on customers). I will look more into understanding @wjordan performance measurements. I need to understand the objective that we are optimising towards. |
I'm caught up on your thread on Gitlab now, but it looks like you still have some measurement issues re: the patch's effectiveness, correct? |
I would say not really, just comparison between one node vs another gives wrong results for various reasons: like traffic distribution, or variability in performance between nodes. This was also the reason why we started running at 23% of fleet to reduce these variability. Anyway, we will do another set of disables to have bigger sample sets. Now, it seems that the impact is substantial, and makes all execution duration to be pretty consistent with Unicorn (at least for us). |
As of today GitLab.com runs fully on Puma with this patch. Issue for configuration change: https://gitlab.com/gitlab-com/gl-infra/production/issues/1684. I will now spend some time on comparing and figuring out which solution to ask for upstreaming :) Thanks for your help! |
@nateberkopec I asked our production team to change params of old patch: https://gitlab.com/gitlab-org/gitlab/-/issues/196002#note_330784539. The old patch is still equally the same, as we can control all aspects of it, with configuring params. This PR is simplified version with these values hardcoded. |
Some results from running the benchmark (included): https://docs.google.com/spreadsheets/d/1y2YqrPPgZ-RtjKiCGplJ7YkwjMXkZx6vEPq7prFbUn4/edit#gid=1510215191. The exact numbers are fully dependent on a workload, this hopefully somehow shows that interleaved workflow. |
@nateberkopec WDYT? What would be the next steps? :) |
I'm gonna take some time this weekend to review 👍 The code change is simple, so the main thing is figuring out the impact and benchmarking. |
Agreed, that's a tall order. We could publish this in a pre-release and then solicit people to try it out, though finding the people and getting meaningful feedback there has always been an issue. We can promote via twitter and slack etc. it's better than nothing. |
We can make it disabled by default, or even make the delay adaptable. Adjusting default later allows this to be implicitly enabled. |
Most def. Once the 5.0 milestone is complete I'll release 5.0.0.beta1 |
deeef15
to
e022667
Compare
I rebased on
Do you think this to be acceptable in that form to merge and continue testing this non-blocking/non-breaking change? |
ed006fb
to
340b341
Compare
Ruby MRI when used can at most process a single thread concurrently due to GVL. This results in a over-utilisation if unfavourable distribution of connections is happening. This tries to prefer less-busy workers (ie. faster to accept the connection) to improve workers utilisation.
# Ruby MRI does GVL, this can result | ||
# in processing contention when multiple threads | ||
# (requests) are running concurrently | ||
return unless Puma.mri? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding two method calls to the hot path, but for now I'll leave them in. Just pointing out for future optimiation that we could metaprogram these out based on config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth swapping the order of calls.
require "benchmark/ips"
def mri?
RUBY_ENGINE == 'ruby' || RUBY_ENGINE.nil?
end
num = rand(-10..10)
Benchmark.ips do |b|
b.report("mri?") { mri? }
b.report("delay_s > 0") { num > 0 }
b.report("delay_s > 0 with presence check") { num && num > 0 }
b.compare!
end
Warming up --------------------------------------
mri? 1.043M i/100ms
delay_s > 0 2.328M i/100ms
delay_s > 0 with presence check
2.281M i/100ms
Calculating -------------------------------------
mri? 11.013M (± 3.6%) i/s - 55.264M in 5.025004s
delay_s > 0 25.234M (± 3.0%) i/s - 128.063M in 5.079836s
delay_s > 0 with presence check
22.287M (± 4.6%) i/s - 111.772M in 5.026304s
Comparison:
delay_s > 0: 25233810.8 i/s
delay_s > 0 with presence check: 22286998.1 i/s - 1.13x (± 0.00) slower
mri?: 11013083.0 i/s - 2.29x (± 0.00) slower
(I'm not sure how hot this path runs / how much it matters.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @ayufan. I'm going to write up a guide to 5.0 and the beta which will provide more guidance for using this feature, but excited to see reports/data back when this goes live. |
Thank you @nateberkopec ! |
Amazing! Thank you all! |
@@ -283,6 +283,9 @@ def handle_servers | |||
else | |||
begin | |||
pool.wait_until_not_full | |||
pool.wait_for_less_busy_worker( | |||
@options[:wait_for_less_busy_worker].to_f) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
Describe the bug
We at GitLab.com are production testing Puma with prepartions for switching to it.
However, we noticed a few oddness when running Puma vs Unicorn in mostly similar configurations.
An worker of Puma when running in Cluster mode has no knowledge about siblings and their utilisation. This results in some of the sockets to be accepted by sub-optimal workers, that are already processing requests. This does have a statistical significance as in my local testing, and GitLab.com testing in the above workload we see an increase of around 20-30% of durations on P60%. Which is a significant increase.
Capacity of Puma
Capacity of Puma web-server is defined by
workers * threads
. Each of theseform a slot that can accept a new request. It means that each slot can accept
a new request at random (effectively round-robin).
Now, what happens if we have 2 requests waiting to be processed and two workers,
and two threads:
processing currently request, as we do not control that,
they should be assigned to two separate workers,
It does mean that it is plausible that the two requests will be assigned in sub-optimal
way: to the single worker, but multiple threads.
it means that they processing time is increased by the noisy neighbor due to Ruby MRI GVL.
Interestingly, the same latency impact is present on
Sidekiq
, we just don't see it,as we do not care about real-time aspect of background processing that much.
However, the scheduling changes can improve
Sidekiq
performance as well if we wouldtarget
sidekiq
as well.This is more described in detail here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247859173
Our Configuration
We do run Puma and Unicorn in the following configurations:
W16/T2
, the16
corresponds to 16 CPUs available, nodes do not process anything else,W30
, the30
seems like some artificial factor taken to have some sort of node saturation,We did notice oddness in Puma scheduling, as it was sometimes hitting the workers that are already processing requests. This resulted in increase of latency and duration of requests processing, even though we had a ton of spare capacity on other workers that were simply idle.
The nodes are configured today like that, due to graceful restart/overwrite/shutdown of Unicorn, to have a spare capacity to handle as many as twice amount of regular
W30
. They are highly underutilized on average due to that, but this is something to change.Data
The exact data are presented here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247824814 with the description of what we are seeing.
Expected behavior: Ideal-scheduling algorithm in multi-threaded scenario
In ideal scenario we would always want to use our capacity efficiently:
Currently, Puma does nothing from that. For round-robin we pay around 20% of ~performance penalty due to sub-optimal request processing assignment.
It is expected that the closer we get to 100% CPU-usage the more expected is that the Puma-non-scheduling algorithm will not be a problem, due to node being saturated, and threads by balanced by the kernel.
Workaround
We can insert very small delay for busy workers, by assuming that they are already busy doing other work, and even if we schedule requests on them it will not really make them process requests faster.
The value of
5ms
is tested to provide a compromise between throughput and a delay, as we don't really want to delay forever. There's a test that tries to test that value and provide a benchmark which shows when this delay makes a difference.Your checklist for this pull request
[changelog skip]
to all commit messages.[ci skip]
to the title of the PR.#issue
" to the PR description or my commit messages.