Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inject small delay for busy workers to improve requests distribution #2079

Merged
merged 1 commit into from May 11, 2020

Conversation

ayufan
Copy link
Contributor

@ayufan ayufan commented Nov 21, 2019

Describe the bug

We at GitLab.com are production testing Puma with prepartions for switching to it.
However, we noticed a few oddness when running Puma vs Unicorn in mostly similar configurations.

An worker of Puma when running in Cluster mode has no knowledge about siblings and their utilisation. This results in some of the sockets to be accepted by sub-optimal workers, that are already processing requests. This does have a statistical significance as in my local testing, and GitLab.com testing in the above workload we see an increase of around 20-30% of durations on P60%. Which is a significant increase.

Capacity of Puma

Capacity of Puma web-server is defined by workers * threads. Each of these
form a slot that can accept a new request. It means that each slot can accept
a new request at random (effectively round-robin).

Now, what happens if we have 2 requests waiting to be processed and two workers,
and two threads:

  1. Our total capacity is 4 (W2 * T2),
  2. Each request can be assigned at random to any worker, even the worker that is
    processing currently request, as we do not control that,
  3. It means that in ideal scenario if we have 2 requests, for the optimal performance
    they should be assigned to two separate workers,
  4. Puma does not implement any mechanism for 3., rather it is first accepting, first wins.
    It does mean that it is plausible that the two requests will be assigned in sub-optimal
    way: to the single worker, but multiple threads.
  5. If the two requests are being processed by the same worker, they do share CPU-time,
    it means that they processing time is increased by the noisy neighbor due to Ruby MRI GVL.

Interestingly, the same latency impact is present on Sidekiq, we just don't see it,
as we do not care about real-time aspect of background processing that much.

However, the scheduling changes can improve Sidekiq performance as well if we would
target sidekiq as well.

This is more described in detail here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247859173

Our Configuration

We do run Puma and Unicorn in the following configurations:

  1. Puma runs in W16/T2, the 16 corresponds to 16 CPUs available, nodes do not process anything else,
  2. Unicorn runs in W30, the 30 seems like some artificial factor taken to have some sort of node saturation,
  3. The nodes are lightly utilised, peeking at around 45% of CPU usage,
  4. We seem to process around 70-80rq/sec.

We did notice oddness in Puma scheduling, as it was sometimes hitting the workers that are already processing requests. This resulted in increase of latency and duration of requests processing, even though we had a ton of spare capacity on other workers that were simply idle.

The nodes are configured today like that, due to graceful restart/overwrite/shutdown of Unicorn, to have a spare capacity to handle as many as twice amount of regular W30. They are highly underutilized on average due to that, but this is something to change.

Data

The exact data are presented here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247824814 with the description of what we are seeing.

Expected behavior: Ideal-scheduling algorithm in multi-threaded scenario

In ideal scenario we would always want to use our capacity efficiently:

  1. Assign to free workers first, as they do offer the best latency,
  2. Assign to workers with the least amount of other requests being processed, as they are the least busy,
  3. Assign to workers that are close to finish the requests being processed, as they will have a free capacity in the future (this is hard to implement, as this becomes a quite complex heuristic to understand the request completion time).

Currently, Puma does nothing from that. For round-robin we pay around 20% of ~performance penalty due to sub-optimal request processing assignment.

It is expected that the closer we get to 100% CPU-usage the more expected is that the Puma-non-scheduling algorithm will not be a problem, due to node being saturated, and threads by balanced by the kernel.

Workaround

We can insert very small delay for busy workers, by assuming that they are already busy doing other work, and even if we schedule requests on them it will not really make them process requests faster.

The value of 5ms is tested to provide a compromise between throughput and a delay, as we don't really want to delay forever. There's a test that tries to test that value and provide a benchmark which shows when this delay makes a difference.

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added an entry to History.md if this PR fixes a bug or adds a feature. If it doesn't need an entry to HISTORY.md, I have added [changelog skip] to all commit messages.
  • I have added appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

Copy link
Contributor

@olleolleolle olleolleolle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny typo note

lib/puma/thread_pool.rb Outdated Show resolved Hide resolved
@evanphx
Copy link
Member

evanphx commented Nov 27, 2019

This along with the original Issue are well researched and explained, thanks so much for that!

Would you be wiling to test a suite of values for latency and tick so we can get a feel for how they influence the average response time?

Could you run your test against a request that only lasts 150ms so we can get a feel for if your patch causes a meaningful slowdown in it? If it does, one option would be to sample average time to complete a job and feed it back into the wait algorithm so that it only ever attempts to sleep for, say, 1% or less of the average request time.

@ayufan
Copy link
Contributor Author

ayufan commented Nov 27, 2019 via email

@ayufan ayufan force-pushed the gitlab-latency branch 5 times, most recently from 073e545 to 30f2206 Compare December 11, 2019 10:26
Copy link
Contributor

@wjordan wjordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems promising. I'll try to run it on my production workload and the benchmark mentioned in #1646 (comment) to separately validate the improvements.

I'm particularly interested in comparison against the approach in #1646, which I'm currently running in production (note that #1920 is incorrectly implemented). If I understand correctly, the two main differences between this approach and the one in #1646:

  • Each worker calls @not_full.wait after IO.select has completed (when an incoming request is available on the socket) and before accept_nonblock is called. By comparison Balance incoming requests across processes #1646 calls wait after accept_nonblock and before IO.select is called again.
  • wait is called a number of times equal to the current number of requests queued up on the worker, which implicitly sorts the workers so that one with the shortest request-queue should accept the pending request first. By comparison, Balance incoming requests across processes #1646 waits whenever no requests are immediately available.

I suspect this approach will do better (particularly in heavily-loaded scenarios at higher thread-depths), but it would be good to have data confirmation since the approaches are a bit different.

lib/puma/server.rb Outdated Show resolved Hide resolved
@ayufan
Copy link
Contributor Author

ayufan commented Dec 11, 2019

Yes, this approach targets particularly idle workers, or semi-idle. I did not see the other one to have big benefits for workloads and our configuration.

The other one just prevents workers to pick new connections if they are already fully utilised (it seems), so it might seem that the other one is beneficial as well.

@wjordan
Copy link
Contributor

wjordan commented Dec 11, 2019

I did not see the other one to have big benefits for workloads and our configuration.

By 'the other one' did you mean specifically the code in #1646 or the different code in #1920? Note that the code in #1920 is incorrectly implemented as I mentioned, and would have no effect.

@ayufan
Copy link
Contributor Author

ayufan commented Dec 11, 2019

Let me check my notes @wjordan, but also I was testing under the different expectations I believe :)

So, I was testing the #1920: https://docs.google.com/spreadsheets/d/1y2YqrPPgZ-RtjKiCGplJ7YkwjMXkZx6vEPq7prFbUn4/edit#gid=0. It shows no effect on my tests.

@wjordan
Copy link
Contributor

wjordan commented Dec 12, 2019

I tested out this PR on the Sinatra-MySQL TechEmpower benchmark mentioned in #1646 (comment):

  • m4.10xlarge EC2 instance with 40 vCPUs
  • Ruby 2.5.7-p206
  • Puma configured with 40 workers, 5 threads (1 thread for comparison)
  • Test configured for 40 concurrent requests, 20 queries per request (./tfb --test sinatra --concurrency-levels 40 --type query --query-levels 20)

Results (request/sec):

1 thread: 3917.86

5 threads:

@wjordan wjordan mentioned this pull request Dec 12, 2019
@ayufan
Copy link
Contributor Author

ayufan commented Dec 12, 2019

Thanks @wjordan. Let me check what happens with your changes when I run against my testing suite :)

I believe we test different scenarios:

  • underutilisation: my patch focuses on that
  • oversaturation: your patch focuses on that

We might basically need both changes.

@ayufan
Copy link
Contributor Author

ayufan commented Dec 12, 2019

Puma BASELINE without patches

➜  gitlab-rails ./run-ab-projects
100	1	8.757	87.572
200	2	9.589	95.886
300	3	10.418	104.176
400	4	11.556	115.563
500	5	13.994	139.937
600	6	15.488	154.883
700	7	18.259	182.589
800	8	20.339	203.389
➜  gitlab-rails ./run-ab-projects
100	1	8.891	88.907
200	2	9.325	93.254
300	3	10.471	104.706
400	4	12.004	120.038
500	5	13.212	132.125
600	6	15.558	155.583
700	7	18.037	180.374
800	8	20.100	201.004
➜  gitlab-rails ./run-ab-projects
100	1	8.967	89.672
200	2	9.489	94.889
300	3	10.759	107.591
400	4	11.720	117.197
500	5	13.355	133.551
600	6	15.439	154.391
700	7	18.122	181.223
800	8	20.236	202.364

PUMA with PR 2079

https://github.com/puma/puma/commit/30f2206bb9d31bc9527b697f6948f7963315df88.patch

curl https://github.com/puma/puma/commit/30f2206bb9d31bc9527b697f6948f7963315df88.patch | patch -p1 -d /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/puma-4.3.1/
➜  gitlab-rails ./run-ab-projects
100	1	8.932	89.320
200	2	9.015	90.146
300	3	9.749	97.495
400	4	10.317	103.170
500	5	12.891	128.913
600	6	15.084	150.837
700	7	17.667	176.668
800	8	20.239	202.388
➜  gitlab-rails ./run-ab-projects
100	1	9.185	91.852
200	2	9.243	92.427
300	3	9.557	95.567
400	4	10.269	102.689
500	5	12.625	126.253
600	6	15.184	151.837
700	7	17.795	177.945
800	8	19.925	199.245
➜  gitlab-rails ./run-ab-projects
100	1	8.879	88.789
200	2	9.056	90.557
300	3	9.596	95.961
400	4	10.370	103.698
500	5	13.024	130.237
600	6	15.276	152.763
700	7	17.490	174.896
800	8	20.303	203.026

Puma with PR 1646

https://patch-diff.githubusercontent.com/raw/puma/puma/pull/1646.patch

# I needed to fix manually, as it is not compatible with Puma 4.3.1
curl https://patch-diff.githubusercontent.com/raw/puma/puma/pull/1646.patch | patch -p1 -d /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/puma-4.3.1/
➜  gitlab-rails ./run-ab-projects
100	1	8.861	88.609
200	2	9.608	96.078
300	3	11.007	110.065
400	4	11.523	115.233
500	5	13.487	134.872
600	6	15.591	155.913
700	7	18.424	184.245
800	8	20.288	202.876
➜  gitlab-rails ./run-ab-projects
100	1	8.918	89.179
200	2	9.255	92.552
300	3	10.456	104.563
400	4	11.964	119.635
500	5	13.714	137.145
600	6	16.010	160.098
700	7	17.923	179.234
800	8	20.582	205.822
➜  gitlab-rails ./run-ab-projects
100	1	8.928	89.279
200	2	9.604	96.035
300	3	10.933	109.333
400	4	11.741	117.414
500	5	13.128	131.279
600	6	15.790	157.905
700	7	18.070	180.696
800	8	20.592	205.915

run-ab-projects

#!/bin/bash

run_ab() {
    result=$(ab -q -n "$1" -c "$2" http://localhost:4080/api/v4/projects)
    time_taken=$(echo "$result" | grep "Time taken for tests:" | cut -d" " -f7)
    time_per_req=$(echo "$result" | grep "Time per request:" | grep "(mean)" | cut -d" " -f10)

    echo -e "$1\t$2\t$time_taken\t$time_per_req"
}

for i in 1 2 3 4 5 6 7 8; do
    run_ab $((i*100)) $i
    sleep 1
done

@ayufan
Copy link
Contributor Author

ayufan commented Dec 12, 2019

I also put the results in: https://docs.google.com/spreadsheets/d/1y2YqrPPgZ-RtjKiCGplJ7YkwjMXkZx6vEPq7prFbUn4/edit#gid=1614859337.

I picked the best result recorded for each requests/concurrency.

I deliberately tested that against GitLab, as it has many/many layers. It is hard to achieve throughput of 4k rq/sec, but also this is not an intent. I optimise towards the fastest processing time, not towards the highest throughput. The higher throughput can be achieved with higher amount of workers, but processing time is a function of amount of threads.

@ayufan
Copy link
Contributor Author

ayufan commented Dec 18, 2019

We are running the cond.wait patch (this PR) on GitLab.com and we do not see a performance penalty in our testing. This is a positive change to before running patch.

@ayufan
Copy link
Contributor Author

ayufan commented Jan 29, 2020

We are running Puma on gitlab.com on 25% of fleet. So, far it really seems that patch does make Puma to be equal to Unicorn in our workload.

We are discussing that in: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_277452635.

The results from running today will be posted tomorrow, but so far distributions look pretty much equal between Puma/Unicorn. It means that threading does not affect for us on the request processing.

@ayufan
Copy link
Contributor Author

ayufan commented Jan 30, 2020

@nateberkopec @olleolleolle @wjordan

It seems that simple change has a big impact for us: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_279516488. Doing extensive tests on GitLab.com shows no noticeable difference between Unicorn (W30/W40: depends whether api/git/web) and Puma (W16/T4: all configured the same). I expected still some deficiency when running Puma, but it seems that it is statistically insignificant.

We are now discussing the do additional testing to compare on a big scale: puma without patch, puma with a patch, unicorn for limited time (likely 1h, to reduce impact on customers).

I will look more into understanding @wjordan performance measurements. I need to understand the objective that we are optimising towards.

@nateberkopec
Copy link
Member

I'm caught up on your thread on Gitlab now, but it looks like you still have some measurement issues re: the patch's effectiveness, correct?

@nateberkopec nateberkopec added feature waiting-for-changes Waiting on changes from the requestor and removed question labels Feb 7, 2020
@ayufan
Copy link
Contributor Author

ayufan commented Feb 7, 2020

@nateberkopec

I would say not really, just comparison between one node vs another gives wrong results for various reasons: like traffic distribution, or variability in performance between nodes. This was also the reason why we started running at 23% of fleet to reduce these variability.

Anyway, we will do another set of disables to have bigger sample sets. Now, it seems that the impact is substantial, and makes all execution duration to be pretty consistent with Unicorn (at least for us).

@ayufan
Copy link
Contributor Author

ayufan commented Feb 19, 2020

As of today GitLab.com runs fully on Puma with this patch. Issue for configuration change: https://gitlab.com/gitlab-com/gl-infra/production/issues/1684.

I will now spend some time on comparing and figuring out which solution to ask for upstreaming :)

Thanks for your help!

@ayufan
Copy link
Contributor Author

ayufan commented Apr 24, 2020

@nateberkopec I asked our production team to change params of old patch: https://gitlab.com/gitlab-org/gitlab/-/issues/196002#note_330784539. The old patch is still equally the same, as we can control all aspects of it, with configuring params. This PR is simplified version with these values hardcoded.

@ayufan
Copy link
Contributor Author

ayufan commented Apr 24, 2020

Some results from running the benchmark (included): https://docs.google.com/spreadsheets/d/1y2YqrPPgZ-RtjKiCGplJ7YkwjMXkZx6vEPq7prFbUn4/edit#gid=1510215191. The exact numbers are fully dependent on a workload, this hopefully somehow shows that interleaved workflow.

@ayufan
Copy link
Contributor Author

ayufan commented Apr 28, 2020

@nateberkopec WDYT? What would be the next steps? :)

@nateberkopec
Copy link
Member

I'm gonna take some time this weekend to review 👍 The code change is simple, so the main thing is figuring out the impact and benchmarking.

@schneems
Copy link
Contributor

figuring out the impact and benchmarking

Agreed, that's a tall order. We could publish this in a pre-release and then solicit people to try it out, though finding the people and getting meaningful feedback there has always been an issue. We can promote via twitter and slack etc. it's better than nothing.

@ayufan
Copy link
Contributor Author

ayufan commented Apr 29, 2020

@schneems @nateberkopec

We can make it disabled by default, or even make the delay adaptable. Adjusting default later allows this to be implicitly enabled.

@nateberkopec
Copy link
Member

Most def. Once the 5.0 milestone is complete I'll release 5.0.0.beta1

@ayufan ayufan force-pushed the gitlab-latency branch 2 times, most recently from deeef15 to e022667 Compare May 5, 2020 12:10
@ayufan
Copy link
Contributor Author

ayufan commented May 5, 2020

@nateberkopec

I rebased on master as there were some changes merge related to OOB:

  • I added a config setting where you specify delay, which is off by default
  • This makes it not yet default, but allows us to merge/test without the impact on existing users without them doing a very simple addition
  • I reworked a cpu_spin.rb to make it easier
  • I added additional test.rb to very simply test that, since this is sleep-based I had to use small sleeps, this test is pretty fast anyway: ~0.238589s

Do you think this to be acceptable in that form to merge and continue testing this non-blocking/non-breaking change?

@ayufan ayufan force-pushed the gitlab-latency branch 3 times, most recently from ed006fb to 340b341 Compare May 5, 2020 14:35
Ruby MRI when used can at most process a single thread concurrently due to GVL. This results in a over-utilisation if unfavourable distribution of connections is happening.

This tries to prefer less-busy workers (ie. faster to accept
the connection) to improve workers utilisation.
# Ruby MRI does GVL, this can result
# in processing contention when multiple threads
# (requests) are running concurrently
return unless Puma.mri?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding two method calls to the hot path, but for now I'll leave them in. Just pointing out for future optimiation that we could metaprogram these out based on config.

Copy link
Contributor

@ghiculescu ghiculescu Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth swapping the order of calls.

require "benchmark/ips"

def mri?
  RUBY_ENGINE == 'ruby' || RUBY_ENGINE.nil?
end

num = rand(-10..10)

Benchmark.ips do |b|
  b.report("mri?") { mri? }

  b.report("delay_s > 0") { num > 0 }

  b.report("delay_s > 0 with presence check") { num && num > 0 }

  b.compare!
end
Warming up --------------------------------------
                mri?     1.043M i/100ms
         delay_s > 0     2.328M i/100ms
delay_s > 0 with presence check
                         2.281M i/100ms
Calculating -------------------------------------
                mri?     11.013M (± 3.6%) i/s -     55.264M in   5.025004s
         delay_s > 0     25.234M (± 3.0%) i/s -    128.063M in   5.079836s
delay_s > 0 with presence check
                         22.287M (± 4.6%) i/s -    111.772M in   5.026304s

Comparison:
         delay_s > 0: 25233810.8 i/s
delay_s > 0 with presence check: 22286998.1 i/s - 1.13x  (± 0.00) slower
                mri?: 11013083.0 i/s - 2.29x  (± 0.00) slower

(I'm not sure how hot this path runs / how much it matters.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nateberkopec nateberkopec merged commit 554b969 into puma:master May 11, 2020
@nateberkopec
Copy link
Member

Excellent work @ayufan. I'm going to write up a guide to 5.0 and the beta which will provide more guidance for using this feature, but excited to see reports/data back when this goes live.

@ayufan
Copy link
Contributor Author

ayufan commented May 11, 2020

Thank you @nateberkopec !

@mwpastore
Copy link
Contributor

Amazing! Thank you all!

@@ -283,6 +283,9 @@ def handle_servers
else
begin
pool.wait_until_not_full
pool.wait_for_less_busy_worker(
@options[:wait_for_less_busy_worker].to_f)

This comment was marked as resolved.

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature perf waiting-for-review Waiting on review from anyone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants