Least-busy load balancing in cluster mode #2092

wjordan · 2019-12-17T07:29:46Z

Description

Proof-of-concept HTTP request load-balancing using a 'least-busy' algorithm for cluster mode. Instead of having each worker-process run a select-loop concurrently on the listener and race to accept incoming connections, this PR runs a select-loop on the listener in the parent process and forwards individual client TCP connections to worker-processes by sending the file descriptor through a unix domain socket-pair unique to each worker. With this approach, the parent gets to pick which worker gets the connection using a specific routing algorithm.

In order to implement 'least busy' routing for the parent process to select the worker to send each client connection, each worker updates the parent with its current capacity whenever its threads start or finish processing, and the parent selects the worker with the highest capacity. (If two workers report the same capacity, it chooses the one with the oldest last-request.)

This PR was designed to solve #1254 / #2078 as an alternative approach to #1646, #1920 and #2079. The configuration of custom load-balancing algorithms via DSL could support other use-cases as well.

Performance

Runs the benchmark noted in #2079 (comment) with 4013 requests/sec.

[todo - will put the results of some other micro-benchmarks/comparisons here]

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added an entry to History.md if this PR fixes a bug or adds a feature. If it doesn't need an entry to HISTORY.md, I have added [changelog skip] the pull request title.
I have added appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

Forward connections to workers based on current capacity, using pipe to track capacity across processes. Customize load balancing algorithm in DSL.

wjordan · 2019-12-17T19:37:45Z

lib/puma/cluster.rb

@@ -267,6 +282,8 @@ def worker(index, master)
      @launcher.config.run_hooks :before_worker_boot, index

      server = start_server
+      server.update_capacity = Proc.new {|c| @worker_write << "!c#{Process.pid},#{c}\n" }


Note there is a bit of synchronization overhead in this operation, which starts to become visible at extremely low-latency response times / high request rates (e.g., ~1ms/req, ~10k req/sec). An alternative is to use shared memory (e.g. the raindrops gem), see 486b4f9 for a commit implementing this alternative approach. The tradeoff is pulling in an extra dependency (or writing extra C-extension code) for that extra bit of performance.

nateberkopec · 2019-12-18T03:53:48Z

lib/puma/cluster.rb

@@ -75,9 +77,13 @@ def initialize(idx, pid, phase, options)
        @last_checkin = Time.now
        @last_status = '{}'
        @term = false
+        @socket = socket
+        @capacity = 0


Is there a reason we can't re-use information from the threadpool, I wonder, rather than setting a separate counter?

nateberkopec · 2019-12-18T03:56:42Z

test/test_integration_cluster.rb

@@ -140,7 +140,7 @@ def test_stuck_phased_restart
  def term_closes_listeners(unix: false)
    skip_unless_signal_exist? :TERM

-    cli_server "-w #{WORKERS} -t 0:6 -q test/rackup/sleep_step.ru", unix: unix
+    cli_server "-w #{WORKERS} -t 1:6 -q test/rackup/sleep_step.ru", unix: unix


Why was this req'd?

nateberkopec · 2019-12-18T03:57:11Z

lib/puma/thread_pool.rb

+
+    attr_writer :update_capacity
+
+    def update_capacity


Not a huge fan of this. Would be better if we could leave the threadpool alone.

evanphx

The approach is totally fine: having a toplevel accepter thread and handing off sockets to child processes.

But, this won't work on Windows and won't work on JRuby I suspect (though I'd love some guidance there from @headius because I could be wrong).

As such, this approach needs to be abstracted so that it only operates when send_io and recv_io are available.

evanphx · 2019-12-18T04:15:10Z

lib/puma/thread_pool.rb

@@ -115,13 +122,15 @@ def spawn_thread
                break
              end

-              @waiting += 1
+              @waiting += 1; update_capacity


Please move these calls to their own line

evanphx · 2019-12-18T04:15:22Z

lib/puma/thread_pool.rb

              not_full.signal
              not_empty.wait mutex
              @waiting -= 1
            end

-            work = todo.shift if continue
+            if continue
+              work = todo.shift; update_capacity


DItto, own line.

evanphx · 2019-12-18T04:15:28Z

lib/puma/thread_pool.rb

@@ -157,7 +166,7 @@ def <<(work)
          raise "Unable to add work while shutting down"
        end

-        @todo << work
+        @todo << work; update_capacity


Ditto, own line.

evanphx · 2019-12-18T04:17:28Z

lib/puma/cluster.rb

+        algorithms = {
+          least_busy: lambda {|workers| workers.min_by {|w| [-w.capacity, w.last_request]}},
+          round_robin: lambda {|workers| workers.min_by(&:last_request)},
+          pile_up: lambda {|workers| workers.select {|w| w.capacity > 0}.min_by {|w| w.index} || workers.min_by(&:last_request)},
+          random: lambda {|workers| workers.sample}
+        }.freeze


This should be moved out of here into a constant rather than hidden here, probably even in it's own file.

evanphx · 2019-12-18T04:20:51Z

lib/puma/cluster.rb

+        Socket.accept_loop(*@launcher.binder.ios) do |connection, _|
+          routing.call(@workers.reject(&:term?)).accept(connection)
+          until @workers.any? {|w| w.capacity > 0 }
+            @mutex.synchronize {@not_full.wait(@mutex)}


Please break the block out to a separate line with do

evanphx · 2019-12-18T04:21:25Z

lib/puma/cluster.rb

+        routing = @options[:load_balancing]
+        debug "Using load balancing algorithm: #{routing}"
+        routing = algorithms[routing] unless routing.is_a?(Proc)
+        raise "Invalid value for load_balancing - #{@options[:load_balancing]}" unless routing.is_a?(Proc)


We should probably be checking this in the config DSL rather than at this level.

evanphx · 2019-12-18T04:23:33Z

lib/puma/dsl.rb

+    #
+    # @note Cluster mode only.
+    def load_balancing(algorithm=:least_busy)
+      @options[:load_balancing] = algorithm


Here is where we should validate the argument and throw an error.

evanphx · 2019-12-18T04:25:42Z

lib/puma/server.rb

@@ -386,7 +389,8 @@ def handle_servers
                break if handle_check
              else
                begin
-                  if io = sock.accept_nonblock
+                  receive_socket = sock.is_a?(UNIXSocket) && @update_capacity
+                  if (io = receive_socket ? sock.recv_io(TCPSocket) : sock.accept_nonblock)


While this will work because sock will be a UNIXServer in the case of binding on a unix domain socket as the user requested, it's a bit clunky. I'd rather we plumb through a flag to indicated that recv_io should be used rather than relying on it being a UNIXSocket.

evanphx · 2019-12-18T04:30:56Z

I fully realize this is a POC btw and that a lot of the comments I left are more about a delivered, prod ready PR. I think the approach is totally valid and you should go ahead and turn this into a prod ready PR.

ayufan · 2020-01-16T11:00:03Z

I'm not sure if this is OK.

This makes the master to be responsible for handing off the connections to workers, and makes the master to be a single point of contention in such case. The currently model is equal in behaviour to Unicorn, and makes this effectively a distributed solution, where we have multiple processes waiting for connection, instead of single one.

This does add complexity, and might have unintended side-effects, if for any reason anything else works on master.

nateberkopec · 2020-02-25T13:28:32Z

Closing because I think we have a prod-ready approach in @ayufan's PR and there are two big drawbacks to this approach:

Master process becomes an acceptor process
Probably no JRuby or Windows support.

sayap · 2020-02-26T07:49:24Z

Just want to chime in that this latency issue is preventing us from migrating from passenger (OSS, single-threaded) to puma. As noted by @wjordan in #1254 (comment), passenger uses the exact same least-busy load balancing approach as this PR, and that has been working exceptionally well us.

Regarding the drawback about making the master process as a single point of failure, I think the fear is rather unfounded, as the master-workers approach works just fine for passenger.

As for the second drawback with cross-platform support, it looks like passenger does work fine on JRuby, but doesn't support Windows. So the concern is indeed valid.

Anyway, the condvar approach also has its own drawback, e.g. there could be significant performance difference across different ruby versions: https://bugs.ruby-lang.org/issues/16255

Furthermore, the latency issue would be more apparent at P99 or P999. If we look at the graphs at https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334, the only one that covers the full range is "Zooming into just /jwt/auth calls", and it does look like the long tail latency is worse with puma + #2079 in that graph?

ayufan · 2020-02-26T09:44:38Z

Furthermore, the latency issue would be more apparent at P99 or P999. If we look at the graphs at https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334, the only one that covers the full range is "Zooming into just /jwt/auth calls", and it does look like the long tail latency is worse with puma

From my perspective moving all logic to master changes significantly behaviour of Puma, and makes us to re-learn how to run Puma effectively on a scale. The biggest issue is that something can run on master process that would slowdown request acceptance, and this could cause performance regression for some.

After doing extensive tests across board we see significantly better latency, and consistent with Unicorn. We were running a big scale tests with and without patch disabled, and the conclusion was that we did not see any difference between Puma and Unicorn after applying patch, on all P-xs that we looked (50, 95, 99, etc.).

Anyway, it is still on my agenda for this cycle to re-try both of them. I'm quite certain that I will not be able to run this PR on a scale, as we are this user "that we have something running on master" :) I want to see if there's something that we can improve with cond.wait approach, because to be fair I did not focus on optimising that approach yet.

@sayap Did you maybe try my patch how it works for you?

wjordan · 2020-02-26T19:35:02Z

I agree this PR isn't production-ready yet and should be shelved for now given the limitations mentioned. I thought I'd add some detailed notes in case I or anyone else wanted to revisit this approach in the future:

1. Master process becomes an acceptor process

While I agree with @sayap's comment that the concern is mostly theoretical at this point and that other web-servers use a similar request-balancing approach of using a dedicated process to pass accepted sockets to workers (see e.g., kfcgi and treasure-data/serverengine in addition to passenger), what's unique here is that Puma runs other code in the same master process (including user-defined hooks), and the Ruby's global VM lock could cause one thread in the same process to block another. Forking a dedicated acceptor process (as opposed to an acceptor thread) could address this concern, and probably wouldn't be too difficult.

2. Probably no JRuby or Windows support

JRuby shouldn't have any issues, according to its test coverage it supports passing file descriptors via send_io.
Windows support is possible but would take extra work. Although Windows doesn't support passing descriptors over a socket directly, it provides a WSADuplicateSocket API which creates a new descriptor for a shared socket that can then be passed through a pipe to the child process. Python wraps this up nicely via socket.share() / socket.fromshare(), but Ruby doesn't have any equivalent, so either a C extension or ffi/fiddle magic is required. treasure-data/serverengine includes a Windows-socket wrapper using fiddle to implement this approach which could probably be adapted.

I would add a third limitation, the synchronization overhead of passing busy counters through the existing pipe channel after each request, which I mentioned above in #2092 (comment). This should be replaced with shared-memory counters for best performance but that also adds further complexity to the implementation.

For high-performance use-cases, another possibly simpler approach would be to have each worker listen to a separate socket/port, and use a separate load-balancing proxy such as nginx or haproxy to balance requests across them. This offers best performance, and doesn't add any complexity for setups which already have nginx in their application stack anyway. I'll open a PR demonstrating this approach. (Update: see #2128)

ayufan · 2020-02-26T19:37:37Z

For high-performance use-cases, another possibly simpler approach would be to have each worker listen to a separate socket/port, and use a separate load-balancing proxy such as nginx or haproxy to balance requests across them. This offers best performance, and doesn't add any complexity for setups which already have nginx in their application stack anyway. I'll open a PR demonstrating this approach.

This seems like an interesting idea @wjordan. If we could allow to have each worker to have their own socket, that would be LB externally.

This should be replaced with shared-memory counters for best performance but that also adds further complexity to the implementation.

This would be useful as well.

Least-busy connection routing in cluster mode

1ccc163

Forward connections to workers based on current capacity, using pipe to track capacity across processes. Customize load balancing algorithm in DSL.

wjordan commented Dec 17, 2019

View reviewed changes

nateberkopec added the perf label Dec 18, 2019

nateberkopec reviewed Dec 18, 2019

View reviewed changes

nateberkopec added this to the 5.0.0 milestone Dec 18, 2019

evanphx requested changes Dec 18, 2019

View reviewed changes

ayufan mentioned this pull request Jan 30, 2020

Inject small delay for busy workers to improve requests distribution #2079

Merged

8 tasks

nateberkopec mentioned this pull request Feb 10, 2020

Even accept #1920

Closed

nateberkopec added the waiting-for-changes Waiting on changes from the requestor label Feb 21, 2020

nateberkopec closed this Feb 25, 2020

wjordan mentioned this pull request Feb 26, 2020

Add bind_workers option to bind separate listeners per worker #2128

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Least-busy load balancing in cluster mode #2092

Least-busy load balancing in cluster mode #2092

wjordan commented Dec 17, 2019 •

edited

wjordan Dec 17, 2019

nateberkopec Dec 18, 2019

nateberkopec Dec 18, 2019

nateberkopec Dec 18, 2019

evanphx left a comment

evanphx Dec 18, 2019

evanphx Dec 18, 2019

evanphx Dec 18, 2019

evanphx Dec 18, 2019

evanphx Dec 18, 2019

evanphx Dec 18, 2019

evanphx Dec 18, 2019

evanphx Dec 18, 2019

evanphx commented Dec 18, 2019

ayufan commented Jan 16, 2020

nateberkopec commented Feb 25, 2020

sayap commented Feb 26, 2020

ayufan commented Feb 26, 2020

wjordan commented Feb 26, 2020 •

edited

ayufan commented Feb 26, 2020 •

edited

Least-busy load balancing in cluster mode #2092

Least-busy load balancing in cluster mode #2092

Conversation

wjordan commented Dec 17, 2019 • edited

Description

Performance

Your checklist for this pull request

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanphx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanphx commented Dec 18, 2019

ayufan commented Jan 16, 2020

nateberkopec commented Feb 25, 2020

sayap commented Feb 26, 2020

ayufan commented Feb 26, 2020

wjordan commented Feb 26, 2020 • edited

1. Master process becomes an acceptor process

2. Probably no JRuby or Windows support

ayufan commented Feb 26, 2020 • edited

wjordan commented Dec 17, 2019 •

edited

wjordan commented Feb 26, 2020 •

edited

ayufan commented Feb 26, 2020 •

edited