Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Least-busy load balancing in cluster mode #2092

Closed
wants to merge 1 commit into from

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented Dec 17, 2019

Description

Proof-of-concept HTTP request load-balancing using a 'least-busy' algorithm for cluster mode. Instead of having each worker-process run a select-loop concurrently on the listener and race to accept incoming connections, this PR runs a select-loop on the listener in the parent process and forwards individual client TCP connections to worker-processes by sending the file descriptor through a unix domain socket-pair unique to each worker. With this approach, the parent gets to pick which worker gets the connection using a specific routing algorithm.

In order to implement 'least busy' routing for the parent process to select the worker to send each client connection, each worker updates the parent with its current capacity whenever its threads start or finish processing, and the parent selects the worker with the highest capacity. (If two workers report the same capacity, it chooses the one with the oldest last-request.)

This PR was designed to solve #1254 / #2078 as an alternative approach to #1646, #1920 and #2079. The configuration of custom load-balancing algorithms via DSL could support other use-cases as well.

Performance

Runs the benchmark noted in #2079 (comment) with 4013 requests/sec.

[todo - will put the results of some other micro-benchmarks/comparisons here]

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added an entry to History.md if this PR fixes a bug or adds a feature. If it doesn't need an entry to HISTORY.md, I have added [changelog skip] the pull request title.
  • I have added appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

Forward connections to workers based on current capacity,
using pipe to track capacity across processes.
Customize load balancing algorithm in DSL.
@@ -267,6 +282,8 @@ def worker(index, master)
@launcher.config.run_hooks :before_worker_boot, index

server = start_server
server.update_capacity = Proc.new {|c| @worker_write << "!c#{Process.pid},#{c}\n" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there is a bit of synchronization overhead in this operation, which starts to become visible at extremely low-latency response times / high request rates (e.g., ~1ms/req, ~10k req/sec). An alternative is to use shared memory (e.g. the raindrops gem), see 486b4f9 for a commit implementing this alternative approach. The tradeoff is pulling in an extra dependency (or writing extra C-extension code) for that extra bit of performance.

@@ -75,9 +77,13 @@ def initialize(idx, pid, phase, options)
@last_checkin = Time.now
@last_status = '{}'
@term = false
@socket = socket
@capacity = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't re-use information from the threadpool, I wonder, rather than setting a separate counter?

@@ -140,7 +140,7 @@ def test_stuck_phased_restart
def term_closes_listeners(unix: false)
skip_unless_signal_exist? :TERM

cli_server "-w #{WORKERS} -t 0:6 -q test/rackup/sleep_step.ru", unix: unix
cli_server "-w #{WORKERS} -t 1:6 -q test/rackup/sleep_step.ru", unix: unix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this req'd?


attr_writer :update_capacity

def update_capacity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of this. Would be better if we could leave the threadpool alone.

@nateberkopec nateberkopec added this to the 5.0.0 milestone Dec 18, 2019
Copy link
Member

@evanphx evanphx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach is totally fine: having a toplevel accepter thread and handing off sockets to child processes.

But, this won't work on Windows and won't work on JRuby I suspect (though I'd love some guidance there from @headius because I could be wrong).

As such, this approach needs to be abstracted so that it only operates when send_io and recv_io are available.

@@ -115,13 +122,15 @@ def spawn_thread
break
end

@waiting += 1
@waiting += 1; update_capacity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these calls to their own line

not_full.signal
not_empty.wait mutex
@waiting -= 1
end

work = todo.shift if continue
if continue
work = todo.shift; update_capacity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DItto, own line.

@@ -157,7 +166,7 @@ def <<(work)
raise "Unable to add work while shutting down"
end

@todo << work
@todo << work; update_capacity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, own line.

Comment on lines +498 to +503
algorithms = {
least_busy: lambda {|workers| workers.min_by {|w| [-w.capacity, w.last_request]}},
round_robin: lambda {|workers| workers.min_by(&:last_request)},
pile_up: lambda {|workers| workers.select {|w| w.capacity > 0}.min_by {|w| w.index} || workers.min_by(&:last_request)},
random: lambda {|workers| workers.sample}
}.freeze
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved out of here into a constant rather than hidden here, probably even in it's own file.

Socket.accept_loop(*@launcher.binder.ios) do |connection, _|
routing.call(@workers.reject(&:term?)).accept(connection)
until @workers.any? {|w| w.capacity > 0 }
@mutex.synchronize {@not_full.wait(@mutex)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break the block out to a separate line with do

routing = @options[:load_balancing]
debug "Using load balancing algorithm: #{routing}"
routing = algorithms[routing] unless routing.is_a?(Proc)
raise "Invalid value for load_balancing - #{@options[:load_balancing]}" unless routing.is_a?(Proc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably be checking this in the config DSL rather than at this level.

#
# @note Cluster mode only.
def load_balancing(algorithm=:least_busy)
@options[:load_balancing] = algorithm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we should validate the argument and throw an error.

@@ -386,7 +389,8 @@ def handle_servers
break if handle_check
else
begin
if io = sock.accept_nonblock
receive_socket = sock.is_a?(UNIXSocket) && @update_capacity
if (io = receive_socket ? sock.recv_io(TCPSocket) : sock.accept_nonblock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this will work because sock will be a UNIXServer in the case of binding on a unix domain socket as the user requested, it's a bit clunky. I'd rather we plumb through a flag to indicated that recv_io should be used rather than relying on it being a UNIXSocket.

@evanphx
Copy link
Member

evanphx commented Dec 18, 2019

I fully realize this is a POC btw and that a lot of the comments I left are more about a delivered, prod ready PR. I think the approach is totally valid and you should go ahead and turn this into a prod ready PR.

@ayufan
Copy link
Contributor

ayufan commented Jan 16, 2020

I'm not sure if this is OK.

This makes the master to be responsible for handing off the connections to workers, and makes the master to be a single point of contention in such case. The currently model is equal in behaviour to Unicorn, and makes this effectively a distributed solution, where we have multiple processes waiting for connection, instead of single one.

This does add complexity, and might have unintended side-effects, if for any reason anything else works on master.

@nateberkopec nateberkopec mentioned this pull request Feb 10, 2020
@nateberkopec nateberkopec added the waiting-for-changes Waiting on changes from the requestor label Feb 21, 2020
@nateberkopec
Copy link
Member

Closing because I think we have a prod-ready approach in @ayufan's PR and there are two big drawbacks to this approach:

  • Master process becomes an acceptor process
  • Probably no JRuby or Windows support.

@sayap
Copy link

sayap commented Feb 26, 2020

Just want to chime in that this latency issue is preventing us from migrating from passenger (OSS, single-threaded) to puma. As noted by @wjordan in #1254 (comment), passenger uses the exact same least-busy load balancing approach as this PR, and that has been working exceptionally well us.

Regarding the drawback about making the master process as a single point of failure, I think the fear is rather unfounded, as the master-workers approach works just fine for passenger.

As for the second drawback with cross-platform support, it looks like passenger does work fine on JRuby, but doesn't support Windows. So the concern is indeed valid.

Anyway, the condvar approach also has its own drawback, e.g. there could be significant performance difference across different ruby versions: https://bugs.ruby-lang.org/issues/16255

Furthermore, the latency issue would be more apparent at P99 or P999. If we look at the graphs at https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334, the only one that covers the full range is "Zooming into just /jwt/auth calls", and it does look like the long tail latency is worse with puma + #2079 in that graph?

@ayufan
Copy link
Contributor

ayufan commented Feb 26, 2020

Furthermore, the latency issue would be more apparent at P99 or P999. If we look at the graphs at https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334, the only one that covers the full range is "Zooming into just /jwt/auth calls", and it does look like the long tail latency is worse with puma

From my perspective moving all logic to master changes significantly behaviour of Puma, and makes us to re-learn how to run Puma effectively on a scale. The biggest issue is that something can run on master process that would slowdown request acceptance, and this could cause performance regression for some.

After doing extensive tests across board we see significantly better latency, and consistent with Unicorn. We were running a big scale tests with and without patch disabled, and the conclusion was that we did not see any difference between Puma and Unicorn after applying patch, on all P-xs that we looked (50, 95, 99, etc.).

Anyway, it is still on my agenda for this cycle to re-try both of them. I'm quite certain that I will not be able to run this PR on a scale, as we are this user "that we have something running on master" :) I want to see if there's something that we can improve with cond.wait approach, because to be fair I did not focus on optimising that approach yet.

@sayap Did you maybe try my patch how it works for you?

@wjordan
Copy link
Contributor Author

wjordan commented Feb 26, 2020

I agree this PR isn't production-ready yet and should be shelved for now given the limitations mentioned. I thought I'd add some detailed notes in case I or anyone else wanted to revisit this approach in the future:

1. Master process becomes an acceptor process

While I agree with @sayap's comment that the concern is mostly theoretical at this point and that other web-servers use a similar request-balancing approach of using a dedicated process to pass accepted sockets to workers (see e.g., kfcgi and treasure-data/serverengine in addition to passenger), what's unique here is that Puma runs other code in the same master process (including user-defined hooks), and the Ruby's global VM lock could cause one thread in the same process to block another. Forking a dedicated acceptor process (as opposed to an acceptor thread) could address this concern, and probably wouldn't be too difficult.

2. Probably no JRuby or Windows support

  • JRuby shouldn't have any issues, according to its test coverage it supports passing file descriptors via send_io.
  • Windows support is possible but would take extra work. Although Windows doesn't support passing descriptors over a socket directly, it provides a WSADuplicateSocket API which creates a new descriptor for a shared socket that can then be passed through a pipe to the child process. Python wraps this up nicely via socket.share() / socket.fromshare(), but Ruby doesn't have any equivalent, so either a C extension or ffi/fiddle magic is required. treasure-data/serverengine includes a Windows-socket wrapper using fiddle to implement this approach which could probably be adapted.

I would add a third limitation, the synchronization overhead of passing busy counters through the existing pipe channel after each request, which I mentioned above in #2092 (comment). This should be replaced with shared-memory counters for best performance but that also adds further complexity to the implementation.

For high-performance use-cases, another possibly simpler approach would be to have each worker listen to a separate socket/port, and use a separate load-balancing proxy such as nginx or haproxy to balance requests across them. This offers best performance, and doesn't add any complexity for setups which already have nginx in their application stack anyway. I'll open a PR demonstrating this approach. (Update: see #2128)

@ayufan
Copy link
Contributor

ayufan commented Feb 26, 2020

For high-performance use-cases, another possibly simpler approach would be to have each worker listen to a separate socket/port, and use a separate load-balancing proxy such as nginx or haproxy to balance requests across them. This offers best performance, and doesn't add any complexity for setups which already have nginx in their application stack anyway. I'll open a PR demonstrating this approach.

This seems like an interesting idea @wjordan. If we could allow to have each worker to have their own socket, that would be LB externally.

This should be replaced with shared-memory counters for best performance but that also adds further complexity to the implementation.

This would be useful as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf waiting-for-changes Waiting on changes from the requestor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants