New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Least-busy load balancing in cluster mode #2092
Conversation
Forward connections to workers based on current capacity, using pipe to track capacity across processes. Customize load balancing algorithm in DSL.
@@ -267,6 +282,8 @@ def worker(index, master) | |||
@launcher.config.run_hooks :before_worker_boot, index | |||
|
|||
server = start_server | |||
server.update_capacity = Proc.new {|c| @worker_write << "!c#{Process.pid},#{c}\n" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note there is a bit of synchronization overhead in this operation, which starts to become visible at extremely low-latency response times / high request rates (e.g., ~1ms/req, ~10k req/sec). An alternative is to use shared memory (e.g. the raindrops gem), see 486b4f9 for a commit implementing this alternative approach. The tradeoff is pulling in an extra dependency (or writing extra C-extension code) for that extra bit of performance.
@@ -75,9 +77,13 @@ def initialize(idx, pid, phase, options) | |||
@last_checkin = Time.now | |||
@last_status = '{}' | |||
@term = false | |||
@socket = socket | |||
@capacity = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we can't re-use information from the threadpool, I wonder, rather than setting a separate counter?
@@ -140,7 +140,7 @@ def test_stuck_phased_restart | |||
def term_closes_listeners(unix: false) | |||
skip_unless_signal_exist? :TERM | |||
|
|||
cli_server "-w #{WORKERS} -t 0:6 -q test/rackup/sleep_step.ru", unix: unix | |||
cli_server "-w #{WORKERS} -t 1:6 -q test/rackup/sleep_step.ru", unix: unix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this req'd?
|
||
attr_writer :update_capacity | ||
|
||
def update_capacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge fan of this. Would be better if we could leave the threadpool alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach is totally fine: having a toplevel accepter thread and handing off sockets to child processes.
But, this won't work on Windows and won't work on JRuby I suspect (though I'd love some guidance there from @headius because I could be wrong).
As such, this approach needs to be abstracted so that it only operates when send_io
and recv_io
are available.
@@ -115,13 +122,15 @@ def spawn_thread | |||
break | |||
end | |||
|
|||
@waiting += 1 | |||
@waiting += 1; update_capacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move these calls to their own line
not_full.signal | ||
not_empty.wait mutex | ||
@waiting -= 1 | ||
end | ||
|
||
work = todo.shift if continue | ||
if continue | ||
work = todo.shift; update_capacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DItto, own line.
@@ -157,7 +166,7 @@ def <<(work) | |||
raise "Unable to add work while shutting down" | |||
end | |||
|
|||
@todo << work | |||
@todo << work; update_capacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, own line.
algorithms = { | ||
least_busy: lambda {|workers| workers.min_by {|w| [-w.capacity, w.last_request]}}, | ||
round_robin: lambda {|workers| workers.min_by(&:last_request)}, | ||
pile_up: lambda {|workers| workers.select {|w| w.capacity > 0}.min_by {|w| w.index} || workers.min_by(&:last_request)}, | ||
random: lambda {|workers| workers.sample} | ||
}.freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved out of here into a constant rather than hidden here, probably even in it's own file.
Socket.accept_loop(*@launcher.binder.ios) do |connection, _| | ||
routing.call(@workers.reject(&:term?)).accept(connection) | ||
until @workers.any? {|w| w.capacity > 0 } | ||
@mutex.synchronize {@not_full.wait(@mutex)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please break the block out to a separate line with do
routing = @options[:load_balancing] | ||
debug "Using load balancing algorithm: #{routing}" | ||
routing = algorithms[routing] unless routing.is_a?(Proc) | ||
raise "Invalid value for load_balancing - #{@options[:load_balancing]}" unless routing.is_a?(Proc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably be checking this in the config DSL rather than at this level.
# | ||
# @note Cluster mode only. | ||
def load_balancing(algorithm=:least_busy) | ||
@options[:load_balancing] = algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where we should validate the argument and throw an error.
@@ -386,7 +389,8 @@ def handle_servers | |||
break if handle_check | |||
else | |||
begin | |||
if io = sock.accept_nonblock | |||
receive_socket = sock.is_a?(UNIXSocket) && @update_capacity | |||
if (io = receive_socket ? sock.recv_io(TCPSocket) : sock.accept_nonblock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this will work because sock will be a UNIXServer
in the case of binding on a unix domain socket as the user requested, it's a bit clunky. I'd rather we plumb through a flag to indicated that recv_io
should be used rather than relying on it being a UNIXSocket.
I fully realize this is a POC btw and that a lot of the comments I left are more about a delivered, prod ready PR. I think the approach is totally valid and you should go ahead and turn this into a prod ready PR. |
I'm not sure if this is OK. This makes the This does add complexity, and might have unintended side-effects, if for any reason anything else works on |
Closing because I think we have a prod-ready approach in @ayufan's PR and there are two big drawbacks to this approach:
|
Just want to chime in that this latency issue is preventing us from migrating from passenger (OSS, single-threaded) to puma. As noted by @wjordan in #1254 (comment), passenger uses the exact same least-busy load balancing approach as this PR, and that has been working exceptionally well us. Regarding the drawback about making the master process as a single point of failure, I think the fear is rather unfounded, as the master-workers approach works just fine for passenger. As for the second drawback with cross-platform support, it looks like passenger does work fine on JRuby, but doesn't support Windows. So the concern is indeed valid. Anyway, the condvar approach also has its own drawback, e.g. there could be significant performance difference across different ruby versions: https://bugs.ruby-lang.org/issues/16255 Furthermore, the latency issue would be more apparent at P99 or P999. If we look at the graphs at https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334, the only one that covers the full range is "Zooming into just /jwt/auth calls", and it does look like the long tail latency is worse with puma + #2079 in that graph? |
From my perspective moving all logic to master changes significantly behaviour of Puma, and makes us to re-learn how to run Puma effectively on a scale. The biggest issue is that something can run on After doing extensive tests across board we see significantly better latency, and consistent with Unicorn. We were running a big scale tests with and without patch disabled, and the conclusion was that we did not see any difference between Puma and Unicorn after applying patch, on all P-xs that we looked (50, 95, 99, etc.). Anyway, it is still on my agenda for this cycle to re-try both of them. I'm quite certain that I will not be able to run this PR on a scale, as we are this user "that we have something running on master" :) I want to see if there's something that we can improve with @sayap Did you maybe try my patch how it works for you? |
I agree this PR isn't production-ready yet and should be shelved for now given the limitations mentioned. I thought I'd add some detailed notes in case I or anyone else wanted to revisit this approach in the future: 1. Master process becomes an acceptor processWhile I agree with @sayap's comment that the concern is mostly theoretical at this point and that other web-servers use a similar request-balancing approach of using a dedicated process to pass accepted sockets to workers (see e.g., 2. Probably no JRuby or Windows support
I would add a third limitation, the synchronization overhead of passing busy counters through the existing pipe channel after each request, which I mentioned above in #2092 (comment). This should be replaced with shared-memory counters for best performance but that also adds further complexity to the implementation. For high-performance use-cases, another possibly simpler approach would be to have each worker listen to a separate socket/port, and use a separate load-balancing proxy such as nginx or haproxy to balance requests across them. This offers best performance, and doesn't add any complexity for setups which already have nginx in their application stack anyway. I'll open a PR demonstrating this approach. (Update: see #2128) |
This seems like an interesting idea @wjordan. If we could allow to have each worker to have their own socket, that would be LB externally.
This would be useful as well. |
Description
Proof-of-concept HTTP request load-balancing using a 'least-busy' algorithm for cluster mode. Instead of having each worker-process run a select-loop concurrently on the listener and race to accept incoming connections, this PR runs a select-loop on the listener in the parent process and forwards individual client TCP connections to worker-processes by sending the file descriptor through a unix domain socket-pair unique to each worker. With this approach, the parent gets to pick which worker gets the connection using a specific routing algorithm.
In order to implement 'least busy' routing for the parent process to select the worker to send each client connection, each worker updates the parent with its current capacity whenever its threads start or finish processing, and the parent selects the worker with the highest capacity. (If two workers report the same capacity, it chooses the one with the oldest last-request.)
This PR was designed to solve #1254 / #2078 as an alternative approach to #1646, #1920 and #2079. The configuration of custom load-balancing algorithms via DSL could support other use-cases as well.
Performance
Runs the benchmark noted in #2079 (comment) with 4013 requests/sec.
[todo - will put the results of some other micro-benchmarks/comparisons here]
Your checklist for this pull request
[changelog skip]
the pull request title.[ci skip]
to the title of the PR.#issue
" to the PR description or my commit messages.