Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore: fairer load balancing #71

Open
casperisfine opened this issue Sep 29, 2023 · 1 comment
Open

Explore: fairer load balancing #71

casperisfine opened this issue Sep 29, 2023 · 1 comment

Comments

@casperisfine
Copy link
Contributor

Linux's epoll+accept queue is fundamentally LIFO (see a good writeup at https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/).

Because of this, both Unicorn and Pitchfork don't peroperly balance load between workers, unless the deployment is at capacity, the first workers will handle disproportionately more work.

In some ways this behavior can be useful, but in other it may be undesirable. Most notably in can create a situation where some of the workers are only used when there is a spike of traffic, and when that spike happen, it hit colder workers.

Pitchfork helps with that cold worker issue thanks to reforking, however the first few requests after reforking are likely to hit page fault, so they still have some (smaller) cold worker problem.

We could explore opening multiple TCP servers with SO_REUSEPORT to split the load evenly between subgroups of workers. The downside if that this would create a round-robin between each group, so if one group get multiple much slower requests it may spike latency.

One way I'd like to explore would be to have intertwined groups, e.g.:

  • Say 32 workers (0, 1, ...)
  • 8 re-usedports (A, B, C,...)

We could do something where:

  • Workers 0..7 listen to A
  • Workers 4..11 listen to B
  • Workers 8...15 listen to C

This way each worker listen into multiple request pools (2 in the example, but could be more).

I think such setup could be a good compromise between fairness and tail latency.

@dalehamel
Copy link
Member

I did some prior work here, and the key is finding the balance. Here's a bit of a braindump:

  • If you have one socket per worker, and each worker only polls on one socket, you have the fairest possible load balancing. The kernel will do consistent hashing of the incoming requests, and you'll see that the load is basically equal across workers.
    • Drawbacks:
      • Sub-optimal latency. It is possible a request can land on a queue that has work ahead of it, when an another queue is empty. This is described in the cloud flare article
      • You need to preserve these file descriptors and the count cannot change - the master process must create them, then share with the children. Otherwise if you lose a worker, it messes up the consistent hashing. Easy enough to work around this, but if you have dynamic worker counts, this can be a problem.
      • If the worker dies or reboots, requests will still pile up in the queue for that worker
  • On the other end, if you have every worker poll every socket, you'll end up basically at the base behaviour of a LIFO.

The key is finding some balance here and it is a bit tricky. Between these two poles, you basically have two sliders you can tune:

  • The number of fds that each worker polls
  • The number of fds (specifically, the ratio of fds to workers)

You can end up with "fairer" load balancing if you adjust it such that each worker polls a few sockets. This mitigates some of the issues:

  • It is less likely you'll have the suboptimal latency case, as you have multiple workers to pick up work from the socket
  • When one worker restarts, other workers polling the same fd can pick up the slack for it.

I seem to recall that for a values like:

  • 32 workers
  • 32 file descriptors
  • 3-4 file descriptors per worker

We got pretty good latency characteristics, and pretty good load balancing. You still get "localized pockets" of LIFO, but the extremes are diminished. Ie, you might have 8 hot workers, and 8 colder workers, then 16 "medium warm" workers, but the extremes are much less than you see with the default pure LIFO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants