Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meter accept(2) while taking into account races around thread pool mutex #1563

Merged
merged 1 commit into from May 1, 2018

Conversation

dannyfallon
Copy link
Contributor

@dannyfallon dannyfallon commented Apr 17, 2018

As part of the work in #1278 we refactored the conditional in
ThreadPool#wait_until_not_full because we found it difficult to reason about
at the RailsConf '17 roundtable. We broke it up a bit and everyone went away
happier. However it turns out that you need to account for both @waiting and
@todo.size together to make a determination on whether to accept more work
from the socket and the refactor was not equivalent to the old code.

There are no guarantees that request threads that have been signalled due to
work being added to the thread pool's queue will become active, take the mutex
and decrement @waiting before the main worker's thread runs
ThreadPool#wait_until_not_full. If we run that before the request thread
decrements @waiting we end up hitting the early exit of return @waiting > 0,
taking more work off the socket than the pool's capacity.

The correct way to determine the number of requests we have queued in the pool
that are outstanding is @todo.size - @waiting, so I've reverted to the old
behaviour and augmented the comment above it.

Fixes #1471

As part of the work in puma#1278 we refactored the conditional in
`ThreadPool#wait_until_not_full` because we found it difficult to reason about
at the RailsConf '17 roundtable. We broke it up a bit and everyone went away
happier. However it turns out that you need to account for both `@waiting` and
`@todo.size` together to make a determination on whether to accept more work
from the socket and the refactor was not equivalent to the old code.

There are no guarantees that request threads that have been signalled due to
work being added to the thread pool's queue will become active, take the mutex
and decrement `@waiting` before the main worker's thread runs
`ThreadPool#wait_until_not_full`. If we run that before the request thread
decrements @waiting we end up hitting the early exit of `return @waiting > 0`,
taking more work off the socket than the pool's capacity.

The correct way to determine the number of requests we have queued in the pool
that are outstanding is `@todo.size - @waiting`, so I've reverted to the old
behaviour and augmented the comment above it.
@dannyfallon
Copy link
Contributor Author

dannyfallon commented Apr 17, 2018

/cc @nateberkopec @schneems @evanphx

We encountered this issue in development and it slowed down requests on a four-worker, single-threaded setup. Adding to the impact it even caused deadlocks when the request made a request to the same API endpoint and that internal request was picked up by the same worker 😬We monkeypatched it to the old behaviour and it's been working the past few months without issue. If you wanna chat about this one I'm at RailsConf again this year 👍


To debug this, I turned into a puts debugger and put loads of lines around the 3 operations - #<<, #wait_until_not_full and the request thread's code in #spawn_thread. The debug stuff lives in a different branch over here

Using the test case from @ivanpesin in this comment I got this when everything worked normally:

[96338] Puma starting in cluster mode...
[96338] * Version 3.11.4 (ruby 2.4.3-p205), codename: Love Song
[96338] * Min threads: 1, max threads: 1
[96338] * Environment: development
[96338] * Process workers: 2
[96338] * Phased restart available
[96338] * Listening on tcp://0.0.0.0:4567
[96338] Use Ctrl-C to stop
[96338] - Worker 0 (pid: 96352) booted, phase: 0
[1523990354.618174000, PID: 96352, TID 70345126232300] Request thread has the mutex
[1523990354.618414000, PID: 96352, TID 70345126232300] Waiting incremented, now: 1. Signalling not_full, releasing mutex on not_empty
[96338] - Worker 1 (pid: 96353) booted, phase: 0
[1523990354.619208000, PID: 96353, TID 70345126330160] Request thread has the mutex
[1523990354.619330000, PID: 96353, TID 70345126330160] Waiting incremented, now: 1. Signalling not_full, releasing mutex on not_empty

<... send requests ...> 


[1523990364.892947000, PID: 96353, TID 70345126329660] Main process has the mutex for adding work
[1523990364.893070000, PID: 96353, TID 70345126329660] Added work, todo is now 1
[1523990364.893170000, PID: 96353, TID 70345126329660] Signalled not_empty
[1523990364.893613000, PID: 96353, TID 70345126329660] wait_until_not_full called
[1523990364.893731000, PID: 96353, TID 70345126330160] Request thread is awake after not_empty signal received, has the mutex
[1523990364.893830000, PID: 96353, TID 70345126330160] Waiting decremented, now: 0
[1523990364.893905000, PID: 96353, TID 70345126330160] Work taken from todo. Todo is now 0. Releasing mutex
[1523990364.894161000, PID: 96353, TID 70345126329660] Main process has the mutex for wait_until_not_full
[1523990364.894275000, PID: 96353, TID 70345126329660] Todo is 0
[1523990364.894370000, PID: 96353, TID 70345126329660] Full, waiting
[1523990365.908539000, PID: 96352, TID 70345115057660] Main process has the mutex for adding work
[1523990365.908680000, PID: 96352, TID 70345115057660] Added work, todo is now 1
[1523990365.908825000, PID: 96352, TID 70345115057660] Signalled not_empty
[1523990365.908971000, PID: 96352, TID 70345115057660] wait_until_not_full called
[1523990365.909109000, PID: 96352, TID 70345126232300] Request thread is awake after not_empty signal received, has the mutex
[1523990365.909216000, PID: 96352, TID 70345126232300] Waiting decremented, now: 0
[1523990365.909311000, PID: 96352, TID 70345126232300] Work taken from todo. Todo is now 0. Releasing mutex
[1523990365.909444000, PID: 96352, TID 70345115057660] Main process has the mutex for wait_until_not_full
[1523990365.909668000, PID: 96352, TID 70345115057660] Todo is 0
[1523990365.909769000, PID: 96352, TID 70345115057660] Full, waiting
127.0.0.1 - - [17/Apr/2018:14:39:30 -0400] "GET /sleep/5 HTTP/1.1" 200 24 5.0109
[96352] 127.0.0.1 - - [17/Apr/2018:14:39:30 -0400] "GET /sleep/5 HTTP/1.1" 200 24 5.0188
[1523990370.928931000, PID: 96352, TID 70345126232300] Request thread has the mutex
[1523990370.929086000, PID: 96352, TID 70345126232300] Waiting incremented, now: 1. Signalling not_full, releasing mutex on not_empty
[1523990370.929289000, PID: 96352, TID 70345115057660] Main process is awake after not_full signal received, has the mutex
[1523990370.929413000, PID: 96352, TID 70345115057660] Todo is 0
[1523990370.929536000, PID: 96352, TID 70345115057660] Waiting count is 1, returning
[1523990370.929738000, PID: 96352, TID 70345115057660] Main process has the mutex for adding work
[1523990370.929847000, PID: 96352, TID 70345115057660] Added work, todo is now 1
[1523990370.929973000, PID: 96352, TID 70345115057660] Signalled not_empty
[1523990370.930132000, PID: 96352, TID 70345115057660] wait_until_not_full called
[1523990370.930344000, PID: 96352, TID 70345115057660] Main process has the mutex for wait_until_not_full
[1523990370.930512000, PID: 96352, TID 70345115057660] Todo is 1
[1523990370.930659000, PID: 96352, TID 70345115057660] Waiting count is 1, returning
[1523990370.930869000, PID: 96352, TID 70345126232300] Request thread is awake after not_empty signal received, has the mutex
[1523990370.931026000, PID: 96352, TID 70345126232300] Waiting decremented, now: 0
[1523990370.931181000, PID: 96352, TID 70345126232300] Work taken from todo. Todo is now 0. Releasing mutex
127.0.0.1 - - [17/Apr/2018:14:39:32 -0400] "GET /sleep/2 HTTP/1.1" 200 24 2.0031
[96352] 127.0.0.1 - - [17/Apr/2018:14:39:32 -0400] "GET /sleep/2 HTTP/1.1" 200 24 2.0038
[1523990372.935631000, PID: 96352, TID 70345126232300] Request thread has the mutex
[1523990372.935878000, PID: 96352, TID 70345126232300] Waiting incremented, now: 1. Signalling not_full, releasing mutex on not_empty
127.0.0.1 - - [17/Apr/2018:14:39:34 -0400] "GET /sleep/10 HTTP/1.1" 200 25 10.0065
[96353] 127.0.0.1 - - [17/Apr/2018:14:39:34 -0400] "GET /sleep/10 HTTP/1.1" 200 25 10.0146
[1523990374.909421000, PID: 96353, TID 70345126330160] Request thread has the mutex
[1523990374.909596000, PID: 96353, TID 70345126330160] Waiting incremented, now: 1. Signalling not_full, releasing mutex on not_empty
[1523990374.909753000, PID: 96353, TID 70345126329660] Main process is awake after not_full signal received, has the mutex
[1523990374.909898000, PID: 96353, TID 70345126329660] Todo is 0
[1523990374.910038000, PID: 96353, TID 70345126329660] Waiting count is 1, returning

And then I ran the same producer of requests again and I encountered the bug. One of the worker processes took the first two requests and handed the third one to the second process 🤔 Here's the different output:

[1523990379.791620000, PID: 96353, TID 70345126329660] Main process has the mutex for adding work
[1523990379.791774000, PID: 96353, TID 70345126329660] Added work, todo is now 1
[1523990379.791902000, PID: 96353, TID 70345126329660] Signalled not_empty
[1523990379.792004000, PID: 96353, TID 70345126329660] wait_until_not_full called
[1523990379.792306000, PID: 96353, TID 70345126329660] Main process has the mutex for wait_until_not_full
[1523990379.792582000, PID: 96353, TID 70345126329660] Todo is 1
[1523990379.792886000, PID: 96353, TID 70345126329660] Waiting count is 1, returning
[1523990379.793293000, PID: 96353, TID 70345126330160] Request thread is awake after not_empty signal received, has the mutex
[1523990379.793583000, PID: 96353, TID 70345126330160] Waiting decremented, now: 0
[1523990379.793861000, PID: 96353, TID 70345126330160] Work taken from todo. Todo is now 0. Releasing mutex
[1523990380.806674000, PID: 96353, TID 70345126329660] Main process has the mutex for adding work
[1523990380.806799000, PID: 96353, TID 70345126329660] Added work, todo is now 1
[1523990380.806897000, PID: 96353, TID 70345126329660] Signalled not_empty
[1523990380.806992000, PID: 96353, TID 70345126329660] wait_until_not_full called
[1523990380.807089000, PID: 96353, TID 70345126329660] Main process has the mutex for wait_until_not_full
[1523990380.807188000, PID: 96353, TID 70345126329660] Todo is 1
[1523990380.807288000, PID: 96353, TID 70345126329660] Full, waiting
[1523990381.815731000, PID: 96352, TID 70345115057660] Main process has the mutex for adding work
[1523990381.815854000, PID: 96352, TID 70345115057660] Added work, todo is now 1
[1523990381.815944000, PID: 96352, TID 70345115057660] Signalled not_empty
[1523990381.816142000, PID: 96352, TID 70345115057660] wait_until_not_full called
[1523990381.816246000, PID: 96352, TID 70345126232300] Request thread is awake after not_empty signal received, has the mutex
[1523990381.816399000, PID: 96352, TID 70345126232300] Waiting decremented, now: 0
[1523990381.816499000, PID: 96352, TID 70345126232300] Work taken from todo. Todo is now 0. Releasing mutex
[1523990381.816655000, PID: 96352, TID 70345115057660] Main process has the mutex for wait_until_not_full
[1523990381.817067000, PID: 96352, TID 70345115057660] Todo is 0
[1523990381.817182000, PID: 96352, TID 70345115057660] Full, waiting
127.0.0.1 - - [17/Apr/2018:14:39:43 -0400] "GET /sleep/2 HTTP/1.1" 200 24 2.0017
[96352] 127.0.0.1 - - [17/Apr/2018:14:39:43 -0400] "GET /sleep/2 HTTP/1.1" 200 24 2.0023
[1523990383.819630000, PID: 96352, TID 70345126232300] Request thread has the mutex
[1523990383.819893000, PID: 96352, TID 70345126232300] Waiting incremented, now: 1. Signalling not_full, releasing mutex on not_empty
[1523990383.820209000, PID: 96352, TID 70345115057660] Main process is awake after not_full signal received, has the mutex
[1523990383.820488000, PID: 96352, TID 70345115057660] Todo is 0
[1523990383.820707000, PID: 96352, TID 70345115057660] Waiting count is 1, returning
127.0.0.1 - - [17/Apr/2018:14:39:49 -0400] "GET /sleep/10 HTTP/1.1" 200 25 10.0024
[96353] 127.0.0.1 - - [17/Apr/2018:14:39:49 -0400] "GET /sleep/10 HTTP/1.1" 200 25 10.0029
[1523990389.797607000, PID: 96353, TID 70345126330160] Request thread has the mutex
[1523990389.797875000, PID: 96353, TID 70345126330160] Work taken from todo. Todo is now 0. Releasing mutex
127.0.0.1 - - [17/Apr/2018:14:39:54 -0400] "GET /sleep/5 HTTP/1.1" 200 24 5.0010
[96353] 127.0.0.1 - - [17/Apr/2018:14:39:54 -0400] "GET /sleep/5 HTTP/1.1" 200 24 5.0015
[1523990394.800282000, PID: 96353, TID 70345126330160] Request thread has the mutex
[1523990394.800630000, PID: 96353, TID 70345126330160] Waiting incremented, now: 1. Signalling not_full, releasing mutex on not_empty
[1523990394.800954000, PID: 96353, TID 70345126329660] Main process is awake after not_full signal received, has the mutex
[1523990394.801151000, PID: 96353, TID 70345126329660] Todo is 0
[1523990394.801352000, PID: 96353, TID 70345126329660] Waiting count is 1, returning

The difference here is that we ran wait_until_not_full before the request thread woke up and got the mutex:

[1523990379.791620000, PID: 96353, TID 70345126329660] Main process has the mutex for adding work
[1523990379.791774000, PID: 96353, TID 70345126329660] Added work, todo is now 1
[1523990379.791902000, PID: 96353, TID 70345126329660] Signalled not_empty
[1523990379.792004000, PID: 96353, TID 70345126329660] wait_until_not_full called
[1523990379.792306000, PID: 96353, TID 70345126329660] Main process has the mutex for wait_until_not_full
[1523990379.792582000, PID: 96353, TID 70345126329660] Todo is 1
[1523990379.792886000, PID: 96353, TID 70345126329660] Waiting count is 1, returning
[1523990379.793293000, PID: 96353, TID 70345126330160] Request thread is awake after not_empty signal received, has the mutex

@josler
Copy link

josler commented Apr 19, 2018

cc @jasquat if you wanted to try this out

@jasquat
Copy link

jasquat commented Apr 19, 2018

Thanks @josler and @dannyfallon. This does seem to have fixed my issue.

@schneems
Copy link
Contributor

schneems commented Apr 26, 2018 via email

@dannyfallon
Copy link
Contributor Author

dannyfallon commented Apr 26, 2018

👋 This should have a tiny effect on the backlog size, and mainly in a positive way - it should become more accurate. When a running process falls foul of this behaviour the backlog size will be > 0 despite all threads (one in this example) being busy processing work (@waiting = 0)

We don't want to take requests off the socket and put them into the thread pool unless we have capacity in the pool but because of the mutex sequencing in this race condition, caused by threads not waking up fast enough/before the server enters #wait_until_not_full this absolutely happens.

So in a single-threaded worker with a naive request shape (i.e. the request is ready to be served and doesn't get put into the reactor after being added to the pool) I expect that we'll never have @waiting > 0 and backlog > 0 which I believe was the goal of the original PR 👍

@schneems
Copy link
Contributor

Thanks for the work and the explanation!

It looks like #1471 provides a test case for this behavior. Can you confirm the issue is gone with this patch?

I expect that we'll never have @waiting > 0 and backlog > 0 which I believe was the goal of the original PR

Ahh, so if there is a free thread to process requests, there shouldn't ever be a request waiting in the backlog. Makes sense. I'm currently using backlog as a proxy for "are we keeping up with the work". It sounds like this won't impact that?

@dannyfallon
Copy link
Contributor Author

dannyfallon commented Apr 27, 2018

It looks like #1471 provides a test case for this behavior. Can you confirm the issue is gone with this patch?

The owner of that issue confirms this PR fixes it in the above comment 👍

Ahh, so if there is a free thread to process requests, there shouldn't ever be a request waiting in the backlog. Makes sense.

Yep, and when the thread pool is full with no available threads the backlog builds on the socket not in the process.

'm currently using backlog as a proxy for "are we keeping up with the work". It sounds like this won't impact that?

Just for clarity here we're speaking about Puma's backlog metric, obtained via pumactl stats for example? If that's correct then right now this metric is possibly giving you bad data because the backlog can be > 0 while all threads are actually busy because the server process took too much work off the socket; this shouldn't happen anymore thanks to this PR.

You should continue to monitor the backlog stat because the server process can still end up with a backlog: for example if a the server takes requests off the socket that weren't fully ready they end up in the Reactor and then the request threads free up and take more work off the socket and are busy. Meanwhile, the reactor-based requests became ready and end up in the backlog - now you're not keeping up with the work. This is documented in #1453.

As a bit of a tangent on backlog monitoring I think if you want to know if you're keeping up with the work you also need to monitor the socket's backlog - if the socket backlog grows you're under capacity, not just when the threadpool(s) have backlogs.

@chewi
Copy link
Contributor

chewi commented Apr 27, 2018

In a somewhat desperate attempt for better performance, we put this into production on Monday and it's been working great.

@schneems
Copy link
Contributor

One issue to consider is the "slow client" attack problem. If we're only ever allowing the same number of connections into the reactor as are in the current thread pool then it will be easy to exhaust that and force the server to become unresponsive.

@chewi
Copy link
Contributor

chewi commented Apr 30, 2018

@schneems, does using nginx mitigate that?

@schneems
Copy link
Contributor

Yes. Nginx has its own reactor that will handle slow clients. However many people are using puma over unicorn explicitly for the slow client protection (so they get it without also having to run Nginx). We could consider vendoring in nginx like passenger. Also I think there was one proposal to run Puma via passenger awhile ago. I do not think this is the kind of thing we can just quietly drop support for.

@dannyfallon
Copy link
Contributor Author

dannyfallon commented Apr 30, 2018

Requests spinning in the Reactor until they're ready don't affect or inflate the threadpool's backlog. It's my understanding that when the pool thread tries to perform work on the client the client throws an exception which is caught and examined to see if the client needs to go into the Reactor at that point, freeing up the pool's thread to do work on other clients.

Absolutely agree with you on slow clients but I don't believe that is something that the original #1278 PR nor this one is trying to solve?

Unless I'm mistaken this PR doesn't affect any of these client/slow-client interactions and I'm just trying to restore the thread pool's full capacity check to the one that was in Puma < 3.9.0 because, due to the way Puma puts request threads that are waiting for work to sleep, we can accidentally suck up more requests from the server than we can process after the thread(s) are woken up.

@schneems
Copy link
Contributor

schneems commented May 1, 2018

Okay I've read through a lot of the Puma source code and added docs in #1575 and #1576. So now I think i actually understand what's going on in Puma (in general). At least enough to say more comprehensible things about this PR.

Modifying this does not impact the ability to deal with slow clients at all, yay! Because the @todo size will only increase when a request a fully formed body makes its way through the reactor. While this is happening the wait_until_not_full will continue to not wait, which continues to allow the Puma::Server to add new requests to the reactor. This will happen until one of those requests is fully buffered and is then passed to the thread pool.

Based on my new found understanding of how puma works, I think this PR should be merged. 🎉🎉🎉

Honestly this is a tangent but since i'm thinking out loud i'll mention it here: This does present a problem for measuring the backpressure of the server as with this patch the backlog will always be at or close to zero. I don't think we can get metrics from the TCP socket as to how many requests are waiting. Even if we could, they might not be fully buffered requests, so that wouldn't really be a good indicator anyway. I'll open up another issue.

Thanks for the PR. I'll see what I can do to get this merged.

@schneems schneems merged commit 3f0b725 into puma:master May 1, 2018
@schneems
Copy link
Contributor

schneems commented May 1, 2018

Thanks a ton for the PR and bearing with me while I figure out how Puma works!!

@dannyfallon dannyfallon deleted the df/threadpool-race-fix branch May 1, 2018 22:06
@dannyfallon
Copy link
Contributor Author

🙌 Thanks for not leaving this one in purgatory @schneems and for documenting all the gnarly bits you didn't understand while you got enough context - I wish I'd done it myself after my own voyage of discovery too 😄

@wjordan
Copy link
Contributor

wjordan commented Aug 21, 2018

I don't think we can get metrics from the TCP socket as to how many requests are waiting.

Thought I'd jump in here and note that it is technically possible to get info on waiting requests in the TCP socket through the sock_diag interface (or indirectly via the ss utility, or reading /proc/net/tcp, etc). The raindrops Rack middleware provides stats on the total number of queued clients on each TCP listener. It collects its info from a C extension (linux_inet_diag.c).

I've found that monitoring this metric closely provides a very good early-warning for potential cascading failures in a production system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants