New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent connections from entering Reactor after shutdown begins #2377
Conversation
When a `Puma::Server` instance starts to shutdown, it sets the instance variable `@queue_requests` to `false`, then starts to shut down the Reactor. The intent here is that after the Reactor shuts down, threads won't have the option of sending their connections to the Reactor (the connection must either be closed, assuming we've written a response to the socket, or the thread must wait until the client has finished writing its request, then write a response, then finally close the connection). This works most of the time just fine. The problem is that there are races in the implementation of the `ThreadPool` where it's possible for a thread to see that `@queue_requests` is `true` before the `Server` shutdown has started, then later try to add the request to the Reactor, not knowing that it's already shutting down. Depending on the precise order of operations, one possible outcome is that the `ThreadPool` executes `@reactor.add` after the `@reactor` has closed its `@trigger` pipe, resulting in the error `Error reached top of thread-pool: closed stream (IOError)`. Clients experience connection reset errors as a result. The fix introduced in this commit is to add a `Mutex` that makes it impossible for a thread in the `ThreadPool` to add a client connection to the `@reactor` after `@queue_requests` has been set to `false`. Co-Authored-By: Leo Belyi <leonid.belyi@mycase.com>
It looks like the solution in this PR ended up quite similar to the portion of #2271 involving
I think I triggered this same race condition when running portions of the test suite more quickly. Based on the comments in that PR, if I recall the refactoring in #2279 may also fix this same race condition at the source in the Reactor, rather than in the Server through an extra mutex around its operations. |
That would be great, but that PR is a bigger project. I'll merge this for now, if we can get rid of the extra mutex eventually, great. |
) * Test adding connection to Reactor after shutdown Modifies `TestPumaServer#shutdown_requests` to pause `Reactor#add` until after shutdown begins, to ensure requests are handled correctly for this edge case. Adds unit-test coverage for the fix introduced in #2377 and updated in #2279. * Fix Queue#close implementation for Ruby 2.2 Allow `ClosedQueueError` to be raised when `Queue#<<` is called. * Pass `@block` directly instead of `@block.method(:call)`
Description
This change addresses one source of connection reset errors related to the shutdown process of
Puma::Sever
. Details about the particular problem are described in #2337 (comment) and in the commit messages.Why does this PR not include tests?
The most straightforward way to test this change is to add an integration test that repeatedly fires off requests to a server that is restarting via hot restarts (
SIGUSR2
). The expectation is that no client connection experiences a connection reset error, or better, that every client request gets a successful response.I've written such an integration test, but unfortunately, there are other sources of connection reset errors that make it so that the test does not pass reliably. I suggest we continue to document and catalog those errors in #2337.
My claim is that this change demonstrably fixes one problem that causes connection reset errors and gets us closer to the point that puma reliably handles all client connections that it accepted before shutting down, even if the change doesn't fix every kind of problem. It is possible to use the reproduction steps in #2337 to verify that the error
Error reached top of thread-pool: closed stream (IOError)
is resolved by this PR.Does this change have an effect on puma's maximum throughput?
Not that I've measured. I did a quick test locally by spinning up a puma server
bundle exec ruby -Ilib bin/puma -w 4 test/rackup/hello.ru
Then firing off a bunch of requests
but the results seem to be within the margin of error.
Your checklist for this pull request
[changelog skip]
or[ci skip]
to the pull request title.[ci skip]
to the title of the PR.#issue
" to the PR description or my commit messages.