Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactor refactor #2279

Merged
merged 3 commits into from Oct 6, 2020
Merged

Reactor refactor #2279

merged 3 commits into from Oct 6, 2020

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented May 21, 2020

Description

The Reactor class was getting pretty complicated, hard to reason about and had a tricky bug I was working on fixing (#2282), so this PR is a refactoring pass with a focus on simplicity and more carefully separating concerns between the related classes.

The Reactor has a simple purpose- run a select loop on a collection of IOs, with the added feature of also waking up an IO when a specified timeout has been reached. With the help of SortedSet and Queue and using the built-in Selector#wakeup feature, the Reactor class can focus on this one task while being a bit easier to understand.

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added an entry to History.md if this PR fixes a bug or adds a feature. If it doesn't need an entry to HISTORY.md, I have added [changelog skip] the pull request title.
  • I have added appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@nateberkopec nateberkopec requested a review from evanphx May 21, 2020 07:28
@nateberkopec nateberkopec added the waiting-for-review Waiting on review from anyone label May 21, 2020
@nateberkopec
Copy link
Member

I'm gonna wait for Evan on this. One comment from me is that the Reactor class had a lot of docs before and I'd like to keep it a similar level of documentation.

Copy link
Member

@evanphx evanphx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum there need to be some simplifications. But even before that, I agree with @nateberkopec that it's probably too much to lose all that great documentation. You should readd docs to describe the algorithm and how it works.

lib/puma/reactor.rb Outdated Show resolved Hide resolved
lib/puma/reactor.rb Outdated Show resolved Hide resolved
test/test_puma_server.rb Outdated Show resolved Hide resolved
@wjordan wjordan force-pushed the reactor_refactor branch 3 times, most recently from 42baa55 to 9b32780 Compare May 27, 2020 23:37
@wjordan
Copy link
Contributor Author

wjordan commented May 27, 2020

Finished another pass, please take another look:

  • Moved the request-buffering logic passed to the Reactor constructor-block to Server#reactor_wakeup, and calls try_to_finish to leave the connection in the Reactor if a request isn't ready yet (matching original behavior).
  • Refactored Client I/O error-handling into Server#client_error so it could be reused by both #reactor_wakeup and #process_client.
  • Moved code passed to the ThreadPool constructor-block into #process_client to simplify the client request-processing flow and avoid duplicating the client I/O error-handling logic.

If the client error-handling consolidation is too much for this PR I could try to break it out into a separate PR- it's related to simplifying the request-buffering code path in general.

I also did another pass on the documentation to preserve as much of the existing, relevant details as possible. Note however:

  1. Because the Reactor implementation is now much shorter/simpler, the parts of documentation describing the implementation internals (wakeup pipe, timeout-array sorting, sleep-timeout calculation, etc) became shorter/simpler as well.
  2. Since the Reactor no longer contains any of the Server/Client request-buffering logic, the parts of documentation describing the request-buffering process were moved to the Server#reactor_wakeup method alongside that logic. (I also edited it down quite a bit, if anything I think it's probably still more verbose than it needs to be.)

Finally- it's probably easier to review reactor.rb directly instead of looking at the line-by-line diff for that file.

@nateberkopec nateberkopec requested a review from evanphx May 31, 2020 08:11
[@timeout_at - Time.now, 0].max
end

def <=>(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird to me conceptually. Clients are the same if they time out at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was used by SortedSet (which calls #sort! under the hood to arrange its items), it wasn't meant to convey identity, only ordering. To avoid confusion I changed @timeouts back to an Array that calls #sort_by! to arrange its items in 5175792.

@nateberkopec
Copy link
Member

Left a quick comment but I'm gonna need more time to read the changes to reactor + server

@wjordan
Copy link
Contributor Author

wjordan commented Jun 10, 2020

Updated:

  • Replaced SortedSet with Array based on feedback

  • Tweaked the return code path of Server#process_client to avoid introducing a return from ensure - I've learned it is dangerous (it silently eliminates any exceptions passing through including fatal kill-thread signal)

  • Rebased and resolved conflicts with Add unified detailed error logging #2250, but I have one open question related to this- a call to Events#connection_error was added to just one of the three ConnectionError rescue clauses (raised when a client disconnects or times out while the server is reading the request), when there was no logging on any of them previously. This PR refactors those three code paths into a single unified #client_error implementation (specifically to avoid/clarify inconsistencies such as this).
    Note a comment in Reactor (one of the three ConnectionError rescues) explaining why logging isn't recommended in that case:

    puma/lib/puma/reactor.rb

    Lines 229 to 231 in f7b09dd

    # Don't report these to the lowlevel_error handler, otherwise
    # will be flooding them with errors when persistent connections
    # are closed.

    The open question is: Should we add a call to #connection_error for all client ConnectionError exceptions (seems like this isn't recommended), only in certain specific cases (to match the exact behavior introduced by Add unified detailed error logging #2250), or none of them (to match the original behavior)?

@wjordan
Copy link
Contributor Author

wjordan commented Jun 14, 2020

Two related refactoring thoughts that would go in slightly different directions from this PR:

  • If adding a package dependency on timers is acceptable, Timers::Group could be leveraged to handle the timeout tracking/sorting logic. See 2530b45 for an example of what that would look like on top of this PR.

  • I also noticed that Async::Reactor (from the async gem) overlaps a bit with the purpose of Puma's Reactor and integrates timers in its own very similar select loop. With a few upstream tweaks it might be possible to use Async::Reactor directly and either eliminate this class from Puma entirely or make it a much smaller/simpler wrapper.

@wjordan
Copy link
Contributor Author

wjordan commented Jun 17, 2020

It turns out it's quite possible to use Async::Reactor here with minimal changes, here's a proof of concept: ce53007 (code is in reactor.rb and Server#queue_client).

I like how simple and readable the extension code ended up, though the underlying async library is a much heavier dependency to take on- probably too drastic a change for now, I thought it was an interesting alternative to check out in any case.

@nateberkopec
Copy link
Member

I like how simple and readable the extension code ended up, though the underlying async library is a much heavier dependency to take on

Yes, agreed (sometimes I find myself regretting nio4r, although it doesn't cause any new issues).

Will come back to review everything else soon

@wjordan
Copy link
Contributor Author

wjordan commented Sep 23, 2020

Will come back to review everything else soon

Let me know if you have a chance to review everything else, and if there's anything else I can do to help the next review pass on this.

@wjordan
Copy link
Contributor Author

wjordan commented Sep 23, 2020

Rebased and resolved conflicts with #2250, but I have one open question related to this- a call to Events#connection_error was added to just one of the three ConnectionError rescue clauses [...]

Based on the #2371 issue reports I'm assuming the answer to this question is that we want to revert to the original consistent behavior on this 😅

@nateberkopec
Copy link
Member

@wjordan This is good to go, but needs a rebase after #2377.

@nateberkopec nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Sep 30, 2020
@wjordan wjordan force-pushed the reactor_refactor branch 2 times, most recently from b830a1c to 093f43c Compare October 1, 2020 21:49
Refactor Reactor into a more generic IO-with-timeout monitor,
using a Queue to simplify the implementation.
Move request-buffering logic into Server#reactor_wakeup.
Fixes bug in managing timeouts on clients.
Move, update and rewrite documentation to match updated class structure.
@wjordan
Copy link
Contributor Author

wjordan commented Oct 1, 2020

OK, rebased and ready for another review. Some changes done as part of the rebase:

@cjlarose
Copy link
Member

cjlarose commented Oct 1, 2020

@wjordan I found that this branch (tested on de2f108) actually re-introduces a problem previously fixed by your own #2122: Unable to add work while shutting down

/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/thread_pool.rb:186:in `block in <<'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/thread_pool.rb:179:in `synchronize'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/thread_pool.rb:179:in `with_mutex'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/thread_pool.rb:184:in `<<'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/server.rb:290:in `reactor_wakeup'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/reactor.rb:47:in `rescue in add'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/reactor.rb:43:in `add'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/server.rb:416:in `process_client'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/thread_pool.rb:145:in `block in spawn_thread'
2020-10-01 22:51:38 +0000 Read: #<RuntimeError: Unable to add work while shutting down>
curl: (22) The requested URL returned error: 500 Internal Server Error

It seems like it's possible now for a thread in the ThreadPool to add a client to the Reactor after the Reactor has started to shutdown. That alone isn't a problem (there's even explicit code to handle this case), but it is possible for Server#reactor_wakeup to try to add the client to the ThreadPool after the Server has called ThreadPool#shutdown. In that case, the exception Unable to add work while shutting down is raised and the client gets a 500 error response. In practice, this affects availability during phased restarts, hot restarts, and shutdowns.

I have a reproducible test case here: https://github.com/cjlarose/puma-phased-restart-errors/tree/reactor_reactor

This uses MRI on Linux in a Docker container. You might have to run it a while in order to produce the failure.

I think re-introducing the @shutdown_mutex from #2377 might fix the problem, but of course I'm open to other ideas, too.

@cjlarose
Copy link
Member

cjlarose commented Oct 1, 2020

Unrelated: This branch also fixes flakiness in a few tests like test_halt_unix and test_stop_unix on TruffleRuby. The old Reactor didn't quite handle reading from the @ready pipe reliably on that platform, but the new Reactor basically replaced all of that pipe-reading code. Nice work!

@nateberkopec nateberkopec added waiting-for-review Waiting on review from anyone and removed waiting-for-changes Waiting on changes from the requestor labels Oct 2, 2020
@wjordan
Copy link
Contributor Author

wjordan commented Oct 2, 2020

@cjlarose thanks for catching this! It may be a subtly different bug from the one in #2122, since I'm pretty sure there's a still-passing test that covers the issue described in that PR.

I have a couple ideas on how to fix the issue (one of which is to leave the mutex in place), but I'll also spend some time on writing a test that might reliably trigger this bug to prevent future regressions.

@cjlarose
Copy link
Member

cjlarose commented Oct 2, 2020

I'll also spend some time on writing a test that might reliably trigger this bug to prevent future regressions.

That'd be awesome. I've also written a test that does something similar: it just performs a bunch of hot restarts on a single-mode puma server while concurrently performing a bunch of requests. The expectation is that all clients eventually get a successful response.

cjlarose@c203014

It doesn't pass on all platforms just yet because of various issues in puma, but I'm working on fixing those problems. If you come up with a way to test the Unable to add work while shutting down behavior more precisely, though, that'd be great!

@nateberkopec nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Oct 2, 2020
- In `Reactor#shutdown`, `@selector` can be closed before the call to `#wakeup`, so catch/ignore the `IOError` that may be thrown.
- `Reactor#wakeup!` can delete elements from the `@timeouts` array so calling it from an `#each` block can cause the array iteration to miss elements. Call @block directly instead.
- Change `Reactor#add` to return `false` if the reactor is already shut down instead of invoking the block immediately, so a client-request currently being processed can continue, rather than re-adding to the thread-pool (which may already be shutting down and unable to accept new work).
Copy link
Member

@cjlarose cjlarose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most recent concurrency fixes look good. Just a minor comment.

end
# Wakeup all remaining objects on shutdown.
@timeouts.each(&@block.method(:call))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can pass the block directly, no?

      @timeouts.each(&@block)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course, silly me! Harmless enough since this has been merged, but worth slipping a fix into a future PR.

@nateberkopec nateberkopec added waiting-for-review Waiting on review from anyone and removed waiting-for-changes Waiting on changes from the requestor labels Oct 6, 2020
@nateberkopec
Copy link
Member

This is really top-notch work. I'm so happy this can be merged.

@nateberkopec nateberkopec merged commit a76d390 into puma:master Oct 6, 2020
wjordan added a commit to wjordan/puma that referenced this pull request Oct 7, 2020
Modifies `TestPumaServer#shutdown_requests` to pause `Reactor#add` until after
shutdown begins, to ensure requests are handled correctly for this edge case.
Adds unit-test coverage for the fix introduced in puma#2377 and updated in puma#2279.
wjordan added a commit to wjordan/puma that referenced this pull request Oct 7, 2020
Modifies `TestPumaServer#shutdown_requests` to pause `Reactor#add` until after
shutdown begins, to ensure requests are handled correctly for this edge case.
Adds unit-test coverage for the fix introduced in puma#2377 and updated in puma#2279.
nateberkopec pushed a commit that referenced this pull request Oct 8, 2020
)

* Test adding connection to Reactor after shutdown
Modifies `TestPumaServer#shutdown_requests` to pause `Reactor#add` until after
shutdown begins, to ensure requests are handled correctly for this edge case.
Adds unit-test coverage for the fix introduced in #2377 and updated in #2279.

* Fix Queue#close implementation for Ruby 2.2
Allow `ClosedQueueError` to be raised when `Queue#<<` is called.

* Pass `@block` directly instead of `@block.method(:call)`
MSP-Greg added a commit to MSP-Greg/puma that referenced this pull request Oct 14, 2020
MSP-Greg added a commit to MSP-Greg/puma that referenced this pull request Oct 15, 2020
MSP-Greg added a commit to MSP-Greg/puma that referenced this pull request Oct 15, 2020
nateberkopec pushed a commit that referenced this pull request Oct 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance waiting-for-review Waiting on review from anyone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants