Reactor refactor #2279

wjordan · 2020-05-21T01:06:57Z

Description

The Reactor class was getting pretty complicated, hard to reason about and had a tricky bug I was working on fixing (#2282), so this PR is a refactoring pass with a focus on simplicity and more carefully separating concerns between the related classes.

The Reactor has a simple purpose- run a select loop on a collection of IOs, with the added feature of also waking up an IO when a specified timeout has been reached. With the help of ~~SortedSet and~~ Queue and using the built-in Selector#wakeup feature, the Reactor class can focus on this one task while being a bit easier to understand.

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added an entry to History.md if this PR fixes a bug or adds a feature. If it doesn't need an entry to HISTORY.md, I have added [changelog skip] the pull request title.
I have added appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

nateberkopec · 2020-05-21T07:29:42Z

I'm gonna wait for Evan on this. One comment from me is that the Reactor class had a lot of docs before and I'd like to keep it a similar level of documentation.

evanphx

At a minimum there need to be some simplifications. But even before that, I agree with @nateberkopec that it's probably too much to lose all that great documentation. You should readd docs to describe the algorithm and how it works.

lib/puma/reactor.rb

test/test_puma_server.rb

wjordan · 2020-05-27T23:45:26Z

Finished another pass, please take another look:

Moved the request-buffering logic passed to the Reactor constructor-block to Server#reactor_wakeup, and calls try_to_finish to leave the connection in the Reactor if a request isn't ready yet (matching original behavior).
Refactored Client I/O error-handling into Server#client_error so it could be reused by both #reactor_wakeup and #process_client.
Moved code passed to the ThreadPool constructor-block into #process_client to simplify the client request-processing flow and avoid duplicating the client I/O error-handling logic.

If the client error-handling consolidation is too much for this PR I could try to break it out into a separate PR- it's related to simplifying the request-buffering code path in general.

I also did another pass on the documentation to preserve as much of the existing, relevant details as possible. Note however:

Because the Reactor implementation is now much shorter/simpler, the parts of documentation describing the implementation internals (wakeup pipe, timeout-array sorting, sleep-timeout calculation, etc) became shorter/simpler as well.
Since the Reactor no longer contains any of the Server/Client request-buffering logic, the parts of documentation describing the request-buffering process were moved to the Server#reactor_wakeup method alongside that logic. (I also edited it down quite a bit, if anything I think it's probably still more verbose than it needs to be.)

Finally- it's probably easier to review reactor.rb directly instead of looking at the line-by-line diff for that file.

nateberkopec · 2020-05-31T08:12:31Z

lib/puma/client.rb

+      [@timeout_at - Time.now, 0].max
+    end
+
+    def <=>(other)


This feels weird to me conceptually. Clients are the same if they time out at the same time?

This method was used by SortedSet (which calls #sort! under the hood to arrange its items), it wasn't meant to convey identity, only ordering. To avoid confusion I changed @timeouts back to an Array that calls #sort_by! to arrange its items in 5175792.

nateberkopec · 2020-05-31T08:14:32Z

Left a quick comment but I'm gonna need more time to read the changes to reactor + server

wjordan · 2020-06-10T23:43:06Z

Updated:

Replaced SortedSet with Array based on feedback
Tweaked the return code path of Server#process_client to avoid introducing a return from ensure - I've learned it is dangerous (it silently eliminates any exceptions passing through including fatal kill-thread signal)

Rebased and resolved conflicts with Add unified detailed error logging #2250, but I have one open question related to this- a call to Events#connection_error was added to just one of the three ConnectionError rescue clauses (raised when a client disconnects or times out while the server is reading the request), when there was no logging on any of them previously. This PR refactors those three code paths into a single unified #client_error implementation (specifically to avoid/clarify inconsistencies such as this).
Note a comment in Reactor (one of the three ConnectionError rescues) explaining why logging isn't recommended in that case:

puma/lib/puma/reactor.rb

Lines 229 to 231 in f7b09dd

    
           # Don't report these to the lowlevel_error handler, otherwise 
        
           # will be flooding them with errors when persistent connections 
        
           # are closed.

The open question is: Should we add a call to #connection_error for all client ConnectionError exceptions (seems like this isn't recommended), only in certain specific cases (to match the exact behavior introduced by Add unified detailed error logging #2250), or none of them (to match the original behavior)?

wjordan · 2020-06-14T22:55:31Z

Two related refactoring thoughts that would go in slightly different directions from this PR:

If adding a package dependency on timers is acceptable, Timers::Group could be leveraged to handle the timeout tracking/sorting logic. See 2530b45 for an example of what that would look like on top of this PR.
I also noticed that Async::Reactor (from the async gem) overlaps a bit with the purpose of Puma's Reactor and integrates timers in its own very similar select loop. With a few upstream tweaks it might be possible to use Async::Reactor directly and either eliminate this class from Puma entirely or make it a much smaller/simpler wrapper.

wjordan · 2020-06-17T00:22:51Z

It turns out it's quite possible to use Async::Reactor here with minimal changes, here's a proof of concept: ce53007 (code is in reactor.rb and Server#queue_client).

I like how simple and readable the extension code ended up, though the underlying async library is a much heavier dependency to take on- probably too drastic a change for now, I thought it was an interesting alternative to check out in any case.

nateberkopec · 2020-06-22T06:32:51Z

I like how simple and readable the extension code ended up, though the underlying async library is a much heavier dependency to take on

Yes, agreed (sometimes I find myself regretting nio4r, although it doesn't cause any new issues).

Will come back to review everything else soon

wjordan · 2020-09-23T04:57:50Z

Will come back to review everything else soon

Let me know if you have a chance to review everything else, and if there's anything else I can do to help the next review pass on this.

wjordan · 2020-09-23T05:22:18Z

Rebased and resolved conflicts with #2250, but I have one open question related to this- a call to Events#connection_error was added to just one of the three ConnectionError rescue clauses [...]

Based on the #2371 issue reports I'm assuming the answer to this question is that we want to revert to the original consistent behavior on this 😅

nateberkopec · 2020-09-30T13:46:57Z

@wjordan This is good to go, but needs a rebase after #2377.

Refactor Reactor into a more generic IO-with-timeout monitor, using a Queue to simplify the implementation. Move request-buffering logic into Server#reactor_wakeup. Fixes bug in managing timeouts on clients. Move, update and rewrite documentation to match updated class structure.

wjordan · 2020-10-01T23:08:36Z

OK, rebased and ready for another review. Some changes done as part of the rebase:

After the rebase this PR removes @shutdown_mutex, effectively reverting the changes from Prevent connections from entering Reactor after shutdown begins #2377 since the race-condition is handled by Reactor so the extra synchronization should no longer be necessary in Server.
Simplified Client#finish and added Server#with_force_shutdown, to ensure client.timeout! is always called whenever a force-shutdown occurs in blocks where it's allowed. This is related to Better error handling during force shutdown #2271 and fixes an intermittent test failure in TestPumaServer#test_force_shutdown of the sort Expected 408 to match nil (see example). This could possibly be merged separately but the PR was modifying some of the same code anyway so I figured it would be simpler to fix as part of this.

cjlarose · 2020-10-01T23:12:52Z

@wjordan I found that this branch (tested on de2f108) actually re-introduces a problem previously fixed by your own #2122: Unable to add work while shutting down

/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/thread_pool.rb:186:in `block in <<'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/thread_pool.rb:179:in `synchronize'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/thread_pool.rb:179:in `with_mutex'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/thread_pool.rb:184:in `<<'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/server.rb:290:in `reactor_wakeup'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/reactor.rb:47:in `rescue in add'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/reactor.rb:43:in `add'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/server.rb:416:in `process_client'
/usr/src/app/vendor/cache/puma-de2f108fcb0a/lib/puma/thread_pool.rb:145:in `block in spawn_thread'
2020-10-01 22:51:38 +0000 Read: #<RuntimeError: Unable to add work while shutting down>
curl: (22) The requested URL returned error: 500 Internal Server Error

It seems like it's possible now for a thread in the ThreadPool to add a client to the Reactor after the Reactor has started to shutdown. That alone isn't a problem (there's even explicit code to handle this case), but it is possible for Server#reactor_wakeup to try to add the client to the ThreadPool after the Server has called ThreadPool#shutdown. In that case, the exception Unable to add work while shutting down is raised and the client gets a 500 error response. In practice, this affects availability during phased restarts, hot restarts, and shutdowns.

I have a reproducible test case here: https://github.com/cjlarose/puma-phased-restart-errors/tree/reactor_reactor

This uses MRI on Linux in a Docker container. You might have to run it a while in order to produce the failure.

I think re-introducing the @shutdown_mutex from #2377 might fix the problem, but of course I'm open to other ideas, too.

cjlarose · 2020-10-01T23:21:25Z

Unrelated: This branch also fixes flakiness in a few tests like test_halt_unix and test_stop_unix on TruffleRuby. The old Reactor didn't quite handle reading from the @ready pipe reliably on that platform, but the new Reactor basically replaced all of that pipe-reading code. Nice work!

wjordan · 2020-10-02T01:07:44Z

@cjlarose thanks for catching this! It may be a subtly different bug from the one in #2122, since I'm pretty sure there's a still-passing test that covers the issue described in that PR.

I have a couple ideas on how to fix the issue (one of which is to leave the mutex in place), but I'll also spend some time on writing a test that might reliably trigger this bug to prevent future regressions.

cjlarose · 2020-10-02T01:54:01Z

I'll also spend some time on writing a test that might reliably trigger this bug to prevent future regressions.

That'd be awesome. I've also written a test that does something similar: it just performs a bunch of hot restarts on a single-mode puma server while concurrently performing a bunch of requests. The expectation is that all clients eventually get a successful response.

cjlarose@c203014

It doesn't pass on all platforms just yet because of various issues in puma, but I'm working on fixing those problems. If you come up with a way to test the Unable to add work while shutting down behavior more precisely, though, that'd be great!

- In `Reactor#shutdown`, `@selector` can be closed before the call to `#wakeup`, so catch/ignore the `IOError` that may be thrown. - `Reactor#wakeup!` can delete elements from the `@timeouts` array so calling it from an `#each` block can cause the array iteration to miss elements. Call @block directly instead. - Change `Reactor#add` to return `false` if the reactor is already shut down instead of invoking the block immediately, so a client-request currently being processed can continue, rather than re-adding to the thread-pool (which may already be shutting down and unable to accept new work).

cjlarose

The most recent concurrency fixes look good. Just a minor comment.

cjlarose · 2020-10-06T00:04:39Z

lib/puma/reactor.rb

      end
+      # Wakeup all remaining objects on shutdown.
+      @timeouts.each(&@block.method(:call))


I think we can pass the block directly, no?

@timeouts.each(&@block)

Yes of course, silly me! Harmless enough since this has been merged, but worth slipping a fix into a future PR.

nateberkopec · 2020-10-06T13:22:09Z

This is really top-notch work. I'm so happy this can be merged.

Modifies `TestPumaServer#shutdown_requests` to pause `Reactor#add` until after shutdown begins, to ensure requests are handled correctly for this edge case. Adds unit-test coverage for the fix introduced in puma#2377 and updated in puma#2279.

) * Test adding connection to Reactor after shutdown Modifies `TestPumaServer#shutdown_requests` to pause `Reactor#add` until after shutdown begins, to ensure requests are handled correctly for this edge case. Adds unit-test coverage for the fix introduced in #2377 and updated in #2279. * Fix Queue#close implementation for Ruby 2.2 Allow `ClosedQueueError` to be raised when `Queue#<<` is called. * Pass `@block` directly instead of `@block.method(:call)`

See puma#2390 Fixed by puma#2279 Fails in 5.0.2

See #2390 Fixed by #2279 Fails in 5.0.2

nateberkopec added the maintenance label May 21, 2020

nateberkopec requested a review from evanphx May 21, 2020 07:28

nateberkopec added the waiting-for-review Waiting on review from anyone label May 21, 2020

evanphx requested changes May 22, 2020

View reviewed changes

lib/puma/reactor.rb Outdated Show resolved Hide resolved

lib/puma/reactor.rb Outdated Show resolved Hide resolved

test/test_puma_server.rb Outdated Show resolved Hide resolved

wjordan force-pushed the reactor_refactor branch 3 times, most recently from 42baa55 to 9b32780 Compare May 27, 2020 23:37

nateberkopec requested a review from evanphx May 31, 2020 08:11

nateberkopec reviewed May 31, 2020

View reviewed changes

wjordan force-pushed the reactor_refactor branch from 9b32780 to 1849516 Compare June 10, 2020 21:55

wjordan mentioned this pull request Jun 10, 2020

Better error handling during force shutdown #2271

Merged

8 tasks

This was referenced Sep 23, 2020

Prevent connections from entering Reactor after shutdown begins #2377

Merged

Requests fail during phased restarts, hot restarts, and shutdown #2337

Closed

wjordan mentioned this pull request Sep 23, 2020

Puma 5.0.0 HTTP connection error: #<Puma::ConnectionError: Connection error detected during read> #2371

Closed

wjordan mentioned this pull request Sep 23, 2020

[Draft] Various test-suite speed/reliability improvements #2241

Closed

8 tasks

nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Sep 30, 2020

wjordan force-pushed the reactor_refactor branch 2 times, most recently from b830a1c to 093f43c Compare October 1, 2020 21:49

nateberkopec added waiting-for-review Waiting on review from anyone and removed waiting-for-changes Waiting on changes from the requestor labels Oct 2, 2020

nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Oct 2, 2020

cjlarose reviewed Oct 6, 2020

View reviewed changes

nateberkopec added waiting-for-review Waiting on review from anyone and removed waiting-for-changes Waiting on changes from the requestor labels Oct 6, 2020

Merge branch 'master' into reactor_refactor

c08b63f

nateberkopec merged commit a76d390 into puma:master Oct 6, 2020

cjlarose mentioned this pull request Oct 7, 2020

Add hot restart integration test [changelog skip] #2417

Closed

8 tasks

This was referenced Oct 7, 2020

Test adding connection to Reactor after shutdown [changelog skip] #2418

Merged

Minor timeout bug in Reactor #2282

Closed

wjordan mentioned this pull request Oct 14, 2020

lowlevel_error_handler is unexpectedly invoked when an HTTP/1.1 client closes its connection first #2390

Closed

MSP-Greg added a commit to MSP-Greg/puma that referenced this pull request Oct 14, 2020

Add test_client_quick_close_no_lowlevel_error_handler_call

c88f3c1

See puma#2390 Fixed by puma#2279 Fails in 5.0.2

MSP-Greg mentioned this pull request Oct 14, 2020

Add test_client_quick_close_no_lowlevel_error_handler_call [changelog skip] #2429

Merged

8 tasks

MSP-Greg added a commit to MSP-Greg/puma that referenced this pull request Oct 15, 2020

Add test_client_quick_close_no_lowlevel_error_handler_call

9634577

See puma#2390 Fixed by puma#2279 Fails in 5.0.2

MSP-Greg added a commit to MSP-Greg/puma that referenced this pull request Oct 15, 2020

Add test_client_quick_close_no_lowlevel_error_handler_call

e415784

See puma#2390 Fixed by puma#2279 Fails in 5.0.2

nateberkopec pushed a commit that referenced this pull request Oct 15, 2020

Add test_client_quick_close_no_lowlevel_error_handler_call (#2429)

5170c24

See #2390 Fixed by #2279 Fails in 5.0.2

wjordan mentioned this pull request Jan 4, 2021

[Feature Request] Ruby 3.0 Fiber Scheduler support #2517

Open

cjlarose mentioned this pull request Mar 12, 2021

Timeout during long file upload #2574

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reactor refactor #2279

Reactor refactor #2279

wjordan commented May 21, 2020 •

edited by nateberkopec

nateberkopec commented May 21, 2020

evanphx left a comment

wjordan commented May 27, 2020 •

edited

nateberkopec May 31, 2020

wjordan Jun 10, 2020

nateberkopec commented May 31, 2020

wjordan commented Jun 10, 2020 •

edited

wjordan commented Jun 14, 2020

wjordan commented Jun 17, 2020

nateberkopec commented Jun 22, 2020

wjordan commented Sep 23, 2020

wjordan commented Sep 23, 2020

nateberkopec commented Sep 30, 2020

wjordan commented Oct 1, 2020

cjlarose commented Oct 1, 2020 •

edited

cjlarose commented Oct 1, 2020

wjordan commented Oct 2, 2020

cjlarose commented Oct 2, 2020

cjlarose left a comment

cjlarose Oct 6, 2020

wjordan Oct 7, 2020

nateberkopec commented Oct 6, 2020

Reactor refactor #2279

Reactor refactor #2279

Conversation

wjordan commented May 21, 2020 • edited by nateberkopec

Description

Your checklist for this pull request

nateberkopec commented May 21, 2020

evanphx left a comment

Choose a reason for hiding this comment

wjordan commented May 27, 2020 • edited

nateberkopec May 31, 2020

Choose a reason for hiding this comment

wjordan Jun 10, 2020

Choose a reason for hiding this comment

nateberkopec commented May 31, 2020

wjordan commented Jun 10, 2020 • edited

wjordan commented Jun 14, 2020

wjordan commented Jun 17, 2020

nateberkopec commented Jun 22, 2020

wjordan commented Sep 23, 2020

wjordan commented Sep 23, 2020

nateberkopec commented Sep 30, 2020

wjordan commented Oct 1, 2020

cjlarose commented Oct 1, 2020 • edited

cjlarose commented Oct 1, 2020

wjordan commented Oct 2, 2020

cjlarose commented Oct 2, 2020

cjlarose left a comment

Choose a reason for hiding this comment

cjlarose Oct 6, 2020

Choose a reason for hiding this comment

wjordan Oct 7, 2020

Choose a reason for hiding this comment

nateberkopec commented Oct 6, 2020

wjordan commented May 21, 2020 •

edited by nateberkopec

wjordan commented May 27, 2020 •

edited

wjordan commented Jun 10, 2020 •

edited

cjlarose commented Oct 1, 2020 •

edited