Add hot_restart_does_not_drop_connections tests [changelog skip] #2423

MSP-Greg · 2020-10-11T01:29:55Z

Description

Building on PR #2417, this adds testing for whether restart/USR2 drops connections.

Tests

Adds hot_restart_does_not_drop_connections to helpers/integration.rb. The method is used in both test_integration_cluster.rb and test_integration_single.rb tests. Each file runs the test two ways, first, with all connections created in one thread, and a second test with connections created in five threads.
Currently the test is adding output to the bottom 'debug' section, showing resets, restart count, and connections competed after the the first restart.
A file rackup/hello_with_delay.ru has been added and used in the test. A normal 'Hello World' app, with a delay of 0.001 before the return. The time can be increased to whatever is appropriate.
Not sure about whether the test could also be used to test rolling-restart/USR1.

Integration Tests Misc

Moved cli_pumactl from test_integration_pumactl.rb to helpers/integration.rb. This was done so it can be used in other integrations tests when signals aren't supported (Windows).
Moved thread_run_refused from test_integration_cluster.rb to helpers/integration.rb. It's being used by hot_restart_does_not_drop_connections in both 'single' and 'cluster'.
Added if ::Process.respond_to?(:fork) to the end of test_integration_cluster.rb. This omits the skips shown for the tests on platforms that do not support fork. Tired of looking thru them on non-MRI failures...
Small changes to integration tests as to how 'worker' is used when starting Puma. The was needed to allow hot_restart_does_not_drop_connections to work in both 'single and 'cluster' tests.
Added a method fast_write to helpers/integration.rb. It's essentially Puma::Request##fast_write with different error handling. It uses syswrite instead of write, which is probably only beneficial with tests that write a large numbers sockets, as in hundreds or more.

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added an entry to History.md if this PR fixes a bug or adds a feature. If it doesn't need an entry to HISTORY.md, I have added [changelog skip] or [ci skip] to the pull request title.
I have added appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

dentarg · 2020-10-11T09:27:56Z

3. A file rackup/hello_dly.ru has been added

Suggestion: name it rackup/hello_with_delay.ru (I didn't understand what dly was referring to until I read the whole paragraph above)

MSP-Greg · 2020-10-11T13:48:40Z

Suggestion: name it rackup/hello_with_delay.ru

Good point. Renamed...

MSP-Greg · 2020-10-11T14:27:06Z

In the test 'socket' loop, the sockets were closed inside a conditional of:

socket.close if socket.is_a?(IO) && !socket.closed?

Apparently, that wasn't enough (see Actions failed job), so changed to:

if socket.is_a?(IO) && !socket.closed?
  begin
    socket.close
  rescue Errno::EBADF
  end
end

nateberkopec · 2020-10-12T14:02:21Z

paging @cjlarose

MSP-Greg · 2020-10-12T14:51:03Z

@cjlarose

Apologies, I thought I pinged you somewhere over the weekend, but I did not. So, I'd appreciate a review when you can.

cjlarose · 2020-10-12T17:16:26Z

Nice work! Just left some minor comments.

cjlarose · 2020-10-12T17:41:52Z

test/helpers/integration.rb

+    pumactl
+  end
+
+  def hot_restart_does_not_drop_connections(num_threads: 1, ttl_requests: 500)


Just a nitpick: I read ttl_requests as time-to-live requests, which didn't really make sense.

total_requests would read a little easier.

@cjlarose & @dentarg

Note to self - no more abbreviations in naming. Fixed...

cjlarose · 2020-10-12T17:47:10Z

test/helpers/integration.rb

+  def hot_restart_does_not_drop_connections(num_threads: 1, ttl_requests: 500)
+    skipped = true
+    skip_on :jruby, suffix: <<-MSG
+ - file descriptors are not preserved on exec on JRuby; connection reset errors are expected during restarts


Seems like the behavior on JRuby should be the same as on Windows: Since exec doesn't preserve file descriptors on those platforms, puma has to unbind-and-rebind the listening socket, so we lose some connections during restart.

Is there any reason why we run the test on Windows (with the expectation that the number of connection reset errors per thread is bound from above by the number of restarts), but skip the test on JRuby? Seems like we can make the same decision in both places.

Either we can skip on both platforms (connection reset errors are expected on both, so maybe not worth having explicit tests for), or we can run it on both.

I think I lean towards just skipping it on both. Since we expect some connection reset errors on Windows, when this test runs on Windows, it's basically just testing that we can make requests to a running Puma server on that platform and get a successful response; it doesn't seem like it adds a lot of value on top of existing tests.

@cjlarose

I don't have strong feelings either way. I doubt there are many publicly facing Windows Puma instances.

Running it locally, I found the reset errors were bounded and a very small percentage of total requests, and I think resets will normally be retried by most clients. Hence, I including Windows in the tests since if the tests start failing, something has changed for the worse. Again, not bothered by removing Windows tests.

Re JRuby and/or TruffleRuby, I'm not sure if I tried them, especially when the code was close to finished...

I just added JRuby in my fork, it definitely did not pass the new tests. See:
https://github.com/MSP-Greg/puma/actions/runs/302733028

Also, see comment below ('I hate intermittent tests') for more info.

These tests are somewhat dependent on timing. Restarting on JRuby probably takes a bit longer than MRI Rubies, and the gap during restart will be difficult to account for.

Makes sense. In my fork, I had changed the test to run for a max number of seconds instead of doing a particular number of requests. cjlarose@691a5e4

Just something to consider

I kind of felt that it was working, and along with more runs to see if they're stable, some of the quantities in the tests may be able to change.

I started with 'single' tests, added the threading, then added the 'cluster' tests. I added the debug output at the end of the test log, and noticed that on (probably) the threaded cluster test, all the requests were being processed before the first restart. That's when I kicked the counts up, and later added the small delay in hello_with_delay.ru.

They're all finishing fairly quickly, and seeing the cluster threaded test complete 3k requests in less than 3 seconds seemed like it could stay that way even though the 3k number could probably drop to 2k.

After seeing some runs during the busy CI time (Thurs & Fri late afternoon), we might be able to add a 'something slowed Puma down' assert. Really don't have anything like that in the test suite.

BTW, thanks for taking a look at it, and your PR's...

MSP-Greg · 2020-10-12T18:48:46Z

I hate intermittent tests.

With the change from ttl_requests to total_requests, the MRI workflow passed here but failed in my fork, one job only.

I added a modified version of Puma::Request##fast_write to integration.rb, and did the write using it for the new tests. Currently these are the only tests that are creating a lot of sockets, so it may help.

I removed all other tests other than the new ones, which should cause them to run parallel. Let me run it in my fork a few times...

Co-authored-by: Chris LaRose <cjlarose@gmail.com>

MSP-Greg · 2020-10-12T22:33:50Z

I ran the reduced test set twelve times. All Ubuntu/macOS MRI jobs passed. After six, one Windows job failed. made a small rescue adjustment, and the next six Windows runs all passed. Hence, until the next intermittent failure, this seems stable.

MSP-Greg force-pushed the restart-ok branch from 4ee87b1 to 4bad8f6 Compare October 11, 2020 13:47

MSP-Greg force-pushed the restart-ok branch from 4bad8f6 to ca1c7e8 Compare October 11, 2020 14:24

nateberkopec added the maintenance label Oct 12, 2020

nateberkopec added the waiting-for-review Waiting on review from anyone label Oct 12, 2020

cjlarose reviewed Oct 12, 2020

View reviewed changes

cjlarose mentioned this pull request Oct 12, 2020

Add hot restart integration test [changelog skip] #2417

Closed

8 tasks

cjlarose reviewed Oct 12, 2020

View reviewed changes

MSP-Greg force-pushed the restart-ok branch from ca1c7e8 to b5fecd0 Compare October 12, 2020 17:58

Add hot_restart_does_not_drop_connections tests

67c686a

Co-authored-by: Chris LaRose <cjlarose@gmail.com>

MSP-Greg force-pushed the restart-ok branch from b5fecd0 to 67c686a Compare October 12, 2020 22:31

nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Oct 13, 2020

MSP-Greg merged commit 54f6911 into puma:master Oct 13, 2020

MSP-Greg deleted the restart-ok branch October 13, 2020 16:54

cjlarose mentioned this pull request Oct 13, 2020

Requests fail during phased restarts, hot restarts, and shutdown #2337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hot_restart_does_not_drop_connections tests [changelog skip] #2423

Add hot_restart_does_not_drop_connections tests [changelog skip] #2423

MSP-Greg commented Oct 11, 2020 •

edited

dentarg commented Oct 11, 2020

MSP-Greg commented Oct 11, 2020

MSP-Greg commented Oct 11, 2020 •

edited

nateberkopec commented Oct 12, 2020

MSP-Greg commented Oct 12, 2020

cjlarose commented Oct 12, 2020 •

edited

cjlarose Oct 12, 2020

MSP-Greg Oct 12, 2020

cjlarose Oct 12, 2020 •

edited

cjlarose Oct 12, 2020

MSP-Greg Oct 12, 2020

MSP-Greg Oct 12, 2020

MSP-Greg Oct 12, 2020

cjlarose Oct 13, 2020

MSP-Greg Oct 13, 2020

MSP-Greg commented Oct 12, 2020

MSP-Greg commented Oct 12, 2020

Add hot_restart_does_not_drop_connections tests [changelog skip] #2423

Add hot_restart_does_not_drop_connections tests [changelog skip] #2423

Conversation

MSP-Greg commented Oct 11, 2020 • edited

Description

Tests

Integration Tests Misc

Your checklist for this pull request

dentarg commented Oct 11, 2020

MSP-Greg commented Oct 11, 2020

MSP-Greg commented Oct 11, 2020 • edited

nateberkopec commented Oct 12, 2020

MSP-Greg commented Oct 12, 2020

cjlarose commented Oct 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjlarose Oct 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSP-Greg commented Oct 12, 2020

MSP-Greg commented Oct 12, 2020

MSP-Greg commented Oct 11, 2020 •

edited

MSP-Greg commented Oct 11, 2020 •

edited

cjlarose commented Oct 12, 2020 •

edited

cjlarose Oct 12, 2020 •

edited