Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hot_restart_does_not_drop_connections tests [changelog skip] #2423

Merged
merged 1 commit into from Oct 13, 2020

Conversation

MSP-Greg
Copy link
Member

@MSP-Greg MSP-Greg commented Oct 11, 2020

Description

Building on PR #2417, this adds testing for whether restart/USR2 drops connections.

Tests

  1. Adds hot_restart_does_not_drop_connections to helpers/integration.rb. The method is used in both test_integration_cluster.rb and test_integration_single.rb tests. Each file runs the test two ways, first, with all connections created in one thread, and a second test with connections created in five threads.

  2. Currently the test is adding output to the bottom 'debug' section, showing resets, restart count, and connections competed after the the first restart.

  3. A file rackup/hello_with_delay.ru has been added and used in the test. A normal 'Hello World' app, with a delay of 0.001 before the return. The time can be increased to whatever is appropriate.

  4. Not sure about whether the test could also be used to test rolling-restart/USR1.

Integration Tests Misc

  1. Moved cli_pumactl from test_integration_pumactl.rb to helpers/integration.rb. This was done so it can be used in other integrations tests when signals aren't supported (Windows).

  2. Moved thread_run_refused from test_integration_cluster.rb to helpers/integration.rb. It's being used by hot_restart_does_not_drop_connections in both 'single' and 'cluster'.

  3. Added if ::Process.respond_to?(:fork) to the end of test_integration_cluster.rb. This omits the skips shown for the tests on platforms that do not support fork. Tired of looking thru them on non-MRI failures...

  4. Small changes to integration tests as to how 'worker' is used when starting Puma. The was needed to allow hot_restart_does_not_drop_connections to work in both 'single and 'cluster' tests.

  5. Added a method fast_write to helpers/integration.rb. It's essentially Puma::Request##fast_write with different error handling. It uses syswrite instead of write, which is probably only beneficial with tests that write a large numbers sockets, as in hundreds or more.

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added an entry to History.md if this PR fixes a bug or adds a feature. If it doesn't need an entry to HISTORY.md, I have added [changelog skip] or [ci skip] to the pull request title.
  • I have added appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@dentarg
Copy link
Member

dentarg commented Oct 11, 2020

3. A file rackup/hello_dly.ru has been added

Suggestion: name it rackup/hello_with_delay.ru (I didn't understand what dly was referring to until I read the whole paragraph above)

@MSP-Greg
Copy link
Member Author

Suggestion: name it rackup/hello_with_delay.ru

Good point. Renamed...

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Oct 11, 2020

In the test 'socket' loop, the sockets were closed inside a conditional of:

socket.close if socket.is_a?(IO) && !socket.closed?

Apparently, that wasn't enough (see Actions failed job), so changed to:

if socket.is_a?(IO) && !socket.closed?
  begin
    socket.close
  rescue Errno::EBADF
  end
end

@nateberkopec
Copy link
Member

paging @cjlarose

@nateberkopec nateberkopec added the waiting-for-review Waiting on review from anyone label Oct 12, 2020
@MSP-Greg
Copy link
Member Author

@cjlarose

Apologies, I thought I pinged you somewhere over the weekend, but I did not. So, I'd appreciate a review when you can.

@cjlarose
Copy link
Member

cjlarose commented Oct 12, 2020

Nice work! Just left some minor comments.

pumactl
end

def hot_restart_does_not_drop_connections(num_threads: 1, ttl_requests: 500)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpick: I read ttl_requests as time-to-live requests, which didn't really make sense.

total_requests would read a little easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjlarose & @dentarg

Note to self - no more abbreviations in naming. Fixed...

def hot_restart_does_not_drop_connections(num_threads: 1, ttl_requests: 500)
skipped = true
skip_on :jruby, suffix: <<-MSG
- file descriptors are not preserved on exec on JRuby; connection reset errors are expected during restarts
Copy link
Member

@cjlarose cjlarose Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the behavior on JRuby should be the same as on Windows: Since exec doesn't preserve file descriptors on those platforms, puma has to unbind-and-rebind the listening socket, so we lose some connections during restart.

Is there any reason why we run the test on Windows (with the expectation that the number of connection reset errors per thread is bound from above by the number of restarts), but skip the test on JRuby? Seems like we can make the same decision in both places.

Either we can skip on both platforms (connection reset errors are expected on both, so maybe not worth having explicit tests for), or we can run it on both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I lean towards just skipping it on both. Since we expect some connection reset errors on Windows, when this test runs on Windows, it's basically just testing that we can make requests to a running Puma server on that platform and get a successful response; it doesn't seem like it adds a lot of value on top of existing tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjlarose

I don't have strong feelings either way. I doubt there are many publicly facing Windows Puma instances.

Running it locally, I found the reset errors were bounded and a very small percentage of total requests, and I think resets will normally be retried by most clients. Hence, I including Windows in the tests since if the tests start failing, something has changed for the worse. Again, not bothered by removing Windows tests.

Re JRuby and/or TruffleRuby, I'm not sure if I tried them, especially when the code was close to finished...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added JRuby in my fork, it definitely did not pass the new tests. See:
https://github.com/MSP-Greg/puma/actions/runs/302733028

Also, see comment below ('I hate intermittent tests') for more info.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are somewhat dependent on timing. Restarting on JRuby probably takes a bit longer than MRI Rubies, and the gap during restart will be difficult to account for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In my fork, I had changed the test to run for a max number of seconds instead of doing a particular number of requests. cjlarose@691a5e4

Just something to consider

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of felt that it was working, and along with more runs to see if they're stable, some of the quantities in the tests may be able to change.

I started with 'single' tests, added the threading, then added the 'cluster' tests. I added the debug output at the end of the test log, and noticed that on (probably) the threaded cluster test, all the requests were being processed before the first restart. That's when I kicked the counts up, and later added the small delay in hello_with_delay.ru.

They're all finishing fairly quickly, and seeing the cluster threaded test complete 3k requests in less than 3 seconds seemed like it could stay that way even though the 3k number could probably drop to 2k.

After seeing some runs during the busy CI time (Thurs & Fri late afternoon), we might be able to add a 'something slowed Puma down' assert. Really don't have anything like that in the test suite.

BTW, thanks for taking a look at it, and your PR's...

@MSP-Greg
Copy link
Member Author

I hate intermittent tests.

With the change from ttl_requests to total_requests, the MRI workflow passed here but failed in my fork, one job only.

I added a modified version of Puma::Request##fast_write to integration.rb, and did the write using it for the new tests. Currently these are the only tests that are creating a lot of sockets, so it may help.

I removed all other tests other than the new ones, which should cause them to run parallel. Let me run it in my fork a few times...

Co-authored-by: Chris LaRose <cjlarose@gmail.com>
@MSP-Greg
Copy link
Member Author

I ran the reduced test set twelve times. All Ubuntu/macOS MRI jobs passed. After six, one Windows job failed. made a small rescue adjustment, and the next six Windows runs all passed. Hence, until the next intermittent failure, this seems stable.

@nateberkopec nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Oct 13, 2020
@MSP-Greg MSP-Greg merged commit 54f6911 into puma:master Oct 13, 2020
@MSP-Greg MSP-Greg deleted the restart-ok branch October 13, 2020 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance waiting-for-changes Waiting on changes from the requestor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants