Fix performance of server-side SSL connection close. #2675

devwout · 2021-08-09T08:17:55Z

When the server wants to close a persistent SSL connection
because it was idle for persistent_timeout, the call stack is:

Reactor.wakeup!
  Server#reactor_wakeup
    Client#close
      MiniSSL::Socket#close

The close method is called from within the reactor thread.

In this case, should_drop_bytes? is true, because @engine.shutdown
is false the first time it is called.

Then, read_and_drop(1) is called, which starts by selecting on the
socket. Now because this is a server-initiated close of an idle
connection, in almost all cases there will be nothing to select,
and hence the thread will just wait for 1 second.

Since this is called by the reactor, the reactor will halt for 1 second
and not be able to buffer any data during that time, which has huge
effects on overall worker throughput.

Now I'm not sure what is the use to read from the socket?

From the docs:

It is acceptable for an application to only send its shutdown alert and
then close the underlying connection without waiting for the peer's
response.
https://www.openssl.org/docs/man1.1.1/man3/SSL_shutdown.html
The existing code did not seem to send any shutdown alert,
because the shutdown method was called only on the engine, but the
engine is not connected to the actual TCP socket. The resulting data
still needed to be compied
If the server wants to wait for the client's close_notify shutdown alert,
then this waiting needs to happen with a non-blocking select in the reactor,
so other work can be done in the meantime. This is not trivial to
implement, though.

Note that when the client initiated the close and the data was already
read into the engine, @engine.shutdown will return true immediately,
hence this is only a problem when the server wants to close a
connection.

Description

Please describe your pull request. Thank you for contributing! You're the best.

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

When the server wants to close a persistent SSL connection because it was idle for `persistent_timeout`, the call stack is: Reactor.wakeup! Server#reactor_wakeup Client#close MiniSSL::Socket#close The close method is called from within the reactor thread. In this case, `should_drop_bytes?` is true, because `@engine.shutdown` is false the first time it is called. Then, `read_and_drop(1)` is called, which starts by selecting on the socket. Now because this is a server-initiated close of an idle connection, in almost all cases there will be nothing to select, and hence the thread will just wait for 1 second. Since this is called by the reactor, the reactor will halt for 1 second and not be able to buffer any data during that time, which has huge effects on overall worker throughput. Now I'm not sure what is the use to read from the socket? * From the docs: It is acceptable for an application to only send its shutdown alert and then close the underlying connection without waiting for the peer's response. https://www.openssl.org/docs/man1.1.1/man3/SSL_shutdown.html * The existing code did not seem to send any shutdown alert, because the shutdown method was called only on the engine, but the engine is not connected to the actual TCP socket. The resulting data still needed to be compied * If the server wants to wait for the client's close_notify shutdown alert, then this waiting needs to happen with a non-blocking select in the reactor, so other work can be done in the meantime. This is not trivial to implement, though. Note that when the client initiated the close and the data was already read into the engine, @engine.shutdown will return true immediately, hence this is only a problem when the server wants to close a connection.

misdoro · 2021-08-26T19:19:34Z

👍 The code from this merge request solves the issue of occasional 502 responses with SSL and Amazon load balancer, as described in #1735, without increasing the persistent_timeout setting.

It also completely solves the occasional 502 on worker restarts with the same SSL+Amazon load balancer configuration

misdoro · 2021-08-27T11:29:18Z

Looks like the original code was introduced in #1334 to fix a puma shutdown issue with SSL.

I confirm that with the code from this PR puma shuts down without any issue when there are established SSL keep-alive connections.

misdoro · 2021-08-27T12:07:53Z

🤔 restarting the truffleruby check or rebasing should solve the failing check, I can't reproduce the failure in my fork neither locally

nateberkopec · 2021-08-27T14:14:12Z

This definitely looks like a bit of old whale legs.

If I'm reading this correctly, the close method before did absolutely nothing except, potentially, cause a 1 second wait before always closing the socket in the ensure block.

I think this makes sense, needs a test.

nateberkopec · 2021-09-06T13:20:56Z

Marking as waiting-on-changes, let us know if you need help writing the test @devwout

MSP-Greg · 2021-09-07T19:54:01Z

@devwout

I've got a test for this, passing here, failing on master. I'll try to get a PR soon showing the master failure... Thanks for the PR.

devwout · 2021-09-28T09:22:30Z

Completely lost track of this, thank you for adding the tests!

respire · 2021-10-06T01:34:16Z

Thanks for the fix!

I'm the original author of #1334
I noticed that a change on read_and_drop which replaces MiniSSL::Socket#read_nonblock call with ruby stdlib ::Socket#read_nonblock without injecting the read bytes back to SSL engine.
This might be the major cause of keeping IO#select timeout.

I saw the fix by devwout is majorly performing unidirectional shutdown.
SSL_shutdown onlys closes the write side, not the read side.
To perform a bi-directional one instead, we could call
SSL_read (at MiniSSL#read) to wait for peer to return close_notify.
Absolutely, we should guard it with wait_readable to make it non-blocking.
This is what MiniSSL::Socket#read_nonblock does.
So, you could call read_nonblock instead of

while alert_data = @engine.extract
   @socket.write alert_data
end

Though, unidirectional shutdown might be enough in most use cases of puma I guess.

To prevent timeout on select (slow client), we could also introduce an array of pending close sockets which makes it possible to use a single IO#select call for all the sockets.
But considering most people might connect puma with high speed vpc, blocking IO might not be a big problem.

When the server wants to close a persistent SSL connection because it was idle for `persistent_timeout`, the call stack is: Reactor.wakeup! Server#reactor_wakeup Client#close MiniSSL::Socket#close The close method is called from within the reactor thread. In this case, `should_drop_bytes?` is true, because `@engine.shutdown` is false the first time it is called. Then, `read_and_drop(1)` is called, which starts by selecting on the socket. Now because this is a server-initiated close of an idle connection, in almost all cases there will be nothing to select, and hence the thread will just wait for 1 second. Since this is called by the reactor, the reactor will halt for 1 second and not be able to buffer any data during that time, which has huge effects on overall worker throughput. Now I'm not sure what is the use to read from the socket? * From the docs: It is acceptable for an application to only send its shutdown alert and then close the underlying connection without waiting for the peer's response. https://www.openssl.org/docs/man1.1.1/man3/SSL_shutdown.html * The existing code did not seem to send any shutdown alert, because the shutdown method was called only on the engine, but the engine is not connected to the actual TCP socket. The resulting data still needed to be compied * If the server wants to wait for the client's close_notify shutdown alert, then this waiting needs to happen with a non-blocking select in the reactor, so other work can be done in the meantime. This is not trivial to implement, though. Note that when the client initiated the close and the data was already read into the engine, @engine.shutdown will return true immediately, hence this is only a problem when the server wants to close a connection.

devwout force-pushed the fix-ssl-close branch from 0089f7e to 173a725 Compare August 9, 2021 08:36

nateberkopec added perf ssl waiting-for-review Waiting on review from anyone labels Aug 11, 2021

nateberkopec linked an issue Aug 27, 2021 that may be closed by this pull request

Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB #1735

Closed

misdoro mentioned this pull request Aug 30, 2021

Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB #1735

Closed

nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Sep 6, 2021

MSP-Greg mentioned this pull request Sep 8, 2021

Add new test framework files, test for PR #2675 #2694

Closed

7 tasks

MSP-Greg merged commit 319f84d into puma:master Sep 8, 2021

MSP-Greg removed the waiting-for-changes Waiting on changes from the requestor label Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance of server-side SSL connection close. #2675

Fix performance of server-side SSL connection close. #2675

devwout commented Aug 9, 2021 •

edited

misdoro commented Aug 26, 2021

misdoro commented Aug 27, 2021 •

edited

misdoro commented Aug 27, 2021

nateberkopec commented Aug 27, 2021

nateberkopec commented Sep 6, 2021

MSP-Greg commented Sep 7, 2021

devwout commented Sep 28, 2021

respire commented Oct 6, 2021

Fix performance of server-side SSL connection close. #2675

Fix performance of server-side SSL connection close. #2675

Conversation

devwout commented Aug 9, 2021 • edited

Description

Your checklist for this pull request

misdoro commented Aug 26, 2021

misdoro commented Aug 27, 2021 • edited

misdoro commented Aug 27, 2021

nateberkopec commented Aug 27, 2021

nateberkopec commented Sep 6, 2021

MSP-Greg commented Sep 7, 2021

devwout commented Sep 28, 2021

respire commented Oct 6, 2021

devwout commented Aug 9, 2021 •

edited

misdoro commented Aug 27, 2021 •

edited