Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: stop always closing connections when loopy returns #6110

Merged
merged 4 commits into from
Mar 14, 2023

Conversation

dfawley
Copy link
Member

@dfawley dfawley commented Mar 9, 2023

Fixes #6019

As discovered in #6019, closing a connection upon encountering an I/O error can result in loss of data that was sent before the connection was closed. We previously believed that any data in the TCP connection would still be available to the reader indefinitely, but it appears that is not the case. This change makes it so we only close the connection from the loopy writer in non-I/O error situations. If an I/O error causes the loopy writer to exit, we don't need to close the connection, as the reader goroutine will also encounter an I/O error once it is done consuming all the data.

RELEASE NOTES:

  • transport: do not close connections when we encounter I/O errors until after all data is consumed

for {
it, err := l.cbuf.get(true)
if err != nil {
l.closeConnection()
return err
}
if err = l.handle(it); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle() returns an error only when it encounters an unknown control message type. Is this an I/O error? Shouldn't we close the connection here? Same applies to the handle() call down below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I see that processData() today returns an error only when writing of data or headers fails. But how can we guarantee that in the future? Should we at least document that handle() and processData() should return errors only for I/O related events. And also document loopy.run() saying it will close the connection only when it sees a non-I/O error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle returns errors from the handlers themselves which has a lot of I/O error possibilities. We'll need to do that closing in handle itself. Or we could make the errors carry a type with a bool to indicate whether they are I/O errors, but that felt messier.

It seems there's two ways to do this. Commit 1 is wrapping in a lot of different places which feels finnicky, and commit 2 is wrapping in the writer which might not work if the http2 framer decides to start wrapping errors without supporting Unwrap (but that seems very unlikely and we could deal with it if it ever happens).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the ioError option as well, and the PR looks good to me.

@@ -846,7 +856,8 @@ func (l *loopyWriter) handle(i interface{}) error {
case *outFlowControlSizeRequest:
l.outFlowControlSizeRequestHandler(i)
case closeConnection:
return l.closeConnectionHandler()
l.closeConnection()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only current usage of the closeConnection type is from the server-side keepalive code (when the grace period expires). Why do we need this separate type? Why can't we instead simply close the transport and let is call Close() on the underlying connection. Is this intended to be way to close the connection after completing all pending tasks in the controlbuf, while closing the transport will immediately close the underlying connection without completing pending tasks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to flush any pending writes that may have made it legal to close the transport at this time (vs waiting longer for streams to finish). This was added recently: #5821

@easwars easwars assigned dfawley and unassigned easwars Mar 11, 2023
@dfawley dfawley assigned easwars and unassigned dfawley Mar 13, 2023
@easwars easwars assigned dfawley and unassigned easwars Mar 13, 2023
@dfawley dfawley merged commit b458a4f into grpc:master Mar 14, 2023
1 check passed
@dfawley dfawley deleted the cbufcleanup branch March 14, 2023 20:32
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why gRPC Server closes a connection at the 2nd GOAWAY.
2 participants