Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: propagate connection error causes to RPC statuses #4311

Merged
merged 12 commits into from Apr 13, 2021

Conversation

apolcyn
Copy link
Contributor

@apolcyn apolcyn commented Apr 1, 2021

Motivated mainly to debug internal issue: b/182572215

This PR is a replacement for a subset of the behavior added in #4190. In particular, note that this follows the first idea described in the design in #4163 (comment) about passing connection close errors to the transport's Close method. However, this PR doesn't try to propagate errors up further to the LB policy or ClientConn. My thinking is that the error propagation added here (for RPCs that have picked connections) is still useful on its own, and we can save further plumbing for followup changes, but please let me know.

@apolcyn apolcyn marked this pull request as ready for review April 1, 2021 21:18
@dfawley dfawley self-requested a review April 8, 2021 20:39
@dfawley dfawley self-assigned this Apr 8, 2021
@dfawley dfawley added this to the 1.38 Release milestone Apr 8, 2021
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up. Looks great, just a few minor comments.

clientconn.go Show resolved Hide resolved
internal/transport/http2_client.go Outdated Show resolved Hide resolved
internal/transport/http2_client.go Show resolved Hide resolved
internal/transport/http2_client.go Outdated Show resolved Hide resolved
internal/transport/http2_client.go Outdated Show resolved Hide resolved
tc := testpb.NewTestServiceClient(cc)
ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
defer cancel()
possibleConnResetMsg := "connection reset by peer"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When and how would this error occur?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think I actually left this in while I was developing the test and before I had the rpcStartedOnServer channel to wait for the server handler to receive the RPC (preventing us from stopping the server before headers had been sent from client to server). When we remove that channel and the synchronization it provides, stopping the server can easily race with the client's attempt to send RPC headers to the server, and we can see frequent RST packets being sent from server to client in this case (I guess, if a socket closes while they're still un-processed data in the socket buffer, then the kernel will generate an RST to send back to the peer) - in this case, the error message will be "connection reset by peer".

That said, I think that the "synchronization" provided by the rpcStartedOnServer channel is fundamentally brittle, because AFAICT the client can still e.g. send a BDP ping to the server - i.e., this channel doesn't give us an actual guarantee that no more data will be sent from to the client, so I think the test is actually more robust if we remove this channel entirely.

test/end2end_test.go Outdated Show resolved Hide resolved
test/end2end_test.go Outdated Show resolved Hide resolved
@dfawley dfawley assigned apolcyn and unassigned dfawley Apr 9, 2021
Comment on lines 1344 to 1345
// Use WithBlock() to guarantee that the RPC will be able to pick a healthy connection to
// go out on before we call Stop on the server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default "wait for ready" behavior of RPCs should be fine for this, too. Are you sure this is necessary (and why)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this in because NI was still looking for a way to make sure that RPCs failed strictly after the client conn had found a healthy TCP connection for them to go out on, because I was seeing some flakes where the RPC failed with:

end2end_test.go:1372: &{0xc0000ba360}.Recv() = _, rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:37309: connect: connection refused", want _, rpc error containing substring: |connection reset by peer| OR |error reading from server: EOF|

i.e., in some test flakes, the call to ss.S.Stop() caused server shutdown apparently before the RPC picked a connection.

I dealt with this instead by just adding back the rpcStartedOnServer channel.

BTW, AFAICS RPCs aren't actually using the WaitForReady call option, when using the stub server, right? I do see

if err := waitForReady(cc); err != nil {
in the stub server, but it looks like that basically accomplishes the same thing as WithBlock, rather than the WaitForReady call option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, AFAICS RPCs aren't actually using the WaitForReady call option, when using the stub server, right?

Oops.. "wait for ready" is not what I meant - the default is not WaitForReady. But by default, RPCs should wait for the channel to go from connecting->ready (or transient failure, whis is not expected in tests).

I'm not sure why that waitForReady call is in the stub server; we should probably remove it and do it in the tests that need it for whatever reason.

But since you are using ss.Start which does that waitForReady thing, I'm not sure why you'd be getting that RPC error with "connection refused". That seems like something we should look into.

ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
defer cancel()
// The precise behavior of this test is subject to raceyness around the timing of when TCP packets
// or sent from client to server, and when we tell the server to stop, so we need to account for both
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or->are

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
ss.S.Stop()
if _, err := stream.Recv(); err == nil || (!strings.Contains(err.Error(), possibleConnResetMsg) && !strings.Contains(err.Error(), possibleEOFMsg)) {
t.Fatalf("%v.Recv() = _, %v, want _, rpc error containing substring: |%v| OR |%v|", stream, err, possibleConnResetMsg, possibleEOFMsg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: %q instead of |%v|

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good as-is. One optional thing if you think it sounds like a good idea, and one thing to follow-up on later (the "connection refused" error - maybe file a bug if you don't mind?).

rpcDoneOnClient := make(chan struct{})
ss := &stubserver.StubServer{
FullDuplexCallF: func(stream testpb.TestService_FullDuplexCallServer) error {
close(rpcStartedOnServer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, another option here would be to send a message, and have the client receive it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior of this test is IMO easier to reason about as is, just since there's slightly fewer moving parts and less happening on the wire. So for the purposes of this test, I feel the existing approach is simpler so I'll keep as is.

@apolcyn
Copy link
Contributor Author

apolcyn commented Apr 13, 2021

This is good as-is. One optional thing if you think it sounds like a good idea, and one thing to follow-up on later (the "connection refused" error - maybe file a bug if you don't mind?).

Thanks for the review. I filed #4338 about the "connection refused" error. Otherwise, if this PR looks good, can you please merge? I don't have write access.

@dfawley
Copy link
Member

dfawley commented Apr 13, 2021

Otherwise, if this PR looks good, can you please merge? I don't have write access.

No problem, thank you for the PR!

@dfawley dfawley changed the title Propagate errors causing connection close to RPC statuses client: propagate connection error causes to RPC statuses Apr 13, 2021
@dfawley dfawley merged commit c229922 into grpc:master Apr 13, 2021
apolcyn added a commit to apolcyn/grpc-go that referenced this pull request Apr 14, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants