client: propagate connection error causes to RPC statuses #4311

apolcyn · 2021-04-01T00:41:38Z

Motivated mainly to debug internal issue: b/182572215

This PR is a replacement for a subset of the behavior added in #4190. In particular, note that this follows the first idea described in the design in #4163 (comment) about passing connection close errors to the transport's Close method. However, this PR doesn't try to propagate errors up further to the LB policy or ClientConn. My thinking is that the error propagation added here (for RPCs that have picked connections) is still useful on its own, and we can save further plumbing for followup changes, but please let me know.

dfawley

Thanks for picking this up. Looks great, just a few minor comments.

clientconn.go

internal/transport/http2_client.go

dfawley · 2021-04-09T22:44:45Z

test/end2end_test.go

+	tc := testpb.NewTestServiceClient(cc)
+	ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
+	defer cancel()
+	possibleConnResetMsg := "connection reset by peer"


When and how would this error occur?

Good question. I think I actually left this in while I was developing the test and before I had the rpcStartedOnServer channel to wait for the server handler to receive the RPC (preventing us from stopping the server before headers had been sent from client to server). When we remove that channel and the synchronization it provides, stopping the server can easily race with the client's attempt to send RPC headers to the server, and we can see frequent RST packets being sent from server to client in this case (I guess, if a socket closes while they're still un-processed data in the socket buffer, then the kernel will generate an RST to send back to the peer) - in this case, the error message will be "connection reset by peer".

That said, I think that the "synchronization" provided by the rpcStartedOnServer channel is fundamentally brittle, because AFAICT the client can still e.g. send a BDP ping to the server - i.e., this channel doesn't give us an actual guarantee that no more data will be sent from to the client, so I think the test is actually more robust if we remove this channel entirely.

test/end2end_test.go

error

dfawley · 2021-04-13T16:26:56Z

test/end2end_test.go

+	// Use WithBlock() to guarantee that the RPC will be able to pick a healthy connection to
+	// go out on before we call Stop on the server.


The default "wait for ready" behavior of RPCs should be fine for this, too. Are you sure this is necessary (and why)?

I added this in because NI was still looking for a way to make sure that RPCs failed strictly after the client conn had found a healthy TCP connection for them to go out on, because I was seeing some flakes where the RPC failed with:

end2end_test.go:1372: &{0xc0000ba360}.Recv() = _, rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:37309: connect: connection refused", want _, rpc error containing substring: |connection reset by peer| OR |error reading from server: EOF|

i.e., in some test flakes, the call to ss.S.Stop() caused server shutdown apparently before the RPC picked a connection.

I dealt with this instead by just adding back the rpcStartedOnServer channel.

BTW, AFAICS RPCs aren't actually using the WaitForReady call option, when using the stub server, right? I do see

grpc-go/internal/stubserver/stubserver.go

Line 120 in 950ddd3

if err := waitForReady(cc); err != nil {

in the stub server, but it looks like that basically accomplishes the same thing as WithBlock, rather than the WaitForReady call option.

BTW, AFAICS RPCs aren't actually using the WaitForReady call option, when using the stub server, right?

Oops.. "wait for ready" is not what I meant - the default is not WaitForReady. But by default, RPCs should wait for the channel to go from connecting->ready (or transient failure, whis is not expected in tests).

I'm not sure why that waitForReady call is in the stub server; we should probably remove it and do it in the tests that need it for whatever reason.

But since you are using ss.Start which does that waitForReady thing, I'm not sure why you'd be getting that RPC error with "connection refused". That seems like something we should look into.

dfawley · 2021-04-13T16:28:30Z

test/end2end_test.go

+	ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
+	defer cancel()
+	// The precise behavior of this test is subject to raceyness around the timing of when TCP packets
+	// or sent from client to server, and when we tell the server to stop, so we need to account for both


dfawley · 2021-04-13T16:30:27Z

test/end2end_test.go

+	}
+	ss.S.Stop()
+	if _, err := stream.Recv(); err == nil || (!strings.Contains(err.Error(), possibleConnResetMsg) && !strings.Contains(err.Error(), possibleEOFMsg)) {
+		t.Fatalf("%v.Recv() = _, %v, want _, rpc error containing substring: |%v| OR |%v|", stream, err, possibleConnResetMsg, possibleEOFMsg)


Nit: %q instead of |%v|

dfawley

This is good as-is. One optional thing if you think it sounds like a good idea, and one thing to follow-up on later (the "connection refused" error - maybe file a bug if you don't mind?).

dfawley · 2021-04-13T18:27:36Z

test/end2end_test.go

 	rpcDoneOnClient := make(chan struct{})
 	ss := &stubserver.StubServer{
 		FullDuplexCallF: func(stream testpb.TestService_FullDuplexCallServer) error {
+			close(rpcStartedOnServer)


FWIW, another option here would be to send a message, and have the client receive it.

This behavior of this test is IMO easier to reason about as is, just since there's slightly fewer moving parts and less happening on the wire. So for the purposes of this test, I feel the existing approach is simpler so I'll keep as is.

apolcyn · 2021-04-13T19:24:11Z

This is good as-is. One optional thing if you think it sounds like a good idea, and one thing to follow-up on later (the "connection refused" error - maybe file a bug if you don't mind?).

Thanks for the review. I filed #4338 about the "connection refused" error. Otherwise, if this PR looks good, can you please merge? I don't have write access.

dfawley · 2021-04-13T20:05:28Z

Otherwise, if this PR looks good, can you please merge? I don't have write access.

No problem, thank you for the PR!

apolcyn added 3 commits March 31, 2021 17:39

Propagate errors causing connection close to RPC statuses

ba8e17b

fix transport tests

0640203

Make test deterministic

38da1fd

apolcyn marked this pull request as ready for review April 1, 2021 21:18

apolcyn mentioned this pull request Apr 2, 2021

client: include details about GOAWAYs in status messages #4316

Merged

dfawley self-requested a review April 8, 2021 20:39

dfawley self-assigned this Apr 8, 2021

dfawley added the Type: Feature label Apr 8, 2021

dfawley added this to the 1.38 Release milestone Apr 8, 2021

dfawley requested changes Apr 9, 2021

View reviewed changes

dfawley assigned apolcyn and unassigned dfawley Apr 9, 2021

apolcyn added 7 commits April 12, 2021 13:21

Address most review comments; TODO: find specific reason for ECONNRST

05425bc

error

don't pass nil error to Close

2096b3c

Don't pass nil to Close in keepalive test

29e6854

fix keepalive test

622a2be

Fix keepalive test

05ced4d

improve comment and test

83679d0

use WithBlock

bb3a1ee

dfawley reviewed Apr 13, 2021

View reviewed changes

apolcyn added 2 commits April 13, 2021 10:25

Revert use of WithBlock, go back to rpcStartedOnServer channel

1c4fb15

Address comments

6db7094

dfawley approved these changes Apr 13, 2021

View reviewed changes

apolcyn mentioned this pull request Apr 13, 2021

Investigate cause of intermittent "connection refused" errors in TestDetailedConnectionCloseErrorPropagatesToRpcError #4338

Closed

dfawley changed the title ~~Propagate errors causing connection close to RPC statuses~~ client: propagate connection error causes to RPC statuses Apr 13, 2021

dfawley merged commit c229922 into grpc:master Apr 13, 2021

apolcyn added a commit to apolcyn/grpc-go that referenced this pull request Apr 14, 2021

address comment from grpc#4311 (comment)

9277ca5

dfawley mentioned this pull request May 25, 2021

Surface TLS errors to RPC errors #4163

Closed

jefferai mentioned this pull request Jun 15, 2021

Update grpc, grpc-gateway, and a few other deps hashicorp/boundary#1325

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: propagate connection error causes to RPC statuses #4311

client: propagate connection error causes to RPC statuses #4311

apolcyn commented Apr 1, 2021 •

edited

dfawley left a comment

dfawley Apr 9, 2021

apolcyn Apr 13, 2021

dfawley Apr 13, 2021

apolcyn Apr 13, 2021

dfawley Apr 13, 2021

dfawley Apr 13, 2021

apolcyn Apr 13, 2021

dfawley Apr 13, 2021

apolcyn Apr 13, 2021

dfawley left a comment

dfawley Apr 13, 2021

apolcyn Apr 13, 2021

apolcyn commented Apr 13, 2021

dfawley commented Apr 13, 2021

		// Use WithBlock() to guarantee that the RPC will be able to pick a healthy connection to
		// go out on before we call Stop on the server.

client: propagate connection error causes to RPC statuses #4311

client: propagate connection error causes to RPC statuses #4311

Conversation

apolcyn commented Apr 1, 2021 • edited

dfawley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfawley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apolcyn commented Apr 13, 2021

dfawley commented Apr 13, 2021

apolcyn commented Apr 1, 2021 •

edited