Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tunnel auth clients appear to become stuck in bad state on restart #9655

Closed
fspmarshall opened this issue Jan 5, 2022 · 3 comments · Fixed by #9656
Closed

Tunnel auth clients appear to become stuck in bad state on restart #9655

fspmarshall opened this issue Jan 5, 2022 · 3 comments · Fixed by #9656
Assignees
Labels

Comments

@fspmarshall
Copy link
Contributor

When investigating the high failure rate of the TwoClustersTunnel issue, rj discovered that this call to GetNodes appears to block nearly indefinitely. Upon further investigation, we found that this occurs when the cache is unhealthy and the call to GetNodes is forwarded to the leaf cluster's auth server. The test could be "fixed" by applying a very short (<=5s) timeout here. This solution can't work in production, since real GetNodes calls can take quite a while in very large clusters.

Our working theory is that the GRPC client is blocking on the old unhealthy tunnel connection instead of erring out and eventually receiving a new healthy tunnel connection. The dialer used by the GRPC client is here, which is probably where an investigation aught to begin. Ideally, we want the GRPC client to err out and re-dial as soon as possible after the leaf cluster is restarted.

@rosstimothy
Copy link
Contributor

This appears to be caused by a bug in gRPC. We are currently using a version of go-grpc that was released on Apr 23, 2020

google.golang.org/grpc v1.29.1

https://github.com/grpc/grpc-go/releases/tag/v1.29.1

Running the test with the gRPC logs on, it seems that when the clusters are restarted the channel connectivity state immediately transitions into TRANSIENT_FAILURE anytime a connection attempt is made to the other cluster. This behavior keeps repeating until the test times out.

INFO: 2022/01/06 09:22:25 Subchannel Connectivity change to CONNECTING
INFO: 2022/01/06 09:22:25 Subchannel picks a new address "teleport.cluster.local" to connect
INFO: 2022/01/06 09:22:25 pickfirstBalancer: HandleSubConnStateChange: 0xc0002e38a0, {CONNECTING <nil>}
INFO: 2022/01/06 09:22:25 Channel Connectivity change to CONNECTING
2022-01-06T09:22:25-05:00 DEBU [CLIENT]    Client  is connecting to auth server on cluster "site-A". client/client.go:840
WARNING: 2022/01/06 09:22:25 grpc: addrConn.createTransport failed to connect to {teleport.cluster.local  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing failed to dial: read tcp 127.0.0.1:60898->127.0.0.1:24996: use of closed network connection". Reconnecting...
WARNING: 2022/01/06 09:22:25 grpc: addrConn.createTransport failed to connect to {teleport.cluster.local  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing failed to dial: read tcp 127.0.0.1:60898->127.0.0.1:24996: use of closed network connection". Reconnecting...
INFO: 2022/01/06 09:22:25 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2022/01/06 09:22:25 pickfirstBalancer: HandleSubConnStateChange: 0xc0002e38a0, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing failed to dial: read tcp 127.0.0.1:60898->127.0.0.1:24996: use of closed network connection"}
INFO: 2022/01/06 09:22:25 Channel Connectivity change to TRANSIENT_FAILURE

I created a branch that updates the go-grpc version to 1.43.0 and the issue is no longer present. I ran the integration tests via CI for several days and the TwoClustersTunnel tests have not failed once.

Additionally the gRPC logs no longer show a transition into the TRANSIENT_FAILURE state. Instead the channel transitions to the IDLE state and shortly after is able to connect successfully.

@espadolini
Copy link
Contributor

After testing TwoClustersTunnel repeatedly with multiple versions of go-grpc, it seems like that the bug was fixed in https://github.com/grpc/grpc-go/releases/tag/v1.41.0, so I guess the culprit isclient: fix transparent retries when per-RPC credentials are in use (grpc/grpc-go#4785).

rosstimothy added a commit that referenced this issue Jan 10, 2022
Update grpc dependency to the latest version. Needed to fix the client side hang that
prevents TwoClustersTunnel from running succesfully, see #9655.
rosstimothy added a commit that referenced this issue Jan 10, 2022
Update grpc dependency to the latest version. Needed to fix the client side hang that
prevents TwoClustersTunnel from running succesfully, see #9655.
@ibeckermayer
Copy link
Contributor

I just encountered an error related to this:

        	Messages:   	Failed to find 3 events on helpers.Site A after 5s
        	Test:       	TestIntegrations/TwoClustersTunnel/node
        	Error:      	Condition never satisfied
        	            				integration_test.go:1775
        	Error Trace:	integration_test.go:1950
    integration_test.go:1950: 

https://console.cloud.google.com/cloud-build/builds/3be16660-ed72-4c9f-8e13-73e8dc0e811e?project=ci-account

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants