Tunnel auth clients appear to become stuck in bad state on restart #9655

fspmarshall · 2022-01-05T18:48:21Z

When investigating the high failure rate of the TwoClustersTunnel issue, rj discovered that this call to GetNodes appears to block nearly indefinitely. Upon further investigation, we found that this occurs when the cache is unhealthy and the call to GetNodes is forwarded to the leaf cluster's auth server. The test could be "fixed" by applying a very short (<=5s) timeout here. This solution can't work in production, since real GetNodes calls can take quite a while in very large clusters.

Our working theory is that the GRPC client is blocking on the old unhealthy tunnel connection instead of erring out and eventually receiving a new healthy tunnel connection. The dialer used by the GRPC client is here, which is probably where an investigation aught to begin. Ideally, we want the GRPC client to err out and re-dial as soon as possible after the leaf cluster is restarted.

The text was updated successfully, but these errors were encountered:

rosstimothy · 2022-01-07T16:53:58Z

This appears to be caused by a bug in gRPC. We are currently using a version of go-grpc that was released on Apr 23, 2020

teleport/go.mod

Line 99 in 622e0aa

google.golang.org/grpc v1.29.1

https://github.com/grpc/grpc-go/releases/tag/v1.29.1

Running the test with the gRPC logs on, it seems that when the clusters are restarted the channel connectivity state immediately transitions into TRANSIENT_FAILURE anytime a connection attempt is made to the other cluster. This behavior keeps repeating until the test times out.

INFO: 2022/01/06 09:22:25 Subchannel Connectivity change to CONNECTING
INFO: 2022/01/06 09:22:25 Subchannel picks a new address "teleport.cluster.local" to connect
INFO: 2022/01/06 09:22:25 pickfirstBalancer: HandleSubConnStateChange: 0xc0002e38a0, {CONNECTING <nil>}
INFO: 2022/01/06 09:22:25 Channel Connectivity change to CONNECTING
2022-01-06T09:22:25-05:00 DEBU [CLIENT]    Client  is connecting to auth server on cluster "site-A". client/client.go:840
WARNING: 2022/01/06 09:22:25 grpc: addrConn.createTransport failed to connect to {teleport.cluster.local  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing failed to dial: read tcp 127.0.0.1:60898->127.0.0.1:24996: use of closed network connection". Reconnecting...
WARNING: 2022/01/06 09:22:25 grpc: addrConn.createTransport failed to connect to {teleport.cluster.local  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing failed to dial: read tcp 127.0.0.1:60898->127.0.0.1:24996: use of closed network connection". Reconnecting...
INFO: 2022/01/06 09:22:25 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2022/01/06 09:22:25 pickfirstBalancer: HandleSubConnStateChange: 0xc0002e38a0, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing failed to dial: read tcp 127.0.0.1:60898->127.0.0.1:24996: use of closed network connection"}
INFO: 2022/01/06 09:22:25 Channel Connectivity change to TRANSIENT_FAILURE

I created a branch that updates the go-grpc version to 1.43.0 and the issue is no longer present. I ran the integration tests via CI for several days and the TwoClustersTunnel tests have not failed once.

Additionally the gRPC logs no longer show a transition into the TRANSIENT_FAILURE state. Instead the channel transitions to the IDLE state and shortly after is able to connect successfully.

espadolini · 2022-01-10T09:49:49Z

After testing TwoClustersTunnel repeatedly with multiple versions of go-grpc, it seems like that the bug was fixed in https://github.com/grpc/grpc-go/releases/tag/v1.41.0, so I guess the culprit isclient: fix transparent retries when per-RPC credentials are in use (grpc/grpc-go#4785).

Update grpc dependency to the latest version. Needed to fix the client side hang that prevents TwoClustersTunnel from running succesfully, see #9655.

ibeckermayer · 2022-09-07T16:40:32Z

I just encountered an error related to this:

        	Messages:   	Failed to find 3 events on helpers.Site A after 5s
        	Test:       	TestIntegrations/TwoClustersTunnel/node
        	Error:      	Condition never satisfied
        	            				integration_test.go:1775
        	Error Trace:	integration_test.go:1950
    integration_test.go:1950:

https://console.cloud.google.com/cloud-build/builds/3be16660-ed72-4c9f-8e13-73e8dc0e811e?project=ci-account

fspmarshall added the bug label Jan 5, 2022

russjones mentioned this issue Jan 5, 2022

Flaky Test Tracker #9492

Closed

espadolini mentioned this issue Jan 10, 2022

Upgrade from go.etcd.io/etcd v3.4.14 to go.etcd.io/etcd/{api,client}/v3 v3.5.1 #9607

Merged

rosstimothy added a commit that referenced this issue Jan 10, 2022

Update google.golang.org/grpc to v1.43.0

c2e4f03

Update grpc dependency to the latest version. Needed to fix the client side hang that prevents TwoClustersTunnel from running succesfully, see #9655.

rosstimothy mentioned this issue Jan 10, 2022

Update google.golang.org/grpc to v1.43.0 #9656

Merged

rosstimothy closed this as completed in #9656 Jan 10, 2022

rosstimothy added a commit that referenced this issue Jan 10, 2022

Update google.golang.org/grpc to v1.43.0 (#9656)

95d0f0d

Update grpc dependency to the latest version. Needed to fix the client side hang that prevents TwoClustersTunnel from running succesfully, see #9655.

russjones assigned espadolini, fspmarshall and rosstimothy Jan 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tunnel auth clients appear to become stuck in bad state on restart #9655

Tunnel auth clients appear to become stuck in bad state on restart #9655

fspmarshall commented Jan 5, 2022

rosstimothy commented Jan 7, 2022

espadolini commented Jan 10, 2022

ibeckermayer commented Sep 7, 2022

Tunnel auth clients appear to become stuck in bad state on restart #9655

Tunnel auth clients appear to become stuck in bad state on restart #9655

Comments

fspmarshall commented Jan 5, 2022

rosstimothy commented Jan 7, 2022

espadolini commented Jan 10, 2022

ibeckermayer commented Sep 7, 2022