Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the subConn state is stuck in CONNECTING if connection is lose after preface is received #5688

Closed
fuweid opened this issue Oct 5, 2022 · 0 comments · Fixed by #5714
Closed
Assignees

Comments

@fuweid
Copy link
Contributor

fuweid commented Oct 5, 2022

What version of gRPC are you using?

v1.47.0

What version of Go are you using (go version)?

go1.19.1

What operating system (Linux, Windows, …) and version?

➜  ~ uname -a
Linux wenchang 5.19.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.19.6-1 (2022-09-01) x86_64 GNU/Linux

What did you do?

The issue is shown in etcd-io/etcd#14487, which is to try to verify the gRPC reconnect functionality for the ETCD lease client's cache.

I create a repo https://github.com/fuweid/etcd-issue-14487 to reproduce the issue, which is more easier to understand.
In the testcase, there is bridge server as proxy between gRPC client and gRPC server.

gRPC client  <-- (connC) --> bridge(as proxy) <-- (connS) --> gRPC server

When gRPC client is filing unaryCall, the other goroutine keeps disconnecting the connection in a loop until the unaryCall succeeds.
gRPC client will connect to server only if the subConn is idle. If the transport.NewClientTransport return nil and the preface is received and then connection is closed, the (ac *addrConn) createTransport will have a chance to reach the select branch case <-connClosed.Done().

// https://github.com/grpc/grpc-go/blob/v1.47.0/clientconn.go#L1323

// createTransport creates a connection to addr. It returns an error if the
// address was not successfully connected, or updates ac appropriately with the
// new transport.
func (ac *addrConn) createTransport(addr resolver.Address, copts transport.ConnectOptions, connectDeadline time.Time) error {
        ...
	case <-connClosed.Done():
		// The transport has already closed.  If we received the preface, too,
		// this is not an error.
		select {
		case <-prefaceReceived.Done():
			return nil
		default:
			return errors.New("connection closed before server preface received")
		}
	}
}

If so, the subConn will be stuck in CONNECTING. And if there is only one subConn, the gRPC client will be stuck in pick state.
It is about timing. In order to make it easier, I add the time.Sleep after transport.NewClientTransport in fuweid/etcd-issue-14487@a8c3e6f.

 //
 // Notice: This API is EXPERIMENTAL and may be changed or removed in a
 // later release.
@@ -1276,6 +1276,8 @@ func (ac *addrConn) createTransport(addr resolver.Address, copts transport.Conne
                return err
        }

+       time.Sleep(10 * time.Millisecond)
+
        select {
        case <-connectCtx.Done():
                // We didn't get the preface in time.
@@ -1325,6 +1327,7 @@ func (ac *addrConn) createTransport(addr resolver.Address, copts transport.Conne
                // this is not an error.
                select {
                case <-prefaceReceived.Done():
+                       channelz.Warningf(logger, ac.channelzID, "grpc: addrConn.createTransport connClosed and prefaceReceived: oops")
                        return nil
                default:
                        return errors.New("connection closed before server preface received")

The reproduce steps are

cd /tmp
git clone https://github.com/fuweid/etcd-issue-14487.git
cd etcd-issue-14487
go test --cpu=4 -p=2 -v -count=1 --timeout=3m --race=false ./

What did you expect to see?

The gRPC client should handle reconnect well.

What did you see instead?

The test will show the following log

2022/10/05 14:36:02 WARNING: [core] [Channel #2 SubChannel #4] grpc: addrConn.createTransport connClosed and prefaceReceived: oops`

And timeout

goroutine 6 [select, 2 minutes]:
google.golang.org/grpc.(*pickerWrapper).pick(0xc0001aebd0, {0x9f9e20, 0xc00017e120}, 0x1, {{0x941473?, 0x0?}, {0x9f9e20?, 0xc00017e120?}})
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/picker_wrapper.go:103 +0x215
google.golang.org/grpc.(*ClientConn).getTransport(0x41bec6?, {0x9f9e20?, 0xc00017e120?}, 0x5f?, {0x941473?, 0x941473?})
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/clientconn.go:929 +0x35
google.golang.org/grpc.(*csAttempt).getTransport(0xc000238000)
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/stream.go:430 +0x48
google.golang.org/grpc.newClientStreamWithParams.func2(0xc000238000)
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/stream.go:315 +0x2b
google.golang.org/grpc.(*clientStream).withRetry(0xc000236000, 0xc0002240e0, 0xc00012fb18)
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/stream.go:709 +0xd3
google.golang.org/grpc.newClientStreamWithParams({0x9f9db0, 0xc00002a0f8}, 0xd051a0, 0xc000117c00, {0x941473, 0x1c}, {0x0, 0x0, 0x0, 0x0, ...}, ...)
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/stream.go:327 +0xa0e
google.golang.org/grpc.newClientStream.func2({0x9f9db0?, 0xc00002a0f8?}, 0xc00002a0f8?)
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/stream.go:192 +0x9f
google.golang.org/grpc.newClientStream({0x9f9db0, 0xc00002a0f8}, 0xd051a0, 0xc000117c00, {0x941473, 0x1c}, {0x0, 0x0, 0x0})
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/stream.go:220 +0x4c2
google.golang.org/grpc.invoke({0x9f9db0?, 0xc00002a0f8?}, {0x941473?, 0x30?}, {0x8d7160, 0xc000232000}, {0x8d70a0, 0xc000232040}, 0x0?, {0x0, ...})
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/call.go:66 +0x7d
google.golang.org/grpc.(*ClientConn).Invoke(0x7f23ce519a68?, {0x9f9db0?, 0xc00002a0f8?}, {0x941473?, 0x0?}, {0x8d7160?, 0xc000232000?}, {0x8d70a0?, 0xc000232040?}, {0x0, ...})
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/call.go:37 +0x265
google.golang.org/grpc/examples/helloworld/helloworld.(*greeterClient).SayHello(0xc000171f08, {0x9f9db0, 0xc00002a0f8}, 0xf?, {0x0, 0x0, 0x0})
        /home/fuwei/workspace/etcd-issue-14487/vendor/google.golang.org/grpc/examples/helloworld/helloworld/helloworld_grpc.pb.go:53 +0xce
github.com/fuweid/etcd-issue-14487.TestETCDIssue11487(0xc00012d040)
        /home/fuwei/workspace/etcd-issue-14487/main_test.go:66 +0x478
testing.tRunner(0xc00012d040, 0x970d20)
        /usr/local/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:1493 +0x35f

I think the select branch case <-connClosed.Done() should be like case <-prefaceReceived.Done(): branch which turns the state into IDLE if the connClosed.HasFired().

fuweid added a commit to fuweid/grpc-go that referenced this issue Oct 12, 2022
The select branch will be selected randomly, if there are several ready
branches. If the preface has been received and then connection is
closed, the `createTransport` might hit the `connClosed.Done()` branch.
Ideally, the subConn should go idle and reconnect.

Fixes: grpc#5688

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/grpc-go that referenced this issue Oct 14, 2022
The select branch will be selected randomly, if there are several ready
branches. If the preface has been received and then connection is
closed, the `createTransport` might hit the `connClosed.Done()` branch.
Ideally, the subConn should go idle and reconnect.

Fixes: grpc#5688

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@dfawley dfawley assigned fuweid and unassigned dfawley Oct 18, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants