Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client tries to connect to remote loadbalancer using server name of the target #3217

Closed
krzysztofdrys opened this issue Nov 29, 2019 · 3 comments

Comments

@krzysztofdrys
Copy link
Contributor

krzysztofdrys commented Nov 29, 2019

After updating grpc version from 1.24.0 to 1.25.0 (and 1.25.11) we have a problem with connecting to the external load balancer.

I am labeling this as a question, not issue because in the past it turned out that we have been using grpclb feature the wrong way (#2838). All in all, my goal is to understand whether we are doing something wrong, or is it a regression in grpc.

Here is an example error from our logs. This is when a client tries to connect to service-xxx using an external load balancer at service-grpclb:5000. It showed up after updating to the the newest grpc:

WARNING: 2019/11/29 11:54:30 grpc: addrConn.createTransport failed to connect to {service-grpclb:5000 0 service-grpclb <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for service-grpclb, not service-xxx". Reconnecting..

Looking at the logs grpc@v1.25.0, I understand that client application cannot connect to service-grcplb, because it assumes it's server name should be service-xxx. What is strange is that one part of the log says that server name should be service-grpclb ({service-grpclb:5000 0 service-grpclb <nil>}), while other says it should be service-xxx (certificate is valid for service-grpclb, not service-xxx").

I was trying to debug line, where this warning is emitted (https://github.com/grpc/grpc-go/blob/master/clientconn.go#L1285):

newTr, err := transport.NewClientTransport(connectCtx, ac.cc.ctx, target, copts, onPrefaceReceipt, onGoAway, onClose)
	if err != nil {
		// newTr is either nil, or closed.
		grpclog.Warningf("grpc: addrConn.createTransport failed to connect to %v. Err :%v. Reconnecting...", addr, err)
		return nil, nil, err
	}

In this line target.Authority is service-grpclb, but copts.TransportCredentials.config.ServerName is service-xxx. It seems to me that NewClientTransport uses ServerName from copts rather than target.

I tried to recreate this using a minimal example. Consider the following setup:

  • service-xxx, listening on localhost:6000,
  • service-grpclb listening on localhost:5000. This is mock service. When it asked to resolve address name:port, it will answer with locahost:port,
  • client trying to connect to service-xxx using service-grpclb as a load balancer. Client is defined here: https://gist.github.com/krzysztofdrys/fa7df8352e667decf0449311fcaf2481 .

When I run the client with grpc@v1.25.0, I get the following log:

INFO: 2019/11/29 12:00:40 ccResolverWrapper: sending update to cc: {[{localhost:5000 1 service-grpclb <nil>}] <nil>}
INFO: 2019/11/29 12:00:40 ClientConn switching balancer to "grpclb"
INFO: 2019/11/29 12:00:40 lbBalancer: UpdateClientConnState: {ResolverState:{Addresses:[{Addr:localhost:5000 Type:1 ServerName:service-grpclb Metadata:<nil>}] ServiceConfig:<nil>} BalancerConfig:<nil>}
INFO: 2019/11/29 12:00:40 ccResolverWrapper: sending update to cc: {[{localhost:5000 0 service-grpclb <nil>}] <nil>}
INFO: 2019/11/29 12:00:40 ClientConn switching balancer to "pick_first"
INFO: 2019/11/29 12:00:40 blockingPicker: the picked transport is not ready, loop back to repick
INFO: 2019/11/29 12:00:40 pickfirstBalancer: HandleSubConnStateChange: 0xc00019c070, CONNECTING
INFO: 2019/11/29 12:00:40 CPU time info is unavailable on non-linux or appengine environment.
WARNING: 2019/11/29 12:00:40 grpc: addrConn.createTransport failed to connect to {localhost:5000 0 service-grpclb <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for service-grpclb.api-development.ingrid.com, service-grpclb, localhost, not service-xxx". Reconnecting...
WARNING: 2019/11/29 12:00:40 grpc: addrConn.createTransport failed to connect to {localhost:5000 0 service-grpclb <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for service-grpclb.api-development.ingrid.com, service-grpclb, localhost, not service-xxx". Reconnecting...
INFO: 2019/11/29 12:00:40 pickfirstBalancer: HandleSubConnStateChange: 0xc00019c070, TRANSIENT_FAILURE
INFO: 2019/11/29 12:00:41 pickfirstBalancer: HandleSubConnStateChange: 0xc00019c070, CONNECTING
WARNING: 2019/11/29 12:00:41 grpc: addrConn.createTransport failed to connect to {localhost:5000 0 service-grpclb <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for service-grpclb.api-development.ingrid.com, service-grpclb, localhost, not service-xxx". Reconnecting...
WARNING: 2019/11/29 12:00:41 grpc: addrConn.createTransport failed to connect to {localhost:5000 0 service-grpclb <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for service-grpclb.api-development.ingrid.com, service-grpclb, localhost, not service-xxx". Reconnecting...

When I change grpc to v1.24.0, I get the following log:

INFO: 2019/11/29 12:07:55 parsed scheme: "grpclb"
INFO: 2019/11/29 12:07:55 ccResolverWrapper: sending update to cc: {[{localhost:5000 1 service-grpclb <nil>}] <nil>}
INFO: 2019/11/29 12:07:55 ClientConn switching balancer to "grpclb"
INFO: 2019/11/29 12:07:55 lbBalancer: UpdateClientConnState: {ResolverState:{Addresses:[{Addr:localhost:5000 Type:1 ServerName:service-grpclb Metadata:<nil>}] ServiceConfig:<nil>} BalancerConfig:<nil>}
INFO: 2019/11/29 12:07:55 ccResolverWrapper: sending update to cc: {[{localhost:5000 0 service-grpclb <nil>}] <nil>}
INFO: 2019/11/29 12:07:55 pickfirstBalancer: HandleSubConnStateChange: 0xc0000dd120, CONNECTING
INFO: 2019/11/29 12:07:55 CPU time info is unavailable on non-linux or appengine environment.
INFO: 2019/11/29 12:07:55 pickfirstBalancer: HandleSubConnStateChange: 0xc0000dd120, READY
INFO: 2019/11/29 12:07:55 lbBalancer: processing server list: servers:<ip_address:"\177\000\000\001" port:6060 load_balance_token:"lb-service-xxx-127.0.0.1-1575025675669496000" > 
INFO: 2019/11/29 12:07:55 lbBalancer: server list entry[0]: ipStr:|127.0.0.1|, port:|6060|, load balancer token:|lb-service-xxx-127.0.0.1-1575025675669496000|
INFO: 2019/11/29 12:07:55 lbBalancer: handle SubConn state change: 0xc000020490, CONNECTING
INFO: 2019/11/29 12:07:55 lbBalancer: handle SubConn state change: 0xc000020490, READY
2019/11/29 12:08:05 Waiting for the connection
2019/11/29 12:08:05 Connected

My questions are:

  • Is it a bug in grpc? At least for us this is a change in behaviour.
  • If this is not a bug, this probably mean that we were using grpc in a wrong way. Could you guide me, what I am doing wrong?
@menghanl
Copy link
Contributor

menghanl commented Dec 4, 2019

For this specific problem, it seems the root cause is how the TLS credentials is created

creds, err := credentials.NewClientTLSFromFile(os.Getenv("CERT_PATH"), "service-xxx")

This sets "service-xxx" as serverNameOverride. The handshake will use "service-xxx" (instead of the real server name) to verify hostname on the returned certificates. (Note that you should never override server name in production code.)
When grpclb client tries to connect to grpclb server, server will send certificates valid for service-grpclb, but client handshake will verify it with service-xxx, and error.

Please try to create the credentials without overriding the name (leaving it as an empty string), and see if that works.


There was a behavior change in gRPC (#3073). The new behavior is, TLS handshake will use ServerName field from address if it's not empty.
But I failed to see how it affects your client, since you are setting serverNameOverride.

It also doesn't make sense to me how v1.24.0 works for you. The TLS handshake should fail with the same verification error when connecting to grpclb servers.

@menghanl
Copy link
Contributor

menghanl commented Dec 4, 2019

Ah, the behavior change that broke you is actually this: https://github.com/grpc/grpc-go/pull/3119/files#diff-887fcbbbde408218ce725b7de320c30bL332 (Sorry for the breakage)

Before, the grpclb client will always override the serverName in TLS to service-grpclb, so the TLS verification won't error. But that code is no longer necessary, because we have #3073 now.

I still believe if you don't set serverNameOverride where TLS credentials is created, everything will work.

Without serverNameOverride, TLS will use dial target or address.ServerName for hostname verification

  • connection to service backends will use "service-xxx", because of grpc.Dial("service-xxx")
  • connection to grpclb server will use "service-grpclb" because the name resolver returns Address{Addr: ..., ServerName: "service-grpclb", Type: resolver.GRPCLB}

@stale
Copy link

stale bot commented Dec 12, 2019

This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@stale stale bot added the stale label Dec 12, 2019
@stale stale bot closed this as completed Dec 20, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants