Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default client balancer only returns one address #1694

Closed
trusch opened this issue Nov 28, 2017 · 7 comments
Closed

default client balancer only returns one address #1694

trusch opened this issue Nov 28, 2017 · 7 comments

Comments

@trusch
Copy link

trusch commented Nov 28, 2017

What version of gRPC are you using?

glide version: ^1.7.2
ref: 5a9f7b4

What version of Go are you using (go version)?

1.9.1

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I created a service in docker swarm which serves gRPC requests with endpoint mode dnsrr (so the DNS returns multiple A records for that service).
Another service inside swarm calls this.
Dialing looks like this:

conn, err := grpc.Dial(target, grpc.WithCredentials(...))
if err != nil {
	return nil, err
}
client := btrfaasgrpc.NewFunctionRunnerClient(conn) // project specific

This client is then reused to serve rpc invocations.

What did you expect to see?

The calls should be dispatched round-robin to all available replicas of the target service out of the box as documented in the go-docs (round-robin must not be registered, because its the default)

What did you see instead?

Only the first replica is used to serve the requests.

Additional Notes

  • when using WithBalancer(balancer.RoundRobin(resolver.NewDNSResolver())) it gives me an error that no addresses are available

Do I need to setup loadbalancing manually for the moment?

@menghanl
Copy link
Contributor

menghanl commented Nov 28, 2017

Did you look at the new balancer package? Or the v1 balancer in grpc package?

Can you try to use the new balancer and resolver

rr := balancer.Get("round_robin")
grpc.Dial("dns:///your.target.name", // "dns:///" specifies the resolver to use
    grpc.WithCredentials(...),
    grpc.WithBalancerBuilder(rr), // use round_robin balancer
)

Not that WithBalancerBuilder is for testing only. I'm planning to add a dial option to set the balancer (#1697).

@trusch
Copy link
Author

trusch commented Nov 29, 2017

Thanks for that hint @menghanl ! This seems to work now, but unfortunately it doesn't seem to query the DNS server that often. I saw the requery frequency in another package set to 30 minutes. Is this configurable?

edit:
It was not another package, It is the requery frequency of the dns resolver and it doesnt seem to be configurable. I think it would be usefull to make this configurable. I know that frequent polling is generally a bad idea, but i'm currently in the comfortable position of building a complete stateless system and I would not like to introduce something big like etcd or zookeeper into my stack, just for loadbalancing.

Perhaps a WithResolveNowInterval(time.Duration) option and a independent goroutine in the ClientConn which calls ResolveNow() in a loop when this optin is set?

I could write that If it would help. I think it should be a very good and small task to get started ;)

Please let me know if I can help @menghanl

@menghanl
Copy link
Contributor

The resolve interval is decided by each resolver implementation. There are resolvers that do pushing instead of polling.
So a WithResolveNowInterval(time.Duration) DialOption doesn't look like a good idea IMO.
A possible solution would be to create another DNS resolver with a custom resolve interval, as I mentioned in #1663 (comment).

From your comment in #1388, you mentioned dead connections will still be retried. This can be solved by #1679. The resolver will re-resolve whenever a connection is down. If the dead server was removed in DNS, the re-resolve will notice that and will remove it from ClientConn.

MAX_CONNECTION_AGE plus #1679 would also cause the resolver to re-resolve and discover new servers.

Let me know what you think about this solution.

@trusch
Copy link
Author

trusch commented Dec 1, 2017

I tried the MAX_CONNECTION_AGE plus #1679 approach but It doesn't trigger the re-resolving. I dont know if perhaps the MAX_CONNECTION_AGE parameter in the server keep alive parameters of the workers is not respected, or if the resolver is not invoked when the connection closes normally (without error). I could imagine that this happens. I do not even see SubConn state changes in the logs.

When killing one of the worker pods everything works fine and the resolver returns the new address set.

@dfawley
Copy link
Member

dfawley commented Dec 7, 2017

Have you turned on info logging by importing the glogger package or using the environment variable GRPC_GO_LOG_SEVERITY_LEVEL="INFO"?

If killing the server manually works, however, my guess is MAX_CONNECTION_AGE isn't configured correctly or is not working correctly -- it should kill the connection and appear the same as an error to the client.

@menghanl
Copy link
Contributor

Does the max age problem still exist? Did you get more logs for this issue?

@trusch
Copy link
Author

trusch commented Dec 27, 2017

I tried today, but can not reproduce the issue anymore!

@trusch trusch closed this as completed Dec 27, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants