Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom balancer to always remove subConns #15701

Merged
merged 8 commits into from
Dec 19, 2022

Conversation

kisunji
Copy link
Contributor

@kisunji kisunji commented Dec 6, 2022

Fixes #10603

Description

Remove the need to call UpdateState twice by changing the pick_first balancer's behavior to close existing connections (almost) every time.

Testing & Reproduction steps

No net-new behavior; existing test TestClientConnPool_IntegrationWithGRPCResolver_Rebalance is still passing.

Manually tested with frequent rebalancing

PR Checklist

  • updated test coverage
  • external facing docs updated
  • not a security concern

@kisunji kisunji force-pushed the kisunji/patched-pick-first-balancer branch 4 times, most recently from a79dbc3 to 4b43407 Compare December 7, 2022 17:28
@kisunji kisunji marked this pull request as ready for review December 7, 2022 17:50
@kisunji kisunji requested a review from boxofrad December 7, 2022 17:51
@kisunji kisunji force-pushed the kisunji/patched-pick-first-balancer branch from cc1a028 to eed6393 Compare December 7, 2022 19:52
Copy link
Contributor

@boxofrad boxofrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @kisunji 🙌🏻

I'm hoping to have the 1.15+ balancer (PR in enterprise repository) ready to merge in the next couple of days. Should we merge yours into main and then have my PR remove it, or somehow just merge yours into the 1.13 and 1.14 release branches?

@kisunji kisunji force-pushed the kisunji/patched-pick-first-balancer branch from b5f20f6 to 0c4d7f5 Compare December 9, 2022 16:58
//
// [1]: https://github.com/grpc/grpc-go/blob/v1.49.x/pickfirst.go

/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just present here for reference, or is it executed?
If the former you could add a build tag to make it more obvious +build reference-only for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's executed unfortunately

@kisunji
Copy link
Contributor Author

kisunji commented Dec 9, 2022

Nice work, @kisunji 🙌🏻

I'm hoping to have the 1.15+ balancer (PR in enterprise repository) ready to merge in the next couple of days. Should we merge yours into main and then have my PR remove it, or somehow just merge yours into the 1.13 and 1.14 release branches?

Thanks! I think I should merge to main first and backport so we have traceability if someone happens to git log -S for it.

@kisunji kisunji requested a review from banks December 9, 2022 17:17
@kisunji kisunji requested a review from mkeeler December 9, 2022 21:20
}

if b.subConn != nil {
b.cc.UpdateAddresses(b.subConn, state.ResolverState.Addresses)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context for reviewers: this line does not close a connection if the active subConn is in the list of state.ResolverState.Addresses. This is the behavior that resolver.go tried to work around by calling UpdateState with a single address to close the connection (unless it happened to be the active one) and again to re-add the rest of the addresses.

@huikang
Copy link
Collaborator

huikang commented Dec 16, 2022

I am new to load balancing of gRPC and having two basic question

  1. Given most of the customized pick_first policy is copied from grpc, do we have a plan to send the improvement to upstream gRPC?
  2. I also noticed the WARN message contains a nil pointer (not sure where it happens)
Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->192.168.1.101:8300: operation was canceled"

will the PR resolve this nil pointer?

@kisunji
Copy link
Contributor Author

kisunji commented Dec 16, 2022

I am new to load balancing of gRPC and having two basic question

  1. Given most of the customized pick_first policy is copied from grpc, do we have a plan to send the improvement to upstream gRPC?
  2. I also noticed the WARN message contains a nil pointer (not sure where it happens)
Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->192.168.1.101:8300: operation was canceled"

will the PR resolve this nil pointer?

  1. It's not an improvement but a patch to get around gRPC behavior; I'm not sure if it's valuable upstream for others
  2. Are you seeing that in this branch? I'm not sure how that error string is formatted but I assume <nil> means that there was no initial connection. Will leave this branch running overnight and see if I encounter it.

@kisunji kisunji force-pushed the kisunji/patched-pick-first-balancer branch from 03a0aa5 to c2b35e7 Compare December 16, 2022 15:12
@vercel
Copy link

vercel bot commented Dec 16, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
consul-ui-staging 🔄 Building (Inspect) Dec 19, 2022 at 3:14PM (UTC)

@huikang
Copy link
Collaborator

huikang commented Dec 16, 2022

Are you seeing that in this branch? I'm not sure how that error string is formatted but I assume means that there was no initial connection. Will leave this branch running overnight and see if I encounter it.

@kisunji , thanks for the explanation.

The error message containing was observed in the main. I will give this branch a try today.

@kisunji
Copy link
Contributor Author

kisunji commented Dec 16, 2022

While testing I encountered rare WARN logs at startup and occasionally if I leave a cluster running overnight. After some debugging I raised a separate issue #15821 which I consider out of scope for now. This PR should eliminate the biggest source of WARN logs but not all.

@david-yu
Copy link
Contributor

@kisunji Could we add labels to backport to 1.14, 1.13 and 1.12?

@kisunji kisunji added backport-inactive/1.12 This release series is no longer active backport-inactive/1.13 This release series is no longer active backport-inactive/1.14 This release series is no longer active labels Dec 16, 2022
@kisunji kisunji force-pushed the kisunji/patched-pick-first-balancer branch from c2b35e7 to aa2d5fd Compare December 19, 2022 15:14
@kisunji kisunji enabled auto-merge (squash) December 19, 2022 15:18
kisunji added a commit that referenced this pull request Dec 19, 2022
The new balancer is a patched version of gRPC's default pick_first balancer
which removes the behavior of preserving the active subconnection if
a list of new addresses contains the currently active address.
kisunji added a commit that referenced this pull request Dec 19, 2022
The new balancer is a patched version of gRPC's default pick_first balancer
which removes the behavior of preserving the active subconnection if
a list of new addresses contains the currently active address.
kisunji pushed a commit that referenced this pull request Dec 19, 2022
…e/1.12.x (#15834)

* Add custom balancer to always remove subConns (#15701)

The new balancer is a patched version of gRPC's default pick_first balancer
which removes the behavior of preserving the active subconnection if
a list of new addresses contains the currently active address.
kisunji pushed a commit that referenced this pull request Dec 19, 2022
…e/1.13.x (#15835)

* Add custom balancer to always remove subConns (#15701)

The new balancer is a patched version of gRPC's default pick_first balancer
which removes the behavior of preserving the active subconnection if
a list of new addresses contains the currently active address.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-inactive/1.12 This release series is no longer active backport-inactive/1.13 This release series is no longer active backport-inactive/1.14 This release series is no longer active
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool
6 participants