Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: perform graceful switching of LB policies in the ClientConn by default #5285

Merged
merged 16 commits into from Apr 1, 2022

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Mar 31, 2022

This PR changes the default behavior of ClientConn when it ends up switching the LB policy.

  • Earlier, it used to close the old policy and start the new one, when it received a service config update pointing to a new LB policy.
  • Now, it uses the gracefulswitch.Balancer to gracefully switch from the old balancer to the new one.

Summary of code changes:

  • Use a gracefulswitch.Balancer in ccBalancerWrapper instead of balancer.Balancer
    • Add a switchTo method to the balancer wrapper which calls SwitchTo on the gracefulswitch.Balancer
    • Ensure calls from grpc to the underlying balancers are ordered
      • For every type of update from grpc, we now have:
        • a method which pushes the update into the update channel
        • a method which handles the update, when processed by the watcher goroutine
      • An enum to represent the type of grpc update pushed into the update channel
    • The watcher goroutine is now merely a skeleton, with functionality moved to appropriate handlers
    • Add more comments to the balancer wrapper code
  • The following functionality has been moved out of ClientConn:
    • Falling back to the default LB policy when the specified LB policy is not registered is moved to the ccBalancerWrapper
    • Filtering of addresses of type GRPCLB is now handled by the ccBalancerWrapper
    • Manually calling Connect on the subConns if the underlying balancer does not support ExitIdle is moved to the gracefulswitch.Balancer

Summary of test changes:

  • Added a test in test/balancer_switching_test.go to exercise the graceful switching functionality e2e
  • While I was adding the above mentioned test, I figured out that I can make the stubBalancer wrap a real balancer. Updated an existing test in test/resolver_update_test.go which was using a custom balancer for the same to use the stubBalancer instead.
  • Added ParseConfig functionality to the stubBalancer
  • Updated a test in clientconn_test.go to fail within a reasonable amount of time. (This test was timing out initially when I was making my changes, so had to fix it).

Fixes #5270

RELEASE NOTES:

  • perform graceful switching of LB policies in the ClientConn by default

balancer_conn_wrappers.go Outdated Show resolved Hide resolved
balancer_conn_wrappers.go Show resolved Hide resolved
balancer_conn_wrappers.go Show resolved Hide resolved
balancer_conn_wrappers.go Outdated Show resolved Hide resolved
balancer_conn_wrappers.go Outdated Show resolved Hide resolved
balancer_conn_wrappers.go Show resolved Hide resolved
balancer_conn_wrappers.go Show resolved Hide resolved
balancer_conn_wrappers.go Outdated Show resolved Hide resolved
// With the use of gracefulswitch.Balancer in ccBalancerWrapper, handling this
// asynchronously is probably not required anymore since the switchTo() method
// handles the balancer switch by pushing the update onto the channel.
// TODO(easwars): Handle this inline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile? This would be the only thing done inline, then? Seems better to keep everything going through the channel unless there's compelling reasons to skip it for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only call from the underlying balancer which goes through the channel instead of being handled inline. And this was being handled inline before we discovered a couple of deadlocks which forced us to release cc.mu before calling ccb.close(). But now, ccb.close() will only be called from cc.Close() and without holding cc.mu. So, it should be safe to handle this inline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to move all the outgoing calls to the goroutine, and do all the incoming calls inline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already moved all outgoing calls to the goroutine. Do you want me to switch this to be done inline in this PR?

clientconn.go Outdated
cc.curBalancerName = builder.Name()
cc.balancerWrapper = newCCBalancerWrapper(cc, builder, cc.balancerBuildOpts)
}

func (cc *ClientConn) handleSubConnStateChange(sc balancer.SubConn, s connectivity.State, err error) {
cc.mu.Lock()
if cc.conns == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is unnecessary too? Add a TODO to remove if you agree, perhaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this.

@dfawley dfawley assigned easwars and unassigned dfawley Mar 31, 2022
Copy link
Contributor Author

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review.

balancer_conn_wrappers.go Outdated Show resolved Hide resolved
balancer_conn_wrappers.go Outdated Show resolved Hide resolved
balancer_conn_wrappers.go Show resolved Hide resolved
balancer_conn_wrappers.go Outdated Show resolved Hide resolved
balancer_conn_wrappers.go Outdated Show resolved Hide resolved
balancer_conn_wrappers.go Show resolved Hide resolved
balancer_conn_wrappers.go Outdated Show resolved Hide resolved
balancer_conn_wrappers.go Show resolved Hide resolved
// With the use of gracefulswitch.Balancer in ccBalancerWrapper, handling this
// asynchronously is probably not required anymore since the switchTo() method
// handles the balancer switch by pushing the update onto the channel.
// TODO(easwars): Handle this inline.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only call from the underlying balancer which goes through the channel instead of being handled inline. And this was being handled inline before we discovered a couple of deadlocks which forced us to release cc.mu before calling ccb.close(). But now, ccb.close() will only be called from cc.Close() and without holding cc.mu. So, it should be safe to handle this inline.

clientconn.go Outdated
cc.curBalancerName = builder.Name()
cc.balancerWrapper = newCCBalancerWrapper(cc, builder, cc.balancerBuildOpts)
}

func (cc *ClientConn) handleSubConnStateChange(sc balancer.SubConn, s connectivity.State, err error) {
cc.mu.Lock()
if cc.conns == nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this.

@easwars
Copy link
Contributor Author

easwars commented Mar 31, 2022

I was able to run the test 1000 times with the race detector without failures. Ready to be looked at again.

@easwars easwars assigned dfawley and unassigned easwars Mar 31, 2022
@@ -178,6 +178,10 @@ func (gsb *Balancer) ResolverError(err error) {
}

// ExitIdle forwards the call to the latest balancer created.
//
// If the latest balancer does not support ExitIdle, the subConns need to be
// re-connected manually. This used to be done in the ClientConn earlier, but
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would delete the sentence about how it used to be. Or at least it doesn't belong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

type exitIdle struct{}

// ccBalancerWrapper is a wrapper on top of cc for balancers.
// ccBalancerWrapper is a wrapper on top of cc for balancers. It ensures that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it's actually both a cc wrapper and a balancer wrapper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about now?

updateCh *buffer.Unbounded // Updates written on this channel are processed by watcher().
resultCh *buffer.Unbounded // Results of calls to UpdateClientConnState() are pushed here.
closed *grpcsync.Event // Indicates if close has been called.
done *grpcsync.Event // Indicates if close has completed its work.

mu sync.Mutex
subConns map[*acBalancerWrapper]struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? Doesn't look like we use it for anything now.
And the gracefulswitch balancer also keeps maps of SubConns. Is that good enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I was able to get rid of this.

// back to grpc which propagates that to the resolver.
func (ccb *ccBalancerWrapper) updateClientConnState(ccs *balancer.ClientConnState) error {
ccb.updateCh.Put(watcherUpdate{typ: updateTypeClientConnState, update: ccs})
res := <-ccb.resultCh.Get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance this will block forever? When things are closing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I added a select and made it exit if close is called.

// and the selected LB policy is not "grpclb", these addresses will be filtered
// out and ccs will be modified with the updated address list.
func (ccb *ccBalancerWrapper) handleClientConnStateChange(ccs *balancer.ClientConnState) {
ccb.balancerMu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this hold balancerMu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it calls ccb.balancer.UpdateClientConnState in the last line. Would you rather that I grab the lock around that statement?

// forward to the resolver.
func (ccb *ccBalancerWrapper) updateClientConnState(ccs *balancer.ClientConnState) error {
// handleSubConnStateChange handles a SubConnState update from the update
// channel and invokes the appropriate method on the underlying balancer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to not push this to the channel?
And document that.

And if we do all the outgoing calls in the watcher goroutine, maybe we don't need the balancerMu?
There were two methods that need to happen inline: Close() and ClientConnStateChange() (because of the return error). But both are handled now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to not push this to the channel?

SubConn state change is pushed onto the channel in updateSubConnState(). This is the method which is called when the watcher reads that update from the channel.

And if we do all the outgoing calls in the watcher goroutine, maybe we don't need the balancerMu?

Woohoo !!

balancer_conn_wrappers.go Outdated Show resolved Hide resolved
// With the use of gracefulswitch.Balancer in ccBalancerWrapper, handling this
// asynchronously is probably not required anymore since the switchTo() method
// handles the balancer switch by pushing the update onto the channel.
// TODO(easwars): Handle this inline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to move all the outgoing calls to the goroutine, and do all the incoming calls inline.

@easwars easwars assigned menghanl and unassigned easwars Mar 31, 2022
@easwars
Copy link
Contributor Author

easwars commented Apr 1, 2022

I got rid of the checking for trace events thingy from the tests. All tests are happy now. Definitely this time around, good to be looked at another time.
@dfawley @menghanl

@easwars
Copy link
Contributor Author

easwars commented Apr 1, 2022

Thank you very much for the review @dfawley @menghanl

@easwars easwars merged commit 0066bf6 into grpc:master Apr 1, 2022
@easwars easwars deleted the graceful_switch_final branch July 15, 2022 17:25
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

grpc: use gracefulswitch.Balancer by default in the ClientConn
3 participants