Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: throw away subchannel references after round_robin is shutdown #8132

Merged

Conversation

voidzcy
Copy link
Contributor

@voidzcy voidzcy commented Apr 30, 2021

After shutting down a Subchannel, (after 5s delay) its state listener will receive a connectivity state update with SHUTDOWN state. Round_robin will pick up that state update and trigger a balancing state update with TRANSIENT_FAILURE and an empty picker that buffers RPCs to the upstream.

This can cause extremely-hard-to-debug problems such as when round_robin is shut down (e.g., switching to another LB policy, or in complex cases like xDS where endpoint-level load balancing is turned off due to EDS resource being revoked) while its replacement has not yet produced a picker, the Channel updates the picker to buffer RPCs (instead of keep using the current one or fail RPCs).

Note the same thing is fairly safe in pick_first: any subchannel state updates after LB's view of the subchannel's state has become SHUTDOWN will be ignored.

…own. This avoids receiving balancing state updates after the LB policy is shutdown.
@ejona86
Copy link
Member

ejona86 commented Apr 30, 2021

such as when round_robin is shut down while its replacement has not yet produced a picker, the Channel updates the picker to buffer RPCs (instead of keep using the current one or fail RPCs).

It sounds like round_robin shouldn't have been shut down in that case. If the picker is still being used, the LB policy shouldn't be shut down. The policy shutdown will shut down all its subchannels, so a buffering picker is actually pretty apt. The only other option (and not a bad one, except that updating the picker after shutdown should be a noop) is an erroring picker.

This change looks quite fair, but not for the reason presented.

@voidzcy
Copy link
Contributor Author

voidzcy commented Apr 30, 2021

It sounds like round_robin shouldn't have been shut down in that case. If the picker is still being used, the LB policy shouldn't be shut down.

This sounds fair, and it is what AutoConfiguredLoadbalancerFactory is doing today: first replacing the picker then shut down the current LB policy. We should change xDS to do this in the same way (today it first shuts down the downstream LB policy and then replace the picker).

But still, if we do not prevent round_robin triggering balancing state update after it being shutdown, the Channel's picker would still be swapped by it.

@ejona86
Copy link
Member

ejona86 commented Apr 30, 2021

But still, if we do not prevent round_robin triggering balancing state update after it being shutdown, the Channel's picker would still be swapped by it.

Yes, we should prevent that. Fixing round_robin doesn't "fix" that issue; only fixing all policies and making sure new ones are okay and keeping existing ones from regressing would avoid the need. Much better to address the problem in the caller, since it seems it would be easy.

@voidzcy
Copy link
Contributor Author

voidzcy commented Apr 30, 2021

Yes, we should prevent that. Fixing round_robin doesn't "fix" that issue; only fixing all policies and making sure new ones are okay and keeping existing ones from regressing would avoid the need. Much better to address the problem in the caller, since it seems it would be easy.

Right, I totally agree with it. That's why I was saying "multiple things not working perfectly well, then causes the whole thing being broken". I am going through all callers (or LBs that would potentially encounter things like this) to fix problematic/risky usages. This PR would just fix RR. Does this sound good to you?

Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this change to preserve the normal invariant that shutdown/unused subchannels are not present in subchannels. But this should not be considered a fix for any problem anybody is noticing.

@voidzcy voidzcy merged commit 368c43a into grpc:master Apr 30, 2021
voidzcy added a commit to voidzcy/grpc-java that referenced this pull request May 10, 2021
…rpc#8132)

Triggering balancing state updates after already being shutdown can be confusing for the upstream of round_robin. In cases of the callers not managing round_robin's lifecycle (e.g., not ignoring updates after it shuts down round_robin, which it should), it can make problem very bad, especially with the behavior that round_robin is actually propagating TRANSIENT_FAILURE with a picker that buffers RPCs.

This change only polishes round_robin by always preserving its invariant. Callers/LBs should not rely on this and should still manage the balancing updates from its downstream correctly based on the downstream's lifetime.
voidzcy added a commit that referenced this pull request May 10, 2021
…8132) (#8155)

Triggering balancing state updates after already being shutdown can be confusing for the upstream of round_robin. In cases of the callers not managing round_robin's lifecycle (e.g., not ignoring updates after it shuts down round_robin, which it should), it can make problem very bad, especially with the behavior that round_robin is actually propagating TRANSIENT_FAILURE with a picker that buffers RPCs.

This change only polishes round_robin by always preserving its invariant. Callers/LBs should not rely on this and should still manage the balancing updates from its downstream correctly based on the downstream's lifetime.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants