core: throw away subchannel references after round_robin is shutdown #8132

voidzcy · 2021-04-30T19:15:09Z

After shutting down a Subchannel, (after 5s delay) its state listener will receive a connectivity state update with SHUTDOWN state. Round_robin will pick up that state update and trigger a balancing state update with TRANSIENT_FAILURE and an empty picker that buffers RPCs to the upstream.

This can cause extremely-hard-to-debug problems such as when round_robin is shut down (e.g., switching to another LB policy, or in complex cases like xDS where endpoint-level load balancing is turned off due to EDS resource being revoked) while its replacement has not yet produced a picker, the Channel updates the picker to buffer RPCs (instead of keep using the current one or fail RPCs).

Note the same thing is fairly safe in pick_first: any subchannel state updates after LB's view of the subchannel's state has become SHUTDOWN will be ignored.

…own. This avoids receiving balancing state updates after the LB policy is shutdown.

ejona86 · 2021-04-30T19:30:01Z

such as when round_robin is shut down while its replacement has not yet produced a picker, the Channel updates the picker to buffer RPCs (instead of keep using the current one or fail RPCs).

It sounds like round_robin shouldn't have been shut down in that case. If the picker is still being used, the LB policy shouldn't be shut down. The policy shutdown will shut down all its subchannels, so a buffering picker is actually pretty apt. The only other option (and not a bad one, except that updating the picker after shutdown should be a noop) is an erroring picker.

This change looks quite fair, but not for the reason presented.

voidzcy · 2021-04-30T19:42:28Z

It sounds like round_robin shouldn't have been shut down in that case. If the picker is still being used, the LB policy shouldn't be shut down.

This sounds fair, and it is what AutoConfiguredLoadbalancerFactory is doing today: first replacing the picker then shut down the current LB policy. We should change xDS to do this in the same way (today it first shuts down the downstream LB policy and then replace the picker).

But still, if we do not prevent round_robin triggering balancing state update after it being shutdown, the Channel's picker would still be swapped by it.

ejona86 · 2021-04-30T19:46:44Z

But still, if we do not prevent round_robin triggering balancing state update after it being shutdown, the Channel's picker would still be swapped by it.

Yes, we should prevent that. Fixing round_robin doesn't "fix" that issue; only fixing all policies and making sure new ones are okay and keeping existing ones from regressing would avoid the need. Much better to address the problem in the caller, since it seems it would be easy.

voidzcy · 2021-04-30T19:54:00Z

Yes, we should prevent that. Fixing round_robin doesn't "fix" that issue; only fixing all policies and making sure new ones are okay and keeping existing ones from regressing would avoid the need. Much better to address the problem in the caller, since it seems it would be easy.

Right, I totally agree with it. That's why I was saying "multiple things not working perfectly well, then causes the whole thing being broken". I am going through all callers (or LBs that would potentially encounter things like this) to fix problematic/risky usages. This PR would just fix RR. Does this sound good to you?

ejona86

We can make this change to preserve the normal invariant that shutdown/unused subchannels are not present in subchannels. But this should not be considered a fix for any problem anybody is noticing.

…rpc#8132) Triggering balancing state updates after already being shutdown can be confusing for the upstream of round_robin. In cases of the callers not managing round_robin's lifecycle (e.g., not ignoring updates after it shuts down round_robin, which it should), it can make problem very bad, especially with the behavior that round_robin is actually propagating TRANSIENT_FAILURE with a picker that buffers RPCs. This change only polishes round_robin by always preserving its invariant. Callers/LBs should not rely on this and should still manage the balancing updates from its downstream correctly based on the downstream's lifetime.

…8132) (#8155) Triggering balancing state updates after already being shutdown can be confusing for the upstream of round_robin. In cases of the callers not managing round_robin's lifecycle (e.g., not ignoring updates after it shuts down round_robin, which it should), it can make problem very bad, especially with the behavior that round_robin is actually propagating TRANSIENT_FAILURE with a picker that buffers RPCs. This change only polishes round_robin by always preserving its invariant. Callers/LBs should not rely on this and should still manage the balancing updates from its downstream correctly based on the downstream's lifetime.

Throw away subchannel references after round_robin LB policy is shutd…

188149d

…own. This avoids receiving balancing state updates after the LB policy is shutdown.

voidzcy requested review from dapengzhang0 and ejona86 April 30, 2021 19:15

ejona86 approved these changes Apr 30, 2021

View reviewed changes

voidzcy merged commit 368c43a into grpc:master Apr 30, 2021

This was referenced May 3, 2021

xds: ignore balancing state update from downstream after LB shutdown #8134

Merged

xds: throw away subchannel references after ring_hash is shutdown #8140

Merged

voidzcy mentioned this pull request May 10, 2021

core: throw away subchannel references after round_robin is shutdown (v1.37.x backport) #8155

Merged

github-actions bot locked as resolved and limited conversation to collaborators Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: throw away subchannel references after round_robin is shutdown #8132

core: throw away subchannel references after round_robin is shutdown #8132

voidzcy commented Apr 30, 2021 •

edited

ejona86 commented Apr 30, 2021

voidzcy commented Apr 30, 2021

ejona86 commented Apr 30, 2021

voidzcy commented Apr 30, 2021 •

edited

ejona86 left a comment

core: throw away subchannel references after round_robin is shutdown #8132

core: throw away subchannel references after round_robin is shutdown #8132

Conversation

voidzcy commented Apr 30, 2021 • edited

ejona86 commented Apr 30, 2021

voidzcy commented Apr 30, 2021

ejona86 commented Apr 30, 2021

voidzcy commented Apr 30, 2021 • edited

ejona86 left a comment

Choose a reason for hiding this comment

voidzcy commented Apr 30, 2021 •

edited

voidzcy commented Apr 30, 2021 •

edited