core: make subchannel creation timing restriction stricter #7790

voidzcy · 2021-01-08T01:31:58Z

Throw for subchannel creation if the channel is being shutting down and the delayed transport is terminated (aka, all retry calls has been finished).

This enforces a tighter timing restriction on load balancer implementations to ensure no subchannel should be created after the balancer itself has been shut down. A load balancer implementation may not always check its state before creating new subchannels. They can be out of sync when the subchannel creation is queued and executed later. We had run into exceptions below in xDS where PriorityLoadBalancer enqueues addresses propagation to its child policy (to avoid reentrancy by updateBalancingState() upcalls) while the Channel (and therefore the LB tree) is shut down before it gets executed. Although we've fix existing LB implementations to not delay addresses propagation (instead enqueue upcalls if necessary), checking the consistency at LoadBalancer.Helper and making it throw if the load balancer had been shut down can help such bugs in load balancer implementations surface earlier.

~/grpc-java-1.34.1/examples/example-xds$ ./build/install/example-xds/bin/xds-hello-world-client "world"    xds:///helloworld-gce
Dec 29, 2020 6:42:40 PM io.grpc.examples.helloworldxds.XdsHelloWorldClient greet
INFO: Will try to greet world ...
Dec 29, 2020 6:42:43 PM io.grpc.examples.helloworldxds.XdsHelloWorldClient greet
INFO: Greeting: Hello world, from grpc-td-mig-us-central1-682b
Dec 29, 2020 6:42:43 PM io.grpc.internal.ManagedChannelImpl$2 uncaughtException
SEVERE: [Channel<1>: (xds:///helloworld-gce)] Uncaught exception in the SynchronizationContext. Panic!
java.lang.NullPointerException
        at io.grpc.internal.ManagedChannelImpl$SubchannelImpl.requestConnection(ManagedChannelImpl.java:1914)
        at io.grpc.util.ForwardingSubchannel.requestConnection(ForwardingSubchannel.java:49)
        at io.grpc.util.RoundRobinLoadBalancer.handleResolvedAddresses(RoundRobinLoadBalancer.java:113)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.services.HealthCheckingLoadBalancerFactory$HealthCheckingLoadBalancer.handleResolvedAddresses(HealthCheckingLoadBalancerFactory.java:189)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.xds.LrsLoadBalancer.handleResolvedAddresses(LrsLoadBalancer.java:87)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.xds.WeightedTargetLoadBalancer.handleResolvedAddresses(WeightedTargetLoadBalancer.java:88)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.xds.PriorityLoadBalancer$ChildLbState$1.run(PriorityLoadBalancer.java:263)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.xds.EdsLoadBalancer2$EdsLbState$ChildLbState.onChanged(EdsLoadBalancer2.java:265)
        at io.grpc.xds.ClientXdsClient$ResourceSubscriber.notifyWatcher(ClientXdsClient.java:856)
        at io.grpc.xds.ClientXdsClient$ResourceSubscriber.onData(ClientXdsClient.java:815)
        at io.grpc.xds.ClientXdsClient.handleEdsResponse(ClientXdsClient.java:495)
        at io.grpc.xds.AbstractXdsClient$AbstractAdsStream.handleRpcResponse(AbstractXdsClient.java:493)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1$1.run(AbstractXdsClient.java:576)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1.onNext(AbstractXdsClient.java:568)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1.onNext(AbstractXdsClient.java:565)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:465)
        at io.grpc.internal.DelayedClientCall$DelayedListener.onMessage(DelayedClientCall.java:448)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:716)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:701)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Dec 29, 2020 6:42:43 PM io.grpc.internal.ManagedChannelImpl$2 uncaughtException
SEVERE: [Channel<1>: (xds:///helloworld-gce)] Uncaught exception in the SynchronizationContext. Panic!
java.lang.NullPointerException
        at io.grpc.internal.ManagedChannelImpl$SubchannelImpl.getAllAddresses(ManagedChannelImpl.java:1921)
        at io.grpc.util.ForwardingSubchannel.getAllAddresses(ForwardingSubchannel.java:54)
        at io.grpc.LoadBalancer$Subchannel.getAddresses(LoadBalancer.java:1263)
        at io.grpc.util.RoundRobinLoadBalancer.processSubchannelState(RoundRobinLoadBalancer.java:139)
        at io.grpc.util.RoundRobinLoadBalancer.access$000(RoundRobinLoadBalancer.java:53)
        at io.grpc.util.RoundRobinLoadBalancer$1.onSubchannelState(RoundRobinLoadBalancer.java:109)
        at io.grpc.services.HealthCheckingLoadBalancerFactory$HealthCheckState.gotoState(HealthCheckingLoadBalancerFactory.java:332)
        at io.grpc.services.HealthCheckingLoadBalancerFactory$HealthCheckState.adjustHealthCheck(HealthCheckingLoadBalancerFactory.java:295)
        at io.grpc.services.HealthCheckingLoadBalancerFactory$HealthCheckState.onSubchannelState(HealthCheckingLoadBalancerFactory.java:275)
        at io.grpc.internal.ManagedChannelImpl$SubchannelImpl$1.run(ManagedChannelImpl.java:1771)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.xds.EdsLoadBalancer2$EdsLbState$ChildLbState.onChanged(EdsLoadBalancer2.java:265)
        at io.grpc.xds.ClientXdsClient$ResourceSubscriber.notifyWatcher(ClientXdsClient.java:856)
        at io.grpc.xds.ClientXdsClient$ResourceSubscriber.onData(ClientXdsClient.java:815)
        at io.grpc.xds.ClientXdsClient.handleEdsResponse(ClientXdsClient.java:495)
        at io.grpc.xds.AbstractXdsClient$AbstractAdsStream.handleRpcResponse(AbstractXdsClient.java:493)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1$1.run(AbstractXdsClient.java:576)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1.onNext(AbstractXdsClient.java:568)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1.onNext(AbstractXdsClient.java:565)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:465)
        at io.grpc.internal.DelayedClientCall$DelayedListener.onMessage(DelayedClientCall.java:448)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:716)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:701)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

… subchannel creation if the channel is being shutting down and the delayed transport is terminated (aka, all retry calls has been finished).

ejona86

This change is pretty safe for the createSubchannel(CreateSubchannelArgs) method. But it is racy for createSubchannel(List<EquivalentAddressGroup>, Attributes). You might investigate when we deprecated the old method; it might be an appropriate time to delete it.

voidzcy · 2021-01-08T23:57:30Z

This change is pretty safe for the createSubchannel(CreateSubchannelArgs) method. But it is racy for createSubchannel(List<EquivalentAddressGroup>, Attributes). You might investigate when we deprecated the old method; it might be an appropriate time to delete it.

It doesn't seem to be racy, as terminating is volatile. But sure, sending out #7793 to delete the old API (along with other ones), they've been deprecated for a long time.

ejona86 · 2021-01-09T00:02:51Z

It doesn't seem to be racy, as terminating is volatile.

More like, "there is not a way for the caller of the API to use the method correctly." Yes, it will not be a data race. But it will still be a race because it will non-deterministically throw an exception.

…e_stricter_subchannel_creation_timing

voidzcy · 2021-01-12T20:00:06Z

#7793 has been merged. PTAL.

ejona86

It would be really helpful to include additional details about why we are making this change. Like referencing back to the exception we saw. Why we make a change is generally at least as important as what change is being made.

core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

ejona86 · 2021-01-12T21:46:15Z

It seems we should also simplify SubchannelImpl.

voidzcy · 2021-01-12T23:28:44Z

It seems we should also simplify SubchannelImpl.

Hmm... That kind of cleanup could (should) have been done in #7793. Let's do that in a separate PR and make this change easily noticeable.

ejona86 · 2021-01-12T23:50:45Z

Hmm... That kind of cleanup could (should) have been done in #7793.

How do you figure? It seems directly related to us changing terminated to terminating here. We can put a similar checkstate to prohibit calling start if terminating == true.

…er happen.

voidzcy · 2021-01-13T01:23:00Z

Hmm... That kind of cleanup could (should) have been done in #7793.

How do you figure? It seems directly related to us changing terminated to terminating here. We can put a similar checkstate to prohibit calling start if terminating == true.

Oh you are talking about prohibiting start() (I thought you were saying what's being mentioned in the TODO comment to remove volatile).

Sure, we do the same for Subchannel's start(). This further syncs subchannel lifecycle with load balancer state.

voidzcy · 2021-01-13T01:25:33Z

I will create a separate PR for cleanups left in #5015 (comment) then we can close that issue.

Throw for subchannel creation if the channel is being shutting down and the delayed transport is terminated (aka, all retry calls has been finished). This enforces load balancer implementations to avoid creating subchannels after being shut down.

Make the timing restriction of subchannel creation stricter. Throw at…

d6bf9e0

… subchannel creation if the channel is being shutting down and the delayed transport is terminated (aka, all retry calls has been finished).

voidzcy requested review from ejona86 and dapengzhang0 January 8, 2021 01:32

ejona86 reviewed Jan 8, 2021

View reviewed changes

voidzcy mentioned this pull request Jan 8, 2021

api: delete LoadBalancer.Helper APIs that had been deprecated for a long time #7793

Merged

Merge branch 'master' of github.com:grpc/grpc-java into bugfix/enforc…

0ba1444

…e_stricter_subchannel_creation_timing

ejona86 approved these changes Jan 12, 2021

View reviewed changes

core/src/main/java/io/grpc/internal/ManagedChannelImpl.java Outdated Show resolved Hide resolved

Minor cleanup.

666aebd

Simplify SubchannelImpl's start by removing handlings that should nev…

b775db0

…er happen.

voidzcy merged commit 389c540 into grpc:master Jan 14, 2021

voidzcy mentioned this pull request Jan 14, 2021

core: further clean up leftovers in ManagedChannelImpl's LoadBalancer.Helper and Subchannel implementations #7806

Merged

github-actions bot locked as resolved and limited conversation to collaborators Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: make subchannel creation timing restriction stricter #7790

core: make subchannel creation timing restriction stricter #7790

voidzcy commented Jan 8, 2021 •

edited

ejona86 left a comment

voidzcy commented Jan 8, 2021

ejona86 commented Jan 9, 2021

voidzcy commented Jan 12, 2021

ejona86 left a comment

ejona86 commented Jan 12, 2021

voidzcy commented Jan 12, 2021

ejona86 commented Jan 12, 2021

voidzcy commented Jan 13, 2021

voidzcy commented Jan 13, 2021

core: make subchannel creation timing restriction stricter #7790

core: make subchannel creation timing restriction stricter #7790

Conversation

voidzcy commented Jan 8, 2021 • edited

ejona86 left a comment

Choose a reason for hiding this comment

voidzcy commented Jan 8, 2021

ejona86 commented Jan 9, 2021

voidzcy commented Jan 12, 2021

ejona86 left a comment

Choose a reason for hiding this comment

ejona86 commented Jan 12, 2021

voidzcy commented Jan 12, 2021

ejona86 commented Jan 12, 2021

voidzcy commented Jan 13, 2021

voidzcy commented Jan 13, 2021

voidzcy commented Jan 8, 2021 •

edited