Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: make subchannel creation timing restriction stricter #7790

Merged

Conversation

voidzcy
Copy link
Contributor

@voidzcy voidzcy commented Jan 8, 2021

Throw for subchannel creation if the channel is being shutting down and the delayed transport is terminated (aka, all retry calls has been finished).

This enforces a tighter timing restriction on load balancer implementations to ensure no subchannel should be created after the balancer itself has been shut down. A load balancer implementation may not always check its state before creating new subchannels. They can be out of sync when the subchannel creation is queued and executed later. We had run into exceptions below in xDS where PriorityLoadBalancer enqueues addresses propagation to its child policy (to avoid reentrancy by updateBalancingState() upcalls) while the Channel (and therefore the LB tree) is shut down before it gets executed. Although we've fix existing LB implementations to not delay addresses propagation (instead enqueue upcalls if necessary), checking the consistency at LoadBalancer.Helper and making it throw if the load balancer had been shut down can help such bugs in load balancer implementations surface earlier.

~/grpc-java-1.34.1/examples/example-xds$ ./build/install/example-xds/bin/xds-hello-world-client "world"    xds:///helloworld-gce
Dec 29, 2020 6:42:40 PM io.grpc.examples.helloworldxds.XdsHelloWorldClient greet
INFO: Will try to greet world ...
Dec 29, 2020 6:42:43 PM io.grpc.examples.helloworldxds.XdsHelloWorldClient greet
INFO: Greeting: Hello world, from grpc-td-mig-us-central1-682b
Dec 29, 2020 6:42:43 PM io.grpc.internal.ManagedChannelImpl$2 uncaughtException
SEVERE: [Channel<1>: (xds:///helloworld-gce)] Uncaught exception in the SynchronizationContext. Panic!
java.lang.NullPointerException
        at io.grpc.internal.ManagedChannelImpl$SubchannelImpl.requestConnection(ManagedChannelImpl.java:1914)
        at io.grpc.util.ForwardingSubchannel.requestConnection(ForwardingSubchannel.java:49)
        at io.grpc.util.RoundRobinLoadBalancer.handleResolvedAddresses(RoundRobinLoadBalancer.java:113)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.services.HealthCheckingLoadBalancerFactory$HealthCheckingLoadBalancer.handleResolvedAddresses(HealthCheckingLoadBalancerFactory.java:189)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.xds.LrsLoadBalancer.handleResolvedAddresses(LrsLoadBalancer.java:87)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.xds.WeightedTargetLoadBalancer.handleResolvedAddresses(WeightedTargetLoadBalancer.java:88)
        at io.grpc.util.ForwardingLoadBalancer.handleResolvedAddresses(ForwardingLoadBalancer.java:46)
        at io.grpc.xds.PriorityLoadBalancer$ChildLbState$1.run(PriorityLoadBalancer.java:263)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.xds.EdsLoadBalancer2$EdsLbState$ChildLbState.onChanged(EdsLoadBalancer2.java:265)
        at io.grpc.xds.ClientXdsClient$ResourceSubscriber.notifyWatcher(ClientXdsClient.java:856)
        at io.grpc.xds.ClientXdsClient$ResourceSubscriber.onData(ClientXdsClient.java:815)
        at io.grpc.xds.ClientXdsClient.handleEdsResponse(ClientXdsClient.java:495)
        at io.grpc.xds.AbstractXdsClient$AbstractAdsStream.handleRpcResponse(AbstractXdsClient.java:493)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1$1.run(AbstractXdsClient.java:576)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1.onNext(AbstractXdsClient.java:568)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1.onNext(AbstractXdsClient.java:565)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:465)
        at io.grpc.internal.DelayedClientCall$DelayedListener.onMessage(DelayedClientCall.java:448)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:716)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:701)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Dec 29, 2020 6:42:43 PM io.grpc.internal.ManagedChannelImpl$2 uncaughtException
SEVERE: [Channel<1>: (xds:///helloworld-gce)] Uncaught exception in the SynchronizationContext. Panic!
java.lang.NullPointerException
        at io.grpc.internal.ManagedChannelImpl$SubchannelImpl.getAllAddresses(ManagedChannelImpl.java:1921)
        at io.grpc.util.ForwardingSubchannel.getAllAddresses(ForwardingSubchannel.java:54)
        at io.grpc.LoadBalancer$Subchannel.getAddresses(LoadBalancer.java:1263)
        at io.grpc.util.RoundRobinLoadBalancer.processSubchannelState(RoundRobinLoadBalancer.java:139)
        at io.grpc.util.RoundRobinLoadBalancer.access$000(RoundRobinLoadBalancer.java:53)
        at io.grpc.util.RoundRobinLoadBalancer$1.onSubchannelState(RoundRobinLoadBalancer.java:109)
        at io.grpc.services.HealthCheckingLoadBalancerFactory$HealthCheckState.gotoState(HealthCheckingLoadBalancerFactory.java:332)
        at io.grpc.services.HealthCheckingLoadBalancerFactory$HealthCheckState.adjustHealthCheck(HealthCheckingLoadBalancerFactory.java:295)
        at io.grpc.services.HealthCheckingLoadBalancerFactory$HealthCheckState.onSubchannelState(HealthCheckingLoadBalancerFactory.java:275)
        at io.grpc.internal.ManagedChannelImpl$SubchannelImpl$1.run(ManagedChannelImpl.java:1771)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.xds.EdsLoadBalancer2$EdsLbState$ChildLbState.onChanged(EdsLoadBalancer2.java:265)
        at io.grpc.xds.ClientXdsClient$ResourceSubscriber.notifyWatcher(ClientXdsClient.java:856)
        at io.grpc.xds.ClientXdsClient$ResourceSubscriber.onData(ClientXdsClient.java:815)
        at io.grpc.xds.ClientXdsClient.handleEdsResponse(ClientXdsClient.java:495)
        at io.grpc.xds.AbstractXdsClient$AbstractAdsStream.handleRpcResponse(AbstractXdsClient.java:493)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1$1.run(AbstractXdsClient.java:576)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1.onNext(AbstractXdsClient.java:568)
        at io.grpc.xds.AbstractXdsClient$AdsStreamV2$1.onNext(AbstractXdsClient.java:565)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:465)
        at io.grpc.internal.DelayedClientCall$DelayedListener.onMessage(DelayedClientCall.java:448)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:716)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:701)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

… subchannel creation if the channel is being shutting down and the delayed transport is terminated (aka, all retry calls has been finished).
Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is pretty safe for the createSubchannel(CreateSubchannelArgs) method. But it is racy for createSubchannel(List<EquivalentAddressGroup>, Attributes). You might investigate when we deprecated the old method; it might be an appropriate time to delete it.

@voidzcy
Copy link
Contributor Author

voidzcy commented Jan 8, 2021

This change is pretty safe for the createSubchannel(CreateSubchannelArgs) method. But it is racy for createSubchannel(List<EquivalentAddressGroup>, Attributes). You might investigate when we deprecated the old method; it might be an appropriate time to delete it.

It doesn't seem to be racy, as terminating is volatile. But sure, sending out #7793 to delete the old API (along with other ones), they've been deprecated for a long time.

@ejona86
Copy link
Member

ejona86 commented Jan 9, 2021

It doesn't seem to be racy, as terminating is volatile.

More like, "there is not a way for the caller of the API to use the method correctly." Yes, it will not be a data race. But it will still be a race because it will non-deterministically throw an exception.

@voidzcy
Copy link
Contributor Author

voidzcy commented Jan 12, 2021

#7793 has been merged. PTAL.

Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really helpful to include additional details about why we are making this change. Like referencing back to the exception we saw. Why we make a change is generally at least as important as what change is being made.

@ejona86
Copy link
Member

ejona86 commented Jan 12, 2021

It seems we should also simplify SubchannelImpl.

@voidzcy
Copy link
Contributor Author

voidzcy commented Jan 12, 2021

It seems we should also simplify SubchannelImpl.

Hmm... That kind of cleanup could (should) have been done in #7793. Let's do that in a separate PR and make this change easily noticeable.

@ejona86
Copy link
Member

ejona86 commented Jan 12, 2021

Hmm... That kind of cleanup could (should) have been done in #7793.

How do you figure? It seems directly related to us changing terminated to terminating here. We can put a similar checkstate to prohibit calling start if terminating == true.

@voidzcy
Copy link
Contributor Author

voidzcy commented Jan 13, 2021

Hmm... That kind of cleanup could (should) have been done in #7793.

How do you figure? It seems directly related to us changing terminated to terminating here. We can put a similar checkstate to prohibit calling start if terminating == true.

Oh you are talking about prohibiting start() (I thought you were saying what's being mentioned in the TODO comment to remove volatile).

Sure, we do the same for Subchannel's start(). This further syncs subchannel lifecycle with load balancer state.

@voidzcy
Copy link
Contributor Author

voidzcy commented Jan 13, 2021

I will create a separate PR for cleanups left in #5015 (comment) then we can close that issue.

@voidzcy voidzcy merged commit 389c540 into grpc:master Jan 14, 2021
dfawley pushed a commit to dfawley/grpc-java that referenced this pull request Jan 15, 2021
Throw for subchannel creation if the channel is being shutting down and the delayed transport is terminated (aka, all retry calls has been finished). This enforces load balancer implementations to avoid creating subchannels after being shut down.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants