Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rls: fix child lb leak when client channel is shutdown #8750

Merged
merged 8 commits into from
Jan 12, 2022

Conversation

dapengzhang0
Copy link
Member

@dapengzhang0 dapengzhang0 commented Dec 9, 2021

When client channel is shutting down, the RlsLoadBalancer is shutting down. However, the child loadbalancers of RlsLoadBalancer are not shut down. This is causing the issue b/209831670

This PR contains two commits and it will be easier to review separately. The 1st commit change the lifecycle of child lb to make it easier to manage. The 2nd commit is to shut down all child lbs when the parent lb is shut down.

@dapengzhang0
Copy link
Member Author

@sergiitk A gentle ping.

@sergiitk
Copy link
Member

sergiitk commented Jan 5, 2022

Per IRL discussion, this needs more testing/fixing, so the review is on hold.

@dapengzhang0
Copy link
Member Author

dapengzhang0 commented Jan 5, 2022

Per IRL discussion, this needs more testing/fixing, so the review is on hold.

Oh, I misunderstood you. I just meant it's higher priority than the other RLS PR. It's ready for review. The test was added (https://github.com/grpc/grpc-java/pull/8750/files#diff-14b2d9691a6f8fc60385a0bcf960a20b51b7415d9119e02e22d17cccc8103349R176).

Copy link
Member

@sergiitk sergiitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions

rls/src/main/java/io/grpc/rls/LbPolicyConfiguration.java Outdated Show resolved Hide resolved
rls/src/main/java/io/grpc/rls/LbPolicyConfiguration.java Outdated Show resolved Hide resolved
rls/src/main/java/io/grpc/rls/LinkedHashLruCache.java Outdated Show resolved Hide resolved
@@ -346,6 +392,7 @@ public ChildPolicyWrapper returnObject(Object object) {
long newCnt = refCnt.decrementAndGet();
checkState(newCnt != -1, "Cannot return never pooled childPolicyWrapper");
if (newCnt == 0) {
childPolicyWrapper.shutdown();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible that in concurrent environment one thread sets childPolicyWrapper = null, and another will call childPolicyWrapper.shutdown() causing NPE?

I'm not sure if AtomicLong volatility visibility guarantee is sufficient. IRRC because childPolicyWrapper = null comes after the write to the volatile variable, it's not guaranteed to flush immediately.
Should returnObject() be fully synchronized?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the concurrency issue in 3ae4e3d

new Runnable() {
@Override
public void run() {
lb.shutdown();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care if this operation competes before returning a ChildPolicyWrapper object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! There's a (pre-existing) concurrency issue with RefCountedChildPolicyWrapperFactory.createOrGet() and release(). As long as they become thread-safe, it doesn't matter when shutdown() operation completes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the concurrency issue in 3ae4e3d

rls/src/test/java/io/grpc/rls/CachingRlsLbClientTest.java Outdated Show resolved Hide resolved
rls/src/test/java/io/grpc/rls/CachingRlsLbClientTest.java Outdated Show resolved Hide resolved
Copy link
Member

@sergiitk sergiitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Plug those pesky leaks 💧 😄

@dapengzhang0
Copy link
Member Author

@sergiitk Thank you so much for the review. I know this PR as well as RLS is so complex and hard to read.

@dapengzhang0 dapengzhang0 merged commit 7a23fb2 into grpc:master Jan 12, 2022
@dapengzhang0 dapengzhang0 deleted the fix-rls-child-lb-leak branch January 12, 2022 22:58
@dapengzhang0 dapengzhang0 added the TODO:backport PR needs to be backported. Removed after backport complete label Jan 12, 2022
dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this pull request Jan 12, 2022
When client channel is shutting down, the RlsLoadBalancer is shutting down. However, the child loadbalancers of RlsLoadBalancer are not shut down. This is causing the issue b/209831670
@dapengzhang0 dapengzhang0 removed the TODO:backport PR needs to be backported. Removed after backport complete label Jan 12, 2022
dapengzhang0 added a commit that referenced this pull request Jan 14, 2022
When client channel is shutting down, the RlsLoadBalancer is shutting down. However, the child loadbalancers of RlsLoadBalancer are not shut down. This is causing the issue b/209831670
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants