Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client_channel: allow LB policy to communicate update errors to resolver #30809

Merged
merged 16 commits into from Sep 13, 2022

Conversation

markdroth
Copy link
Member

Fixes #30803.

This fixes a bug where, when the resolver returns an empty address list or an error for addresses on the first resolution attempt, we put the channel in TRANSIENT_FAILURE, but we never do any re-resolutions, so the channel never recovers from that state, even when valid DNS data becomes available.

@ejona86 and @dfawley and I had already agreed on the desired way to handle problems like this, but I hadn't implemented it yet, since I (mistakenly) believed that there was no case in which we'd actually encounter this problem in practice. So this PR fills in that missing piece of the client_channel architecture, which is to provide a mechanism whereby the LB policy can return some feedback to the resolver to indicate whether the resolver result was accepted. The resolver can then use this feedback to determine whether to go into backoff.

The specific changes here are:

  • The resolver result can now include an optional result_health_callback. If non-null, it will be invoked by the channel with the resulting status of the result from the LB policy.
  • The LB policy's UpdateLocked() method now returns a status, which can be non-OK in the case that the LB policy rejects the update. All of the leaf LB policies have been changed to return non-OK when they get an empty address list or an error for the addresses. (Note: There are some edge cases where a policy may not know for sure that there is a problem with the config until after UpdateLocked() has returned, so there will need to be some follow-up changes to address those cases.)
  • In PollingResolver, we return a result_health_callback with every result, and we use that to determine whether to schedule another attempt or reset the backoff state.

@daichij, can you please try this out and see if it fixes the problem for you?

Copy link
Contributor

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main general comment, can we simplify this change by making ReportResult return an absl::Status, and have resolvers read that?

Since result_health_callback is only ever called inline from ReportResult, I don't think we we would lose anything.

@markdroth
Copy link
Member Author

Even though the code in this PR always invokes the callback synchronously, we'll need it to be async in the future to support some of the use-cases where the LB policies don't know whether or not the update is good at the time that UpdateLocked() is called. That's why the API is designed this way.

@apolcyn
Copy link
Contributor

apolcyn commented Sep 8, 2022

Even though the code in this PR always invokes the callback synchronously, we'll need it to be async in the future to support some of the use-cases where the LB policies don't know whether or not the update is good at the time that UpdateLocked() is called. That's why the API is designed this way.

Makes sense. For these async cases, would it be simpler to handle them by adding another method to the Resolver API?

The complexity I'm thinking about is, by storing result_health_callback in the client channel, we'll be adding another external ref to resolver_, which seems to be a little abnormal for OrphanablePtr usage.

@markdroth
Copy link
Member Author

The callback is created and owned by the resolver, so it's an internally-created ref, which is exactly the intended usage pattern for InternallyRefCounted<> objects like the resolver.

@apolcyn
Copy link
Contributor

apolcyn commented Sep 8, 2022

The callback is created and owned by the resolver, so it's an internally-created ref, which is exactly the intended usage pattern for InternallyRefCounted<> objects like the resolver.

I'm a little confused. When the client channel invokes the callback asynchronously from ReportResult, it will be holding a second ref on the Resolver via the callback. So, for example, resolver_.reset() will no longer be enough to ensure eventual destruction, the client channel will also need to clear it's result_health_callback.

@markdroth
Copy link
Member Author

The channel isn't holding a ref on the resolver; it's holding a callback, and the callback is holding a ref. The callback is opaque from the perspective of the channel; it's created by the resolver, and only the resolver knows that it's holding a ref.

To say this another way, I think there are two conceptually independent things going on here:

  1. The resolver is expecting a callback and needs the callback to hold a ref to itself to ensure that it is not destroyed before the callback runs. This is exactly the use-case that InternallyRefCounted<> was designed for. (Internally, see go/grpc-c-core-ref-counted-types#slide=id.g2fcb01e5a8_0_91; this is case 1 as described there.)
  2. The channel is holding a callback from the resolver. You're right that the channel needs to be sure to eventually delete this callback, but that's true of basically any callback ever taken by any piece of code. From the channel's perspective, the callback is opaque; it has no idea that the channel is holding a ref to the resolver, and it does not care.

So I really don't think there's anything abnormal here; I think both of these uses are exactly the normal way of handling things from the perspective of each object. The fact that the object holding the callback (and thus transitively holding an internal ref to the resolver) happens to be the same object holding the external ref on the resolver is essentially just an implementation detail; I don't think it actually changes any of the principles here.

In addition, note that there are functional reasons why we want the callback to actually be associated with an individual resolver result. There are some potential future use-cases where the resolver may be getting data from a control plane and needs to provide feedback to the control plane as to whether that individual result was accepted or not.

result_handler_->ReportResult(std::move(result));
}
Unref(DEBUG_LOCATION, "OnRequestComplete");
}

void PollingResolver::GetResultStatus(absl::Status status) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a guarantee that an LB policy's UpdateLocked method won't be called between the time that ReportResult is called and GetResultStatus is called?

(from looking at the code, I think this won't happen, but I'm not sure if we have a more formal guarantee)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this question. The LB policy's UpdateLocked() method will always be called between when the resolver calls ReportResult() and when the result-health callback is invoked. It has to be that way, because the result of the UpdateLocked() call is what determines the status to be passed to the result-health callback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! I meant to ask what's our guarantee that an LB policy's UpdateLocked method won't cause RequestReResolutionLocked to be invoked synchronously, i.e. before the health check callback is invoked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay -- good question!

It looks like this PR did actually add such a case in pick_first. I could change pick_first to avoid that, but in the future, when we have more async cases for this, it probably won't be possible to do so. So instead, I've changed PollingResolver to handle that case properly, basically by deferring the re-resolution until after the result-health callback has been invoked.

Thanks for catching this!

@markdroth markdroth merged commit 9ff943b into grpc:master Sep 13, 2022
@markdroth markdroth deleted the lb_feedback_to_resolver branch September 13, 2022 00:17
gnossen added a commit that referenced this pull request Sep 13, 2022
ctiller pushed a commit that referenced this pull request Sep 13, 2022
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Sep 14, 2022
markdroth added a commit to markdroth/grpc that referenced this pull request Sep 14, 2022
markdroth added a commit that referenced this pull request Sep 14, 2022
@chi-jams
Copy link

@markdroth Back-porting this change appears to have solved our issue!

@markdroth
Copy link
Member Author

Super, I'm glad to hear it!

copybara-service bot pushed a commit that referenced this pull request May 1, 2024
This fixes some TODOs added in #30809 for cases where LB policies lazily create child policies.  Credit to @ejona86 for pointing out that simply calling `RequestReresolution()` in this case will ultimately result in the exponential backoff behavior we want.

This also adds some missing plumbing in code added as part of the dualstack work (in the endpoint_list library and in ring_hash) to propagate non-OK statuses from `UpdateLocked()`.  When I first made the dualstack changes, I didn't bother with this plumbing, because there are no cases today where these code-paths will actually see a non-OK status (`EndpointAddresses` won't allow creating an endpoint with 0 addresses, and that's the only case where pick_first will return a non-OK status), and I wasn't sure if we would stick with the approach of returning status from `UpdateLocked()` due to the aforementioned lazy creation case.  However, now that we have a good solution for the lazy creation case, I've added the necessary plumbing, just so that we don't have a bug if in the future pick_first winds up returning non-OK status in some other case.

I have not bothered to fix the propagation in the grpclb policy, since that looked like it would be slightly more work than it's really worth at this point.

Closes #36463

COPYBARA_INTEGRATE_REVIEW=#36463 from markdroth:lb_reresolve_for_lazy_child_creation 49043b2
PiperOrigin-RevId: 629755047
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bloat/low imported Specifies if the PR has been imported to the internal repository lang/c++ lang/core per-call-memory/neutral per-channel-memory/neutral release notes: yes Indicates if PR needs to be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pick First load balancer does not retry to re-resolve if the first address resolution has 0 addresses
4 participants