New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client_channel: allow LB policy to communicate update errors to resolver #30809
Conversation
Automated fix for refs/heads/lb_feedback_to_resolver
…initial resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main general comment, can we simplify this change by making ReportResult
return an absl::Status, and have resolvers read that?
Since result_health_callback
is only ever called inline from ReportResult
, I don't think we we would lose anything.
Even though the code in this PR always invokes the callback synchronously, we'll need it to be async in the future to support some of the use-cases where the LB policies don't know whether or not the update is good at the time that |
Makes sense. For these async cases, would it be simpler to handle them by adding another method to the The complexity I'm thinking about is, by storing |
The callback is created and owned by the resolver, so it's an internally-created ref, which is exactly the intended usage pattern for |
I'm a little confused. When the client channel invokes the callback asynchronously from |
The channel isn't holding a ref on the resolver; it's holding a callback, and the callback is holding a ref. The callback is opaque from the perspective of the channel; it's created by the resolver, and only the resolver knows that it's holding a ref. To say this another way, I think there are two conceptually independent things going on here:
So I really don't think there's anything abnormal here; I think both of these uses are exactly the normal way of handling things from the perspective of each object. The fact that the object holding the callback (and thus transitively holding an internal ref to the resolver) happens to be the same object holding the external ref on the resolver is essentially just an implementation detail; I don't think it actually changes any of the principles here. In addition, note that there are functional reasons why we want the callback to actually be associated with an individual resolver result. There are some potential future use-cases where the resolver may be getting data from a control plane and needs to provide feedback to the control plane as to whether that individual result was accepted or not. |
result_handler_->ReportResult(std::move(result)); | ||
} | ||
Unref(DEBUG_LOCATION, "OnRequestComplete"); | ||
} | ||
|
||
void PollingResolver::GetResultStatus(absl::Status status) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a guarantee that an LB policy's UpdateLocked
method won't be called between the time that ReportResult
is called and GetResultStatus
is called?
(from looking at the code, I think this won't happen, but I'm not sure if we have a more formal guarantee)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this question. The LB policy's UpdateLocked()
method will always be called between when the resolver calls ReportResult()
and when the result-health callback is invoked. It has to be that way, because the result of the UpdateLocked()
call is what determines the status to be passed to the result-health callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! I meant to ask what's our guarantee that an LB policy's UpdateLocked
method won't cause RequestReResolutionLocked
to be invoked synchronously, i.e. before the health check callback is invoked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay -- good question!
It looks like this PR did actually add such a case in pick_first. I could change pick_first to avoid that, but in the future, when we have more async cases for this, it probably won't be possible to do so. So instead, I've changed PollingResolver
to handle that case properly, basically by deferring the re-resolution until after the result-health callback has been invoked.
Thanks for catching this!
… errors to resolver (grpc#30809)" (grpc#30970)" This reverts commit 1648bc0.
@markdroth Back-porting this change appears to have solved our issue! |
Super, I'm glad to hear it! |
This fixes some TODOs added in #30809 for cases where LB policies lazily create child policies. Credit to @ejona86 for pointing out that simply calling `RequestReresolution()` in this case will ultimately result in the exponential backoff behavior we want. This also adds some missing plumbing in code added as part of the dualstack work (in the endpoint_list library and in ring_hash) to propagate non-OK statuses from `UpdateLocked()`. When I first made the dualstack changes, I didn't bother with this plumbing, because there are no cases today where these code-paths will actually see a non-OK status (`EndpointAddresses` won't allow creating an endpoint with 0 addresses, and that's the only case where pick_first will return a non-OK status), and I wasn't sure if we would stick with the approach of returning status from `UpdateLocked()` due to the aforementioned lazy creation case. However, now that we have a good solution for the lazy creation case, I've added the necessary plumbing, just so that we don't have a bug if in the future pick_first winds up returning non-OK status in some other case. I have not bothered to fix the propagation in the grpclb policy, since that looked like it would be slightly more work than it's really worth at this point. Closes #36463 COPYBARA_INTEGRATE_REVIEW=#36463 from markdroth:lb_reresolve_for_lazy_child_creation 49043b2 PiperOrigin-RevId: 629755047
Fixes #30803.
This fixes a bug where, when the resolver returns an empty address list or an error for addresses on the first resolution attempt, we put the channel in TRANSIENT_FAILURE, but we never do any re-resolutions, so the channel never recovers from that state, even when valid DNS data becomes available.
@ejona86 and @dfawley and I had already agreed on the desired way to handle problems like this, but I hadn't implemented it yet, since I (mistakenly) believed that there was no case in which we'd actually encounter this problem in practice. So this PR fills in that missing piece of the client_channel architecture, which is to provide a mechanism whereby the LB policy can return some feedback to the resolver to indicate whether the resolver result was accepted. The resolver can then use this feedback to determine whether to go into backoff.
The specific changes here are:
result_health_callback
. If non-null, it will be invoked by the channel with the resulting status of the result from the LB policy.UpdateLocked()
method now returns a status, which can be non-OK in the case that the LB policy rejects the update. All of the leaf LB policies have been changed to return non-OK when they get an empty address list or an error for the addresses. (Note: There are some edge cases where a policy may not know for sure that there is a problem with the config until afterUpdateLocked()
has returned, so there will need to be some follow-up changes to address those cases.)PollingResolver
, we return aresult_health_callback
with every result, and we use that to determine whether to schedule another attempt or reset the backoff state.@daichij, can you please try this out and see if it fixes the problem for you?