client_channel: allow LB policy to communicate update errors to resolver #30809

markdroth · 2022-08-31T20:40:00Z

This fixes a bug where, when the resolver returns an empty address list or an error for addresses on the first resolution attempt, we put the channel in TRANSIENT_FAILURE, but we never do any re-resolutions, so the channel never recovers from that state, even when valid DNS data becomes available.

@ejona86 and @dfawley and I had already agreed on the desired way to handle problems like this, but I hadn't implemented it yet, since I (mistakenly) believed that there was no case in which we'd actually encounter this problem in practice. So this PR fills in that missing piece of the client_channel architecture, which is to provide a mechanism whereby the LB policy can return some feedback to the resolver to indicate whether the resolver result was accepted. The resolver can then use this feedback to determine whether to go into backoff.

The specific changes here are:

The resolver result can now include an optional result_health_callback. If non-null, it will be invoked by the channel with the resulting status of the result from the LB policy.
The LB policy's UpdateLocked() method now returns a status, which can be non-OK in the case that the LB policy rejects the update. All of the leaf LB policies have been changed to return non-OK when they get an empty address list or an error for the addresses. (Note: There are some edge cases where a policy may not know for sure that there is a problem with the config until after UpdateLocked() has returned, so there will need to be some follow-up changes to address those cases.)
In PollingResolver, we return a result_health_callback with every result, and we use that to determine whether to schedule another attempt or reset the backoff state.

@daichij, can you please try this out and see if it fixes the problem for you?

Automated fix for refs/heads/lb_feedback_to_resolver

…initial resolution

…solver

apolcyn

Main general comment, can we simplify this change by making ReportResult return an absl::Status, and have resolvers read that?

Since result_health_callback is only ever called inline from ReportResult, I don't think we we would lose anything.

markdroth · 2022-09-08T22:39:25Z

Even though the code in this PR always invokes the callback synchronously, we'll need it to be async in the future to support some of the use-cases where the LB policies don't know whether or not the update is good at the time that UpdateLocked() is called. That's why the API is designed this way.

apolcyn · 2022-09-08T22:51:48Z

Even though the code in this PR always invokes the callback synchronously, we'll need it to be async in the future to support some of the use-cases where the LB policies don't know whether or not the update is good at the time that UpdateLocked() is called. That's why the API is designed this way.

Makes sense. For these async cases, would it be simpler to handle them by adding another method to the Resolver API?

The complexity I'm thinking about is, by storing result_health_callback in the client channel, we'll be adding another external ref to resolver_, which seems to be a little abnormal for OrphanablePtr usage.

markdroth · 2022-09-08T22:54:02Z

The callback is created and owned by the resolver, so it's an internally-created ref, which is exactly the intended usage pattern for InternallyRefCounted<> objects like the resolver.

apolcyn · 2022-09-08T23:03:29Z

The callback is created and owned by the resolver, so it's an internally-created ref, which is exactly the intended usage pattern for InternallyRefCounted<> objects like the resolver.

I'm a little confused. When the client channel invokes the callback asynchronously from ReportResult, it will be holding a second ref on the Resolver via the callback. So, for example, resolver_.reset() will no longer be enough to ensure eventual destruction, the client channel will also need to clear it's result_health_callback.

markdroth · 2022-09-08T23:23:06Z

The channel isn't holding a ref on the resolver; it's holding a callback, and the callback is holding a ref. The callback is opaque from the perspective of the channel; it's created by the resolver, and only the resolver knows that it's holding a ref.

To say this another way, I think there are two conceptually independent things going on here:

The resolver is expecting a callback and needs the callback to hold a ref to itself to ensure that it is not destroyed before the callback runs. This is exactly the use-case that InternallyRefCounted<> was designed for. (Internally, see go/grpc-c-core-ref-counted-types#slide=id.g2fcb01e5a8_0_91; this is case 1 as described there.)
The channel is holding a callback from the resolver. You're right that the channel needs to be sure to eventually delete this callback, but that's true of basically any callback ever taken by any piece of code. From the channel's perspective, the callback is opaque; it has no idea that the channel is holding a ref to the resolver, and it does not care.

So I really don't think there's anything abnormal here; I think both of these uses are exactly the normal way of handling things from the perspective of each object. The fact that the object holding the callback (and thus transitively holding an internal ref to the resolver) happens to be the same object holding the external ref on the resolver is essentially just an implementation detail; I don't think it actually changes any of the principles here.

In addition, note that there are functional reasons why we want the callback to actually be associated with an individual resolver result. There are some potential future use-cases where the resolver may be getting data from a control plane and needs to provide feedback to the control plane as to whether that individual result was accepted or not.

src/core/lib/resolver/resolver.h

…solver

apolcyn · 2022-09-12T18:10:24Z

src/core/ext/filters/client_channel/resolver/polling_resolver.cc

    result_handler_->ReportResult(std::move(result));
  }
  Unref(DEBUG_LOCATION, "OnRequestComplete");
 }

+void PollingResolver::GetResultStatus(absl::Status status) {


Do we have a guarantee that an LB policy's UpdateLocked method won't be called between the time that ReportResult is called and GetResultStatus is called?

(from looking at the code, I think this won't happen, but I'm not sure if we have a more formal guarantee)

I'm not sure I understand this question. The LB policy's UpdateLocked() method will always be called between when the resolver calls ReportResult() and when the result-health callback is invoked. It has to be that way, because the result of the UpdateLocked() call is what determines the status to be passed to the result-health callback.

Sorry! I meant to ask what's our guarantee that an LB policy's UpdateLocked method won't cause RequestReResolutionLocked to be invoked synchronously, i.e. before the health check callback is invoked?

Ah, okay -- good question!

It looks like this PR did actually add such a case in pick_first. I could change pick_first to avoid that, but in the future, when we have more async cases for this, it probably won't be possible to do so. So instead, I've changed PollingResolver to handle that case properly, basically by deferring the re-resolution until after the result-health callback has been invoked.

Thanks for catching this!

… result-health callback

…to resolver (#30809)" This reverts commit 9ff943b.

…to resolver (#30809)" (#30970) This reverts commit 9ff943b.

… errors to resolver (grpc#30809)" (grpc#30970)" This reverts commit 1648bc0.

… errors to resolver (#30809)" (#30970)" (#30981) This reverts commit 1648bc0.

chi-jams · 2022-09-21T17:15:08Z

@markdroth Back-porting this change appears to have solved our issue!

markdroth · 2022-09-21T17:58:57Z

Super, I'm glad to hear it!

@ejona86

This fixes some TODOs added in #30809 for cases where LB policies lazily create child policies. Credit to @ejona86 for pointing out that simply calling `RequestReresolution()` in this case will ultimately result in the exponential backoff behavior we want. This also adds some missing plumbing in code added as part of the dualstack work (in the endpoint_list library and in ring_hash) to propagate non-OK statuses from `UpdateLocked()`. When I first made the dualstack changes, I didn't bother with this plumbing, because there are no cases today where these code-paths will actually see a non-OK status (`EndpointAddresses` won't allow creating an endpoint with 0 addresses, and that's the only case where pick_first will return a non-OK status), and I wasn't sure if we would stick with the approach of returning status from `UpdateLocked()` due to the aforementioned lazy creation case. However, now that we have a good solution for the lazy creation case, I've added the necessary plumbing, just so that we don't have a bug if in the future pick_first winds up returning non-OK status in some other case. I have not bothered to fix the propagation in the grpclb policy, since that looked like it would be slightly more work than it's really worth at this point. Closes #36463 COPYBARA_INTEGRATE_REVIEW=#36463 from markdroth:lb_reresolve_for_lazy_child_creation 49043b2 PiperOrigin-RevId: 629755047

client_channel: allow LB policy to communicate update errors to resolver

a8eeb32

markdroth added the release notes: yes Indicates if PR needs to be in release notes label Aug 31, 2022

markdroth requested a review from apolcyn August 31, 2022 20:40

github-actions bot added lang/c++ lang/core labels Aug 31, 2022

grpc-checks bot added per-call-memory/neutral per-channel-memory/neutral labels Aug 31, 2022

fix tests

214d93a

grpc-checks bot added the bloat/low label Aug 31, 2022

markdroth and others added 6 commits August 31, 2022 21:40

Automated change: Fix sanity tests

dc528aa

Merge pull request #210 from markdroth/create-pull-request/patch-214d93a

e51187b

Automated fix for refs/heads/lb_feedback_to_resolver

fix build

16be588

fix another "ignoring return value" warning

9204c29

fix use-after-move

62db23f

fix channel to invoke resolver callback when service config fails on …

67ecefb

…initial resolution

grpc-checks bot added bloat/medium and removed bloat/low labels Sep 1, 2022

Merge remote-tracking branch 'upstream/master' into lb_feedback_to_re…

8611a75

…solver

grpc-checks bot added bloat/low and removed bloat/medium labels Sep 1, 2022

markdroth requested a review from yashykt September 1, 2022 22:50

apolcyn reviewed Sep 8, 2022

View reviewed changes

yashykt reviewed Sep 9, 2022

View reviewed changes

src/core/lib/resolver/resolver.h Show resolved Hide resolved

yashykt reviewed Sep 9, 2022

View reviewed changes

src/core/lib/resolver/resolver.h Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into lb_feedback_to_re…

ca1bec1

…solver

markdroth added 2 commits September 10, 2022 00:15

remove outdated TODO

f600676

improve comments

f9318a5

yashykt approved these changes Sep 12, 2022

View reviewed changes

apolcyn approved these changes Sep 12, 2022

View reviewed changes

markdroth and others added 4 commits September 12, 2022 20:56

fix PollingResolver to defer re-resolution requests while waiting for…

58048bf

… result-health callback

Automated change: Fix sanity tests

705aa55

absl::exchange -> std::exchange

863e8fd

fix dns_resolver_cooldown_test

c06a4c8

markdroth merged commit 9ff943b into grpc:master Sep 13, 2022

markdroth deleted the lb_feedback_to_resolver branch September 13, 2022 00:17

gnossen added a commit that referenced this pull request Sep 13, 2022

Revert "client_channel: allow LB policy to communicate update errors …

64842b2

…to resolver (#30809)" This reverts commit 9ff943b.

gnossen mentioned this pull request Sep 13, 2022

Revert "client_channel: allow LB policy to communicate update errors to resolver" #30970

Merged

ctiller pushed a commit that referenced this pull request Sep 13, 2022

Revert "client_channel: allow LB policy to communicate update errors …

1648bc0

…to resolver (#30809)" (#30970) This reverts commit 9ff943b.

copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Sep 14, 2022

markdroth added a commit to markdroth/grpc that referenced this pull request Sep 14, 2022

Revert "Revert "client_channel: allow LB policy to communicate update…

24e1883

… errors to resolver (grpc#30809)" (grpc#30970)" This reverts commit 1648bc0.

markdroth mentioned this pull request Sep 14, 2022

Second attempt: client_channel: allow LB policy to communicate update errors to resolver #30981

Merged

markdroth added a commit that referenced this pull request Sep 14, 2022

Revert "Revert "client_channel: allow LB policy to communicate update…

e5aadf9

… errors to resolver (#30809)" (#30970)" (#30981) This reverts commit 1648bc0.

markdroth mentioned this pull request Jun 26, 2023

[round robin] delegate to pick_first instead of creating subchannels directly #32692

Merged

markdroth mentioned this pull request Apr 26, 2024

[LB policies] fix handling of UpdateLocked() result #36463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client_channel: allow LB policy to communicate update errors to resolver #30809

client_channel: allow LB policy to communicate update errors to resolver #30809

markdroth commented Aug 31, 2022

apolcyn left a comment

markdroth commented Sep 8, 2022

apolcyn commented Sep 8, 2022

markdroth commented Sep 8, 2022

apolcyn commented Sep 8, 2022 •

edited

markdroth commented Sep 8, 2022

apolcyn Sep 12, 2022

markdroth Sep 12, 2022

apolcyn Sep 12, 2022

markdroth Sep 12, 2022

chi-jams commented Sep 21, 2022

markdroth commented Sep 21, 2022

client_channel: allow LB policy to communicate update errors to resolver #30809

client_channel: allow LB policy to communicate update errors to resolver #30809

Conversation

markdroth commented Aug 31, 2022

apolcyn left a comment

Choose a reason for hiding this comment

markdroth commented Sep 8, 2022

apolcyn commented Sep 8, 2022

markdroth commented Sep 8, 2022

apolcyn commented Sep 8, 2022 • edited

markdroth commented Sep 8, 2022

apolcyn Sep 12, 2022

Choose a reason for hiding this comment

markdroth Sep 12, 2022

Choose a reason for hiding this comment

apolcyn Sep 12, 2022

Choose a reason for hiding this comment

markdroth Sep 12, 2022

Choose a reason for hiding this comment

chi-jams commented Sep 21, 2022

markdroth commented Sep 21, 2022

apolcyn commented Sep 8, 2022 •

edited