grpclb: implement subchannel caching #27657

markdroth · 2021-10-08T16:02:19Z

This is something that we always should have done but never quite got around to, and we have reports of the subchannel churn causing problems for internal users.

ctiller · 2021-10-08T16:25:04Z

src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc

+
+  // Deleted subchannel caching.
+  const grpc_millis subchannel_cache_interval_ms_;
+  std::map<grpc_millis /*deletion time*/,


What if two subchannels are deleted at the same millisecond?
And why are we reimplementing a timer queue here?

I'm assuming what we ultimately want to do is just keep the subchannel reference around for a period of time, so ultimately we could write:

void Cache(RefCountedPtr<SubchannelInterface> p) { event_engine->RunAt(now() + 10s, [p](){}); }

Maybe we could find a way to write it similar to that with the current API?

The value of the map is a vector, so if there are multiple subchannels deleted in the same millisecond, they'll go in the same map entry. That's intentional, and in fact I expect it to happen very frequently due to the cached value of "now" in the ExecCtx.

The workflow is basically this:

The grpclb policy gets an update from the balancer that does not include one or more addresses that were in the previous update.

The grpclb policy sends the updated address list to the round_robin child policy.

The round_robin policy calls the helper's CreateSubchannel() for every address in the new list, and then as soon as it unrefs the subchannels from the previous list. (Note that the same cached value of "now" in ExecCtx is used for all of these unrefs.)

As each subchannel is unreffed, it gets added to cached_subchannels_ (in the same bucket, because the same value of "now"), and a timer is started when the first one is added.

The idea here is to minimize the number of timers and therefore the amount of memory used for the cache, since we know that there can be multiple subchannels removed in the same update from the balancer, and we know that another update is likely to come in (which may remove another set of subchannels) before the timer fires for the subchannels removed in the previous update (the balancer may send updates as often as every 1s, but we cache subchannels for 10s). This way, we basically have just one timer pending at any given time, no matter how many subchannels are cached.

I could instead structure this using a separate timer for each subchannel, but that would increase the amount of memory I'd have to store for each cached subchannel: instead of just the ref to the subchannel, I'd also need to store a timer and a closure.

[none of this should be blocking]

Yup ok... mostly trying to make the event engine conversion easier, since I expect the code I wrote above would be preferred there, but as expressed we'll probably end up keeping the infrastructure here and making more complicated code.

I'm not sure that conservation of timers or memory warrants the additional long term complexity.

Given how expensive memory is right now, it seemed worth the optimization. But I acknowledge that I have absolutely no data to justify it; it's just sort of a hunch. It didn't seem that hard to do it this way, so I figured I might as well do it. But if at some point it is causing problems, it also isn't hard to change it to work the other way.

I don't think this will affect the EventEngine conversion either way.

apolcyn · 2021-10-08T16:26:50Z

src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc

+}
+
+void GrpcLb::OnSubchannelCacheTimerLocked(grpc_error_handle error) {
+  if (subchannel_cache_timer_pending_ && error == GRPC_ERROR_NONE) {


looks like we need to reset subchannel_cache_timer_pending_ in here?

Good catch. Done.

apolcyn · 2021-10-08T16:31:39Z

src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc

+// subchannel caching
+//
+
+void GrpcLb::CacheDeletedSubchannel(


Looks like StartSubchannelCacheTimer and this method need to be synchronized the same way as OnSubchannelCacheTimerLocked, in order to safely access cached_subchannels_. So let's suffix these methods with Locked?

apolcyn · 2021-10-08T16:45:50Z

src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc

          lb_token_(std::move(lb_token)),
          client_stats_(std::move(client_stats)) {}

+    ~SubchannelWrapper() override {
+      if (!lb_policy_->shutting_down_) {
+        lb_policy_->CacheDeletedSubchannel(wrapped_subchannel());


We could get rid of the timer loop in the grpclb policy if we just had this dtor allocate its own object which holds a closure, and a timer which fires in GRPC_GRPCLB_DEFAULT_SUBCHANNEL_DELETION_DELAY_MS from now (this object would destroy itself and the unref the subchannel when the timer fired).

We could do that, but we'd still need to keep track of all of the cached subchannels that are pending deletion, because we need to be able to cancel any pending timers when the LB policy shuts down. So I think this would increase memory usage without any real benefit, as per my reply to Craig below.

we need to be able to cancel any pending timers when the LB policy shuts down

Just for the thought experiment, what happens if we don't try to shut down these timers when the LB policy shuts down? I wonder if the per-subchannel timer approach could be made simpler this way.

That would cause memory leaks on grpc shutdown.

grpc shutdown should cancel all pending timers globally, though, right? Is that not sufficient to prevent this?

The other thing I'm thinking about here is that one of more of these subchannels may be in the process of setting up a TCP connection, and TCP connection setup can't be cancelled anyways AFAIK.

Is this potential leak with the cached subchannel timers different from the case where TCP connections are still in the process of establishing and grpc shutdown is called?

I think the correct behavior for any code is to cancel any async work that it has pending when it shuts down. I would consider not doing that to be a bug.

I'm not sure what happens in the case where the subchannel has a pending TCP connection setup; it may be that we have a bug there. But even if we do, I don't think that's a justification for adding a new bug here.

apolcyn · 2021-10-12T21:19:26Z

test/cpp/end2end/grpclb_end2end_test.cc

+          << "backend " << i;
+    }
+  }
+  // TODO(roth): This should ideally check that backend 1 never lost its


I think we can check that backend 1 never lost its connection by checking that it only received RPCs from one peer IP:port

EXPECT_EQ(1UL, backends_[1]->service_.clients().size()); - like above?

Good idea. Done.

apolcyn · 2021-10-12T21:44:36Z

src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc

+      DEBUG_LOCATION);
+}
+
+void GrpcLb::OnSubchannelCacheTimerLocked(grpc_error_handle error) {


Looks like error is missing an unref here

Good catch! Fixed.

…_caching

markdroth · 2021-10-13T19:13:54Z

Known issues: #27711

* grpclb: implement subchannel caching * code review changes * fix clang tidy * code review changes

grpclb: implement subchannel caching

68f799d

markdroth added the release notes: no Indicates if PR should not be in release notes label Oct 8, 2021

markdroth requested a review from apolcyn October 8, 2021 16:02

github-actions bot added lang/c++ lang/core labels Oct 8, 2021

ctiller reviewed Oct 8, 2021

View reviewed changes

apolcyn reviewed Oct 8, 2021

View reviewed changes

markdroth added 2 commits October 8, 2021 17:49

code review changes

4527b64

fix clang tidy

59a2a9c

apolcyn reviewed Oct 12, 2021

View reviewed changes

markdroth added 2 commits October 12, 2021 21:54

Merge remote-tracking branch 'upstream/master' into grpclb_subchannel…

6ae29ae

…_caching

code review changes

34499c3

apolcyn approved these changes Oct 12, 2021

View reviewed changes

markdroth added the kokoro:force-run label Oct 13, 2021

grpc-kokoro removed kokoro:force-run labels Oct 13, 2021

markdroth merged commit 2b813d2 into grpc:master Oct 13, 2021

copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Oct 14, 2021

dennycd pushed a commit to dennycd/grpc that referenced this pull request Oct 15, 2021

grpclb: implement subchannel caching (grpc#27657)

b54d541

* grpclb: implement subchannel caching * code review changes * fix clang tidy * code review changes

yihuazhang mentioned this pull request Oct 20, 2021

Eliminate gRPC insecure build #25586

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpclb: implement subchannel caching #27657

grpclb: implement subchannel caching #27657

markdroth commented Oct 8, 2021

ctiller Oct 8, 2021

markdroth Oct 8, 2021

ctiller Oct 8, 2021

markdroth Oct 8, 2021

apolcyn Oct 8, 2021

markdroth Oct 8, 2021

apolcyn Oct 8, 2021

markdroth Oct 8, 2021

apolcyn Oct 8, 2021

markdroth Oct 8, 2021

apolcyn Oct 8, 2021

markdroth Oct 8, 2021

apolcyn Oct 8, 2021

markdroth Oct 8, 2021

apolcyn Oct 12, 2021

markdroth Oct 12, 2021

apolcyn Oct 12, 2021

markdroth Oct 12, 2021

markdroth commented Oct 13, 2021

grpclb: implement subchannel caching #27657

grpclb: implement subchannel caching #27657

Conversation

markdroth commented Oct 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markdroth commented Oct 13, 2021