Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xDS status notifier #25321

Merged
merged 28 commits into from
Mar 9, 2021
Merged

xDS status notifier #25321

merged 28 commits into from
Mar 9, 2021

Conversation

yashykt
Copy link
Member

@yashykt yashykt commented Feb 2, 2021

  1. Added a new type XdsServerBuilderServingStatusNotifierInterface and modified XdsServerConfigFetcher to invoke the notifier callback on serving status changes.
  2. Added logic for the server to stop accepting new connections when the resource gets deleted, and to send a GOAWAY on existing connections. For this, the server needs to keep track of all the connections as a list of ConnectionState.
  3. Modified grpc_chttp2_transport_start_reading to add another parameter notify_on_close so that the ConnectionState struct can be maintained for the lifetime of the connection.

This change is Reviewable

@yashykt yashykt changed the title Initial commit xDS status notifier Feb 3, 2021
@yashykt yashykt added the release notes: no Indicates if PR should not be in release notes label Feb 3, 2021
@yashykt yashykt marked this pull request as ready for review February 3, 2021 08:28
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start!

Please let me know if you have any questions. Thanks!

Reviewed 23 of 23 files at r1.
Reviewable status: all files reviewed, 25 unresolved discussions (waiting on @yashykt)


include/grpc/grpc.h, line 414 at r1 (raw file):

                                                   void* reserved);

// There might be more methods added later, so users should take care to memset

Unlike the C++ API, we can make non-backward-compatible changes to the C-core API, so this isn't such a big deal. In fact, we can probably not have this struct at all and instead just pass the function pointer and the user_data arg as separate parameters to grpc_server_config_fetcher_xds_create().

Just a thought -- this is fine as-is too.


include/grpcpp/xds_server_builder.h, line 29 at r1 (raw file):

namespace experimental {

class XdsServerBuilderServingStatusNotifierInterface {

Suggest removing the "Builder" from this name, since this object will actually outlive the builder.


include/grpcpp/xds_server_builder.h, line 68 at r1 (raw file):

                                    grpc_status_code code,
                                    const char* error_message) {
    XdsServerBuilderServingStatusNotifierInterface* ptr =

Suggest calling this notifier.


include/grpcpp/xds_server_builder.h, line 70 at r1 (raw file):

    XdsServerBuilderServingStatusNotifierInterface* ptr =
        static_cast<XdsServerBuilderServingStatusNotifierInterface*>(user_data);
    ptr->OnServingStatusChange(

If the user hasn't set a notifier, this will be null.


include/grpcpp/xds_server_builder.h, line 74 at r1 (raw file):

  }

  grpc_server_xds_status_notifier c_notifier_;

As currently written, the C-core API passes the grpc_server_xds_status_notifier struct to grpc_server_config_fetcher_xds_create() by value, which means that there actually isn't any reason we need a copy of the struct here in the C++ object; all we really need here is an XdsServerBuilderServingStatusNotifierInterface*. We can construct the grpc_server_xds_status_notifier on the stack when we call grpc_server_config_fetcher_xds_create().


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 128 at r1 (raw file):

      MutexLock lock(&listener_->mu_);
      listener_->stop_serving_ = true;
      for (auto& iterator : listener_->connections_) {

This isn't an iterator, it's actually a ConnectionState*.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 170 at r1 (raw file):

    grpc_pollset_set* const interested_parties_;
    std::list<ConnectionState*>::iterator position_;
    bool stop_serving_ = false;

The double-negative here is a little confusing. Maybe call this is_serving_ and negate the value?


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 193 at r1 (raw file):

  grpc_channel_args* args_;  // guarded by mu_
  ConfigFetcherWatcher* config_fetcher_watcher_ = nullptr;
  bool stop_serving_ = false;

Same comment as above regarding the double-negative.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 195 at r1 (raw file):

  bool stop_serving_ = false;
  bool shutdown_ = true;
  std::list<ConnectionState*> connections_;

If we change this to use std::set<>, then I don't think we need the ConnectionState object to hold an iterator to its own position, because the lookup should be fairly quick just based on its address.

I suppose we could also consider something like std::unordered_set<> or maybe one of the absl containers if we think this could get really large. But I suspect that shutdown and startup are going to be rare enough that we don't need to optimize for run-time here; minimizing memory usage is probably more important.

Speaking of which, have we considered how much memory this will add to really busy servers (e.g., those with 10K+ connections)? Are there alternatives that might not require us to keep this list at all? I don't see any obvious alternative, but maybe you can think of something I can't.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 242 at r1 (raw file):

  stop_serving_ = true;
  grpc_transport_op* op = grpc_make_transport_op(nullptr);
  op->goaway_error = GRPC_ERROR_CREATE_FROM_STATIC_STRING(

How does this interact with #24237 and internal issue b/135458602? Are we going to do the right GOAWAY behavior here?


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 244 at r1 (raw file):

  op->goaway_error = GRPC_ERROR_CREATE_FROM_STATIC_STRING(
      "Server is stopping to serve requests.");
  grpc_transport_perform_op(&transport_->base, op);

What happens if the connection is still handshaking and has not yet created the transport?

I think in that case, we just need to shut down the endpoint and let the handshaking fail. Note that we also probably need to check stop_serving_ in OnHandshakeDone() to handle the race condition where handshaking is already complete but OnHandshakeDone() has not yet been invoked.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 337 at r1 (raw file):

          self->transport_ =
              reinterpret_cast<grpc_chttp2_transport*>(transport);
          // TODO(yashkt) : Check if this ref is unnecessary since we are

Is it possible that the transport closes before the initial SETTINGS frame is received? If so, then I think we need both refs.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 340 at r1 (raw file):

          // holding a ref for OnClose() anyway.
          self->Ref().release();  // Held by OnReceiveSettings().
          self->Ref().release();  // Held by OnClose().

It's worth noting that this basically means that this ConnectionState object is no longer just about handshaking; it actually lives until the transport is closed. But the ConnectionState object has a lot of data members that aren't relevant once handshaking is complete, so this seems like it will unnecessarily increase memory usage for servers.

Instead, can we create a separate object that contains just the state we need, and let that object share the lifetime of the connection, so that we can destroy the ConnectionState object when the handshaking is complete? I think after that we don't need to keep anything other than the transport itself and maybe a closure (although the latter might be eliminated by using std::function<> instead).


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 527 at r1 (raw file):

    MutexLock lock(&self->mu_);
    if (self->stop_serving_) {
      goto error;

I'd prefer to avoid goto in new code. Instead, I suggest moving the body of this function into a helper function that returns a bool to indicate if it failed, in which case we can do the error cleanup.

See my next comment about moving some of this code into CreateHandshakeManager(). It may be that that function is the helper function we want, and that we just need to move some more code into it.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 535 at r1 (raw file):

  }
  {
    MutexLock lock(&self->mu_);

It looks like we're now grabbing and releasing this same lock three times in a row (once to check stop_serving_, once in CreateHandshakeManager(), and once here.

Instead, how about moving the stop_serving_ check and the channel args copy into CreateHandshakeManager()? I think this would also eliminate the need for the goto.


src/core/ext/xds/xds_server_config_fetcher.cc, line 51 at r1 (raw file):

        listening_address);
    auto* listener_watcher_ptr = listener_watcher.get();
    // TODO(yashykt): Get the resource name id from bootstrap

This reminds me: We still need to take care of this.

This doesn't need to be done as part of this PR, but please make sure you go through gRFC A36 in detail to make sure we're covering anything. I want to make sure there's nothing else we're forgetting about here.


src/core/ext/xds/xds_server_config_fetcher.cc, line 93 at r1 (raw file):

          xds_client_(std::move(xds_client)),
          serving_status_notifier_(serving_status_notifier),
          listening_address_(listening_address) {}

std::move()


src/core/ext/xds/xds_server_config_fetcher.cc, line 126 at r1 (raw file):

              GRPC_STATUS_OK, "");
        } else {
          gpr_log(GPR_INFO, "Server will begin serving.");

Suggest having this message say something like "xDS Listener resource obtained, will start serving on ${listening_address}".


src/core/ext/xds/xds_server_config_fetcher.cc, line 148 at r1 (raw file):

              GRPC_STATUS_UNAVAILABLE, grpc_error_string(error));
        } else {
          gpr_log(GPR_INFO, "Server will not serve.");

Similar to above: "error obtaining xDS Listener resource; not serving on ${listening_address}".


src/core/ext/xds/xds_server_config_fetcher.cc, line 170 at r1 (raw file):

            GRPC_STATUS_NOT_FOUND, "Requested listener does not exist");
      } else {
        gpr_log(GPR_INFO, "Server will not serve.");

Similar to above: "xDS Listener resource does not exist; not serving on ${listening_address}".


src/core/ext/xds/xds_server_config_fetcher.cc, line 273 at r1 (raw file):

    RefCountedPtr<grpc_tls_certificate_provider> identity_certificate_provider_;
    RefCountedPtr<XdsCertificateProvider> xds_certificate_provider_;
    bool updated_once_ = false;

Suggest renaming this to something like have_resource_, to reflect what we're now actually using it for.


src/core/lib/surface/server.h, line 420 at r1 (raw file):

    // Ownership of \a args is transferred.
    virtual void UpdateConfig(grpc_channel_args* args) = 0;
    virtual void StopServing() = 0;

Please document this new API. It's not obvious from reading this how it's expected to be used, when it will be called, when we start serving again, etc.


test/cpp/end2end/xds_end2end_test.cc, line 6816 at r1 (raw file):

      absl::StrCat("127.0.0.1:", backends_[0]->port()), grpc::StatusCode::OK);
  SendRpc([this]() { return CreateInsecureChannel(); }, {}, {});
  // Invalid update does not lead to a change in the serving status.

In this test, how do we know that the update has actually been seen by the client by the time the test is complete? Could this test fail to catch a failure?


test/cpp/end2end/xds_end2end_test.cc, line 6838 at r1 (raw file):

}

// Alternative name - ServingStatusToNonServingStatusTransition

Why not use that name?


test/cpp/end2end/xds_end2end_test.cc, line 6876 at r1 (raw file):

      absl::StrCat("127.0.0.1:", backends_[0]->port()), grpc::StatusCode::OK);
  constexpr int kNumChannels = 10;
  std::shared_ptr<Channel> channel[kNumChannels];

Instead of having three different arrays here, please declare a struct and make a single array of the struct.

Alternatively, maybe use something like LongRunningRpc that already exists above?

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 25 unresolved discussions (waiting on @markdroth and @yashykt)


include/grpcpp/xds_server_builder.h, line 70 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If the user hasn't set a notifier, this will be null.

If the user hasn't set a notifier, OnServingStatusChange will not be invoked.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 195 at r1 (raw file):
I chose to use a std::list instead of an std::set primarily to avoid a lookup without adding a lot of additional state which would be the case with an unordered_set.

I suppose we could also consider something like std::unordered_set<> or maybe one of the absl containers if we think this could get really large. But I suspect that shutdown and startup are going to be rare enough that we don't need to optimize for run-time here; minimizing memory usage is probably more important.

I am considering servers where the connections themselves are short-lived, something like a client starts a connection, performs a unary RPC and deletes the channel. Using a std::list where we can just save the position seems like a good match.

Speaking of which, have we considered how much memory this will add to really busy servers (e.g., those with 10K+ connections)? Are there alternatives that might not require us to keep this list at all? I don't see any obvious alternative, but maybe you can think of something I can't.

I haven't found one yet. It seems that we need some way to delete all active connections for a particular listener, so it seems like we won't be able to avoid storing this association somehow.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 242 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

How does this interact with #24237 and internal issue b/135458602? Are we going to do the right GOAWAY behavior here?

If I am not making a mistake, the server will send a GOAWAY immediately. Nice gRPC clients will eventually respect that GOAWAY and not try to create new RPCs, but this assumes that we are talking to a nice client. A malicious/bad client might very well ignore the GOAWAY and continue creating new RPCs forever. Due to the existing bug, our server will keep accepting those new streams and the connection will never close.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 244 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

What happens if the connection is still handshaking and has not yet created the transport?

I think in that case, we just need to shut down the endpoint and let the handshaking fail. Note that we also probably need to check stop_serving_ in OnHandshakeDone() to handle the race condition where handshaking is already complete but OnHandshakeDone() has not yet been invoked.

I thought I had added that. Evidently not.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 337 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Is it possible that the transport closes before the initial SETTINGS frame is received? If so, then I think we need both refs.

If the transport closes before the settings frame is received, both of the closures are invoked, and so if we are sure about them being invoked from the same thread one after another, we might not need the ref

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 24 unresolved discussions (waiting on @markdroth and @yashykt)


include/grpcpp/xds_server_builder.h, line 70 at r1 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

If the user hasn't set a notifier, OnServingStatusChange will not be invoked.

I don't think that's true. The notifier struct has two fields: on_serving_status_change and user_data. The code here always sets the first to non-null and sets the second to non-null only if the caller sets a notifier. However, the code in xds_server_config_fetcher.cc checks that the first field is non-null before invoking the callback, which means that this will always be invoked.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 195 at r1 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

I chose to use a std::list instead of an std::set primarily to avoid a lookup without adding a lot of additional state which would be the case with an unordered_set.

I suppose we could also consider something like std::unordered_set<> or maybe one of the absl containers if we think this could get really large. But I suspect that shutdown and startup are going to be rare enough that we don't need to optimize for run-time here; minimizing memory usage is probably more important.

I am considering servers where the connections themselves are short-lived, something like a client starts a connection, performs a unary RPC and deletes the channel. Using a std::list where we can just save the position seems like a good match.

Speaking of which, have we considered how much memory this will add to really busy servers (e.g., those with 10K+ connections)? Are there alternatives that might not require us to keep this list at all? I don't see any obvious alternative, but maybe you can think of something I can't.

I haven't found one yet. It seems that we need some way to delete all active connections for a particular listener, so it seems like we won't be able to avoid storing this association somehow.

I don't think we should optimize for the case of short-lived connections, since that's not the common case, and any application doing that is going to behave very inefficiently anyway. I think it's more important to optimize for the common case of long-lived connections and servers that need to scale up to a large number of clients.

I don't understand what you mean by "adding a lot of additional state" for a set. It seems to me that the std::list<> approach requires additional state, because it means that the ConnectionState object has to store the iterator. That becomes unnecessary with the std::set<> approach; the ConnectionState object can just do connections_.find(this). I think that approach should use less memory.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 242 at r1 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

If I am not making a mistake, the server will send a GOAWAY immediately. Nice gRPC clients will eventually respect that GOAWAY and not try to create new RPCs, but this assumes that we are talking to a nice client. A malicious/bad client might very well ignore the GOAWAY and continue creating new RPCs forever. Due to the existing bug, our server will keep accepting those new streams and the connection will never close.

Okay, I see -- so we're depending on the client to close the connection when the last in-flight RPC finishes.

I guess that could work, but it seems a little sub-optimal, because it allows the existing connection to remain open for as long as the existing RPCs are still in flight. If those are long-running streams, the connection could remain around indefinitely. It seems like what we really ought to do here is to do a safe shutdown -- i.e., send a GOAWAY, wait for a grace period, and then close the connection.

As I said on the internal bug, it would be really nice to change our transport op semantics to allow callers to do this without setting their own timers, since there are a lot of places we trigger this kind of behavior from.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 337 at r1 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

If the transport closes before the settings frame is received, both of the closures are invoked, and so if we are sure about them being invoked from the same thread one after another, we might not need the ref

Ah, right, good point -- I had forgotten that OnReceiveSettings() will be invoked regardless.

Yeah, I guess this does come down to a question of whether there's a guarantee that the two closures will be invoked in the right order. But it seems like we don't really have any such guarantee; even if the current code happens to behave that way, there's no guarantee it will stay that way in the future, so depending on it seems brittle.

Also, this problem may be moot if you take my suggestion below about splitting the OnClose() functionality into a different object.

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 23 files reviewed, 23 unresolved discussions (waiting on @markdroth and @yashykt)


include/grpcpp/xds_server_builder.h, line 29 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest removing the "Builder" from this name, since this object will actually outlive the builder.

Done.


include/grpcpp/xds_server_builder.h, line 68 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling this notifier.

Done.


include/grpcpp/xds_server_builder.h, line 70 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think that's true. The notifier struct has two fields: on_serving_status_change and user_data. The code here always sets the first to non-null and sets the second to non-null only if the caller sets a notifier. However, the code in xds_server_config_fetcher.cc checks that the first field is non-null before invoking the callback, which means that this will always be invoked.

Ah you are right. Added a null check.


include/grpcpp/xds_server_builder.h, line 74 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As currently written, the C-core API passes the grpc_server_xds_status_notifier struct to grpc_server_config_fetcher_xds_create() by value, which means that there actually isn't any reason we need a copy of the struct here in the C++ object; all we really need here is an XdsServerBuilderServingStatusNotifierInterface*. We can construct the grpc_server_xds_status_notifier on the stack when we call grpc_server_config_fetcher_xds_create().

Done


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 128 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This isn't an iterator, it's actually a ConnectionState*.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 195 at r1 (raw file):

I don't understand what you mean by "adding a lot of additional state" for a set.
This is for an unordered_set which would use more memory. I do agree though that using std::set is better in terms of memory usage when compared to std::list, and if we want to optimize for long lived connections then it makes sense.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 244 at r1 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

I thought I had added that. Evidently not.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 337 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Ah, right, good point -- I had forgotten that OnReceiveSettings() will be invoked regardless.

Yeah, I guess this does come down to a question of whether there's a guarantee that the two closures will be invoked in the right order. But it seems like we don't really have any such guarantee; even if the current code happens to behave that way, there's no guarantee it will stay that way in the future, so depending on it seems brittle.

Also, this problem may be moot if you take my suggestion below about splitting the OnClose() functionality into a different object.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 340 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It's worth noting that this basically means that this ConnectionState object is no longer just about handshaking; it actually lives until the transport is closed. But the ConnectionState object has a lot of data members that aren't relevant once handshaking is complete, so this seems like it will unnecessarily increase memory usage for servers.

Instead, can we create a separate object that contains just the state we need, and let that object share the lifetime of the connection, so that we can destroy the ConnectionState object when the handshaking is complete? I think after that we don't need to keep anything other than the transport itself and maybe a closure (although the latter might be eliminated by using std::function<> instead).

I like this idea! Done!


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 527 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I'd prefer to avoid goto in new code. Instead, I suggest moving the body of this function into a helper function that returns a bool to indicate if it failed, in which case we can do the error cleanup.

See my next comment about moving some of this code into CreateHandshakeManager(). It may be that that function is the helper function we want, and that we just need to move some more code into it.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 535 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It looks like we're now grabbing and releasing this same lock three times in a row (once to check stop_serving_, once in CreateHandshakeManager(), and once here.

Instead, how about moving the stop_serving_ check and the channel args copy into CreateHandshakeManager()? I think this would also eliminate the need for the goto.

Moved most of it to CreateHandshakeManager


src/core/ext/xds/xds_server_config_fetcher.cc, line 51 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This reminds me: We still need to take care of this.

This doesn't need to be done as part of this PR, but please make sure you go through gRFC A36 in detail to make sure we're covering anything. I want to make sure there's nothing else we're forgetting about here.

Right


src/core/ext/xds/xds_server_config_fetcher.cc, line 93 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move()

Done.


src/core/ext/xds/xds_server_config_fetcher.cc, line 126 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest having this message say something like "xDS Listener resource obtained, will start serving on ${listening_address}".

Done.


src/core/ext/xds/xds_server_config_fetcher.cc, line 148 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Similar to above: "error obtaining xDS Listener resource; not serving on ${listening_address}".

Done.


src/core/ext/xds/xds_server_config_fetcher.cc, line 170 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Similar to above: "xDS Listener resource does not exist; not serving on ${listening_address}".

Done.


src/core/ext/xds/xds_server_config_fetcher.cc, line 273 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest renaming this to something like have_resource_, to reflect what we're now actually using it for.

Done.


src/core/lib/surface/server.h, line 420 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please document this new API. It's not obvious from reading this how it's expected to be used, when it will be called, when we start serving again, etc.

Done.


test/cpp/end2end/xds_end2end_test.cc, line 6838 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why not use that name?

Both of these names seem useful :)


test/cpp/end2end/xds_end2end_test.cc, line 6876 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of having three different arrays here, please declare a struct and make a single array of the struct.

Alternatively, maybe use something like LongRunningRpc that already exists above?

Created a struct out of it

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 23 files reviewed, 23 unresolved discussions (waiting on @markdroth)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 170 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The double-negative here is a little confusing. Maybe call this is_serving_ and negate the value?

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 193 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same comment as above regarding the double-negative.

Done.


test/cpp/end2end/xds_end2end_test.cc, line 6816 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

In this test, how do we know that the update has actually been seen by the client by the time the test is complete? Could this test fail to catch a failure?

You are right. There is a chance that the test could fail to catch the failure.
One way I can increase the confidence is to do it in a loop for 10 times.
Alternatively, I can simply add the OnWarning() callback for such cases

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting closer! The only significant issue is that we're still not cancelling handshakes the way we should be.

Please let me know if you have any questions. Thanks!

Reviewed 6 of 7 files at r2, 1 of 1 files at r3.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @yashykt)


include/grpcpp/xds_server_builder.h, line 51 at r3 (raw file):

  std::unique_ptr<Server> BuildAndStart() override {
    grpc_server_xds_status_notifier c_notifier;

Could just say = { OnServingStatusChange, notifier_ };.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 128 at r1 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

Done.

Please also use ActiveConnection* instead of auto&.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 168 at r3 (raw file):

    void StopServing();
    static void OnClose(void* arg, grpc_error* error);

This can be private.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 203 at r3 (raw file):

  Chttp2ServerArgsModifier args_modifier_;
  Mutex mu_;
  grpc_channel_args* args_;  // guarded by mu_

While you're in here, maybe start using some of the absl lock annotations, so that the compiler can enforce that certain fields are accessed only while the mutex is held? @veblush just recently enabled this, so he can probably advise if you have questions about this.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 377 at r3 (raw file):

    Chttp2ServerListener* listener, grpc_chttp2_transport* transport)
    : listener_(listener), transport_(transport) {
  // The constructor should be called while holding a lock on listener_->mu

Here's another case where a lock annotation would be good.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 388 at r3 (raw file):

  op->goaway_error = GRPC_ERROR_CREATE_FROM_STATIC_STRING(
      "Server is stopping to serve requests.");
  grpc_transport_perform_op(&transport_->base, op);

As per my comment below about creating the ActiveConnection object as soon as the TCP connection is established instead of waiting until handshaking is complete, I think this needs to handle the case where StopServing() is called while we are still in the process of handshaking. In that case, the transport hasn't been created yet (transport_ will be null), so sending a transport op to generate a GOAWAY isn't the right thing. Instead, I think we need to call the HandshakeManager's Shutdown() method to stop handshaking. (I think I previously said that we need to shut down the endpoint, but I realized that we can't do that directly; while handshaking is going on, the endpoint is owned by the handshake manager, so we have to shut down the handshake manager and let it shut down the endpoint for us.)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 521 at r3 (raw file):

}

bool Chttp2ServerListener::CreateHandshakeManager(

Suggest renaming this to StartHandshaking(), to better reflect what it now does.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 543 at r3 (raw file):

  }
  // Deletes itself when done.
  new ActiveConnection::HandshakingState(this, accepting_pollset, acceptor,

I think we need to create the ActiveConnection object at this point and register it in connections_, rather than waiting until handshaking is complete to do so. Otherwise, if StopListening() is called while handshaking is going on, we won't cancel the handshake. (We will still fail the connection once handshaking is complete, but we shouldn't just let it run to completion when we can cancel it.)

Note that if we do this, then this may eliminate the need for pending_handshake_mgrs_. As this PR is currently written, we are referencing the connection in pending_handshake_mgrs_ until it finishes handshaking, and then moving it to connections_. Instead, I think we can store all connections in connections_, and when we shut down a connection, we can have the ActiveConnection object decide whether it needs to kill its handshake or tell the transport to send a GOAWAY. This would also allow us to eliminate the prev_ and next_ fields in the HandshakeManager class.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 553 at r3 (raw file):

                                    grpc_tcp_server_acceptor* acceptor) {
  Chttp2ServerListener* self = static_cast<Chttp2ServerListener*>(arg);
  RefCountedPtr<HandshakeManager> handshake_mgr;

This is no longer needed.


src/core/ext/xds/xds_server_config_fetcher.cc, line 150 at r3 (raw file):

              GRPC_STATUS_UNAVAILABLE, grpc_error_string(error));
        } else {
          gpr_log(GPR_INFO,

I just noticed that we're already generating a GPR_ERROR log at the top of this method. Maybe just change that to include the listening address, and then remove this one?


src/core/ext/xds/xds_server_config_fetcher.cc, line 174 at r3 (raw file):

            GRPC_STATUS_NOT_FOUND, "Requested listener does not exist");
      } else {
        gpr_log(GPR_INFO,

Similar comment here -- maybe just combine this with the GPR_ERROR log at the top of the method?


test/cpp/end2end/xds_end2end_test.cc, line 6816 at r1 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

You are right. There is a chance that the test could fail to catch the failure.
One way I can increase the confidence is to do it in a loop for 10 times.
Alternatively, I can simply add the OnWarning() callback for such cases

How about sending the RPC before calling WaitOnServingStatusChange()? That may at least give it more time to register.

Can we at least manually inspect the logs and verify that the test is running long enough for the change to register?


test/cpp/end2end/xds_end2end_test.cc, line 7514 at r3 (raw file):

    std::unique_ptr<ClientWriter<EchoRequest>> writer;
  } streaming_rpcs[kNumChannels];

Please remove blank line.

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 19 of 25 files reviewed, 13 unresolved discussions (waiting on @markdroth and @yashykt)


include/grpcpp/xds_server_builder.h, line 51 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Could just say = { OnServingStatusChange, notifier_ };.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 388 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As per my comment below about creating the ActiveConnection object as soon as the TCP connection is established instead of waiting until handshaking is complete, I think this needs to handle the case where StopServing() is called while we are still in the process of handshaking. In that case, the transport hasn't been created yet (transport_ will be null), so sending a transport op to generate a GOAWAY isn't the right thing. Instead, I think we need to call the HandshakeManager's Shutdown() method to stop handshaking. (I think I previously said that we need to shut down the endpoint, but I realized that we can't do that directly; while handshaking is going on, the endpoint is owned by the handshake manager, so we have to shut down the handshake manager and let it shut down the endpoint for us.)

Updated.

To make sure that refs are being handled properly and that ActiveConnection does not go away before HandshakingState, I made ActiveConnection a refcounted object.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 521 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest renaming this to StartHandshaking(), to better reflect what it now does.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 543 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think we need to create the ActiveConnection object at this point and register it in connections_, rather than waiting until handshaking is complete to do so. Otherwise, if StopListening() is called while handshaking is going on, we won't cancel the handshake. (We will still fail the connection once handshaking is complete, but we shouldn't just let it run to completion when we can cancel it.)

Note that if we do this, then this may eliminate the need for pending_handshake_mgrs_. As this PR is currently written, we are referencing the connection in pending_handshake_mgrs_ until it finishes handshaking, and then moving it to connections_. Instead, I think we can store all connections in connections_, and when we shut down a connection, we can have the ActiveConnection object decide whether it needs to kill its handshake or tell the transport to send a GOAWAY. This would also allow us to eliminate the prev_ and next_ fields in the HandshakeManager class.

Done. prev_ and next_ fields removed


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 553 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is no longer needed.

Done.


src/core/ext/xds/xds_server_config_fetcher.cc, line 150 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I just noticed that we're already generating a GPR_ERROR log at the top of this method. Maybe just change that to include the listening address, and then remove this one?

In this case, the two might not necessarily mean the same thing. The difference would be the same as the difference between OnWarning and NotServing


src/core/ext/xds/xds_server_config_fetcher.cc, line 174 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Similar comment here -- maybe just combine this with the GPR_ERROR log at the top of the method?

It makes sense here. Done.


test/cpp/end2end/xds_end2end_test.cc, line 6816 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

How about sending the RPC before calling WaitOnServingStatusChange()? That may at least give it more time to register.

Can we at least manually inspect the logs and verify that the test is running long enough for the change to register?

Adding another RPC did not help. Adding a 100ms sleep does help


test/cpp/end2end/xds_end2end_test.cc, line 7514 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please remove blank line.

Done.

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 19 of 25 files reviewed, 13 unresolved discussions (waiting on @markdroth and @veblush)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 168 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This can be private.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 203 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

While you're in here, maybe start using some of the absl lock annotations, so that the compiler can enforce that certain fields are accessed only while the mutex is held? @veblush just recently enabled this, so he can probably advise if you have questions about this.

I was able to add lock annotations, but i've not been able to fix one remaining warning

src/core/ext/transport/chttp2/server/chttp2_server.cc:555:17: warning: calling function 'StopServingLocked' requires holding mutex 'connection->listener_->mu_' exclusively [-Wthread-safety-precise]
    connection->StopServingLocked();
                ^
src/core/ext/transport/chttp2/server/chttp2_server.cc:555:17: note: found near match 'mu_'

Unless I'm making a mistake, I think it does not realize that mu_ is the same as connection->listener_->mu_ and I'm not sure how to get it to understand that.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 377 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Here's another case where a lock annotation would be good.

Updated

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good structurally! My main noteworthy comments are about the need to clean up the locking semantics in the chttp2 listener code.

Please let me know if you have any questions. Thanks!

Reviewed 4 of 6 files at r4, 4 of 4 files at r6.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @veblush and @yashykt)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 203 at r3 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

I was able to add lock annotations, but i've not been able to fix one remaining warning

src/core/ext/transport/chttp2/server/chttp2_server.cc:555:17: warning: calling function 'StopServingLocked' requires holding mutex 'connection->listener_->mu_' exclusively [-Wthread-safety-precise]
    connection->StopServingLocked();
                ^
src/core/ext/transport/chttp2/server/chttp2_server.cc:555:17: note: found near match 'mu_'

Unless I'm making a mistake, I think it does not realize that mu_ is the same as connection->listener_->mu_ and I'm not sure how to get it to understand that.

Try using ABSL_EXCLUSIVE_LOCKS_REQUIRED(&Chttp2ServerListener::mu_).


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 134 at r6 (raw file):

  };

  class ActiveConnection : public RefCounted<ActiveConnection> {

Can this be InternallyRefCounted<>? I think there's only one external owner, right?


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 136 at r6 (raw file):

  class ActiveConnection : public RefCounted<ActiveConnection> {
   public:
    class HandshakingState : public RefCounted<HandshakingState> {

Similar question here: Can this be InternallyRefCounted<>?


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 174 at r6 (raw file):

    // does not go away before we expect.
    Chttp2ServerListener* const listener_;
    RefCountedPtr<HandshakeManager> handshake_mgr_

It seems strange to have the handshake manager owned by ActiveConnection instead of by HandshakingState, since the latter is the object that exists while handshaking is in progress. Instead, I suggest moving this to HandshakingState and adding a data member here in ActiveConnection that holds a ref to HandshakingState for cancellation purposes.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 176 at r6 (raw file):

    RefCountedPtr<HandshakeManager> handshake_mgr_
        ABSL_GUARDED_BY(listener_->mu_);
    // Guarding with atomics instead of mutex to avoid running into deadlocks.

What was the deadlock issue that required use of an atomic here?


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 181 at r6 (raw file):

    Atomic<grpc_chttp2_transport*> transport_;
    grpc_closure on_close_;
    bool is_serving_ ABSL_GUARDED_BY(listener_->mu_) = true;

Is this field actually needed? It seems like it will always have the same value as listener_->is_serving_.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 244 at r6 (raw file):

                             OnHandshakeDone, this);
  MutexLock lock(&connection_->listener_->mu_);
  // If the listener's stopped serving, then shutdown the handshake early.

I think this check should actually be moved to Chttp2ServerListener::StartHandshake(). If we already know that we're not serving when we accept the connection, we should not bother to create the ActiveConnection or HandshakingState objects to begin with.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 267 at r6 (raw file):

  // Note that we may be called with GRPC_ERROR_NONE when the timer fires
  // or with an error indicating that the timer system is being shut down.
  if (error != GRPC_ERROR_CANCELLED) {

I think this also needs to check connection_->is_shutdown_ to handle race conditions.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 349 at r6 (raw file):

                            self, grpc_schedule_on_exec_ctx);
          // Refs helds by OnClose()
          self->connection_->Ref().release();

Can we avoid registering the on-close callback in the case where there is no config fetcher? That way, we won't increase the memory usage for each connection in non-xDS use-cases.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 402 at r6 (raw file):

    : listener_(listener) {
  {
    MutexLock lock(&listener_->mu_);

It looks like this code is now acquiring and releasing this same lock 3 times for every incoming connection: once in Chttp2ServerListener::StartHandshake(), once here in the ActiveConnection ctor, and again in the HandshakingState ctor. This seems pretty sub-optimal.

What was the reason for this change? When I last reviewed (snapshot 3 in reviewable), we were acquiring it just once. Can we change this to acquire the lock just once?


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 546 at r6 (raw file):

  } else {
    {
      MutexLock lock(&mu_);

Here we're acquiring the lock, releasing it, and then immediately reacquiring it in StartListening(). Can we change this to acquire the lock just once?


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 575 at r6 (raw file):

                                          grpc_pollset* accepting_pollset,
                                          grpc_tcp_server_acceptor* acceptor) {
  RefCountedPtr<HandshakeManager> handshake_mgr;

This is no longer needed.


src/core/ext/xds/xds_server_config_fetcher.cc, line 150 at r3 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

In this case, the two might not necessarily mean the same thing. The difference would be the same as the difference between OnWarning and NotServing

Sure, but the cases where this log occurs are always a strict subset of the cases where the other one occurs. This doesn't seem very friendly to users, because it means a single event will cause two different log messages at two different log levels.

How about this:

  • Change the log at the top of this function to occur only if have_resource_ is true.
  • Change this log message to use GPR_ERROR.

test/cpp/end2end/xds_end2end_test.cc, line 6816 at r1 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

Adding another RPC did not help. Adding a 100ms sleep does help

That still seems pretty fragile.

Hmm... how about checking to see that the xDS server sees a NACK first, so that we know the client has seen the new resource?

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @markdroth, @veblush, and @yashykt)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 174 at r6 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It seems strange to have the handshake manager owned by ActiveConnection instead of by HandshakingState, since the latter is the object that exists while handshaking is in progress. Instead, I suggest moving this to HandshakingState and adding a data member here in ActiveConnection that holds a ref to HandshakingState for cancellation purposes.

I wanted to avoid adding a data member. So, keeping this in one location made sense.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 181 at r6 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Is this field actually needed? It seems like it will always have the same value as listener_->is_serving_.

No, it is possible that the listener was in the NOT SERVING state for a very short time which required all existing connections to go down. When the listener goes back to the SERVING state, we would want to avoid any inconsistent state here from the view of the connection.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 244 at r6 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this check should actually be moved to Chttp2ServerListener::StartHandshake(). If we already know that we're not serving when we accept the connection, we should not bother to create the ActiveConnection or HandshakingState objects to begin with.

So, the reason for this is that I wanted to avoid calling DoHandshake from within the critical region of mu_. We can do a single check at the start of StartHandshake but then it would mean that whole thing would need to be in a single critical region.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 267 at r6 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this also needs to check connection_->is_shutdown_ to handle race conditions.

Which race condition? If you are talking about the transport receiving two separate ops, then the transport is capable of handling it.

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @markdroth, @veblush, and @yashykt)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 402 at r6 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It looks like this code is now acquiring and releasing this same lock 3 times for every incoming connection: once in Chttp2ServerListener::StartHandshake(), once here in the ActiveConnection ctor, and again in the HandshakingState ctor. This seems pretty sub-optimal.

What was the reason for this change? When I last reviewed (snapshot 3 in reviewable), we were acquiring it just once. Can we change this to acquire the lock just once?

I basically wanted to avoid long critical regions especially the region where we start performing a handshake.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making all of those changes! The overall structure of this code makes a lot more sense now.

Please let me know if you have any questions. Thanks!

Reviewed 1 of 2 files at r11, 1 of 1 files at r12.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @apolcyn, @jtattermusch, @nicolasnoble, and @yashykt)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 402 at r6 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

To further explain, HandshakingState and OnClose are the only two owners of refs for ActiveConnection. We could add an external ref for connections_ and use the OrphanablePtr as the alternative. That way we can avoid a StopServingLocked method on ActiveConnection and use Orphan instead to disconnect the transport if needed

This looks really good now!


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 546 at r6 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

No, is_serving_ and started_ are different in that, started_ will be set only once when grpc_tcp_server_start has been called while is_serving_ can be reset multiple times during the lifetime of the server.

It looks to me like there's still a race condition between this and Orphan(). Consider the following case:

  • Thread 1 invokes Start(), which acquires the lock, sets started_ to true, and releases the lock.
  • Thread 2 invokes Orphan(), which acquires the lock, sets shutdown_ to true, releases the lock, and then calls grpc_tcp_server_shutdown_listeners() and grpc_tcp_server_unref().
  • Thread 1 then calls grpc_tcp_server_start().

src/core/ext/transport/chttp2/server/chttp2_server.cc, line 486 at r8 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

That seems very odd. Can you show me the code and the error?

Created #25632 to show the compiler issue on windows
https://source.cloud.google.com/results/invocations/6dad388f-ece4-46e9-a4d7-b5f66e66efc8/log

Yeah, that does look like a compiler bug of some sort.

However, now that we're explicitly exposing the Ref() method with the using statement above, I think this should no longer be an issue. So let's go ahead and get rid of this method.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 646 at r8 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

I think simply changing it from Orphanable to InternallyRefCounted would not require any internal changes, but even if it did, it should be fine.
Also, I think adding an additional atomic per listener is fine since we don't expect a process to have many listeners.

I think it will actually require internal changes, since we'll need to change the Orphan() method of existing ListenerInterface implementations to call Unref().

But I'm fine with this.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 111 at r12 (raw file):

        }
        {
          MutexLock lock(&listener_->channel_args_mu_);

Can we wait until after we release mu_ to acquire channel_args_mu_, just to minimize lock contention?


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 167 at r12 (raw file):

      void Start(grpc_endpoint* endpoint, grpc_channel_args* args);

      using InternallyRefCounted<HandshakingState>::Ref;

Please add a comment explaining that this is needed for grabbing the ref externally in ActiveConnection::Start().


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 195 at r12 (raw file):

    void Start(grpc_endpoint* endpoint, grpc_channel_args* args);

    using InternallyRefCounted<ActiveConnection>::Ref;

Please add a comment that this is needed for grabbing the ref externally in Chttp2ServerListener::OnAccept().


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 210 at r12 (raw file):

    grpc_chttp2_transport* transport_ ABSL_GUARDED_BY(&mu_) = nullptr;
    grpc_closure on_close_;
    bool is_serving_ ABSL_GUARDED_BY(&mu_) = true;

I think this can be called shutdown_, because the connection will never exit this state once it occurs.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 230 at r12 (raw file):

  Chttp2ServerArgsModifier args_modifier_;
  ConfigFetcherWatcher* config_fetcher_watcher_ = nullptr;
  Mutex channel_args_mu_ ACQUIRED_AFTER(&mu_);

I'd prefer to not acquire the two mutexes at the same time if possible. The only place I see where this is done is in UpdateConfig() above, where I suggested changing the code to avoid that overlap. If we do that, then this annotation can go away.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 236 at r12 (raw file):

  bool started_ ABSL_GUARDED_BY(mu_) = false;
  // Signals whether grpc_tcp_server_start() is in progress.
  bool start_in_progress_ ABSL_GUARDED_BY(mu_) = false;

Do we need both started_ and start_in_progress_? When are they set differently, and does anything need to know when they are set differently?

Maybe we could assume the start is in progress if is_serving_ is true but started_ is not yet true?


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 367 at r12 (raw file):

                resource_user);
        if (channel_init_err == GRPC_ERROR_NONE) {
          grpc_closure* on_close = nullptr;

I would move this back down to line 386, since it isn't needed before then.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 425 at r12 (raw file):

  if (cleanup_connection) {
    MutexLock listener_lock(&self->connection_->listener_->mu_);
    connection = std::move(

For efficiency, I suggest writing this as:

auto it = self->connection_->listener_->connections_.find(self->connection_.get());
if (it != self->connection_->listener_->connections_.end()) {
  connection = std::move(it->second);
  self->connection_->listener_->connections_.erase(it);
}

This way, we're only doing the map lookup once instead of twice.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 499 at r12 (raw file):

  {
    MutexLock listener_lock(&self->listener_->mu_);
    MutexLock connection_lock(&self->mu_);

Would it be more performant to grab the connection lock first, since it's less likely to be contended?


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 503 at r12 (raw file):

    // is not serving.
    if (self->is_serving_) {
      connection = std::move(self->listener_->connections_[self]);

Similar to my comment on line 425 above, suggest rewriting this to avoid doing the map lookup twice.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 644 at r12 (raw file):

    args = grpc_channel_args_copy(self->args_);
  }
  OrphanablePtr<ActiveConnection> connection = MakeOrphanable<ActiveConnection>(

Can just use auto here, since the type is obvious from the RHS.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 647 at r12 (raw file):

      self->Ref(), accepting_pollset, acceptor, args);
  RefCountedPtr<ActiveConnection> connection_ref;
  bool shutting_down = false;

This variable isn't necessary, because you can instead check whether connection is null.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 666 at r12 (raw file):

        // Hold a ref to connection to allow starting handshake outside the
        // critical region
        connection_ref = connection->Ref();

This ref can be taken before acquiring the lock.

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @apolcyn, @jtattermusch, @markdroth, @nicolasnoble, and @yashykt)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 546 at r6 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It looks to me like there's still a race condition between this and Orphan(). Consider the following case:

  • Thread 1 invokes Start(), which acquires the lock, sets started_ to true, and releases the lock.
  • Thread 2 invokes Orphan(), which acquires the lock, sets shutdown_ to true, releases the lock, and then calls grpc_tcp_server_shutdown_listeners() and grpc_tcp_server_unref().
  • Thread 1 then calls grpc_tcp_server_start().

Chttp2ServerListener::Start() happens as part of grpc_server_start() which happens as part of ServerBuilder::BuildAndStart. In other words, we don't expect Shutdown to be called before Start returns.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 111 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Can we wait until after we release mu_ to acquire channel_args_mu_, just to minimize lock contention?

That can cause races where the listener state is different from the channel_args state.
Consider the case where the listener has started accepting connections i.e. tcp_server_start has been called but it has stopped serving due to the resource being deleted. In this case, if it receives a good update and if we receive a new connection before we update the channel args, we would be starting the connection with older channel args..

But, I think if we modify channel args before updating the listener state, we should be fine.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 236 at r12 (raw file):

Do we need both started_ and start_in_progress_? When are they set differently, and does anything need to know when they are set differently?

started is used just in UpdateConfig to know whether we need to start listening or not. start_in_progress is used to denote the time where we are about to start, i.e., calling tcp_server_start.

Maybe we could assume the start is in progress if is_serving_ is true but started_ is not yet true?
Yes, we can if we want. Since this is state on the listener on though, I didn't really care about adding another bool. Done


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 499 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Would it be more performant to grab the connection lock first, since it's less likely to be contended?

I'm not sure which is more performant, but the common approach seems to be to grab the higher lock first. (For example https://g3doc.corp.google.com/borg/borglet/g3doc/developers/containermgr_locks.md#lock-ordering)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 647 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This variable isn't necessary, because you can instead check whether connection is null.

Done, but code readability took a hit with that, since this is not immediately obvious.

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 26 of 27 files reviewed, 15 unresolved discussions (waiting on @apolcyn, @jtattermusch, @markdroth, and @nicolasnoble)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 486 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Yeah, that does look like a compiler bug of some sort.

However, now that we're explicitly exposing the Ref() method with the using statement above, I think this should no longer be an issue. So let's go ahead and get rid of this method.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 167 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add a comment explaining that this is needed for grabbing the ref externally in ActiveConnection::Start().

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 195 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add a comment that this is needed for grabbing the ref externally in Chttp2ServerListener::OnAccept().

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 210 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this can be called shutdown_, because the connection will never exit this state once it occurs.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 230 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I'd prefer to not acquire the two mutexes at the same time if possible. The only place I see where this is done is in UpdateConfig() above, where I suggested changing the code to avoid that overlap. If we do that, then this annotation can go away.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 236 at r12 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

Do we need both started_ and start_in_progress_? When are they set differently, and does anything need to know when they are set differently?

started is used just in UpdateConfig to know whether we need to start listening or not. start_in_progress is used to denote the time where we are about to start, i.e., calling tcp_server_start.

Maybe we could assume the start is in progress if is_serving_ is true but started_ is not yet true?
Yes, we can if we want. Since this is state on the listener on though, I didn't really care about adding another bool. Done

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 367 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I would move this back down to line 386, since it isn't needed before then.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 425 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

For efficiency, I suggest writing this as:

auto it = self->connection_->listener_->connections_.find(self->connection_.get());
if (it != self->connection_->listener_->connections_.end()) {
  connection = std::move(it->second);
  self->connection_->listener_->connections_.erase(it);
}

This way, we're only doing the map lookup once instead of twice.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 503 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Similar to my comment on line 425 above, suggest rewriting this to avoid doing the map lookup twice.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 644 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Can just use auto here, since the type is obvious from the RHS.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 666 at r12 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This ref can be taken before acquiring the lock.

Done.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fabulous!

Remaining comments are all minor, so I'll go ahead and approve.

Thanks for doing this!

Reviewed 1 of 1 files at r13.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @apolcyn, @jtattermusch, @nicolasnoble, and @yashykt)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 546 at r6 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

Chttp2ServerListener::Start() happens as part of grpc_server_start() which happens as part of ServerBuilder::BuildAndStart. In other words, we don't expect Shutdown to be called before Start returns.

Ah, you're right! Okay, then this seems fine.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 499 at r12 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

I'm not sure which is more performant, but the common approach seems to be to grab the higher lock first. (For example https://g3doc.corp.google.com/borg/borglet/g3doc/developers/containermgr_locks.md#lock-ordering)

That page looks like it's just describing the rule established for that particular piece of code, not a general style rule.

Logically, if there is contention when acquiring the second lock, then we're going to hold the first lock longer than we need to. It seems better to hold the connection lock longer than we need to than to hold the top-level lock longer than we need to, since it's better to hold up work for an individual connection than it is to hold up work for all connections. That's why I was thinking the opposite order would be more performant.

It might be a good idea to check with @soheilhy about this too. But I'll go ahead and resolve, since this is something we can easily change later.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 96 at r13 (raw file):

        : listener_(std::move(listener)) {}

    void UpdateConfig(grpc_channel_args* args) override {

The implementation of these methods is now long enough that it would probably make sense to just declare them here and then define them below, just like we're doing for the other classes.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 176 at r13 (raw file):

      grpc_pollset* const accepting_pollset_;
      grpc_tcp_server_acceptor* const acceptor_;
      RefCountedPtr<HandshakeManager> handshake_mgr_;

Please add appropriate lock annotations for these data members.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 390 at r13 (raw file):

            // Refs helds by OnClose()
            self->connection_->Ref().release();
            GRPC_CLOSURE_INIT(

This should move to the ActiveConnection ctor.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 654 at r13 (raw file):

  {
    MutexLock lock(&self->mu_);
    // Shutdown the the conncetion if listener's stopped serving.

s/conncetion/connection/

@yashykt
Copy link
Member Author

yashykt commented Mar 9, 2021

Interop seems to be failing on master due to build_docker_dart

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 22 of 27 files reviewed, 4 unresolved discussions (waiting on @apolcyn, @jtattermusch, @markdroth, and @nicolasnoble)


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 96 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The implementation of these methods is now long enough that it would probably make sense to just declare them here and then define them below, just like we're doing for the other classes.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 176 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add appropriate lock annotations for these data members.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 390 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should move to the ActiveConnection ctor.

Done.


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 654 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/conncetion/connection/

Done.

@yashykt yashykt added release notes: yes Indicates if PR needs to be in release notes lang/c++ lang/core and removed release notes: no Indicates if PR should not be in release notes labels Mar 9, 2021
@yashykt
Copy link
Member Author

yashykt commented Mar 9, 2021

Thanks for reviewing!

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick comment to address in a follow-up PR. Thanks!

Reviewed 5 of 5 files at r14.
Reviewable status: all files reviewed, 1 unresolved discussion


src/core/ext/transport/chttp2/server/chttp2_server.cc, line 303 at r14 (raw file):

void Chttp2ServerListener::ActiveConnection::HandshakingState::Start(
    grpc_endpoint* endpoint,
    grpc_channel_args* args) ABSL_NO_THREAD_SAFETY_ANALYSIS {

I'm not too keen on disabling the lock analaysis. Let's just go ahead and acquire the connection lock here. Nothing else in the connection is running yet at this point, so the lock should not be contended, so I don't think this will affect performance that much.

yashykt added a commit to yashykt/grpc that referenced this pull request Mar 12, 2021
yashykt added a commit to yashykt/grpc that referenced this pull request Mar 12, 2021
yashykt added a commit that referenced this pull request Mar 12, 2021
@yashykt
Copy link
Member Author

yashykt commented Mar 14, 2021

This PR was reverted because it introduced an issue where transports were not completely cleaned up until shutdown for servers when no server config is set.

yashykt added a commit to yashykt/grpc that referenced this pull request Mar 15, 2021
yashykt added a commit that referenced this pull request Mar 15, 2021
* Revert "Revert "xDS status notifier (#25321)" (#25702)"

This reverts commit 3c9f397.

* Remove connection from map when OnClose is not registered

* Reviewer comments
@yashykt yashykt deleted the xdsstatusnotifier branch May 18, 2023 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang/c++ lang/core release notes: yes Indicates if PR needs to be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants