Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Custom Post-handshake Verification in TlsCredentials #25631

Merged
merged 37 commits into from
Nov 10, 2021

Conversation

ZhenLian
Copy link
Contributor

@ZhenLian ZhenLian commented Mar 5, 2021

This change is Reviewable

@ZhenLian ZhenLian changed the title custom verification proposed C core changes Support Custom Post-handshake Verification in TlsCredentials Mar 5, 2021
@ZhenLian ZhenLian requested a review from markdroth March 5, 2021 08:09
Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mark, this PR is far from ready yet, but could you please take a first glance(especially at grpc_tls_custom_verification_check_request and the hostname verifier) since it contains a lot of changes not reflected in the design? Just wanted to make sure I am not going too far away from the correct direction...
Thank you so much!

Reviewable status: 0 of 13 files reviewed, all discussions resolved (waiting on @markdroth)

Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark, the latest push contains all the changes and tests I'd like to make in core. I manually tested all the tests, and they all passed locally. This PR contains many changes that we haven't covered in the design doc, so I will push out for review first and then start to work on the C++ part.
There is one particular thing I want to mention here: in security_handshaker.cc, we will have mutex deadlock problem in e2e tests if we don't comment out the mutex locking code as I did in this PR. I think that was due to the fact that in OnHandshakeDataReceivedFromPeerFn and OnHandshakeDataSentToPeerFn we hold the lock, and later in OnPeerCheckedInner we were trying to grab the lock again. One possible way to solve this is to grant finer locks, but I am not sure how to do that exactly.
I've attached the log: https://gist.github.com/ZhenLian/c8a8ad5630f1aab85f274ff397f88217.
Feel free to take a look and let me know if you have any questions. Thank you so much!

Reviewable status: 0 of 18 files reviewed, all discussions resolved (waiting on @jiangtaoli2016 and @markdroth)

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start!

Please let me know if you have any questions. Thanks!

Reviewed 7 of 13 files at r1, 11 of 11 files at r2.
Reviewable status: all files reviewed, 68 unresolved discussions (waiting on @jiangtaoli2016 and @ZhenLian)


include/grpc/grpc_security.h, line 915 at r2 (raw file):

/**
 * The request information exposed in a verification call.
 * It would only stay within the check_peer() call of the security connector.

The security connector is an internal implementation detail, not something that should be described as part of this API. I think this line can just be removed.


include/grpc/grpc_security.h, line 920 at r2 (raw file):

 * object at the end of custom verification.
 */
struct grpc_tls_custom_verification_check_request {

Please note on every new struct, type, and function that they are experimental.


include/grpc/grpc_security.h, line 924 at r2 (raw file):

  /* The target name of the server when the client initiates the connection. */
  /* This field will be nullptr if on the server side. */
  const char* target_name = nullptr;

Unfortunately, I don't think it's legal in old-style C to initialize data members where they are defined. We'll need to set these all manually inside of core when we create an instance of this struct.


include/grpc/grpc_security.h, line 975 at r2 (raw file):

 */
struct grpc_tls_certificate_verifier_external {
  void* user_data;

Why do we need this? I thought we decided in the design that we don't need it, because implementations can use the address of the grpc_tls_certificate_verifier_external struct instead.

If we keep this, please document what it's for.


include/grpc/grpc_security.h, line 982 at r2 (raw file):

   * external_verifier: a pointer to the struct itself
   * request: request information exposed to the function implementer.
   * user_data: used for binding any caller-specified states. Whatever passed

If user_data is stored in grpc_tls_certificate_verifier_external and we're already passing a pointer to that struct to these methods, then why do we need to pass user_data as a separate parameter? Can't the implementation just access it as external_verifier->user_data?

Or better yet, if we actually do need user_data, then why pass the grpc_tls_certificate_verifier_external struct to these methods at all? Why not just pass user_data instead, since that's the only field from the struct that the implementations will ever actually use?

Whichever of these two arguments we keep should be the first argument to all three methods, since they're basically the "self" pointer.


include/grpc/grpc_security.h, line 993 at r2 (raw file):

   * grpc_tls_certificate_verifier_external during construction time can be
   * retrieved later here.
   * return: return a non-zero value if |verify| is expected to be executed

We need to explicitly document the requirement that the implementation of this method must not invoke the async callback in the thread in which it's invoked before it returns. That can lead to deadlocks.


include/grpc/grpc_security.h, line 1004 at r2 (raw file):

   * verifier is still running but the whole connection got cancelled. This
   * could happen when the verifier is doing some async operations, and the
   * whole handshaker object got destroyed because of connetion time limit is

s/connetion/connection/


include/grpc/grpc_security.h, line 1009 at r2 (raw file):

   *
   * external_verifier: a pointer to the struct itself.
   * request: request information exposed to the function implementer.

We need to document that this will be the same request object that was passed to verify(), and that it tells the verifier which request to cancel.


include/grpc/grpc_security.h, line 1046 at r2 (raw file):

 * check.
 */
grpc_tls_certificate_verifier* grpc_tls_certificate_verifier_host_name_create();

Do we also need to declare the one for the SAN verifier? Or is that not included in this PR?


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 22 at r2 (raw file):

#include <grpc/support/port_platform.h>

#include <grpc/grpc_security.h>

This include should be moved down to after the absl include.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 47 at r2 (raw file):

  // Otherwise, populate the results synchronously and return false.
  // The caller is expected to populate verification results by setting request.
  virtual bool Verify(grpc_tls_custom_verification_check_request* request,

The internal API was supposed to take a struct that contained both this request struct and a ref to the security connector. Why didn't we do that?

We do need to make sure that we're holding a ref to the security connector until this request is complete.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 48 at r2 (raw file):

  // The caller is expected to populate verification results by setting request.
  virtual bool Verify(grpc_tls_custom_verification_check_request* request,
                      std::function<void()> callback) = 0;

It occurs to me that since we're using a std::function<> both here in the internal API and in the C++-layer CertificateVerifier API, it would probably make more sense to have the callback return the status instead of trying to encode it in the request struct. Here, we can use std::function<absl::Status()>, and in the C++ layer, we can use std::function<grpc::Status()>. (We aren't currently allowed to use absl types in the C++ API.)

The only place we need to do something different is in the boundary between external verifiers and C-core. To solve that problem, we could pass the status code and message as two additional parameters to grpc_tls_on_custom_verification_check_done_cb.

Note that this approach would also eliminate the requirement to use gpr_malloc() for the error string, because now C-core will not take ownership of it. Instead, it can copy the string internally before the callback returns.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 51 at r2 (raw file):

  // Operations that will be performed when a request is cancelled.
  // This is only needed when in async mode.
  // TODO(ZhenLian): find out the place to invoke this...

It looks like the problem is that the grpc_security_connector API does not have a cancel_check_peer() method, so we'll need to add that. The method should look like this:

// Cancels the pending check_peer() request associated with on_peer_checked.
// If there is no such request pending, this is a no-op.
virtual void cancel_check_peer(grpc_closure* on_peer_checked) = 0;

In TlsChannelSecurityConnector, we'll need a map of pending requests, similar to the one in ExternalCertificateVerifier, but keyed by on_peer_checked. The map value will be a struct containing the grpc_tls_custom_verification_check_request struct, a ref to the security connector, and any other state we may need. When cancel_check_peer() is called, we'll look up the closure in the map to find the corresponding request struct, so that we can invoke the verifier's Cancel() method.

Then we'll need to change the security handshaker to call cancel_check_peer() when it gets cancelled or times out.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 55 at r2 (raw file):

  // A utility function to help clean up the request.
  // Note the request pointer itself won't be deleted.
  static void CertificateVerificationRequestDestroy(

This does not belong in this class. The verifier is not responsible for cleaning up the request object, because the request object needs to be owned by the security connector.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 74 at r2 (raw file):

      external_verifier_->destruct(external_verifier_, external_verifier_->user_data);
    }
    delete external_verifier_;

This needs to be done in the implementation's destruct() method, not here inside of C-core. This is because on Windows, each DLL has its own heap, and gRPC and the application may be in different DLLs. If the struct was allocated by the application, it can't be freed here.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 90 at r2 (raw file):

                           void* user_data);
  // Guards members below.
  grpc_core::Mutex mu_;

No need for grpc_core:: prefix, since we're already in that namespace.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 103 at r2 (raw file):

  void Cancel(grpc_tls_custom_verification_check_request* request) override {}

 private:

No need for this, since the section is empty.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 75 at r2 (raw file):

    grpc_tls_custom_verification_check_request* request,
    std::function<void()> callback) {
  grpc_core::ExecCtx exec_ctx;

I don't think this is necessary. This method should always be invoked from within core.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 92 at r2 (raw file):

}

void ExternalCertificateVerifier::OnVerifyDone(

We do need to declare an ExecCtx at the start of this method, because this will be called from outside of core.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 100 at r2 (raw file):

    std::function<void()> callback = std::move(it->second);
    self->request_map_.erase(it->first);
    callback();

I don't think we should actually invoke the callback until after we've released the lock, because once we've invoked the callback, the callback will give up the ref it's holding on the security connector, so the security connector might release its ref on the verifier and destroy the verifier. If the verifier is destroyed before we try to release the lock, that will be a use-after-free bug.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 106 at r2 (raw file):

namespace internal {

// TODO(ZhenLian): probably we can use VerifySubjectAlternativeName in

Why not do that?


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 159 at r2 (raw file):

    request->status = GRPC_STATUS_UNAUTHENTICATED;
    request->error_details = gpr_strdup("Target name is not specified.");
    return false; /*synchronous check*/

Please use C++-style comments.

Same thing throughout.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 176 at r2 (raw file):

  // Perform the hostname check.
  // First check the DNS field. We allow prefix or suffix wildcard matching.
  char** dns_names = request->peer_info.san_names.dns_names;

Why are we checking the SAN names here? I thought that was going to be a different verifier implementation.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 192 at r2 (raw file):

  if (ip_names != nullptr && ip_names_size > 0) {
    for (int i = 0; i < ip_names_size; ++i) {
      std::string target_ip(allocated_name);

This allocation is unnecessary. Instead, just change the check below to:

if (allocated_name == ip_name) {

src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 217 at r2 (raw file):

/** -- Wrapper APIs declared in grpc_security.h -- **/

int grpc_tls_certificate_verifier_verify(

This function will be an entry point to C-core, so it will need to declare an ExecCtx at the start of the function.

Same for the next 3 functions.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 249 at r2 (raw file):

                 (verifier));
  grpc_core::ExecCtx exec_ctx;
  // This should work fine as-is, because the C++ delete operator will

I don't think this comment is necessary.


src/core/lib/security/credentials/tls/grpc_tls_credentials_options.cc, line 48 at r2 (raw file):

    grpc_tls_credentials_options* options, int verify_server_cert) {
  GPR_ASSERT(options != nullptr);
  if (verify_server_cert == 0) {

This can just be options->set_verify_server_cert(verify_server_cert). The int will be automatically converted to a bool.


src/core/lib/security/credentials/tls/tls_credentials.cc, line 57 at r2 (raw file):

  // In the following conditions, there could be severe security issues.
  if (is_client && options->certificate_verifier() == nullptr) {
    // If no verifier is specified on the client side, use the hostname verifier

Let's log something here indicating that we're defaulting to the hostname verifier.


src/core/lib/security/credentials/tls/tls_credentials.cc, line 65 at r2 (raw file):

    hostname_verifier->Unref();
  }
  /*if (!is_client && options->certificate_verifier() == nullptr) {

I think this can just be removed.


src/core/lib/security/credentials/xds/xds_credentials.cc, line 74 at r2 (raw file):

              std::function<void()> callback) override {
    GPR_ASSERT(request != nullptr);
    if (XdsVerifySubjectAlternativeNames(

Should XdsVerifySubjectAlternativeNames() take a parameter that tells it which type of SANs it's being given? @yashykt


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 211 at r2 (raw file):

  *auth_context =
      grpc_ssl_peer_to_auth_context(&peer, GRPC_TLS_TRANSPORT_SECURITY_TYPE);
  // GPR_ASSERT(options_->certificate_verifier() != nullptr);

Please remove.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 214 at r2 (raw file):

  // Parse tsi_peer and feed in the values in the check request.
  auto* request = new grpc_tls_custom_verification_check_request();
  request->target_name = gpr_strdup(target_name);

I don't think we actually need to allocate a copy of this particular string. We're going to need to hold a ref to the security connector until this verification request is completed, and target_name points into a member of the security connector, so we know it will continue to exist until after the request is complete. So I think we can just pass a pointer to the existing string to the verifier.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 222 at r2 (raw file):

    if (strcmp(prop->name, TSI_X509_SUBJECT_COMMON_NAME_PEER_PROPERTY) == 0) {
      char* common_name =
          static_cast<char*>(gpr_malloc(prop->value.length + 1));

As I mentioned elsewhere, the security connector will need to maintain a map of pending check_peer() requests. If you store the TSI peer object in that map entry, then we may not need to copy any of these string fields into the cert verifier request structure. Instead, they can just be pointers into the TSI peer object, which we know will live until after the request has completed. (Although we may need to make copies anyway in cases where we need to add nul termination.)


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 297 at r2 (raw file):

        request);
    delete request;
    grpc_core::ExecCtx exec_ctx;

We don't need this here. Instead, it should be in ExternalCertificateVerifier::OnVerifyDone(), as I suggested elsewhere.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 298 at r2 (raw file):

    delete request;
    grpc_core::ExecCtx exec_ctx;
    grpc_core::ExecCtx::Run(DEBUG_LOCATION, on_peer_checked, error);

In this case, I think it would actually be safe to just use Closure::Run() instead of ExecCtx::Run().


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 322 at r2 (raw file):

    return c;
  }
  // Question: I think we shall also compare other fields, right?

Yes, any field that affects the behavior of the security connector needs to be checked here. Failure to do so can cause a security problem, because two security connectors that differ based on a field not checked here will be considered identical, which means that in a channel that's supposed to use one value, we may reuse a subchannel with a different value.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 331 at r2 (raw file):

    absl::string_view host, grpc_auth_context* auth_context,
    grpc_closure* /*on_call_host_checked*/, grpc_error** error) {
  // Question: shall we apply the verifier logic here as well?

Hmm... This is an interesting question.

The check_call_host() check is a per-call check, not a per-connection check. gRPC has a seldom-used feature to support virtual hosting, where each individual call on a channel can have a different host associated with it. So this check is intended to verify that the host specified for the individual call is covered by the cert that the server presented.

Previously, this check was inhibited if the connection-time verification did not check the hostname, because if the hostname was not verified upon connection, then it also doesn't need to be checked for the individual calls.

We may need to add some way for the verifier to tell us whether it did hostname verification. We could then store that state in the auth context, so that we can use that to decide what to check here.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 339 at r2 (raw file):

void TlsChannelSecurityConnector::cancel_check_call_host(
    grpc_closure* /*on_call_host_checked*/, grpc_error* error) {
  // Question: any special treatment we should do here if we are also doing

No matter what we do in check_call_host(), we won't be doing async work, so I don't think we need anything here.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 505 at r2 (raw file):

}

void TlsServerSecurityConnector::check_peer(

Same comments as I had for the client side apply here as well.


src/core/lib/security/transport/security_handshaker.cc, line 422 at r2 (raw file):

                                                           grpc_error* error) {
  RefCountedPtr<SecurityHandshaker> h(static_cast<SecurityHandshaker*>(arg));
  //MutexLock lock(&h->mu_);

Commenting this out is clearly incorrect: this code needs to be holding this lock, or it is not thread-safe.

Looking at the mutex deadlock log you shared, it looks like the problem might be that two threads are taking two locks in the opposite order, although it's hard to tell from the output exactly which two mutexes. You might try running under tsan to see if it will give you more information.


src/core/lib/security/transport/security_handshaker.cc, line 456 at r2 (raw file):

                                                     grpc_error* error) {
  RefCountedPtr<SecurityHandshaker> h(static_cast<SecurityHandshaker*>(arg));
  // MutexLock lock(&h->mu_);

Same here -- this needs to be uncommented.


test/core/end2end/generate_tests.bzl, line 56 at r2 (raw file):

# maps fixture name to whether it requires the security library
END2END_FIXTURES = {
    "h2_compress": _fixture_options(),

Please restore all of these deletions. If your goal is to just run one particular test suite, you can do that by selecting just the test you want on the command line when you invoke bazel. You don't need to change the build config this way.


test/core/end2end/fixtures/h2_tls.cc, line 82 at r2 (raw file):

//

static int SyncExternalVerifierVerify(

Let's structure these the way we would expect them to be structured in a real-world example:

class SyncExternalVerifier {
 public:
  SyncExternalVerifier()
      : base_{Verify,Cancel,Destruct} {}

  grpc_tls_certificate_verifier_external* base() const { return &base_; }

 private:
  static int Verify(
      grpc_tls_certificate_verifier_external* external_verifier,
      grpc_tls_custom_verification_check_request* request,
      grpc_tls_on_custom_verification_check_done_cb callback, void* callback_arg,
      void* user_data)  {
    request->status = GRPC_STATUS_OK;
    return false;
  }

  static void Cancel(
      grpc_tls_certificate_verifier_external* external_verifier,
      grpc_tls_custom_verification_check_request* request,
      void* user_data)  {}

  static void Destruct(
      grpc_tls_certificate_verifier_external* external_verifier,
      void* user_data)  {
    auto* self = static_cast<SyncExternalVerifier*>(user_data);
    delete self;
  }

  grpc_tls_certificate_verifier_external base_;
};

Then we can use it like this:

auto* sync_verifier = new SyncExternalVerifier();
ffd->client_verifier =
    grpc_tls_certificate_verifier_external_create(sync_verifier->base());

test/core/end2end/fixtures/h2_tls.cc, line 93 at r2 (raw file):

// This is the arg we will pass in when creating the thread, and retrieve it
// later in the thread callback.
struct ThreadArgs {

Same thing for the async verifier. ThreadArgs, AsyncExternalVerifierVerifyCb(), and AsyncExternalVerifierVerify() should all be in the same class.


test/core/end2end/fixtures/h2_tls.cc, line 431 at r2 (raw file):

}

static uint32_t h2_tls_feature_mask =

This should be const. Also, please name it kFeatureMask.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 51 at r2 (raw file):

  }

  void CreateExternalCertificateVerifier(

Same comment as in h2_tls.cc: Each verifier implementation should be in its own class.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 93 at r2 (raw file):

  // For Async external verifier, we will take two extra parameters, compared to
  // the sync verifier:
  // 1. thread_ptr_ptr: main test thread will create a dummy thread object, and

As I mentioned in one of the other tests, this seems unnecessarily complicated. I think it will make much more sense to just have a verifier class that spawns a thread for each request, or perhaps has a dedicated worker thread that processes requests out of a queue populated by the verifier.


test/core/security/grpc_tls_credentials_options_test.cc, line 65 at r2 (raw file):

  }

  void CreateExternalCertificateVerifier(

Same comment as in h2_tls.cc: Each verifier implementation should be in its own class.

Also, please consider having a separate library to provide these verifier implementations rather than duplicating them in so many places.


test/core/security/grpc_tls_credentials_options_test.cc, line 114 at r2 (raw file):

        static_cast<void*>(thread_args));
    thread.Start();
    thread.Join();

This is going to cause a deadlock, and it may be related to the mutex problem you were seeing. The API requirement for a verifier is that if it is going to return asynchronously, it can't invoke the callback until after it has returned from verify(), because verify() is called while holding the lock that the callback will try to reacquire. Normally, spawning another thread is fine, because that thread will block until verify() returns and the lock is released. But here you are joining the thread you just spawned, which means that verify() won't return until after the call is finished, which will never happen.


test/core/security/grpc_tls_credentials_options_test.cc, line 569 at r2 (raw file):

//

TEST_F(GrpcTlsCredentialsOptionsTest, ClientOptionsWithGoodExternalVerifier) {

Suggest just calling this ClientOptionsWithExternalVerifier.


test/core/security/grpc_tls_credentials_options_test.cc, line 569 at r2 (raw file):

//

TEST_F(GrpcTlsCredentialsOptionsTest, ClientOptionsWithGoodExternalVerifier) {

Shouldn't we also have a variant of this test for the server side?


test/core/security/grpc_tls_credentials_options_test.cc, line 587 at r2 (raw file):

}

TEST_F(GrpcTlsCredentialsOptionsTest, ServerOptionsWithBadExternalVerifier) {

Suggest calling this ServerOptionsWithVerifierButDontRequestClientCert.


test/core/security/tls_security_connector_test.cc, line 96 at r2 (raw file):

  }

  void CreateExternalCertificateVerifier(

As mentioned in the other tests, each verifier should be in its own class.


test/core/security/tls_security_connector_test.cc, line 117 at r2 (raw file):

  }

  static void grpcClosureCb(void* arg, grpc_error* error) {

As per C++ naming rules, this should be called GprcClosureCb().

Although in this case, a better name would be VerifyExpectedErrorCallback().


test/core/security/tls_security_connector_test.cc, line 118 at r2 (raw file):

  static void grpcClosureCb(void* arg, grpc_error* error) {
    grpc_error* expected_error = static_cast<grpc_error*>(arg);

Instead of passing in the expected error as a grpc_error*, how about passing it in as a const char*? If it's null, you can expect GRPC_ERROR_NONE; otherwise, you can check the error's description against the expected string.


test/core/security/tls_security_connector_test.cc, line 122 at r2 (raw file):

      EXPECT_EQ(expected_error, GRPC_ERROR_NONE);
    } else {
      EXPECT_STREQ(GetErrorMsg(expected_error).c_str(),

STREQ is needed only if the first argument is a C-style string. You can just use EQ for C++ string objects (including both std::string and absl::string_view).


test/core/security/tls_security_connector_test.cc, line 128 at r2 (raw file):

  }

  static std::string GetErrorMsg(grpc_error* error) {

This can return an absl::string_view to avoid making an unnecessary copy.


test/core/security/tls_security_connector_test.cc, line 146 at r2 (raw file):

  // For Async external verifier, we will take two extra parameters, compared to
  // the sync verifier:
  // 1. thread_ptr_ptr: main test thread will create a dummy thread object, and

This is really confusing. Why do we need this level of indirection? Why can't we simply have a verifier object that spawns a thread for each request?


test/core/security/tls_security_connector_test.cc, line 515 at r2 (raw file):

TEST_F(TlsSecurityConnectorTest,
       ChannelSCWithSyncGoodExternalVerifierAndGoodRequestParam) {

Please replace SC with SecurityConnector in all of these test names. More understandable is better, even if that means longer.


test/core/security/tls_security_connector_test.cc, line 515 at r2 (raw file):

TEST_F(TlsSecurityConnectorTest,
       ChannelSCWithSyncGoodExternalVerifierAndGoodRequestParam) {

What does GoodRequestParam mean? It's not clear to me which parameter these names are referring to.


test/core/security/tls_security_connector_test.cc, line 542 at r2 (raw file):

                 &peer.properties[1]) == TSI_OK);
  grpc_core::RefCountedPtr<grpc_auth_context> auth_context;
  grpc_error* expected_error = GRPC_ERROR_NONE;

No need for this local variable; just pass GRPC_ERROR_NONE directly to GRPC_CLOSURE_CREATE().


test/core/security/tls_security_connector_test.cc, line 592 at r2 (raw file):

  grpc_core::ExecCtx exec_ctx;
  grpc_closure* on_peer_checked = GRPC_CLOSURE_CREATE(
      grpcClosureCb, (void*)expected_error, grpc_schedule_on_exec_ctx);

I think this cast is unnecessary.


test/core/security/tls_security_connector_test.cc, line 595 at r2 (raw file):

  tls_connector->check_peer(peer, nullptr, &auth_context, on_peer_checked);
  grpc_channel_args_destroy(new_args);
  /*GRPC_ERROR_UNREF(expected_error);*/

This can be removed. The callback function is doing this, so it's not needed here.


test/core/security/tls_security_connector_test.cc, line 600 at r2 (raw file):

      &event, gpr_time_add(gpr_now(GPR_CLOCK_MONOTONIC),
                           gpr_time_from_seconds(5, GPR_TIMESPAN)));
  EXPECT_NE(value, nullptr);

The value doesn't actually matter here, so there's no point in checking it. We don't even need the value variable in the first place.


test/core/security/tls_security_connector_test.cc, line 602 at r2 (raw file):

  EXPECT_NE(value, nullptr);
  // Join the thread otherwise there will be memory leaks.
  (*thread_ptr).Join();

Instead of (*obj).Method(), please say obj->Method().


test/core/security/tls_security_connector_test.cc, line 608 at r2 (raw file):

TEST_F(TlsSecurityConnectorTest,
       ChannelSCWithHostnameVerifierAndGoodRequestParamOnGoodMatch) {

Suggest just calling this ChannelSecurityConnectorHostnameVerifierSucceeds.


test/core/security/tls_security_connector_test.cc, line 658 at r2 (raw file):

TEST_F(TlsSecurityConnectorTest,
       ChannelSCWithHostnameVerifierAndGoodRequestParamOnBadMatch) {

Suggest just calling this ChannelSecurityConnectorHostnameVerifierFails.


test/core/security/tls_security_connector_test.cc, line 880 at r2 (raw file):

//
// Tests for Certificate Verifier in ServerSecurityConnector.

Please add one more line with // after this.

Copy link
Member

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 68 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, and @ZhenLian)


src/core/lib/security/credentials/xds/xds_credentials.cc, line 74 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Should XdsVerifySubjectAlternativeNames() take a parameter that tells it which type of SANs it's being given? @yashykt

Sorry I missed this earlier. Yes, we would ideally want to know whether the SAN has a DNS type or not. The SAN matcher should only be doing a DNS style verification (for example, wildcard matching) for DNS names.

Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @markdroth for the carefully review, and sorry for the delay of this PR! It took me some time to resolve some of the threading issues while using the shared verifier classes for all the tests. The tests look much cleaner now.
I've replied to almost all the comments, but let me make a quick summary:

These things will go into another PR:

  1. (optional) slight change the SAN name verification logic to match the old verification implementation
  2. security connector's cancel() function
  3. plumbings of more SAN fields and their types in TSI

I will do 2 next, and depending on the reply will probably do 1. For 3, Ashitha had a PR(#25874) to plumb DNS field which should meet xds's case, but the way we plumb is not very clean and in our case here we need all other fields(ip address, etc). I will need to talk to @jiangtaoli2016 to see how we want to proceed.
This PR still has some pending comments in the main code that might change the overall structure significantly. I will be working on the C++ implementation once those comments are solved.
Thanks again for the guidance!

Reviewable status: 4 of 17 files reviewed, 68 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, and @ZhenLian)


include/grpc/grpc_security.h, line 915 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The security connector is an internal implementation detail, not something that should be described as part of this API. I think this line can just be removed.

Done.


include/grpc/grpc_security.h, line 920 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please note on every new struct, type, and function that they are experimental.

Sounds good! I changed all the TLS cred API to the new form with a EXPERIMENTAL API - Subject to change header. I think it might be more concise.


include/grpc/grpc_security.h, line 924 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Unfortunately, I don't think it's legal in old-style C to initialize data members where they are defined. We'll need to set these all manually inside of core when we create an instance of this struct.

Done. I added one CertificateVerificationRequestInit function in grpc_tls_certificate_verifier.h for the set up work.


include/grpc/grpc_security.h, line 975 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why do we need this? I thought we decided in the design that we don't need it, because implementations can use the address of the grpc_tls_certificate_verifier_external struct instead.

If we keep this, please document what it's for.

Yeah, this is mainly used for the tests to pass some extra data when constructing grpc_tls_certificate_verifier_external, and then retrieve it in verify or destruct functions.
For example in grpc_tls_certificate_verifier_test.cc, I want to start a thread in the verify function and return immediately(kind of like simulating how the async mode works), and later in the destruct or main thread to join this thread to avoid memory leaks. I don't have a good idea how to achieve this using the old design, since there is no way to make a connection among verify, destruct and the main thread. So my solution is to have this user_data to be the bridge of three of them. In that test, we will create a dummy thread memory in main thread, and pass it through user_data; in verify, we will replace that thread memory with the actual memory we want to run, and then when back to the main thread we will join that thread. This is a bit ugly but is the best way I could think of.
And it becomes even more useful when it comes to e2e tests. in h2_tls.cc, we are passing fullstack_secure_fixture_data to the user_data so that we can access it in any of the member functions of the external verifier.


include/grpc/grpc_security.h, line 982 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If user_data is stored in grpc_tls_certificate_verifier_external and we're already passing a pointer to that struct to these methods, then why do we need to pass user_data as a separate parameter? Can't the implementation just access it as external_verifier->user_data?

Or better yet, if we actually do need user_data, then why pass the grpc_tls_certificate_verifier_external struct to these methods at all? Why not just pass user_data instead, since that's the only field from the struct that the implementations will ever actually use?

Whichever of these two arguments we keep should be the first argument to all three methods, since they're basically the "self" pointer.

Sure, I updated the user_data to be the first parameter and removed the grpc_tls_certificate_verifier_external parameter.


include/grpc/grpc_security.h, line 993 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We need to explicitly document the requirement that the implementation of this method must not invoke the async callback in the thread in which it's invoked before it returns. That can lead to deadlocks.

SG. I added some lines at the beginning of this comment.


include/grpc/grpc_security.h, line 1004 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/connetion/connection/

Done.


include/grpc/grpc_security.h, line 1009 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We need to document that this will be the same request object that was passed to verify(), and that it tells the verifier which request to cancel.

Done.


include/grpc/grpc_security.h, line 1046 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Do we also need to declare the one for the SAN verifier? Or is that not included in this PR?

I will include the SAN changes in a later PR, since grpc_tls_custom_verification_check_request.peer_info struct is not complete at this moment(it needs some more plumbings from the TSI layer)


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 22 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This include should be moved down to after the absl include.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 47 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The internal API was supposed to take a struct that contained both this request struct and a ref to the security connector. Why didn't we do that?

We do need to make sure that we're holding a ref to the security connector until this request is complete.

Ah ok. Originally, the ref to the security connector was designed for a different purpose, and that purpose was not valid anymore based on the new implementation. But I think this(holding a ref until this request is complete) is also a valid reason. I've updated the code.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 48 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It occurs to me that since we're using a std::function<> both here in the internal API and in the C++-layer CertificateVerifier API, it would probably make more sense to have the callback return the status instead of trying to encode it in the request struct. Here, we can use std::function<absl::Status()>, and in the C++ layer, we can use std::function<grpc::Status()>. (We aren't currently allowed to use absl types in the C++ API.)

The only place we need to do something different is in the boundary between external verifiers and C-core. To solve that problem, we could pass the status code and message as two additional parameters to grpc_tls_on_custom_verification_check_done_cb.

Note that this approach would also eliminate the requirement to use gpr_malloc() for the error string, because now C-core will not take ownership of it. Instead, it can copy the string internally before the callback returns.

Currently the callback will only be invoked when executed in the async mode(when we return true for the Verify() function). Just to make sure I understand, we will need to also invoke the callback in sync mode if we change this way, right?


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 51 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It looks like the problem is that the grpc_security_connector API does not have a cancel_check_peer() method, so we'll need to add that. The method should look like this:

// Cancels the pending check_peer() request associated with on_peer_checked.
// If there is no such request pending, this is a no-op.
virtual void cancel_check_peer(grpc_closure* on_peer_checked) = 0;

In TlsChannelSecurityConnector, we'll need a map of pending requests, similar to the one in ExternalCertificateVerifier, but keyed by on_peer_checked. The map value will be a struct containing the grpc_tls_custom_verification_check_request struct, a ref to the security connector, and any other state we may need. When cancel_check_peer() is called, we'll look up the closure in the map to find the corresponding request struct, so that we can invoke the verifier's Cancel() method.

Then we'll need to change the security handshaker to call cancel_check_peer() when it gets cancelled or times out.

Sure. I will make another PR for this change.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 55 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This does not belong in this class. The verifier is not responsible for cleaning up the request object, because the request object needs to be owned by the security connector.

Done. Putting the init and destroy function to the security connector.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 74 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This needs to be done in the implementation's destruct() method, not here inside of C-core. This is because on Windows, each DLL has its own heap, and gRPC and the application may be in different DLLs. If the struct was allocated by the application, it can't be freed here.

Ah I see. I think we talked about this before when designing the error_details field in grpc_tls_custom_verification_check_request. I updated the code based on the suggestions.
Just to confirm, in core API, that will mean we are actually not able to ask the callers to pass their ownership and we later clean up the resource for them, without the help of gpr_* functions, right? From my understanding this kind of ownership design will always have similar problems. We will need to either ask them to allocate the object in a common grpc space trough gpr_*, or ask them to write a clean-up callback function(just like the way we are doing here).


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 90 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for grpc_core:: prefix, since we're already in that namespace.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 103 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for this, since the section is empty.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 75 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think this is necessary. This method should always be invoked from within core.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 92 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We do need to declare an ExecCtx at the start of this method, because this will be called from outside of core.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 100 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think we should actually invoke the callback until after we've released the lock, because once we've invoked the callback, the callback will give up the ref it's holding on the security connector, so the security connector might release its ref on the verifier and destroy the verifier. If the verifier is destroyed before we try to release the lock, that will be a use-after-free bug.

Sure. I switched the order between this line and its last line.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 106 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why not do that?

Ah, I forgot to mention that the code was copied and rewrote based on the old verification logic. There is one slight difference between the old and the new: the old will treat ".com" as an invalid case while the new will treat it as valid(while both the old and the new will treat "." as an invalid case). I kept the old logic since I thought it might not be safer from a security's point of view(since it will be very rare and unsafe to have a cert covering the whole root domain), but I then realized we could also change the new logic if we want to match the old one. So I updated the code to use the new verify logic.
Do you want me to update the new logic to be a bit more strict? I am ok with either way.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 159 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use C++-style comments.

Same thing throughout.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 176 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why are we checking the SAN names here? I thought that was going to be a different verifier implementation.

For the default hostname check verification, we will need the specific DNS type in the SAN names, and is performed without caller explicitly specifying anything. The SAN name verifier that we planned to add later is to allow callers to specify a list of SAN names, and we try to match against that list. It will cover more SAN types, such as DNS, URI, IP address, etc.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 192 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This allocation is unnecessary. Instead, just change the check below to:

if (allocated_name == ip_name) {

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 217 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This function will be an entry point to C-core, so it will need to declare an ExecCtx at the start of the function.

Same for the next 3 functions.

Done. I noticed that we didn't do that for the provider core API. Do you want me to add that as well?


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 249 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think this comment is necessary.

Done.


src/core/lib/security/credentials/tls/grpc_tls_credentials_options.cc, line 48 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This can just be options->set_verify_server_cert(verify_server_cert). The int will be automatically converted to a bool.

Done.


src/core/lib/security/credentials/tls/tls_credentials.cc, line 57 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Let's log something here indicating that we're defaulting to the hostname verifier.

Done.


src/core/lib/security/credentials/tls/tls_credentials.cc, line 65 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this can just be removed.

Done.


src/core/lib/security/credentials/xds/xds_credentials.cc, line 74 at r2 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

Sorry I missed this earlier. Yes, we would ideally want to know whether the SAN has a DNS type or not. The SAN matcher should only be doing a DNS style verification (for example, wildcard matching) for DNS names.

Yeah, I am planning to deal with that later. Plumbing through the SAN type across TSI is a bit tricky here. We can do some tricks in the tsi to plumb the type, but that way is not very ideal.
(update: Ashitha has a PR to plumb DNS #25874, which should meet the requirement here. While for custom verification, we will need all other fields.)


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 211 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please remove.

Sorry, I commented out this line for debugging purpose, and forgot to make it back.
I think for channel security connector, at this point, there should always be a verifier, because for those who didn't specify the verifier we would use the default host name verifier; I removed this line for the server security connector though since we allow not specifying a verifier on the server side.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 222 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As I mentioned elsewhere, the security connector will need to maintain a map of pending check_peer() requests. If you store the TSI peer object in that map entry, then we may not need to copy any of these string fields into the cert verifier request structure. Instead, they can just be pointers into the TSI peer object, which we know will live until after the request has completed. (Although we may need to make copies anyway in cases where we need to add nul termination.)

Sorry, I didn't see anywhere else you wanted me to have a map of pending check_peer() requests in the security connector. Did I miss it?
In my original design, I made a copy of every peer information string in grpc_tls_custom_verification_check_request because tsi_peer is a bit hard to parse and extract information from(from the user's perspective).
grpc_tls_custom_verification_check_request is in the core API(grpc_security.h). If we store TSI peer pointers directly in grpc_tls_custom_verification_check_request, that will mean the callers will need to extract information from tsi_peer*. Are we OK with that?


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 297 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We don't need this here. Instead, it should be in ExternalCertificateVerifier::OnVerifyDone(), as I suggested elsewhere.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 298 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

In this case, I think it would actually be safe to just use Closure::Run() instead of ExecCtx::Run().

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 322 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Yes, any field that affects the behavior of the security connector needs to be checked here. Failure to do so can cause a security problem, because two security connectors that differ based on a field not checked here will be considered identical, which means that in a channel that's supposed to use one value, we may reuse a subchannel with a different value.

Sounds good. I updated the code to compare credential values and pointers stored as class members. We have one grpc_core::RefCountedPtr<grpc_tls_credentials_options> left though. It seems we didn't define compare() function for grpc_tls_credentials_options. Is it ok to not compare it?


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 331 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Hmm... This is an interesting question.

The check_call_host() check is a per-call check, not a per-connection check. gRPC has a seldom-used feature to support virtual hosting, where each individual call on a channel can have a different host associated with it. So this check is intended to verify that the host specified for the individual call is covered by the cert that the server presented.

Previously, this check was inhibited if the connection-time verification did not check the hostname, because if the hostname was not verified upon connection, then it also doesn't need to be checked for the individual calls.

We may need to add some way for the verifier to tell us whether it did hostname verification. We could then store that state in the auth context, so that we can use that to decide what to check here.

Thanks for the clarification!
Just to make sure I understand, here we will need to do this per-call check whenever we did the per-connection hostname verification, right?
We can access the verifier in this check_call_host(). Maybe we can add a function in the verifier to indicate if we are using hostname verification? Or maybe it's even better to expose a general-purpose function telling whether this kind of per-call check is needed?


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 339 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No matter what we do in check_call_host(), we won't be doing async work, so I don't think we need anything here.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 505 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same comments as I had for the client side apply here as well.

Done.


src/core/lib/security/transport/security_handshaker.cc, line 422 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Commenting this out is clearly incorrect: this code needs to be holding this lock, or it is not thread-safe.

Looking at the mutex deadlock log you shared, it looks like the problem might be that two threads are taking two locks in the opposite order, although it's hard to tell from the output exactly which two mutexes. You might try running under tsan to see if it will give you more information.

Running TSAN just gave me a bunch of tests that failed because of timeout...
I think the problem is because we are holding one lock and trying to acquire it again in the same thread. I guess grpc_core::mutex is not a re-entrant lock? I examined the code in security_handshaker.cc, and found the call trace is roughly like this:

OnHandshakeDataSentToPeerFn -> CheckPeerLocked -> on_peer_checked -> OnPeerCheckedFn -> OnPeerCheckedInner
OnHandshakeDataReceivedFromPeerFn -> DoHandshakeNextLocked -> OnHandshakeNextDoneLocked -> CheckPeerLocked -> on_peer_checked -> OnPeerCheckedFn -> OnPeerCheckedInner

We are trying to hold the same lock in OnHandshakeDataSentToPeerFn/OnHandshakeDataReceivedFromPeerFn and OnPeerCheckedInner. Commenting either the former ones or the latter will make the problem go away. I've updated the code to not acquire lock in OnPeerCheckedInner. Is it ok to do that?

(Or I think a better solution might be to have finer granularity of locks, since it seems the former ones and the latter need the lock for protecting different members(right now we only have one mutex for the whole handshaker class). I tried to update with finer-grained locks multiple times, but always ended up having some threading issues(But I didn't look very carefully on what are the causes). I am wondering if this is a direction that worths trying?)


test/core/end2end/generate_tests.bzl, line 56 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please restore all of these deletions. If your goal is to just run one particular test suite, you can do that by selecting just the test you want on the command line when you invoke bazel. You don't need to change the build config this way.

Done.


test/core/end2end/fixtures/h2_tls.cc, line 82 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Let's structure these the way we would expect them to be structured in a real-world example:

class SyncExternalVerifier {
 public:
  SyncExternalVerifier()
      : base_{Verify,Cancel,Destruct} {}

  grpc_tls_certificate_verifier_external* base() const { return &base_; }

 private:
  static int Verify(
      grpc_tls_certificate_verifier_external* external_verifier,
      grpc_tls_custom_verification_check_request* request,
      grpc_tls_on_custom_verification_check_done_cb callback, void* callback_arg,
      void* user_data)  {
    request->status = GRPC_STATUS_OK;
    return false;
  }

  static void Cancel(
      grpc_tls_certificate_verifier_external* external_verifier,
      grpc_tls_custom_verification_check_request* request,
      void* user_data)  {}

  static void Destruct(
      grpc_tls_certificate_verifier_external* external_verifier,
      void* user_data)  {
    auto* self = static_cast<SyncExternalVerifier*>(user_data);
    delete self;
  }

  grpc_tls_certificate_verifier_external base_;
};

Then we can use it like this:

auto* sync_verifier = new SyncExternalVerifier();
ffd->client_verifier =
    grpc_tls_certificate_verifier_external_create(sync_verifier->base());

Thanks for the suggestion! The code is much more cleaner now!


test/core/end2end/fixtures/h2_tls.cc, line 93 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same thing for the async verifier. ThreadArgs, AsyncExternalVerifierVerifyCb(), and AsyncExternalVerifierVerify() should all be in the same class.

Done.


test/core/end2end/fixtures/h2_tls.cc, line 431 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should be const. Also, please name it kFeatureMask.

Done.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 51 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same comment as in h2_tls.cc: Each verifier implementation should be in its own class.

Done.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 93 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As I mentioned in one of the other tests, this seems unnecessarily complicated. I think it will make much more sense to just have a verifier class that spawns a thread for each request, or perhaps has a dedicated worker thread that processes requests out of a queue populated by the verifier.

Done.


test/core/security/grpc_tls_credentials_options_test.cc, line 65 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same comment as in h2_tls.cc: Each verifier implementation should be in its own class.

Also, please consider having a separate library to provide these verifier implementations rather than duplicating them in so many places.

Done.


test/core/security/grpc_tls_credentials_options_test.cc, line 114 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is going to cause a deadlock, and it may be related to the mutex problem you were seeing. The API requirement for a verifier is that if it is going to return asynchronously, it can't invoke the callback until after it has returned from verify(), because verify() is called while holding the lock that the callback will try to reacquire. Normally, spawning another thread is fine, because that thread will block until verify() returns and the lock is released. But here you are joining the thread you just spawned, which means that verify() won't return until after the call is finished, which will never happen.

Sounds good. Updated to use the verifiers used in other tests.


test/core/security/grpc_tls_credentials_options_test.cc, line 569 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Shouldn't we also have a variant of this test for the server side?

Done.


test/core/security/grpc_tls_credentials_options_test.cc, line 569 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest just calling this ClientOptionsWithExternalVerifier.

Done.


test/core/security/grpc_tls_credentials_options_test.cc, line 587 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling this ServerOptionsWithVerifierButDontRequestClientCert.

The behavior I wanted to test here is that the security connector can be successfully created even when fed in with a bad verifier(since the verifier's check function is not executed during the security connector creation). Is this too obvious that we don't actually need a test?
I think ideally, we will need tests for sync good verifier, sync bad verifier, async good verifier, async bad verifier for both client and server side. That made the tests super long, so I originally chose client sync good verifier and server bad async verifier to test. Do you want me to add all of them?


test/core/security/tls_security_connector_test.cc, line 96 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As mentioned in the other tests, each verifier should be in its own class.

Done.


test/core/security/tls_security_connector_test.cc, line 117 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As per C++ naming rules, this should be called GprcClosureCb().

Although in this case, a better name would be VerifyExpectedErrorCallback().

Done. Updated to VerifyExpectedErrorCallback.


test/core/security/tls_security_connector_test.cc, line 118 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of passing in the expected error as a grpc_error*, how about passing it in as a const char*? If it's null, you can expect GRPC_ERROR_NONE; otherwise, you can check the error's description against the expected string.

Done.


test/core/security/tls_security_connector_test.cc, line 122 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

STREQ is needed only if the first argument is a C-style string. You can just use EQ for C++ string objects (including both std::string and absl::string_view).

Done.


test/core/security/tls_security_connector_test.cc, line 128 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This can return an absl::string_view to avoid making an unnecessary copy.

Done.


test/core/security/tls_security_connector_test.cc, line 146 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is really confusing. Why do we need this level of indirection? Why can't we simply have a verifier object that spawns a thread for each request?

Done with using the new testing helper classes.


test/core/security/tls_security_connector_test.cc, line 515 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please replace SC with SecurityConnector in all of these test names. More understandable is better, even if that means longer.

Done.


test/core/security/tls_security_connector_test.cc, line 515 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

What does GoodRequestParam mean? It's not clear to me which parameter these names are referring to.

Yeah let's just remove it.


test/core/security/tls_security_connector_test.cc, line 542 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for this local variable; just pass GRPC_ERROR_NONE directly to GRPC_CLOSURE_CREATE().

Done.


test/core/security/tls_security_connector_test.cc, line 592 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this cast is unnecessary.

It seems this is needed, otherwise the compiler will complain with the message:

candidate function not viable: no known conversion from 'const char *' to 'void *' for 4th argument; take the address of the argument with &
inline grpc_closure* grpc_closure_create(const char* file, int line,

(I've changed expected_error from type grpc_error* to const char*)


test/core/security/tls_security_connector_test.cc, line 595 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This can be removed. The callback function is doing this, so it's not needed here.

Done.


test/core/security/tls_security_connector_test.cc, line 600 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The value doesn't actually matter here, so there's no point in checking it. We don't even need the value variable in the first place.

Done.


test/core/security/tls_security_connector_test.cc, line 602 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of (*obj).Method(), please say obj->Method().

Done.


test/core/security/tls_security_connector_test.cc, line 608 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest just calling this ChannelSecurityConnectorHostnameVerifierSucceeds.

Done.


test/core/security/tls_security_connector_test.cc, line 658 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest just calling this ChannelSecurityConnectorHostnameVerifierFails.

Done.


test/core/security/tls_security_connector_test.cc, line 880 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add one more line with // after this.

Done.

Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 19 files reviewed, 68 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, and @ZhenLian)


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 222 at r2 (raw file):

Previously, ZhenLian wrote…

Sorry, I didn't see anywhere else you wanted me to have a map of pending check_peer() requests in the security connector. Did I miss it?
In my original design, I made a copy of every peer information string in grpc_tls_custom_verification_check_request because tsi_peer is a bit hard to parse and extract information from(from the user's perspective).
grpc_tls_custom_verification_check_request is in the core API(grpc_security.h). If we store TSI peer pointers directly in grpc_tls_custom_verification_check_request, that will mean the callers will need to extract information from tsi_peer*. Are we OK with that?

Oh, for the first part, I just saw the place where you wanted to me have a map in the security connector, as I was adding a cancel_check_peer function in another PR. So I am now clear on that part.
I still have some concerns on exposing tsi_peer* directly in grpc_security.h. Hope I can get some clarifications on that. Thank you so much!

Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 19 files reviewed, 68 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, and @ZhenLian)


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 51 at r2 (raw file):

virtual void cancel_check_peer(grpc_closure* on_peer_checked) = 0;
BTW, I noticed that the cancel_check_host function was defined with an extra grpc_error param.
Maybe we can also have something like:
virtual void cancel_check_peer(grpc_closure* on_peer_checked, grpc_error* error) = 0;?

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moving in the right direction!

Please let me know if you have any questions. Thanks!

Reviewed 9 of 14 files at r3, 7 of 7 files at r4.
Reviewable status: all files reviewed, 41 unresolved discussions (waiting on @jiangtaoli2016 and @ZhenLian)


include/grpc/grpc_security.h, line 924 at r2 (raw file):

Previously, ZhenLian wrote…

Done. I added one CertificateVerificationRequestInit function in grpc_tls_certificate_verifier.h for the set up work.

Still need to remove the = nullptr or = 0 on all of the fields here. That's not legal in C89.


include/grpc/grpc_security.h, line 975 at r2 (raw file):

Previously, ZhenLian wrote…

Yeah, this is mainly used for the tests to pass some extra data when constructing grpc_tls_certificate_verifier_external, and then retrieve it in verify or destruct functions.
For example in grpc_tls_certificate_verifier_test.cc, I want to start a thread in the verify function and return immediately(kind of like simulating how the async mode works), and later in the destruct or main thread to join this thread to avoid memory leaks. I don't have a good idea how to achieve this using the old design, since there is no way to make a connection among verify, destruct and the main thread. So my solution is to have this user_data to be the bridge of three of them. In that test, we will create a dummy thread memory in main thread, and pass it through user_data; in verify, we will replace that thread memory with the actual memory we want to run, and then when back to the main thread we will join that thread. This is a bit ugly but is the best way I could think of.
And it becomes even more useful when it comes to e2e tests. in h2_tls.cc, we are passing fullstack_secure_fixture_data to the user_data so that we can access it in any of the member functions of the external verifier.

I don't see why any of this requires a separate user_data pointer. Anything you can do by having user_data here you can also do without user_data by storing that same pointer in the wrapping-layer object that contains the grpc_tls_verifier_external struct.

That having been said, I'm fine with this approach. It does allow the caller to not have to play pointer games with the grpc_tls_certificate_external_verifier struct being the first data member of the wrapping-layer object.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 48 at r2 (raw file):

Previously, ZhenLian wrote…

Currently the callback will only be invoked when executed in the async mode(when we return true for the Verify() function). Just to make sure I understand, we will need to also invoke the callback in sync mode if we change this way, right?

No, it won't work to use the callback in sync mode, because the callback function would not know whether it was being called synchronously or asynchronously, and its behavior would need to be different in the two cases (e.g., in terms of avoiding mutex deadlocks and the need to create an ExecCtx).

Instead, I think we would need to add an absl::Status* output parameter to the function. For sync returns, the verifier will return the status via the output parameter and return true; for async returns, it will return false and then pass the status via the callback later.

Note that this is consistent with what we do in existing APIs such as grpc_channel_security_connector::check_call_host() (although that uses grpc_error instead of absl::Status).


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 51 at r2 (raw file):

Previously, ZhenLian wrote…

virtual void cancel_check_peer(grpc_closure* on_peer_checked) = 0;
BTW, I noticed that the cancel_check_host function was defined with an extra grpc_error param.
Maybe we can also have something like:
virtual void cancel_check_peer(grpc_closure* on_peer_checked, grpc_error* error) = 0;?

Yes, passing in the error for cancellation is a good idea. Looks like you're doing that in #25941.

Is the plan here is to merge #25941 first, then merge those changes into this PR? Or will you handle the cancellation code in a third PR that depends on both this PR and #25941?


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 74 at r2 (raw file):

Previously, ZhenLian wrote…

Ah I see. I think we talked about this before when designing the error_details field in grpc_tls_custom_verification_check_request. I updated the code based on the suggestions.
Just to confirm, in core API, that will mean we are actually not able to ask the callers to pass their ownership and we later clean up the resource for them, without the help of gpr_* functions, right? From my understanding this kind of ownership design will always have similar problems. We will need to either ask them to allocate the object in a common grpc space trough gpr_*, or ask them to write a clean-up callback function(just like the way we are doing here).

Yes. To say this another way, any time something needs to be allocated by the application and then owned by core, there are only two options:

  1. Require the caller to allocate it via core using gpr_malloc(), so that core can internally delete it via gpr_free().
  2. Allow the caller to allocate it externally however they want, and provide a destruction callback so that core can tell the caller when to destroy it.

Option 1 does the allocation and deallocation on the core side; option 2 does the allocation and deallocation on the caller side. It doesn't really matter which side they happen on, as long as they both happen on the same side.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 45 at r4 (raw file):

  }

  CertificateVerificationRequest(RefCountedPtr<grpc_security_connector> sc)

Single-argument ctors need to be declared explicit, as per:

https://google.github.io/styleguide/cppguide.html#Implicit_Conversions


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 56 at r4 (raw file):

  // This is needed because C-core API doesn't allow default data member
  // initialization. It should be called every time when a request is created.
  static void CertificateVerificationRequestInit(

Why do we need these as separate static methods? Can't we just do this initialization and destruction in the ctor and dtor directly?


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 82 at r4 (raw file):

  virtual ~grpc_tls_certificate_verifier() = default;
  // Verifies the specific request. It can be processed in sync or async mode.
  // If the caller want it to be processed asynchronously, return true and

I think this is actually backwards relative to our normal convention. For most of the other APIs we have where a function can return either synchronously or asynchronously, it returns true for the sync case, not the async case (e.g., grpc_channel_security_connector::check_call_host()). The idea is that the caller should be able to do something like:

bool is_done = verifier->Verify(&request, callback);
if (is_done) {
  // ...process result...
} else {
  // ...wait for callback...
}

So let's flip this, to be consistent with what we do elsewhere.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 100 at r2 (raw file):

Previously, ZhenLian wrote…

Sure. I switched the order between this line and its last line.

I think you misunderstood. The problem here is not the ordering of invoking the callback and removing the entry from the map. The problem is the ordering of invoking the callback and releasing the lock.

In other words, the code needs to do something like this:

std::function<void()> callback;
{
  MutexLock lock(&self->mu_);
  auto it = self->request_map_.find(internal_request);
  if (it != self->request_map_.end()) {
    callback = std::move(it->second);
    self->request_map_.erase(it);
  }
}
if (callback != nullptr) callback();

That way, lock goes out of scope before we invoke the callback.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 106 at r2 (raw file):

Previously, ZhenLian wrote…

Ah, I forgot to mention that the code was copied and rewrote based on the old verification logic. There is one slight difference between the old and the new: the old will treat ".com" as an invalid case while the new will treat it as valid(while both the old and the new will treat "." as an invalid case). I kept the old logic since I thought it might not be safer from a security's point of view(since it will be very rare and unsafe to have a cert covering the whole root domain), but I then realized we could also change the new logic if we want to match the old one. So I updated the code to use the new verify logic.
Do you want me to update the new logic to be a bit more strict? I am ok with either way.

I think the markdown formatting didn't work correctly for part of what you wrote here (probably it interpretted the * character -- try putting that character in backquotes instead), so I don't quite understand the question.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 217 at r2 (raw file):

Previously, ZhenLian wrote…

Done. I noticed that we didn't do that for the provider core API. Do you want me to add that as well?

Yeah, all functions declared in the include tree should declare an ExecCtx as the first line of the function. But please do that in a separate PR.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 113 at r4 (raw file):

  CertificateVerificationRequest* internal_request =
      reinterpret_cast<CertificateVerificationRequest*>(request);
  grpc_core::MutexLock lock(&self->mu_);

No need for grpc_core:: prefix here, since we're already in that namespace.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 118 at r4 (raw file):

    std::function<void()> callback = std::move(it->second);
    callback();
    self->request_map_.erase(it->first);

This should just use it instead of it->first. No need to lookup the key again when we already have an iterator pointing right at the entry we want to erase.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 124 at r4 (raw file):

namespace internal {

static bool VerifyIdentityAsHostname(absl::string_view identity_name,

If this function is just calling the existing VerifySubjectAlternativeName() function, why do we need this wrapper function at all? Why not just have callers invoke VerifySubjectAlternativeName() directly instead of calling this?


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 124 at r4 (raw file):

namespace internal {

static bool VerifyIdentityAsHostname(absl::string_view identity_name,

If we do keep this function for some reason, please use an anonymous namespace instead of declaring it static.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 128 at r4 (raw file):

  // We are using the target name sent from the client as a matcher to match
  // against identity name on the peer cert.
  return grpc_core::VerifySubjectAlternativeName(identity_name,

No need for grpc_core:: prefix.


src/core/lib/security/credentials/xds/xds_credentials.cc, line 74 at r2 (raw file):

Previously, ZhenLian wrote…

Yeah, I am planning to deal with that later. Plumbing through the SAN type across TSI is a bit tricky here. We can do some tricks in the tsi to plumb the type, but that way is not very ideal.
(update: Ashitha has a PR to plumb DNS #25874, which should meet the requirement here. While for custom verification, we will need all other fields.)

Okay. I'll resolve this comment on the assumption that we'll clean this up in a follow-up PR.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 75 at r4 (raw file):

  // This is needed because C-core API doesn't allow default data member
  // initialization. It should be called every time when a request is created.
  static void CertificateVerificationRequestInit(

It doesn't look like these are actually defined anywhere, and the declarations look like duplicates of the methods you have in the CertificateVerificationRequest struct. I think these should probably just be removed.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 214 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think we actually need to allocate a copy of this particular string. We're going to need to hold a ref to the security connector until this verification request is completed, and target_name points into a member of the security connector, so we know it will continue to exist until after the request is complete. So I think we can just pass a pointer to the existing string to the verifier.

Doesn't look like this was addressed.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 222 at r2 (raw file):

Previously, ZhenLian wrote…

Oh, for the first part, I just saw the place where you wanted to me have a map in the security connector, as I was adding a cancel_check_peer function in another PR. So I am now clear on that part.
I still have some concerns on exposing tsi_peer* directly in grpc_security.h. Hope I can get some clarifications on that. Thank you so much!

I am not actually proposing that we expose tsi_peer in the C-core API, only that we use it as the backing store for these strings to avoid making copies. In other words, I think we should use a data structure similar to this:

struct PendingVerifierRequest {
  CertificateVerificationRequest request;
  tsi_peer peer;
};
std::map<grpc_closure* /*on_peer_checked*/, PendingVerifierRequest*> pending_verifier_requests_;

There will be one entry in the map for each pending request. In the entry for a given request, we can populate the fields of entry.request to point into the strings stored in entry.peer. We won't destroy the tsi_peer until after the request completes. This will avoid the need for making copies of any field stored in the tsi_peer.

Hmmm.... Actually, looking at this struct, it occurs to me that we should probably combine all of the state we need for a pending request into one struct. Currently, we have some state in CertificateVerificationRequest and some state in pre-bound parameters for the lambda callback, and if we add this PendingVerifierRequest struct, we would be adding yet a third place to store state for the same thing. Instead, we could combine all of this state into one object, using something like this:

class PendingVerifierRequest {
 public:
  PendingVerifierRequest(
      RefCountedPtr<TlsChannelSecurityConnector> security_connector,
      grpc_closure* on_peer_checked, tsi_peer peer)
      : security_connector_(std::move(security_connector)),
        on_peer_checked_(on_peer_checked),
        peer_(peer) {
    // ...initialize request_ based on data in security_connector_ and peer_...
  }

  ~PendingVerifierRequest() {
    tsi_peer_destruct(&peer_);
  }

  void Start() {
    grpc_tls_certificate_verifier* verifier = security_connector_->options_->certificate_verifier();
    bool is_async = verifier->Verify(&request_, [this]() { OnVerifyDone(true); });
    if (!is_async) OnVerifyDone(false);
  }

 private:
  void OnVerifyDone(bool run_callback_inline) {
    {
      MutexLock lock(&security_connector_->mu_);
      security_connector_->pending_verifier_requests_.erase(on_peer_checked_);
    }
    grpc_error* error = GRPC_ERROR_NONE;
    if (request_.status != GRPC_STATUS_OK) {
      error = GRPC_ERROR_CREATE_FROM_COPIED_STRING(
          absl::StrCat("Custom verification check failed with error: ",
                       request_.error_details)
              .c_str());
    }
    if (run_callback_inline) {
      Closure::Run(DEBUG_LOCATION, on_peer_checked_, error);
    } else {
      ExecCtx::Run(DEBUG_LOCATION, on_peer_checked_, error);
    }
    delete this;
  }

  RefCountedPtr<TlsChannelSecurityConnector> security_connector_;
  grpc_closure* on_peer_checked_;
  tsi_peer peer_;
  grpc_tls_custom_verification_check_request request_;
};

Then the code here becomes something like this:

auto* pending_request = new PendingVerifierRequest(Ref(), on_peer_checked, peer);
{
  MutexLock lock(&security_connector_->mu_);
  pending_verifier_requests_.emplace(on_peer_checked, pending_request);
}
pending_request->Start();

Note that this would mean that the methods of grpc_tls_certificate_verifier would go back to taking a parameter of type grpc_tls_custom_verification_check_request* instead of CertificateVerificationRequest.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 322 at r2 (raw file):

Previously, ZhenLian wrote…

Sounds good. I updated the code to compare credential values and pointers stored as class members. We have one grpc_core::RefCountedPtr<grpc_tls_credentials_options> left though. It seems we didn't define compare() function for grpc_tls_credentials_options. Is it ok to not compare it?

The goal here is to compare the set of fields that affect the configured behavior of the security connector, not the current data that it happens to have as a result of that configured behavior.

For example, each security connector is configured to use a specific certificate provider instance, and each security connector will contain the certificate data that it was most recently given by that certificate provider. We want to compare based on the configured certificate provider, not based on the certificate data, for two reasons:

  1. At one particular moment in time, two security connectors that are using the same certificate provider might actually have different certificate data, because we might happen to check at a moment when the certificate provider has just finished updating one of the security connectors but has not yet updated the second one. If we compare based on certificate data, then the two would compare as not being equal, even though they are functionally the same and should compare as equal.

  2. At one particular moment in time, two security connectors that are using different certificate providers might happen to have the same certificate data if both certificate providers happen to return the same certs at one point. However, if the certificate providers are different, then one of them could return different data in the future. So if we compare the two security connectors as equal based on the certificate data, it could actually lead to a security problem where we wind up using the wrong security connector for a given subchannel, which means we would use the wrong certificate data.

So basically, I think we do want to compare most of the fields in grpc_tls_credentials_options, and we don't want to compare currently cached data such as pem_root_certs_ and pem_key_cert_pair_list_.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 331 at r2 (raw file):

Previously, ZhenLian wrote…

Thanks for the clarification!
Just to make sure I understand, here we will need to do this per-call check whenever we did the per-connection hostname verification, right?
We can access the verifier in this check_call_host(). Maybe we can add a function in the verifier to indicate if we are using hostname verification? Or maybe it's even better to expose a general-purpose function telling whether this kind of per-call check is needed?

I'm honestly not that enthusiastic about any of the options here. The information we're trying to expose here is actually very specific to the hostname verifier, so pretty much any way of exposing it in the verifier API seems somewhat hacky to me.

Right off the cuff, I can think of the following approaches:

  1. We could go with the option you suggested of having a method for the verifier to indicate if we are using hostname verification. The down-side of that is that this method will be a no-op for the vast majority of verifiers, but any verifier that delegates to another one will have to implement the delegation for this method, so it adds a fair amount of work where it's not really needed.

  2. Instead of adding a new method, we could add a new output from the Verify() method (passed to callback for async verifiers or returned from Verify() for sync verifiers). This might be a little less ugly, since it would not require almost every verifier implementation to implement a new method they don't care about, but it would still require every implementation to pass along another parameter.

  3. Similar to the previous option, we could indicate this from Verify(), but instead of using an output parameter, we could just have a mutable boolean field in the request struct, and the hostname verifier can set that field directly. This would not require any changes in any other verifiers, but it would mean that we'd have to go back to a mode where the request struct is modified by the verifier implementation instead of being immutable. I really prefer keeping the request struct immutable (which it will be if you take my suggestion elsewhere to pass the status to the callback instead of storing it in the request struct), because that makes it much clearer that the request struct is all input parameters and the parameters passed to the callback (or returned from Verify() in the sync case) are output parameters.

  4. We could use a hack that is specific to the hostname verifier that is not exposed to the API in any way. For example, we could have a boolean field in the PendingVerifierRequest class (see my suggestion above about that), and we could have the hostname verifier reach inside that class (maybe by using the trick where the request struct is the first data member of PendingVerifierRequest) and set the boolean field. The problem with this approach is that it's a hack that's specific to our internal implementation of the hostname verifier; it means that if someone wanted to implement their own hostname verification as an external verifier, there would be no way for them to enable this per-call check with that verifier.

Of these options, I guess option 2 seems the least bad to me. But feel free to suggest other alternatives.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 205 at r4 (raw file):

  grpc_error* error = grpc_ssl_check_alpn(&peer);
  if (error != GRPC_ERROR_NONE) {
    Closure::Run(DEBUG_LOCATION, on_peer_checked, error);

I don't think it's safe to run this callback inline. This will cause a deadlock if the security handshaker is holding the lock when it calls check_peer(), because the on_peer_checked callback will wind up trying to acquire the same lock. (This may be what's causing the TSAN problems you're seeing.)

See my comment in the security handshaker code for an alternative approach. But if you don't take that alternative approach, then I think this needs to go back to using ExecCtx::Run() instead of Closure::Run().


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 318 at r4 (raw file):

  }
  delete internal_request;
  Closure::Run(DEBUG_LOCATION, on_peer_checked, error);

Same problem here as on line 205 above: this callback needs to use ExecCtx::Run() instead of Closure::Run(), unless you take the alternative approach I mention in the security handshaker code.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 521 at r4 (raw file):

  grpc_error* error = grpc_ssl_check_alpn(&peer);
  if (error != GRPC_ERROR_NONE) {
    Closure::Run(DEBUG_LOCATION, on_peer_checked, error);

Same problem as on line 205 above: This needs to use ExecCtx::Run() instead of Closure::Run() unless you take the alternative approach I describe in the security handshaker code.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 632 at r4 (raw file):

  }
  delete internal_request;
  Closure::Run(DEBUG_LOCATION, on_peer_checked, error);

Same problem as on line 205 above: This needs to use ExecCtx::Run() instead of Closure::Run() unless you take the alternative approach I describe in the security handshaker code.


src/core/lib/security/transport/security_handshaker.cc, line 422 at r2 (raw file):

Previously, ZhenLian wrote…

Running TSAN just gave me a bunch of tests that failed because of timeout...
I think the problem is because we are holding one lock and trying to acquire it again in the same thread. I guess grpc_core::mutex is not a re-entrant lock? I examined the code in security_handshaker.cc, and found the call trace is roughly like this:

OnHandshakeDataSentToPeerFn -> CheckPeerLocked -> on_peer_checked -> OnPeerCheckedFn -> OnPeerCheckedInner
OnHandshakeDataReceivedFromPeerFn -> DoHandshakeNextLocked -> OnHandshakeNextDoneLocked -> CheckPeerLocked -> on_peer_checked -> OnPeerCheckedFn -> OnPeerCheckedInner

We are trying to hold the same lock in OnHandshakeDataSentToPeerFn/OnHandshakeDataReceivedFromPeerFn and OnPeerCheckedInner. Commenting either the former ones or the latter will make the problem go away. I've updated the code to not acquire lock in OnPeerCheckedInner. Is it ok to do that?

(Or I think a better solution might be to have finer granularity of locks, since it seems the former ones and the latter need the lock for protecting different members(right now we only have one mutex for the whole handshaker class). I tried to update with finer-grained locks multiple times, but always ended up having some threading issues(But I didn't look very carefully on what are the causes). I am wondering if this is a direction that worths trying?)

As I mentioned in the security connector code, the problem is that you changed that code to invoke on_peer_checked from inside check_peer(), before that method returns. The current API requires that the security connector always invoke on_peer_checked after check_peer() returns, because otherwise it leads to exactly this deadlock: we are already holding the lock here in the security handshaker when we call check_peer(), and we need to acquire the same lock in the on_peer_checked callback (in OnPeerCheckedInner()). I noted 4 places in the security connector code where you are now invoking on_peer_checked via Closure::Run() instead of ExecCtx::Run(); if you change those back to ExecCtx::Run(), that should eliminate this new problem you've introduced.

An alternative approach here would be to change the security connector check_peer() API to return a bool indicating whether it is going to return synchronously or asynchronously, much like we're doing for the verifier's Verify() method; it would invoke the callback only in the async case. Then we can move the MutexLock lock(&mu_); line from OnPeerCheckedInner() into OnPeerCheckedFn(). In the sync case, we can directly invoke OnPeerCheckedInner(), since we're already holding the lock.


test/core/end2end/fixtures/h2_tls.cc, line 78 at r4 (raw file):

};

class SyncExternalVerifier {

Can this test just use the shared verifier implementations you've added in tls_utils instead of duplicating them here?


test/core/end2end/fixtures/h2_tls.cc, line 80 at r4 (raw file):

class SyncExternalVerifier {
 public:
  SyncExternalVerifier() : base_{(void*)this, Verify, Cancel, Destruct} {}

I don't think the cast to void* is needed here. But if it is, please use a C++-style cast.


test/core/end2end/fixtures/h2_tls.cc, line 117 at r4 (raw file):

  // This arg is used to carry information more than just the verifier self
  // pointer. The reason we don't store self in ffd and use ffd as user_data is

Why are we storing the thread list in ffd to begin with? Why don't we just store the threads here in the AsyncExternalVerifier object?

Actually, do we even need to store the threads at all? We're not supporting cancellation here, so why don't we just create a new non-joinable thread for each request? That way, there's nothing to clean up when the thread finishes processing the request -- the threads can just be fire-and-forget.


test/core/end2end/fixtures/h2_tls.cc, line 462 at r4 (raw file):

}

const static uint32_t kH2TLSFeatureMask =

Please use static const instead of const static.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 46 at r4 (raw file):

};

TEST_F(GrpcTlsCertificateVerifierTest, SyncExternalVerifierGood) {

For all of these tests, instead of "Good" and "Bad", how about saying "Succeeds" or "Fails"? That will make the intent much clearer.

Same thing in all of these test files.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 69 at r4 (raw file):

      &internal_request_, [] { gpr_log(GPR_INFO, "Callback is invoked."); });
  // Deleting the verifier will wait for the async thread to be completed.
  // We need to make sure it is completed before checking request's information.

Why not just check the result from inside the callback lambda?


test/core/security/grpc_tls_certificate_verifier_test.cc, line 81 at r4 (raw file):

      &internal_request_, [] { gpr_log(GPR_INFO, "Callback is invoked."); });
  // Deleting the verifier will wait for the async thread to be completed.
  // We need to make sure it is completed before checking request's information.

Same here.


test/core/security/grpc_tls_credentials_options_test.cc, line 569 at r2 (raw file):

Previously, ZhenLian wrote…

Done.

I don't see this change.


test/core/security/grpc_tls_credentials_options_test.cc, line 587 at r2 (raw file):

Previously, ZhenLian wrote…

The behavior I wanted to test here is that the security connector can be successfully created even when fed in with a bad verifier(since the verifier's check function is not executed during the security connector creation). Is this too obvious that we don't actually need a test?
I think ideally, we will need tests for sync good verifier, sync bad verifier, async good verifier, async bad verifier for both client and server side. That made the tests super long, so I originally chose client sync good verifier and server bad async verifier to test. Do you want me to add all of them?

I don't think we need to test that the security connector can be created when fed with a verifier that will fail requests. If we're not actually handling requests, then we're not actually testing the verifier.

I do think we should test all combinations of success/failure, sync/async, client/server. Please name the tests appropriately.


test/core/security/tls_security_connector_test.cc, line 592 at r2 (raw file):

Previously, ZhenLian wrote…

It seems this is needed, otherwise the compiler will complain with the message:

candidate function not viable: no known conversion from 'const char *' to 'void *' for 4th argument; take the address of the argument with &
inline grpc_closure* grpc_closure_create(const char* file, int line,

(I've changed expected_error from type grpc_error* to const char*)

Instead of (void*), please use const_cast<char*>(expected_error).

Same thing throughout.


test/core/security/tls_security_connector_test.cc, line 95 at r4 (raw file):

      EXPECT_EQ(error, GRPC_ERROR_NONE);
    } else {
      EXPECT_EQ(std::string(expected_error_msg), GetErrorMsg(error));

If you reverse the two parameters here, then there will be no need to make a copy by converting to std::string.


test/core/util/tls_utils.h, line 51 at r4 (raw file):

  SyncExternalVerifier(bool is_good);

  struct UserData {

Why is this needed? Why not just store is_good as a data member of SyncExternalVerifier?

If this is needed, please make it private.


test/core/util/tls_utils.h, line 85 at r4 (raw file):

  AsyncExternalVerifier(bool is_good, gpr_event* event_ptr);

  struct UserData {

Why is this needed? I think is_good and event_ptr should be data members of AsyncExternalVerifier. And we shouldn't need to store thread at all; just make the thread non-joinable, so it can be fire-and-forget.

If this is needed, it should be private.


test/core/util/tls_utils.h, line 94 at r4 (raw file):

  // This is the arg we will pass in when creating the thread, and retrieve it
  // later in the thread callback.
  struct ThreadArgs {

This should also be private.


test/core/util/tls_utils.cc, line 78 at r4 (raw file):

  user_data->self = this;
  user_data->is_good = is_good;
  base_.user_data = (void*)user_data;

No need for the cast here.


test/core/util/tls_utils.cc, line 110 at r4 (raw file):

  user_data->is_good = is_good;
  user_data->event_ptr = event_ptr;
  base_.user_data = (void*)user_data;

This cast is not needed.


test/core/util/tls_utils.cc, line 128 at r4 (raw file):

  thread_args->event_ptr = data->event_ptr;
  if (data->is_good) {
    data->thread = new grpc_core::Thread("AsyncExternalVerifierGoodVerify",

Instead of having a separate callback for the two cases, I suggest adding an is_good field to ThreadArgs and having a single callback that sets the result based on that field.


test/core/util/tls_utils.cc, line 143 at r4 (raw file):

  auto* data = static_cast<UserData*>(user_data);
  if (data->thread != nullptr) {
    data->thread->Join();

As I mentioned elsewhere, if you mark the thread as non-joinable, then this will not be necessary, and there will be no need to actually store a pointer to the thread anywhere.

Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review, Mark!
Please see my replies in comments. There is still one outstanding question(whether we want to make a copy of request field, or simply point to tsi_peer) that might change the core interface. I will work on the C++ implementation once that is addressed.

Reviewable status: 8 of 20 files reviewed, 41 unresolved discussions (waiting on @jiangtaoli2016 and @markdroth)


include/grpc/grpc_security.h, line 924 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Still need to remove the = nullptr or = 0 on all of the fields here. That's not legal in C89.

Sorry, I forgot to clean this up. Updated.


include/grpc/grpc_security.h, line 975 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't see why any of this requires a separate user_data pointer. Anything you can do by having user_data here you can also do without user_data by storing that same pointer in the wrapping-layer object that contains the grpc_tls_verifier_external struct.

That having been said, I'm fine with this approach. It does allow the caller to not have to play pointer games with the grpc_tls_certificate_external_verifier struct being the first data member of the wrapping-layer object.

Yeah, you are right. I think I was not used to design classes utilizing that first-data-member memory tricks. Having a user_data makes me understand things much easier :)


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 48 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No, it won't work to use the callback in sync mode, because the callback function would not know whether it was being called synchronously or asynchronously, and its behavior would need to be different in the two cases (e.g., in terms of avoiding mutex deadlocks and the need to create an ExecCtx).

Instead, I think we would need to add an absl::Status* output parameter to the function. For sync returns, the verifier will return the status via the output parameter and return true; for async returns, it will return false and then pass the status via the callback later.

Note that this is consistent with what we do in existing APIs such as grpc_channel_security_connector::check_call_host() (although that uses grpc_error instead of absl::Status).

OK, I think I get the idea. Just one more thing to confirm: is the parameter going to be an absl::Status**instead of absl::Status*(just like in check_call_host() we are using grpc_error**)?
I could be wrong, but my current understanding is like this: if the parameter is defined as absl::Status*, then it will mean the caller will only need to feed in an absl::Status on stack and the Verify(...., absl::Status*) function should finish before the stack variable goes out of scope; but in our case we cannot use a stack variable because of the async nature; we have to create one with new and in Verify(...., absl::Status**) clean up the old memory and change the pointer to point to a new address. Is that right?

(If absl::Status** is needed, then would grpc_error** be better? Since it seems there are very rare cases which use a pointer to a pointer toabsl::Status...)


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 51 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Yes, passing in the error for cancellation is a good idea. Looks like you're doing that in #25941.

Is the plan here is to merge #25941 first, then merge those changes into this PR? Or will you handle the cancellation code in a third PR that depends on both this PR and #25941?

Sounds good! I plan to merge #25941 first, and then merge those changes into this PR.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 74 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Yes. To say this another way, any time something needs to be allocated by the application and then owned by core, there are only two options:

  1. Require the caller to allocate it via core using gpr_malloc(), so that core can internally delete it via gpr_free().
  2. Allow the caller to allocate it externally however they want, and provide a destruction callback so that core can tell the caller when to destroy it.

Option 1 does the allocation and deallocation on the core side; option 2 does the allocation and deallocation on the caller side. It doesn't really matter which side they happen on, as long as they both happen on the same side.

Got it! Thank you so much for the great summary!


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 45 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Single-argument ctors need to be declared explicit, as per:

https://google.github.io/styleguide/cppguide.html#Implicit_Conversions

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 56 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why do we need these as separate static methods? Can't we just do this initialization and destruction in the ctor and dtor directly?

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 82 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this is actually backwards relative to our normal convention. For most of the other APIs we have where a function can return either synchronously or asynchronously, it returns true for the sync case, not the async case (e.g., grpc_channel_security_connector::check_call_host()). The idea is that the caller should be able to do something like:

bool is_done = verifier->Verify(&request, callback);
if (is_done) {
  // ...process result...
} else {
  // ...wait for callback...
}

So let's flip this, to be consistent with what we do elsewhere.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 100 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think you misunderstood. The problem here is not the ordering of invoking the callback and removing the entry from the map. The problem is the ordering of invoking the callback and releasing the lock.

In other words, the code needs to do something like this:

std::function<void()> callback;
{
  MutexLock lock(&self->mu_);
  auto it = self->request_map_.find(internal_request);
  if (it != self->request_map_.end()) {
    callback = std::move(it->second);
    self->request_map_.erase(it);
  }
}
if (callback != nullptr) callback();

That way, lock goes out of scope before we invoke the callback.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 106 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think the markdown formatting didn't work correctly for part of what you wrote here (probably it interpretted the * character -- try putting that character in backquotes instead), so I don't quite understand the question.

Haha, got it. I think this is just a minor difference. The old logic will treat *.com as an invalid case, but the new logic will treat it as valid. .* is invalid in both cases, though. So I think the question is actually if we want to allow * in the second level domain.
I thought in a real production environment, it would be rare(and a bit suspicious) to have a cert with *.com, but the application should be fine if designed with proper authentication and revocation infrastructure.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 217 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Yeah, all functions declared in the include tree should declare an ExecCtx as the first line of the function. But please do that in a separate PR.

Sure! I created #26031.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 113 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for grpc_core:: prefix here, since we're already in that namespace.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 118 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should just use it instead of it->first. No need to lookup the key again when we already have an iterator pointing right at the entry we want to erase.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 124 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If this function is just calling the existing VerifySubjectAlternativeName() function, why do we need this wrapper function at all? Why not just have callers invoke VerifySubjectAlternativeName() directly instead of calling this?

Yeah, we don't need this.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 124 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If we do keep this function for some reason, please use an anonymous namespace instead of declaring it static.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 128 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for grpc_core:: prefix.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 75 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It doesn't look like these are actually defined anywhere, and the declarations look like duplicates of the methods you have in the CertificateVerificationRequest struct. I think these should probably just be removed.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 214 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Doesn't look like this was addressed.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 222 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I am not actually proposing that we expose tsi_peer in the C-core API, only that we use it as the backing store for these strings to avoid making copies. In other words, I think we should use a data structure similar to this:

struct PendingVerifierRequest {
  CertificateVerificationRequest request;
  tsi_peer peer;
};
std::map<grpc_closure* /*on_peer_checked*/, PendingVerifierRequest*> pending_verifier_requests_;

There will be one entry in the map for each pending request. In the entry for a given request, we can populate the fields of entry.request to point into the strings stored in entry.peer. We won't destroy the tsi_peer until after the request completes. This will avoid the need for making copies of any field stored in the tsi_peer.

Hmmm.... Actually, looking at this struct, it occurs to me that we should probably combine all of the state we need for a pending request into one struct. Currently, we have some state in CertificateVerificationRequest and some state in pre-bound parameters for the lambda callback, and if we add this PendingVerifierRequest struct, we would be adding yet a third place to store state for the same thing. Instead, we could combine all of this state into one object, using something like this:

class PendingVerifierRequest {
 public:
  PendingVerifierRequest(
      RefCountedPtr<TlsChannelSecurityConnector> security_connector,
      grpc_closure* on_peer_checked, tsi_peer peer)
      : security_connector_(std::move(security_connector)),
        on_peer_checked_(on_peer_checked),
        peer_(peer) {
    // ...initialize request_ based on data in security_connector_ and peer_...
  }

  ~PendingVerifierRequest() {
    tsi_peer_destruct(&peer_);
  }

  void Start() {
    grpc_tls_certificate_verifier* verifier = security_connector_->options_->certificate_verifier();
    bool is_async = verifier->Verify(&request_, [this]() { OnVerifyDone(true); });
    if (!is_async) OnVerifyDone(false);
  }

 private:
  void OnVerifyDone(bool run_callback_inline) {
    {
      MutexLock lock(&security_connector_->mu_);
      security_connector_->pending_verifier_requests_.erase(on_peer_checked_);
    }
    grpc_error* error = GRPC_ERROR_NONE;
    if (request_.status != GRPC_STATUS_OK) {
      error = GRPC_ERROR_CREATE_FROM_COPIED_STRING(
          absl::StrCat("Custom verification check failed with error: ",
                       request_.error_details)
              .c_str());
    }
    if (run_callback_inline) {
      Closure::Run(DEBUG_LOCATION, on_peer_checked_, error);
    } else {
      ExecCtx::Run(DEBUG_LOCATION, on_peer_checked_, error);
    }
    delete this;
  }

  RefCountedPtr<TlsChannelSecurityConnector> security_connector_;
  grpc_closure* on_peer_checked_;
  tsi_peer peer_;
  grpc_tls_custom_verification_check_request request_;
};

Then the code here becomes something like this:

auto* pending_request = new PendingVerifierRequest(Ref(), on_peer_checked, peer);
{
  MutexLock lock(&security_connector_->mu_);
  pending_verifier_requests_.emplace(on_peer_checked, pending_request);
}
pending_request->Start();

Note that this would mean that the methods of grpc_tls_certificate_verifier would go back to taking a parameter of type grpc_tls_custom_verification_check_request* instead of CertificateVerificationRequest.

Thanks for the guidance! I updated using the new structure as suggested.
I didn't change it to be directly pointing to the strings in tsi_peer, though, since as I was implementing, I realized that the strings in tsi_peer are not null-terminated. That will mean for every single const char* field in grpc_tls_custom_verification_check_request we will need another size_t length field, and for every array fieldchar**, we will need another size_t* array. This makes the code a bit complicated, and since grpc_tls_custom_verification_check_request is a struct that many callers will use, it will make the potential clean-up a bit harder if we later decide to reduce the use of non-null-terminated-string in c core.
I am wondering if we can keep it as it is right now? I also thought this extra copy was a bit redundant and unnecessary, but it is the best way i could think of to "stop" the spread of this non-standard string use.
@jiangtaoli2016 This(using the null-terminated string) might be another item that we should do when cleaning the core code.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 322 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The goal here is to compare the set of fields that affect the configured behavior of the security connector, not the current data that it happens to have as a result of that configured behavior.

For example, each security connector is configured to use a specific certificate provider instance, and each security connector will contain the certificate data that it was most recently given by that certificate provider. We want to compare based on the configured certificate provider, not based on the certificate data, for two reasons:

  1. At one particular moment in time, two security connectors that are using the same certificate provider might actually have different certificate data, because we might happen to check at a moment when the certificate provider has just finished updating one of the security connectors but has not yet updated the second one. If we compare based on certificate data, then the two would compare as not being equal, even though they are functionally the same and should compare as equal.

  2. At one particular moment in time, two security connectors that are using different certificate providers might happen to have the same certificate data if both certificate providers happen to return the same certs at one point. However, if the certificate providers are different, then one of them could return different data in the future. So if we compare the two security connectors as equal based on the certificate data, it could actually lead to a security problem where we wind up using the wrong security connector for a given subchannel, which means we would use the wrong certificate data.

So basically, I think we do want to compare most of the fields in grpc_tls_credentials_options, and we don't want to compare currently cached data such as pem_root_certs_ and pem_key_cert_pair_list_.

Hmmm ok I see, thanks for the explanation.
I will add a compare() method to grpc_tls_credentials_options then, in another PR. Let's keep this comment open as a reminder.
grpc_tls_certificate_verifier and grpc_tls_certificate_provider are two fields in grpc_tls_credentials_options. How are we going to compare them? From my point, we can simply compare if they are pointing to the same instance, and no need to add compare() for these two, right?


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 331 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I'm honestly not that enthusiastic about any of the options here. The information we're trying to expose here is actually very specific to the hostname verifier, so pretty much any way of exposing it in the verifier API seems somewhat hacky to me.

Right off the cuff, I can think of the following approaches:

  1. We could go with the option you suggested of having a method for the verifier to indicate if we are using hostname verification. The down-side of that is that this method will be a no-op for the vast majority of verifiers, but any verifier that delegates to another one will have to implement the delegation for this method, so it adds a fair amount of work where it's not really needed.

  2. Instead of adding a new method, we could add a new output from the Verify() method (passed to callback for async verifiers or returned from Verify() for sync verifiers). This might be a little less ugly, since it would not require almost every verifier implementation to implement a new method they don't care about, but it would still require every implementation to pass along another parameter.

  3. Similar to the previous option, we could indicate this from Verify(), but instead of using an output parameter, we could just have a mutable boolean field in the request struct, and the hostname verifier can set that field directly. This would not require any changes in any other verifiers, but it would mean that we'd have to go back to a mode where the request struct is modified by the verifier implementation instead of being immutable. I really prefer keeping the request struct immutable (which it will be if you take my suggestion elsewhere to pass the status to the callback instead of storing it in the request struct), because that makes it much clearer that the request struct is all input parameters and the parameters passed to the callback (or returned from Verify() in the sync case) are output parameters.

  4. We could use a hack that is specific to the hostname verifier that is not exposed to the API in any way. For example, we could have a boolean field in the PendingVerifierRequest class (see my suggestion above about that), and we could have the hostname verifier reach inside that class (maybe by using the trick where the request struct is the first data member of PendingVerifierRequest) and set the boolean field. The problem with this approach is that it's a hack that's specific to our internal implementation of the hostname verifier; it means that if someone wanted to implement their own hostname verification as an external verifier, there would be no way for them to enable this per-call check with that verifier.

Of these options, I guess option 2 seems the least bad to me. But feel free to suggest other alternatives.

Sorry, I have some more questions about the sequence of each operation:

  1. is check_call_host() guaranteed to happen after check_peer()?
  2. for option 2, is the plan to first invoke Verify() in check_peer(), store a if_host_name_verifier bool in the security connector, and then call check_call_host() based on the bool? If so, is the check_call_host() guaranteed to happen after if_host_name_verifier is set, even when Verify()is an async operation?

If so, I think I still slightly prefer option 1, but also fine with 2. In option 1, would it be better if we rename the function as something like usePerCallVirtualHosting() instead of isHostNameVerifier? Or we could have an default implementation in base class, and the hostname verifier will in particular modify it?
Another option might be to leave this option completely to users by adding this usePerCallVirtualHosting field to grpc_tls_credentials_options. Instead of making the decision for the users, we will let them to decide whether to perform this check. If they use a non-hostname-verifier but choose this per-call check, we will do as they want - maybe in some corner cases that's exactly what they want...


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 205 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think it's safe to run this callback inline. This will cause a deadlock if the security handshaker is holding the lock when it calls check_peer(), because the on_peer_checked callback will wind up trying to acquire the same lock. (This may be what's causing the TSAN problems you're seeing.)

See my comment in the security handshaker code for an alternative approach. But if you don't take that alternative approach, then I think this needs to go back to using ExecCtx::Run() instead of Closure::Run().

Sure, changed it back to ExecCtx::Run()


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 318 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same problem here as on line 205 above: this callback needs to use ExecCtx::Run() instead of Closure::Run(), unless you take the alternative approach I mention in the security handshaker code.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 521 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same problem as on line 205 above: This needs to use ExecCtx::Run() instead of Closure::Run() unless you take the alternative approach I describe in the security handshaker code.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 632 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same problem as on line 205 above: This needs to use ExecCtx::Run() instead of Closure::Run() unless you take the alternative approach I describe in the security handshaker code.

Done.


src/core/lib/security/transport/security_handshaker.cc, line 422 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As I mentioned in the security connector code, the problem is that you changed that code to invoke on_peer_checked from inside check_peer(), before that method returns. The current API requires that the security connector always invoke on_peer_checked after check_peer() returns, because otherwise it leads to exactly this deadlock: we are already holding the lock here in the security handshaker when we call check_peer(), and we need to acquire the same lock in the on_peer_checked callback (in OnPeerCheckedInner()). I noted 4 places in the security connector code where you are now invoking on_peer_checked via Closure::Run() instead of ExecCtx::Run(); if you change those back to ExecCtx::Run(), that should eliminate this new problem you've introduced.

An alternative approach here would be to change the security connector check_peer() API to return a bool indicating whether it is going to return synchronously or asynchronously, much like we're doing for the verifier's Verify() method; it would invoke the callback only in the async case. Then we can move the MutexLock lock(&mu_); line from OnPeerCheckedInner() into OnPeerCheckedFn(). In the sync case, we can directly invoke OnPeerCheckedInner(), since we're already holding the lock.

Ah yes, the problem is indeed eliminated if we change back to ExecCtx::Run(). I will use ExecCtx::Run() then, since it turns out to be a relative easy solution to implement. Thanks for the suggestions though!


test/core/end2end/fixtures/h2_tls.cc, line 78 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Can this test just use the shared verifier implementations you've added in tls_utils instead of duplicating them here?

Yeah, the only thing prevent us from doing that is the thd_list in fullstack_secure_fixture_data. We will have to keep it otherwise there will be memory leak issues. Feel free to see my replies under another comment in the same file for the details.


test/core/end2end/fixtures/h2_tls.cc, line 80 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think the cast to void* is needed here. But if it is, please use a C++-style cast.

Done.


test/core/end2end/fixtures/h2_tls.cc, line 117 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why are we storing the thread list in ffd to begin with? Why don't we just store the threads here in the AsyncExternalVerifier object?

Actually, do we even need to store the threads at all? We're not supporting cancellation here, so why don't we just create a new non-joinable thread for each request? That way, there's nothing to clean up when the thread finishes processing the request -- the threads can just be fire-and-forget.

Sure, I've changed this to be fire-and-forgot. But it seems instead of creating the thread in Verify() and destroy it in Destruct()(which is exactly the way we did in the shared verifier implementation), we would need to keep the thread list in ffd, otherwise ~10 tests would fail with a memory leak issue - that's actually the reason why I didn't use the shared verifier implementation, but chose to write new ones here.

My random guess is, using Destruct() to clean-up the tests here is a bit unreliable here, since we don't support cancellation, and this core e2e tests covers many corner cases, like disappearing server, etc. I've included the logs of failing tests(https://source.cloud.google.com/results/invocations/0e0384bc-03d4-451a-a927-897702b75dc7/targets). And the test file causing this issue is https://gist.github.com/ZhenLian/4fc7632e9a2b0615af02c963e336d4e9.


test/core/end2end/fixtures/h2_tls.cc, line 462 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use static const instead of const static.

Done.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 46 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

For all of these tests, instead of "Good" and "Bad", how about saying "Succeeds" or "Fails"? That will make the intent much clearer.

Same thing in all of these test files.

Done.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 69 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why not just check the result from inside the callback lambda?

Done.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 81 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same here.

Done.


test/core/security/grpc_tls_credentials_options_test.cc, line 569 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't see this change.

Done.


test/core/security/grpc_tls_credentials_options_test.cc, line 587 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think we need to test that the security connector can be created when fed with a verifier that will fail requests. If we're not actually handling requests, then we're not actually testing the verifier.

I do think we should test all combinations of success/failure, sync/async, client/server. Please name the tests appropriately.

OK, then I think we can just keep two basic tests in this file, since this only tests the initial creation.
All the combination of test cases are already covered in tls_security_connector_test.cc, which is the place we invoke check_peer().


test/core/security/tls_security_connector_test.cc, line 592 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of (void*), please use const_cast<char*>(expected_error).

Same thing throughout.

Done.


test/core/security/tls_security_connector_test.cc, line 95 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If you reverse the two parameters here, then there will be no need to make a copy by converting to std::string.

Done.


test/core/util/tls_utils.h, line 51 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why is this needed? Why not just store is_good as a data member of SyncExternalVerifier?

If this is needed, please make it private.

Both Verify() and Destruct() are static functions, which have no access to non-static data members. So I used UserData as a bridge connecting these static functions.
Done making them private.


test/core/util/tls_utils.h, line 85 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why is this needed? I think is_good and event_ptr should be data members of AsyncExternalVerifier. And we shouldn't need to store thread at all; just make the thread non-joinable, so it can be fire-and-forget.

If this is needed, it should be private.

OK, I changed the thread to be non-joinable. But we will still need to store the thread, since it is created on heap will be cleaned up in Destruct.
This is based on my previous comment about static function. I might overlook something, and feel free to point it out if that's not the case. Thanks!


test/core/util/tls_utils.h, line 94 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should also be private.

Done.


test/core/util/tls_utils.cc, line 78 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for the cast here.

Done.


test/core/util/tls_utils.cc, line 110 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This cast is not needed.

Done.


test/core/util/tls_utils.cc, line 128 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of having a separate callback for the two cases, I suggest adding an is_good field to ThreadArgs and having a single callback that sets the result based on that field.

Done.


test/core/util/tls_utils.cc, line 143 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As I mentioned elsewhere, if you mark the thread as non-joinable, then this will not be necessary, and there will be no need to actually store a pointer to the thread anywhere.

Done.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moving in the right direction.

Please let me know if you have any questions. Thanks!

Reviewed 1 of 6 files at r5, 4 of 10 files at r6, 8 of 8 files at r7.
Reviewable status: all files reviewed, 45 unresolved discussions (waiting on @jiangtaoli2016, @yashykt, and @ZhenLian)


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 48 at r2 (raw file):

Previously, ZhenLian wrote…

OK, I think I get the idea. Just one more thing to confirm: is the parameter going to be an absl::Status**instead of absl::Status*(just like in check_call_host() we are using grpc_error**)?
I could be wrong, but my current understanding is like this: if the parameter is defined as absl::Status*, then it will mean the caller will only need to feed in an absl::Status on stack and the Verify(...., absl::Status*) function should finish before the stack variable goes out of scope; but in our case we cannot use a stack variable because of the async nature; we have to create one with new and in Verify(...., absl::Status**) clean up the old memory and change the pointer to point to a new address. Is that right?

(If absl::Status** is needed, then would grpc_error** be better? Since it seems there are very rare cases which use a pointer to a pointer toabsl::Status...)

absl::Status replaces grpc_error*. So while check_call_host() uses grpc_error**, here we would use absl::Status*.

There is no need to dynamically allocate the absl::Status. In the async case, the local status variable will not be used; instead, the status will be passed to the callback. This is exactly how check_call_host() works: the grpc_error** parameter is used only in the sync case; in the async case, the error is passed to the callback.

The idea here is that we should be able to do this:

absl::Status status;
bool is_done = certificate_verifier->Verify(&request, cb, &status);
if (is_done) {
  if (!status.ok()) {
    // ...verification failed...
  } else {
    // ...verification succeeded...
  }
} else {
  // async, wait for callback
  // (local status variable not used; status will be passed to callback)
}

Note that we're ultimately going to be completely replacing grpc_error* with absl::Status everywhere in our code-base, so this is consistent with our long-term direction.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 51 at r2 (raw file):

Previously, ZhenLian wrote…

Sounds good! I plan to merge #25941 first, and then merge those changes into this PR.

Looks like #25941 has been merged, so you should be able to make these changes now.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 37 at r7 (raw file):

// The abstract class of a pending request.
class PendingVerifierRequest {

I think this can now be moved into the security connector code. Now that it's not part of the grpc_tls_certificate_verifier API, it shouldn't be needed outside of the security connector.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 44 at r7 (raw file):

  // Starts the synchronous or asynchronous verification.
  virtual void Start() = 0;

Why is this virtual? We should not be subclassing this class.

If we are going to subclassing this, then the dtor needs to be virtual as well.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 53 at r7 (raw file):

      grpc_tls_custom_verification_check_request* request);

 protected:

Data members must be private, as per the style guide:

https://google.github.io/styleguide/cppguide.html#Access_Control

It's not clear to me why these are protected in the first place. As I mentioned above, I don't think we should be creating any subclasses of this class.

If these do need to be accessed in a subclass, then you'll need to add protected accessor methods.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 106 at r2 (raw file):

Previously, ZhenLian wrote…

Haha, got it. I think this is just a minor difference. The old logic will treat *.com as an invalid case, but the new logic will treat it as valid. .* is invalid in both cases, though. So I think the question is actually if we want to allow * in the second level domain.
I thought in a real production environment, it would be rare(and a bit suspicious) to have a cert with *.com, but the application should be fine if designed with proper authentication and revocation infrastructure.

I could go either way on this, but I suspect that we should be consistent between the xDS and non-xDS use-cases. @yashykt may have insight as to the requirements in the xDS case.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 51 at r7 (raw file):

  // We will make a copy of each field and destroy them when request_ is
  // destroyed.
  // TODO(ZhenLian): avoid the copy when the underlying core implementation used

If we don't actually do this, then there's no point in storing the tsi_peer inside this object in the first place.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 77 at r7 (raw file):

      char* dns = CopyCoreString(prop->value.data, prop->value.length);
      dns_names.emplace_back(dns);
      continue;

No need for this.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 81 at r7 (raw file):

      // Not supported fields.
      // TODO(ZhenLian): populate IP Address and other fields here as well.
      continue;

No need for this.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 84 at r7 (raw file):

    }
  }
  GPR_ASSERT(request_.peer_info.san_names.uri_names == nullptr);

This assert doesn't gain anything, since we're in the ctor here -- we know that no other function can previously have left this with a different value.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 95 at r7 (raw file):

    }
  }
  GPR_ASSERT(request_.peer_info.san_names.dns_names == nullptr);

Same here.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 116 at r7 (raw file):

    grpc_tls_custom_verification_check_request* request) {
  GPR_ASSERT(request != nullptr);
  request->target_name = nullptr;

If the only place this is being called is in the ctor, then we probably don't need to initialize everything to null. We can just do that for the fields that are not actually set when we iterate through the tsi_peer properties.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 133 at r7 (raw file):

    grpc_tls_custom_verification_check_request* request) {
  GPR_ASSERT(request != nullptr);
  /*if (request->target_name != nullptr) {

Please remove.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 141 at r7 (raw file):

  if (request->peer_info.san_names.uri_names_size > 0) {
    for (size_t i = 0; i < request->peer_info.san_names.uri_names_size; ++i) {
      delete[] request->peer_info.san_names.uri_names[i];

This should be gpr_free(), because it's being allocated with gpr_malloc() in CopyCoreString().


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 147 at r7 (raw file):

  if (request->peer_info.san_names.ip_names_size > 0) {
    for (size_t i = 0; i < request->peer_info.san_names.ip_names_size; ++i) {
      delete[] request->peer_info.san_names.ip_names[i];

Same here (never actually populated, but I assume it will be handled the same way).


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 153 at r7 (raw file):

  if (request->peer_info.san_names.dns_names_size > 0) {
    for (size_t i = 0; i < request->peer_info.san_names.dns_names_size; ++i) {
      delete[] request->peer_info.san_names.dns_names[i];

Same here.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 213 at r7 (raw file):

    return true;  // synchronous check
  }
  absl::string_view allocated_name;

This isn't allocating anything, so I don't think this is a good name. I suggest calling this target_host instead.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 115 at r7 (raw file):

        grpc_closure* on_peer_checked, tsi_peer peer, const char* target_name)
        : PendingVerifierRequest(on_peer_checked, std::move(peer)),
          security_connector_(security_connector) {

std::move()


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 119 at r7 (raw file):

      // the verifier holds a ref to the security connector until this
      // verification request is completed.
      this->request_.target_name = target_name;

Why reach into the base class like this? Why not just pass the target name as a parameter to the base class?

(I still don't understand why we need a separate base class in the first place -- I think the data and methods from there should just be moved into this class directly, so we don't need any inheritence.)


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 144 at r7 (raw file):

  grpc_core::Mutex mu_;
  // We need a separate mutex for |pending_verifier_requests_|, otherwise there

Why? What causes the deadlock?

I actually think it makes sense to use a separate lock here anyway, since the existing lock is basically to guard a disjoint set of fields against an orthogonal set of interactions. But I don't understand why there would be any deadlock here even if we didn't.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 146 at r7 (raw file):

  // We need a separate mutex for |pending_verifier_requests_|, otherwise there
  // would be deadlock errors.
  grpc_core::Mutex verifier_request_map_mu_;

No need for grpc_core::.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 157 at r7 (raw file):

  absl::optional<grpc_core::PemKeyCertPairList> pem_key_cert_pair_list_;
  std::map<grpc_closure* /*on_peer_checked*/, PendingVerifierRequest*>
      pending_verifier_requests_;

Please add an ABSL_GUARDED_BY(verifier_request_map_mu_) annotation here. I've sent you #26044 as an example of how to do this.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 222 at r7 (raw file):

  // Use "new" to create a new instance, and no need to delete it later, since
  // it will be self-destroyed in |OnVerifyDone|.
  class ServerPendingVerifierRequest final : public PendingVerifierRequest {

Hmmm... Is the reason that you created a shared base class that you wanted to share the code between the client and server to (a) initialize the request from the tsi_peer and (b) free the request when verification is complete? If so, I think we should use composition instead of inheritence for this, as per the style guide:

https://google.github.io/styleguide/cppguide.html#Inheritance

In particular, I think it's fine to have a couple of standalone functions in an anonymous namespace in this file to do (a) and (b), and then call those functions from this class in each of the client and server security connectors.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 228 at r7 (raw file):

        grpc_closure* on_peer_checked, tsi_peer peer)
        : PendingVerifierRequest(on_peer_checked, std::move(peer)),
          security_connector_(security_connector) {}

std::move()


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 252 at r7 (raw file):

  grpc_core::Mutex mu_;
  // We need a separate mutex for |pending_verifier_requests_|, otherwise there

Same question as above about the deadlock.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 254 at r7 (raw file):

  // We need a separate mutex for |pending_verifier_requests_|, otherwise there
  // would be deadlock errors.
  grpc_core::Mutex verifier_request_map_mu_;

No need for grpc_core::.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 263 at r7 (raw file):

  absl::optional<grpc_core::PemKeyCertPairList> pem_key_cert_pair_list_;
  std::map<grpc_closure* /*on_peer_checked*/, PendingVerifierRequest*>
      pending_verifier_requests_;

Please add an ABSL_GUARDED_BY(verifier_request_map_mu_) annotation here.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 222 at r2 (raw file):

Previously, ZhenLian wrote…

Thanks for the guidance! I updated using the new structure as suggested.
I didn't change it to be directly pointing to the strings in tsi_peer, though, since as I was implementing, I realized that the strings in tsi_peer are not null-terminated. That will mean for every single const char* field in grpc_tls_custom_verification_check_request we will need another size_t length field, and for every array fieldchar**, we will need another size_t* array. This makes the code a bit complicated, and since grpc_tls_custom_verification_check_request is a struct that many callers will use, it will make the potential clean-up a bit harder if we later decide to reduce the use of non-null-terminated-string in c core.
I am wondering if we can keep it as it is right now? I also thought this extra copy was a bit redundant and unnecessary, but it is the best way i could think of to "stop" the spread of this non-standard string use.
@jiangtaoli2016 This(using the null-terminated string) might be another item that we should do when cleaning the core code.

If the underlying strings are not null-terminated, then I agree that we'll need to copy them. But in that case, there's no point in storing the tsi_peer in the PendingVerifierRequest object in the first place.

Note that even if we choose not to store the tsi_peer in PendingVerifierRequest, that would not affect the overall structure of PendingVerifierRequest.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 322 at r2 (raw file):

Previously, ZhenLian wrote…

Hmmm ok I see, thanks for the explanation.
I will add a compare() method to grpc_tls_credentials_options then, in another PR. Let's keep this comment open as a reminder.
grpc_tls_certificate_verifier and grpc_tls_certificate_provider are two fields in grpc_tls_credentials_options. How are we going to compare them? From my point, we can simply compare if they are pointing to the same instance, and no need to add compare() for these two, right?

Yes, I think it's fine to compare the certificate verifier and certificate provider by just the pointer value. If the application constructs two identical objects instead of using the same one twice, that would mean the resulting security connectors would compare non-equally, but that's fine -- the application really should be just creating a single instance of these objects instead.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 331 at r2 (raw file):

Previously, ZhenLian wrote…

Sorry, I have some more questions about the sequence of each operation:

  1. is check_call_host() guaranteed to happen after check_peer()?
  2. for option 2, is the plan to first invoke Verify() in check_peer(), store a if_host_name_verifier bool in the security connector, and then call check_call_host() based on the bool? If so, is the check_call_host() guaranteed to happen after if_host_name_verifier is set, even when Verify()is an async operation?

If so, I think I still slightly prefer option 1, but also fine with 2. In option 1, would it be better if we rename the function as something like usePerCallVirtualHosting() instead of isHostNameVerifier? Or we could have an default implementation in base class, and the hostname verifier will in particular modify it?
Another option might be to leave this option completely to users by adding this usePerCallVirtualHosting field to grpc_tls_credentials_options. Instead of making the decision for the users, we will let them to decide whether to perform this check. If they use a non-hostname-verifier but choose this per-call check, we will do as they want - maybe in some corner cases that's exactly what they want...

Yes, check_call_host() is guaranteed to happen after check_peer(), including any async verification. The former happens when the connection is established; the latter happens when calls are sent on that connection.

Having a separate method with a default implementation in the base class does not help for verifiers that delegate to other verifiers. If we have a separate method for this, then basically every single verifier implementation will have to implement this method, even if most of them don't care at all about this. That's why I prefer option 2.

We cannot call this option "use per-call virtual hosting", because it does not control whether per-call virtual hosting is enabled; any application can use that feature via the gRPC API, regardless of what this option is set to. We should call it something like "check per-call host against peer cert".


test/core/end2end/fixtures/h2_tls.cc, line 78 at r4 (raw file):

Previously, ZhenLian wrote…

Yeah, the only thing prevent us from doing that is the thd_list in fullstack_secure_fixture_data. We will have to keep it otherwise there will be memory leak issues. Feel free to see my replies under another comment in the same file for the details.

See below.


test/core/end2end/fixtures/h2_tls.cc, line 117 at r4 (raw file):

Previously, ZhenLian wrote…

Sure, I've changed this to be fire-and-forgot. But it seems instead of creating the thread in Verify() and destroy it in Destruct()(which is exactly the way we did in the shared verifier implementation), we would need to keep the thread list in ffd, otherwise ~10 tests would fail with a memory leak issue - that's actually the reason why I didn't use the shared verifier implementation, but chose to write new ones here.

My random guess is, using Destruct() to clean-up the tests here is a bit unreliable here, since we don't support cancellation, and this core e2e tests covers many corner cases, like disappearing server, etc. I've included the logs of failing tests(https://source.cloud.google.com/results/invocations/0e0384bc-03d4-451a-a927-897702b75dc7/targets). And the test file causing this issue is https://gist.github.com/ZhenLian/4fc7632e9a2b0615af02c963e336d4e9.

The problem with that code is that you're storing a single thread but overwriting it for each request. So any time you start a new request before the previous one ends, you will reset the thread to point to the new instance, thus leaking the memory for the old thread.

See my comments in tls_utils.h about the two options you can use to fix the async external verifier to clean up its threads. If you do that correctly, then you should be able to use the same implementation here.

There should be no need to store any threads in fullstack_secure_fixture_data.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 43 at r7 (raw file):

 protected:
  void SetUp() override {
    PendingVerifierRequest::PendingVerifierRequestInit(&request_);

You can probably just memset() the request to 0 here.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 47 at r7 (raw file):

  void TearDown() override {
    PendingVerifierRequest::PendingVerifierRequestDestroy(&request_);

In all of these tests, the request struct is not outliving the test itself. That means there's no need to dynamically allocate any of the fields; it's fine to have them point to local variables or static strings. That way, you don't need to do any cleanup here.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 74 at r7 (raw file):

  auto* async_verifier = new AsyncExternalVerifier(true, &event);
  auto* core_external_verifier =
      new ExternalCertificateVerifier(async_verifier->base());

Why does this need to be dynamically allocated? Why not just declare it on the stack, since you're unconditionally deleting at the end of the test?


test/core/security/grpc_tls_certificate_verifier_test.cc, line 80 at r7 (raw file):

  });
  // Wait for the async callback to be completed.
  gpr_event_wait(&event, gpr_time_add(gpr_now(GPR_CLOCK_MONOTONIC),

You're not actually setting this event anywhere via gpr_event_set(), so this call is functionally equivalent to just sleeping for 5 seconds. I think you need to call gpr_event_set() from within the callback.

If you're having to wait for this anyway, then I retract my earlier suggestion about checking the status within the callback. Instead, have the callback set a local variable to the status before it sets the event, and then have the code here check that local variable.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 90 at r7 (raw file):

  auto* async_verifier = new AsyncExternalVerifier(false, &event);
  auto* core_external_verifier =
      new ExternalCertificateVerifier(async_verifier->base());

Same question as above: Why does this need to be dynamically allocated?


test/core/security/grpc_tls_certificate_verifier_test.cc, line 98 at r7 (raw file):

  });
  // Wait for the async callback to be completed.
  gpr_event_wait(&event, gpr_time_add(gpr_now(GPR_CLOCK_MONOTONIC),

Same comment here as in the previous test.


test/core/util/tls_utils.h, line 51 at r4 (raw file):

Previously, ZhenLian wrote…

Both Verify() and Destruct() are static functions, which have no access to non-static data members. So I used UserData as a bridge connecting these static functions.
Done making them private.

Static methods don't automatically get a this pointer to one specific object of the class, but they do have access to private fields of an object of that class if you pass them a pointer to that object. So instead of having a separate UserData struct, just move the is_good field to the SyncExternalVerifier class, and have SyncExternalVerifier pass this as the user_data pointer to the external verifier.


test/core/util/tls_utils.h, line 85 at r4 (raw file):

Previously, ZhenLian wrote…

OK, I changed the thread to be non-joinable. But we will still need to store the thread, since it is created on heap will be cleaned up in Destruct.
This is based on my previous comment about static function. I might overlook something, and feel free to point it out if that's not the case. Thanks!

See above. There's is no reason why we need UserData here; everything should be stored directly in the AsyncExternalVerifier class.

If we need to store a separate thread object for each request, then there's no point in making it non-joinable. In that case, there are two options.

One option is that the AsyncExternalVerifier class can have a map of pending requests, where the thread is stored in the map value. This can basically be exactly the same pattern that we use everywhere else.

Another option would be to have a single dedicated thread that is used for all requests. You could do something like this:

class AsyncExternalVerifier {
 public:
  AsyncExternalVerifier(bool is_good, gpr_event* event_ptr)
      : is_good_(is_good),
        event_ptr_(event_ptr),
        thread_("AsyncExternalVerifierWorkerThread", WorkerThread, this),
        base_{this, Verify, Cancel, Destruct} {}

 private:
  // A request to pass to the worker thread.
  struct Request {
    grpc_tls_custom_verification_check_request* request;
    grpc_tls_on_custom_verification_check_done_cb callback;
    void* callback_arg;
    bool shutdown = false;  // If true, thread will exit.
  };

  static int Verify(void* user_data,
                    grpc_tls_custom_verification_check_request* request,
                    grpc_tls_on_custom_verification_check_done_cb callback,
                    void* callback_arg) {
    auto* self = static_cast<AsyncExternalVerifier*>(arg);
    // Add request to queue to be picked up by worker thread.
    MutexLock lock(&self->mu_);
    self->queue_.push_back(Request{request, callback, callback_arg});
    return false;  // async
  }

  static int Destruct(void* user_data) {
    auto* self = static_cast<AsyncExternalVerifier*>(arg);
    // Tell the thread to shut down.
    {
      MutexLock lock(&self->mu_);
      self->queue_.push_back(Request{nullptr, nullptr, nullptr, true});
    }
    // Wait for thread to exit.
    thread_.Join();
  }

  static void WorkerThread(void* arg) {
    auto* self = static_cast<AsyncExternalVerifier*>(arg);
    while (true) {
      // Check queue for work.
      bool got_request = false;
      Request request;
      {
        MutexLock lock(&self->mu_);
        if (!self->queue_.empty()) {
           got_request = true;
           request = self->queue_.front();
           self->queue_.pop_front();
        }
      }
      // If nothing found in the queue, sleep for a bit and try again.
      if (!got_request) {
        gpr_sleep_until(grpc_timeout_milliseconds_to_deadline(100));
        continue;
      }
      // If we're being told to shut down, return.
      if (request->shutdown) return;
      // Process the request.
      if (self->is_good) {
        request.request->status = GRPC_STATUS_OK;
      } else {
        request.request->status = GRPC_STATUS_UNAUTHENTICATED;
        request.request->error_details =
            gpr_strdup("AsyncExternalVerifierBadVerify failed");
      }
      request.callback(request.request, request.callback_arg);
      // Now we can notify the main testing thread that the thread object is set.
      if (self->event_ptr != nullptr) {
        gpr_event_set(self->event_ptr, reinterpret_cast<void*>(1));
      }
    }
  }

  bool is_good_;
  gpr_event* event_ptr_;
  grpc_core::Thread thread_;
  grpc_tls_certificate_verifier_external base_;
  Mutex mu_;
  std::deque<Request> queue_ ABSL_GUARDED_BY(mu_);
};

test/core/util/tls_utils.h, line 49 at r7 (raw file):

class SyncExternalVerifier {
 public:
  SyncExternalVerifier(bool is_good);

This needs to be explicit.


test/core/util/tls_utils.h, line 49 at r7 (raw file):

class SyncExternalVerifier {
 public:
  SyncExternalVerifier(bool is_good);

Suggest calling the parameter succeed.


test/core/util/tls_utils.h, line 56 at r7 (raw file):

  struct UserData {
    SyncExternalVerifier* self = nullptr;
    bool is_good = false;

Suggest calling this succeed.


test/core/util/tls_utils.h, line 83 at r7 (raw file):

  // don't need to be notified(e.g. in case when the check_peer() of the
  // security connector is not invoked), pass nullptr here.
  AsyncExternalVerifier(bool is_good, gpr_event* event_ptr);

Suggest renaming is_good to succeed. Same thing throughout this class.


test/core/util/tls_utils.h, line 100 at r7 (raw file):

    grpc_tls_on_custom_verification_check_done_cb callback;
    void* callback_arg = nullptr;
    bool is_good = false;

These two fields don't need to be here. Instead, they can be accessed from the parent AsyncExternalVerifier object.

Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!
I haven't finished all the fixes, but I asked some follow-up questions and changed the code structure based on two significant problems mentioned. Could you please take a look at them first please? Thank you so much!

Reviewable status: all files reviewed, 45 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @yashykt, and @ZhenLian)


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 48 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

absl::Status replaces grpc_error*. So while check_call_host() uses grpc_error**, here we would use absl::Status*.

There is no need to dynamically allocate the absl::Status. In the async case, the local status variable will not be used; instead, the status will be passed to the callback. This is exactly how check_call_host() works: the grpc_error** parameter is used only in the sync case; in the async case, the error is passed to the callback.

The idea here is that we should be able to do this:

absl::Status status;
bool is_done = certificate_verifier->Verify(&request, cb, &status);
if (is_done) {
  if (!status.ok()) {
    // ...verification failed...
  } else {
    // ...verification succeeded...
  }
} else {
  // async, wait for callback
  // (local status variable not used; status will be passed to callback)
}

Note that we're ultimately going to be completely replacing grpc_error* with absl::Status everywhere in our code-base, so this is consistent with our long-term direction.

Sounds good, thank you so much for the detailed explanation! I've updated the code. Could you please take a look to see if it is what you want?
Looking at this new way of structuring the code, I am wondering if it would make more sense to completely remove the status and error_details in grpc_tls_custom_verification_check_request, and add two output parameters for verify(), so that it becomes:

int (*verify)(void* user_data,
                grpc_tls_custom_verification_check_request* request,
                grpc_tls_on_custom_verification_check_done_cb callback,
                void* callback_arg, grpc_status* sync_status, char** sync_error_details);
```.
In our current implementation, the two fields in `request` are merely for sync verification, and the semantics of how to pass error out is a bit different from the classes that could use `absl::Status* `. It might be better to make the idea of how we pass error out consistent?

src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 37 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this can now be moved into the security connector code. Now that it's not part of the grpc_tls_certificate_verifier API, it shouldn't be needed outside of the security connector.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 44 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why is this virtual? We should not be subclassing this class.

If we are going to subclassing this, then the dtor needs to be virtual as well.

I make this a pure virtual function because while implementing Start(), we will need to do security_connector_->options_. If in PendingVerifierRequest we decided to cache the base grpc_security_connector, we are not able to get the options_. We will need to cache concrete types TlsChannelSecurityConnector or TlsServerSecurityConnector.
I think there are many solutions. We could cache two of them, and in Start() we just use whichever is not empty. Or we could have a bool indicating whether on client side or server side, and in Start() we cast that to concrete types. Can I know what do you suggest?


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 51 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If we don't actually do this, then there's no point in storing the tsi_peer inside this object in the first place.

I think we could do this if it makes sense. @jiangtaoli2016 is considering cleaning up the some of the TSI code-base, and I think it would be great if we could eliminate many of the old C code that uses pointer arithmetic, non-null-terminated string, etc.
These type of work might be relatively easy; to make the code easier to maintain, we might want to restructure files like ssl_transport_security.cc to introduce class and inheritance design. Right now this file has so many components stuffed together, and becomes harder and harder to maintain.
If these fields one day become null-terminated, I think it makes sense to directly take the pointers, since it saves space and doesn't change the API.
It is just from resource-wise that we didn't find a person who has bandwidth to do that. But yeah it's in our radar.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moving in the right direction.

Please let me know if you have any questions. Thanks!

Reviewed 10 of 10 files at r8.
Reviewable status: all files reviewed, 54 unresolved discussions (waiting on @jiangtaoli2016, @yashykt, and @ZhenLian)


include/grpc/grpc_security.h, line 1019 at r8 (raw file):

   * false immediately, and then in the asynchronous thread invoke |callback|
   * with the verification result. The implementer MUST NOT invoke the async
   * |callback| before |verify| returns, otherwise it can lead to deadlocks.

Let's be a bit more precise here: The implementation must not invoke the async callback in the same thread before verify() returns. It's okay if the async callback is invoked in a different thread and happens to be called before verify() returns in the original thread.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 48 at r2 (raw file):

Previously, ZhenLian wrote…

Sounds good, thank you so much for the detailed explanation! I've updated the code. Could you please take a look to see if it is what you want?
Looking at this new way of structuring the code, I am wondering if it would make more sense to completely remove the status and error_details in grpc_tls_custom_verification_check_request, and add two output parameters for verify(), so that it becomes:

int (*verify)(void* user_data,
                grpc_tls_custom_verification_check_request* request,
                grpc_tls_on_custom_verification_check_done_cb callback,
                void* callback_arg, grpc_status* sync_status, char** sync_error_details);
```.
In our current implementation, the two fields in `request` are merely for sync verification, and the semantics of how to pass error out is a bit different from the classes that could use `absl::Status* `. It might be better to make the idea of how we pass error out consistent?


</blockquote></details>

Yes, I had intended that we remove the `status` and `error_details` fields from `grpc_tls_custom_verification_check_request` and instead return them from the `verify()` method, exactly as you described.

In other words, the semantics should be the same in the internal API and in the C-core API.  The only difference is that the C-core API uses a separate status code and string instead of using `absl::Status`.
___
*[src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 44 at r7](https://reviewable.io/reviews/grpc/grpc/25631#-MYpY7lmDajsXsh13ljX:-MZJtPps3bKMMzJfb_3O:b8zt884) ([raw file](https://github.com/grpc/grpc/blob/ea2caf27c77cf64e80535255769b9ef9a580c55c/src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h#L44)):*
<details><summary><i>Previously, ZhenLian wrote…</i></summary><blockquote>

I make this a pure virtual function because while implementing `Start()`, we will need to do `security_connector_->options_`. If in `PendingVerifierRequest` we decided to cache the base `grpc_security_connector`, we are not able to get the `options_`. We will need to cache concrete types `TlsChannelSecurityConnector` or `TlsServerSecurityConnector`.
I think there are many solutions. We could cache two of them, and in `Start()` we just use whichever is not empty. Or we could have a bool indicating whether on client side or server side, and  in `Start()` we cast that to concrete types. Can I know what do you suggest? 
</blockquote></details>

See my comment in tls_security_connector.h line 222.  I don't think it's helpful to share the interface between the client and server in the first place.  The only part of the code that you're actually sharing between the two is the code to initialize the fields in the request struct and then free those fields later, and I think whatever sharing is needed there can be done via standalone helper functions.

In other words, I think that `TlsChannelSecurityConnector::ChannelPendingVerifierRequest` and `TlsServerSecurityConnector::ServerPendingVerifierRequest` should be independent classes instead of inheritting from the same base class.
___
*[src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 51 at r7](https://reviewable.io/reviews/grpc/grpc/25631#-MYpbusq--88Lr0yoLzq:-MZJuJvY80tqen-Dduw_:b-b27c97) ([raw file](https://github.com/grpc/grpc/blob/ea2caf27c77cf64e80535255769b9ef9a580c55c/src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc#L51)):*
<details><summary><i>Previously, ZhenLian wrote…</i></summary><blockquote>

I think we could do this if it makes sense. @jiangtaoli2016 is considering cleaning up the some of the TSI code-base, and I think it would be great if we could eliminate many of the old C code that uses pointer arithmetic,  non-null-terminated string, etc.
These type of work might be relatively easy; to make the code easier to maintain, we might want to restructure files like `ssl_transport_security.cc` to introduce class and inheritance design. Right now this file has so many components stuffed together, and becomes harder and harder to maintain. 
If these fields one day become null-terminated, I think it makes sense to directly take the pointers, since it saves space and doesn't change the API.
It is just from resource-wise that we didn't find a person who has bandwidth to do that. But yeah it's in our radar.
</blockquote></details>

I agree with the long-term direction of changing the TSI API from C to C++.  However, I don't think that's actually a prerequisite for sharing these strings.  It seems like we should be able to change the existing C-based TSI implementation to generate nul-terminated strings even in the existing API.  Just change them to allocate one extra byte and populate it with nul.
___
*[src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 46 at r8](https://reviewable.io/reviews/grpc/grpc/25631#-MZJo2AL5ozRM5-ctNYm:-MZJo2AL5ozRM5-ctNYn:b-cwz1h) ([raw file](https://github.com/grpc/grpc/blob/addb7557e62172a862401814c32258c40f8f2cd4/src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc#L46)):*
> ```cpp
>   if (is_done) {
>     if (request->status != GRPC_STATUS_OK) {
>       *sync_status = absl::Status(absl::StatusCode::kUnauthenticated,
> ```

This should use `request->status` rather than hard-coding `kUnauthenticated`.
___
*[src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 73 at r8](https://reviewable.io/reviews/grpc/grpc/25631#-MZJoJaG0Cvqs-qJcqz6:-MZJoJaG0Cvqs-qJcqz7:b-jur4al) ([raw file](https://github.com/grpc/grpc/blob/addb7557e62172a862401814c32258c40f8f2cd4/src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc#L73)):*
> ```cpp
>     if (status != GRPC_STATUS_OK) {
>       return_status =
>           absl::Status(absl::StatusCode::kUnauthenticated, error_details);
> ```

This should use `status` instead of hard-coding `kUnauthenticated`.
___
*[src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 160 at r8](https://reviewable.io/reviews/grpc/grpc/25631#-MZJosXLC76_Qo4ti3sg:-MZJosXLC76_Qo4ti3sh:b-r12u7i) ([raw file](https://github.com/grpc/grpc/blob/addb7557e62172a862401814c32258c40f8f2cd4/src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc#L160)):*
> ```cpp
>           callback(request, callback_arg, GRPC_STATUS_OK, "");
>         } else {
>           callback(request, callback_arg, GRPC_STATUS_UNAUTHENTICATED,
> ```

This needs to use `async_status.code()` instead of hard-coding `GRPC_STATUS_UNAUTHENTICATED`.
___
*[src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 161 at r8](https://reviewable.io/reviews/grpc/grpc/25631#-MZJpdwzCs1i6QxQNdmk:-MZJpdwzCs1i6QxQNdml:bo9oof5) ([raw file](https://github.com/grpc/grpc/blob/addb7557e62172a862401814c32258c40f8f2cd4/src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc#L161)):*
> ```cpp
>         } else {
>           callback(request, callback_arg, GRPC_STATUS_UNAUTHENTICATED,
>                    async_status.ToString().c_str());
> ```

This should use `message()` instead of `ToString()`.
___
*[src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 168 at r8](https://reviewable.io/reviews/grpc/grpc/25631#-MZJp45ZAFxbJ6VtuirY:-MZJp45ZAFxbJ6VtuirZ:b-ytrpgm) ([raw file](https://github.com/grpc/grpc/blob/addb7557e62172a862401814c32258c40f8f2cd4/src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc#L168)):*
> ```cpp
>   if (is_done) {
>     if (!sync_status.ok()) {
>       request->status = GRPC_STATUS_UNAUTHENTICATED;
> ```

This should use `sync_status.code()` instead of hard-coding `GRPC_STATUS_UNAUTHENTICATED`.
___
*[src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 169 at r8](https://reviewable.io/reviews/grpc/grpc/25631#-MZJphTZC1mx-QoJ_FWh:-MZJphTZC1mx-QoJ_FWi:bo9oof5) ([raw file](https://github.com/grpc/grpc/blob/addb7557e62172a862401814c32258c40f8f2cd4/src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc#L169)):*
> ```cpp
>     if (!sync_status.ok()) {
>       request->status = GRPC_STATUS_UNAUTHENTICATED;
>       request->error_details = gpr_strdup(sync_status.ToString().c_str());
> ```

This should use `message()` instead of `ToString()`.
___
*[src/core/lib/security/security_connector/tls/tls_security_connector.h, line 157 at r8](https://reviewable.io/reviews/grpc/grpc/25631#-MZJr7B430nmRigtmt2r:-MZJr7B430nmRigtmt2s:bdvbvun) ([raw file](https://github.com/grpc/grpc/blob/addb7557e62172a862401814c32258c40f8f2cd4/src/core/lib/security/security_connector/tls/tls_security_connector.h#L157)):*
> ```objc
>       bool is_done = verifier->Verify(
>           &request_,
>           [this](absl::Status status) { OnVerifyDone(true, status); },
> ```

`absl::bind_front(&OnVerifyDone, this, true)`
___
*[src/core/lib/security/security_connector/tls/tls_security_connector.h, line 270 at r8](https://reviewable.io/reviews/grpc/grpc/25631#-MZJqx3n0H-lNXh8RSV6:-MZJqx3n0H-lNXh8RSV7:bdvbvun) ([raw file](https://github.com/grpc/grpc/blob/addb7557e62172a862401814c32258c40f8f2cd4/src/core/lib/security/security_connector/tls/tls_security_connector.h#L270)):*
> ```objc
>       bool is_done = verifier->Verify(
>           &request_,
>           [this](absl::Status status) { OnVerifyDone(true, status); },
> ```

`absl::bind_front(&OnVerifyDone, this, true)`
___
*[test/core/security/grpc_tls_certificate_verifier_test.cc, line 58 at r8](https://reviewable.io/reviews/grpc/grpc/25631#-MZJrfdx3QDCJxPCG122:-MZJrfdy9AqmnmM3BSQZ:b18syji) ([raw file](https://github.com/grpc/grpc/blob/addb7557e62172a862401814c32258c40f8f2cd4/test/core/security/grpc_tls_certificate_verifier_test.cc#L58)):*
> ```cpp
>   ExternalCertificateVerifier core_external_verifier(sync_verifier->base());
>   absl::Status sync_status;
>   core_external_verifier.Verify(
> ```

All of these tests should check the return value from `Verify()`.


<!-- Sent from Reviewable.io -->

@ZhenLian ZhenLian force-pushed the zhen_custom_verification_1 branch from addb755 to 77dbed7 Compare May 2, 2021 17:41
Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mark for the review so far, and for bringing up the awesome test change to use a thread loop! I've fixed most of them.
I also added the C++ implementations, but haven't got a time to work on the test yet(but I've made sure they are at least syntax right). Feel free to take a look and see if you have any early comments on that, or it's also fine to wait for more tests to show up.
Thanks again!

Reviewable status: 5 of 20 files reviewed, 54 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @yashykt, and @ZhenLian)


include/grpc/grpc_security.h, line 1019 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Let's be a bit more precise here: The implementation must not invoke the async callback in the same thread before verify() returns. It's okay if the async callback is invoked in a different thread and happens to be called before verify() returns in the original thread.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 48 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Yes, I had intended that we remove the status and error_details fields from grpc_tls_custom_verification_check_request and instead return them from the verify() method, exactly as you described.

In other words, the semantics should be the same in the internal API and in the C-core API. The only difference is that the C-core API uses a separate status code and string instead of using absl::Status.

Sounds good! I've made the change accordingly.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 51 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Looks like #25941 has been merged, so you should be able to make these changes now.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 44 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

See my comment in tls_security_connector.h line 222. I don't think it's helpful to share the interface between the client and server in the first place. The only part of the code that you're actually sharing between the two is the code to initialize the fields in the request struct and then free those fields later, and I think whatever sharing is needed there can be done via standalone helper functions.

In other words, I think that TlsChannelSecurityConnector::ChannelPendingVerifierRequest and TlsServerSecurityConnector::ServerPendingVerifierRequest should be independent classes instead of inheritting from the same base class.

Done. Please take a look. Thank you so much!


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 53 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Data members must be private, as per the style guide:

https://google.github.io/styleguide/cppguide.html#Access_Control

It's not clear to me why these are protected in the first place. As I mentioned above, I don't think we should be creating any subclasses of this class.

If these do need to be accessed in a subclass, then you'll need to add protected accessor methods.

No longer have this anymore.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 51 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I agree with the long-term direction of changing the TSI API from C to C++. However, I don't think that's actually a prerequisite for sharing these strings. It seems like we should be able to change the existing C-based TSI implementation to generate nul-terminated strings even in the existing API. Just change them to allocate one extra byte and populate it with nul.

Oh, sorry for causing the confusion. The thing we intended to do is to make the code inside TSI a bit more C++-like, instead of changing the TSI interface. Generating nul-terminated strings is one relatively simpler goal we wanted to achieve, but we also wanted to see if it is possible to restructure some of the code inside TSI layer to make them look like written in modern C++(using class, inheritance, reduce the use of pointer arithmetic, using std::* or absl::* when possible, etc).
The reason for this change was because the transport security code was written a long time ago, and possibly before the time we can use C++11 in core. That code seems harder to maintain and make changes, and with the possible future work of CRL and SPIFFE trust domain, we will probably tend to modify TSI code more frequently. So we are trying to see if we could re-write some parts of it, to at least mitigate the maintenance burden.
I will probably change to use nul-terminated strings after this PR goes in, so I think we can keep this TODO at this moment.

(while I think it's a good idea to change the TSI API from C to C++, I would imagine it could be a huge amount of work. We will need to change different transport protocols using TSI internally. That's why we didn't even think about it :) ).


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 77 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for this.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 81 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for this.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 84 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This assert doesn't gain anything, since we're in the ctor here -- we know that no other function can previously have left this with a different value.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 95 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same here.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 116 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If the only place this is being called is in the ctor, then we probably don't need to initialize everything to null. We can just do that for the fields that are not actually set when we iterate through the tsi_peer properties.

Sure. Right now, the only field that is not set through tsi_peer is the target_name, but it is also a bit special: on server, there wouldn't be target_nameavailable, so it will be nullptr in grpc_tls_custom_verification_check_request when on server side. So we might still need to initialize it to nullptr...


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 133 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please remove.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 141 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should be gpr_free(), because it's being allocated with gpr_malloc() in CopyCoreString().

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 147 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same here (never actually populated, but I assume it will be handled the same way).

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 153 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same here.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 213 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This isn't allocating anything, so I don't think this is a good name. I suggest calling this target_host instead.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 46 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should use request->status rather than hard-coding kUnauthenticated.

It seems we are not able to convert from 'grpc_status_code' to 'absl::StatusCode', so I will make a static_cast here.
Just to make sure, are we sure that the fields in these two enums can always stay the same?


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 73 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should use status instead of hard-coding kUnauthenticated.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 160 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This needs to use async_status.code() instead of hard-coding GRPC_STATUS_UNAUTHENTICATED.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 161 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should use message() instead of ToString().

Would it be better if we also include the status code in the debug message? It is also stated in message()'s documentation that It is not unusual for the error message to be the empty string. As a result, prefer Status::ToString() for debug logging....


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 168 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should use sync_status.code() instead of hard-coding GRPC_STATUS_UNAUTHENTICATED.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 115 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move()

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 119 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why reach into the base class like this? Why not just pass the target name as a parameter to the base class?

(I still don't understand why we need a separate base class in the first place -- I think the data and methods from there should just be moved into this class directly, so we don't need any inheritence.)

removed the base class.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 144 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why? What causes the deadlock?

I actually think it makes sense to use a separate lock here anyway, since the existing lock is basically to guard a disjoint set of fields against an orthogonal set of interactions. But I don't understand why there would be any deadlock here even if we didn't.

Thanks for asking! I almost forgot this. That was one thing I felt quite weird.
I've uploaded the failures to https://source.cloud.google.com/results/invocations/911d1a6d-654f-4f36-b0cc-94fe37dc8841/targets. Basically, it says it is holding the lock conflicting with TlsChannelSecurityConnector::TlsChannelCertificateWatcher::OnCertificatesChanged(), which in turn is invoked by TlsChannelSecurityConnector::CreateTlsChannelSecurityConnector().
I am not sure how this happens though. As my understanding, check_peer() is invoked when there is an ongoing request, which should happen after we successfully created the security connector. Do you have any ideas on what is happening here?


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 146 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for grpc_core::.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 157 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add an ABSL_GUARDED_BY(verifier_request_map_mu_) annotation here. I've sent you #26044 as an example of how to do this.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 222 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Hmmm... Is the reason that you created a shared base class that you wanted to share the code between the client and server to (a) initialize the request from the tsi_peer and (b) free the request when verification is complete? If so, I think we should use composition instead of inheritence for this, as per the style guide:

https://google.github.io/styleguide/cppguide.html#Inheritance

In particular, I think it's fine to have a couple of standalone functions in an anonymous namespace in this file to do (a) and (b), and then call those functions from this class in each of the client and server security connectors.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 228 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move()

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 254 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for grpc_core::.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 263 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add an ABSL_GUARDED_BY(verifier_request_map_mu_) annotation here.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 157 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

absl::bind_front(&OnVerifyDone, this, true)

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 270 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

absl::bind_front(&OnVerifyDone, this, true)

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 222 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If the underlying strings are not null-terminated, then I agree that we'll need to copy them. But in that case, there's no point in storing the tsi_peer in the PendingVerifierRequest object in the first place.

Note that even if we choose not to store the tsi_peer in PendingVerifierRequest, that would not affect the overall structure of PendingVerifierRequest.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 322 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Yes, I think it's fine to compare the certificate verifier and certificate provider by just the pointer value. If the application constructs two identical objects instead of using the same one twice, that would mean the resulting security connectors would compare non-equally, but that's fine -- the application really should be just creating a single instance of these objects instead.

I see, that's pretty clear now! I will add a compare() function in grpc_tls_credentials_options in another PR and then update code here.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 331 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Yes, check_call_host() is guaranteed to happen after check_peer(), including any async verification. The former happens when the connection is established; the latter happens when calls are sent on that connection.

Having a separate method with a default implementation in the base class does not help for verifiers that delegate to other verifiers. If we have a separate method for this, then basically every single verifier implementation will have to implement this method, even if most of them don't care at all about this. That's why I prefer option 2.

We cannot call this option "use per-call virtual hosting", because it does not control whether per-call virtual hosting is enabled; any application can use that feature via the gRPC API, regardless of what this option is set to. We should call it something like "check per-call host against peer cert".

Oh ok, sounds good. Just to confirm, you are fine with the way to add a field to grpc_tls_credentials_options, if we name it like check_per_call_host, right?


test/core/end2end/fixtures/h2_tls.cc, line 78 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

See below.

Changed to use the same shared verifier.


test/core/end2end/fixtures/h2_tls.cc, line 117 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The problem with that code is that you're storing a single thread but overwriting it for each request. So any time you start a new request before the previous one ends, you will reset the thread to point to the new instance, thus leaking the memory for the old thread.

See my comments in tls_utils.h about the two options you can use to fix the async external verifier to clean up its threads. If you do that correctly, then you should be able to use the same implementation here.

There should be no need to store any threads in fullstack_secure_fixture_data.

Ah I see! I think I ignored the fact that the verifier is a shared common object that multiple threads could pass by and access. That makes a lot of sense!
Changed to use the same shared verifier.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 43 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

You can probably just memset() the request to 0 here.

Done.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 47 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

In all of these tests, the request struct is not outliving the test itself. That means there's no need to dynamically allocate any of the fields; it's fine to have them point to local variables or static strings. That way, you don't need to do any cleanup here.

Done.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 74 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why does this need to be dynamically allocated? Why not just declare it on the stack, since you're unconditionally deleting at the end of the test?

Done.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 80 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

You're not actually setting this event anywhere via gpr_event_set(), so this call is functionally equivalent to just sleeping for 5 seconds. I think you need to call gpr_event_set() from within the callback.

If you're having to wait for this anyway, then I retract my earlier suggestion about checking the status within the callback. Instead, have the callback set a local variable to the status before it sets the event, and then have the code here check that local variable.

Yeah, the main reason for setting this gpr_event is to avoid shutting down the ExternalCertificateVerifier while there are still threads running. Some threads will invoke callbacks accessing values cached to ExternalCertificateVerifier, and in turn cause an use-after-free bug.
In our new event-loop test structure, I don't think we have this problem anymore, because we will wait for the thread to exist during destruct(). If a new request is added to the processing deque after the shutdown request is pushed, the thread will simply leave without processing the last newly added request, and there won't be any memory issues.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 90 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same question as above: Why does this need to be dynamically allocated?

Removed.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 98 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same comment here as in the previous test.

Done.


test/core/security/grpc_tls_certificate_verifier_test.cc, line 58 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

All of these tests should check the return value from Verify().

Done.


test/core/util/tls_utils.h, line 51 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Static methods don't automatically get a this pointer to one specific object of the class, but they do have access to private fields of an object of that class if you pass them a pointer to that object. So instead of having a separate UserData struct, just move the is_good field to the SyncExternalVerifier class, and have SyncExternalVerifier pass this as the user_data pointer to the external verifier.

Ah I see, that sounds good!


test/core/util/tls_utils.h, line 85 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

See above. There's is no reason why we need UserData here; everything should be stored directly in the AsyncExternalVerifier class.

If we need to store a separate thread object for each request, then there's no point in making it non-joinable. In that case, there are two options.

One option is that the AsyncExternalVerifier class can have a map of pending requests, where the thread is stored in the map value. This can basically be exactly the same pattern that we use everywhere else.

Another option would be to have a single dedicated thread that is used for all requests. You could do something like this:

class AsyncExternalVerifier {
 public:
  AsyncExternalVerifier(bool is_good, gpr_event* event_ptr)
      : is_good_(is_good),
        event_ptr_(event_ptr),
        thread_("AsyncExternalVerifierWorkerThread", WorkerThread, this),
        base_{this, Verify, Cancel, Destruct} {}

 private:
  // A request to pass to the worker thread.
  struct Request {
    grpc_tls_custom_verification_check_request* request;
    grpc_tls_on_custom_verification_check_done_cb callback;
    void* callback_arg;
    bool shutdown = false;  // If true, thread will exit.
  };

  static int Verify(void* user_data,
                    grpc_tls_custom_verification_check_request* request,
                    grpc_tls_on_custom_verification_check_done_cb callback,
                    void* callback_arg) {
    auto* self = static_cast<AsyncExternalVerifier*>(arg);
    // Add request to queue to be picked up by worker thread.
    MutexLock lock(&self->mu_);
    self->queue_.push_back(Request{request, callback, callback_arg});
    return false;  // async
  }

  static int Destruct(void* user_data) {
    auto* self = static_cast<AsyncExternalVerifier*>(arg);
    // Tell the thread to shut down.
    {
      MutexLock lock(&self->mu_);
      self->queue_.push_back(Request{nullptr, nullptr, nullptr, true});
    }
    // Wait for thread to exit.
    thread_.Join();
  }

  static void WorkerThread(void* arg) {
    auto* self = static_cast<AsyncExternalVerifier*>(arg);
    while (true) {
      // Check queue for work.
      bool got_request = false;
      Request request;
      {
        MutexLock lock(&self->mu_);
        if (!self->queue_.empty()) {
           got_request = true;
           request = self->queue_.front();
           self->queue_.pop_front();
        }
      }
      // If nothing found in the queue, sleep for a bit and try again.
      if (!got_request) {
        gpr_sleep_until(grpc_timeout_milliseconds_to_deadline(100));
        continue;
      }
      // If we're being told to shut down, return.
      if (request->shutdown) return;
      // Process the request.
      if (self->is_good) {
        request.request->status = GRPC_STATUS_OK;
      } else {
        request.request->status = GRPC_STATUS_UNAUTHENTICATED;
        request.request->error_details =
            gpr_strdup("AsyncExternalVerifierBadVerify failed");
      }
      request.callback(request.request, request.callback_arg);
      // Now we can notify the main testing thread that the thread object is set.
      if (self->event_ptr != nullptr) {
        gpr_event_set(self->event_ptr, reinterpret_cast<void*>(1));
      }
    }
  }

  bool is_good_;
  gpr_event* event_ptr_;
  grpc_core::Thread thread_;
  grpc_tls_certificate_verifier_external base_;
  Mutex mu_;
  std::deque<Request> queue_ ABSL_GUARDED_BY(mu_);
};

Wow this event loop looks pretty nice! It's far more cleaner than my original way of passing different args around. I am happy to adopt it.


test/core/util/tls_utils.h, line 49 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This needs to be explicit.

Done.


test/core/util/tls_utils.h, line 49 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling the parameter succeed.

Done.


test/core/util/tls_utils.h, line 56 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling this succeed.

Done.


test/core/util/tls_utils.h, line 83 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest renaming is_good to succeed. Same thing throughout this class.

Done.


test/core/util/tls_utils.h, line 100 at r7 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

These two fields don't need to be here. Instead, they can be accessed from the parent AsyncExternalVerifier object.

Changed to use the new structure so this doesn't exist anymore.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting closer!

Please let me know if you have any questions. Thanks!

Reviewed 8 of 15 files at r9, 13 of 13 files at r10.
Reviewable status: all files reviewed, 32 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @yashykt, and @ZhenLian)


include/grpc/grpc_security.h, line 1022 at r10 (raw file):

   * mainly used as an argument in |callback|, if want to invoke |callback| in
   * async mode.
   * status: indicates if a connection should be allowed. This should only be

s/status/sync_status/


include/grpc/grpc_security.h, line 1023 at r10 (raw file):

   * async mode.
   * status: indicates if a connection should be allowed. This should only be
   * used if the verification check is done synchronously. error_details: the

Please move "error_details:" to the start of the next line, since it's a separate parameter. Also change it to "sync_error_details", to match the parameter name.


include/grpc/grpc_security.h, line 1024 at r10 (raw file):

   * status: indicates if a connection should be allowed. This should only be
   * used if the verification check is done synchronously. error_details: the
   * error generated while verifying a connection. This should only be used if

Please document the fact that the implementation must allocate the string via gpr_malloc() or gpr_strdup().


include/grpc/grpc_security.h, line 1025 at r10 (raw file):

   * used if the verification check is done synchronously. error_details: the
   * error generated while verifying a connection. This should only be used if
   * the verification check is done synchronously. return: return 0 if |verify|

Please move "return:" to the start of the next line.


include/grpcpp/security/tls_certificate_verifier.h, line 70 at r10 (raw file):

 private:
  grpc_tls_custom_verification_check_request* c_request_ = nullptr;
  std::string target_name_;

Instead of allocating these at construction time, it might make more sense to store only c_request_ and construct these return values dynamically in the accessor methods. That way, we won't bother allocating the fields that aren't actually going to be used. I would expect the common case here to be that the verifier only looks at a subset of the fields, so that seems more optimal.

@veblush is currently investigating whether it will be possible to use absl in the C++ API. If we can do that, then we can use absl::string_view for all of the non-vector accessor methods, which would avoid the allocation entirely.


include/grpcpp/security/tls_certificate_verifier.h, line 80 at r10 (raw file):

};

class CertificateVerifier {

This needs comments explaining the purpose of this class and showing how to use it.


include/grpcpp/security/tls_certificate_verifier.h, line 101 at r10 (raw file):

  grpc_tls_certificate_verifier* verifier_ = nullptr;
  std::mutex mu_;

Please use grpc::internal::Mutex.


include/grpcpp/security/tls_certificate_verifier.h, line 107 at r10 (raw file):

};

class ExternalCertificateVerifier {

This needs comments explaining the purpose of this class and showing how to use it.


include/grpcpp/security/tls_certificate_verifier.h, line 120 at r10 (raw file):

  // bind its lifetime to the core objects.
  template <typename Subclass, typename... Args>
  static std::shared_ptr<CertificateVerifier> Create(Args&&... args);

This is effectively a factory function, so it should be the first one declared, as per the style guide:

https://google.github.io/styleguide/cppguide.html#Declaration_Order


include/grpcpp/security/tls_certificate_verifier.h, line 128 at r10 (raw file):

 private:
  struct AsyncCallbackProfile {

Suggest calling this AsyncRequestState.


include/grpcpp/security/tls_certificate_verifier.h, line 134 at r10 (raw file):

        : callback(cb), callback_arg(arg), cpp_request(request) {}

    grpc_tls_on_custom_verification_check_done_cb callback = nullptr;

No reason to initialize these fields to nullptr if they are always going to be set by the ctor.


include/grpcpp/security/tls_certificate_verifier.h, line 150 at r10 (raw file):

  static void DestructInCoreExternalVerifier(void* user_data);

  // We have to use a pointer here, because

This is really unfortunate. But I guess we can make this a TODO for now, assigned to @yihuazhang, and we can clean it up as part of eliminating insecure builds.


include/grpcpp/security/tls_certificate_verifier.h, line 155 at r10 (raw file):

  // to bypass this?
  grpc_tls_certificate_verifier_external* base_ = nullptr;
  std::mutex mu_;

grpc::internal::Mutex


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 46 at r8 (raw file):

Previously, ZhenLian wrote…

It seems we are not able to convert from 'grpc_status_code' to 'absl::StatusCode', so I will make a static_cast here.
Just to make sure, are we sure that the fields in these two enums can always stay the same?

Yes, they will always be the same. Using a cast is fine.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 161 at r8 (raw file):

Previously, ZhenLian wrote…

Would it be better if we also include the status code in the debug message? It is also stated in message()'s documentation that It is not unusual for the error message to be the empty string. As a result, prefer Status::ToString() for debug logging....

No, there's no need to include the code in the message. This is not for logging; it's a strict conversion from one type to another.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 42 at r10 (raw file):

  // Invoke the caller-specified verification logic embedded in
  // external_verifier_.
  grpc_status_code sync_status_c = GRPC_STATUS_OK;

Suggest calling this status_code.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 43 at r10 (raw file):

  // external_verifier_.
  grpc_status_code sync_status_c = GRPC_STATUS_OK;
  // Question: this probably is a dumb question, but is there any way to

There's no need to allocate anything here. Just set it to nullptr.

Note that gpr_free() is a no-op if its argument is nullptr.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 178 at r10 (raw file):

    if (!sync_status_cpp.ok()) {
      *sync_status = static_cast<grpc_status_code>(sync_status_cpp.code());
      gpr_free(*sync_error_details);

This is not necessary, as per above.


src/core/lib/security/security_connector/tls/tls_security_connector.h, line 144 at r7 (raw file):

Previously, ZhenLian wrote…

Thanks for asking! I almost forgot this. That was one thing I felt quite weird.
I've uploaded the failures to https://source.cloud.google.com/results/invocations/911d1a6d-654f-4f36-b0cc-94fe37dc8841/targets. Basically, it says it is holding the lock conflicting with TlsChannelSecurityConnector::TlsChannelCertificateWatcher::OnCertificatesChanged(), which in turn is invoked by TlsChannelSecurityConnector::CreateTlsChannelSecurityConnector().
I am not sure how this happens though. As my understanding, check_peer() is invoked when there is an ongoing request, which should happen after we successfully created the security connector. Do you have any ideas on what is happening here?

Ah, okay. I think this can happen when the certificate distributor already has the certificate when the security connector is created. When the watcher is registered with the distributor, it will immediately return the certificate, before WatchTlsCertificates() returns.

In any case, as I said, using two different locks here makes sense, so this code looks good.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 331 at r2 (raw file):

Previously, ZhenLian wrote…

Oh ok, sounds good. Just to confirm, you are fine with the way to add a field to grpc_tls_credentials_options, if we name it like check_per_call_host, right?

No, I don't think we need to add anything to grpc_tls_credentials_options. I think we just need to add a new output to the Verify() method (which will be returned directly for sync verifiers and passed to the callback for async verifiers).


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 56 at r10 (raw file):

}

void PendingVerifierRequestInit(

I don't think this function is needed. In both the client and server cases, we are calling this to set all of the fields to null, and then immediately calling ParseTsiPeer() to reset the fields to the proper value. In most cases, this means that we're setting the fields twice, which is pointless. Instead, please change ParseTsiPeer() to set any field that it is not setting to null (e.g., after the if (!uri_names.empty()) { block, add else { request->peer_info.san_names.uri_names = nullptr; }).

Also, please rename ParseTsiPeer() to PendingVerifierRequestInit(), since that's really what we're doing there.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 385 at r10 (raw file):

    }
    if (pending_verifier_request != nullptr) {
      verifier->Cancel(pending_verifier_request->request());

I think there's a race condition here: If the async request completes after we release the lock but before we execute this, then pending_verifier_request may have been destroyed already, which would cause a crash.

To avoid this, I think we need to call pending_verifier_request->request() while holding the mutex, and then use that to call Cancel() here.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 674 at r10 (raw file):

    }
    if (pending_verifier_request != nullptr) {
      verifier->Cancel(pending_verifier_request->request());

Same race condition here as above.


src/cpp/common/tls_certificate_verifier.cc, line 64 at r10 (raw file):

    request_map_.emplace(request->c_request(), std::move(callback));
  }
  grpc_status_code sync_status_c = GRPC_STATUS_OK;

Please call this status_code.


src/cpp/common/tls_certificate_verifier.cc, line 65 at r10 (raw file):

  }
  grpc_status_code sync_status_c = GRPC_STATUS_OK;
  char* error_details = gpr_strdup("");

This should be initialized to nullptr.


src/cpp/common/tls_certificate_verifier.cc, line 111 at r10 (raw file):

template <typename Subclass, typename... Args>
std::shared_ptr<CertificateVerifier> ExternalCertificateVerifier::Create(

The definition of this function needs to be moved to the .h file, because it's a template. The compiler needs to be able to expand the template at the call site.


src/cpp/common/tls_certificate_verifier.cc, line 158 at r10 (raw file):

        if (callback != nullptr) {
          grpc_status_code return_status = GRPC_STATUS_OK;
          char* return_error_details = gpr_strdup("");

There's no need for this variable or for making a copy of this string. Just pass status.error_details().c_str() directly to the callback below.


src/cpp/common/tls_certificate_verifier.cc, line 173 at r10 (raw file):

      *sync_status = static_cast<grpc_status_code>(
          sync_current_verifier_status.error_code());
      gpr_free(*sync_error_details);

This is not needed.


test/core/util/tls_utils.cc, line 82 at r10 (raw file):

  auto* self = static_cast<SyncExternalVerifier*>(user_data);
  if (self->success_) {
    return true;  // Synchronous call

We should set *sync_status = GRPC_STATUS_OK here.


test/core/util/tls_utils.cc, line 85 at r10 (raw file):

  }
  *sync_status = GRPC_STATUS_UNAUTHENTICATED;
  gpr_free(*sync_error_details);

This is not needed.


test/core/util/tls_utils.cc, line 108 at r10 (raw file):

void AsyncExternalVerifier::Destruct(void* user_data) {
  auto* self = static_cast<AsyncExternalVerifier*>(user_data);
  // Tell the thread to shut down.

Suggest moving this code to the dtor, so that all we have to do here is delete self.

Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review so far! Finally, I think I've completed most of the code required in this PR. I added more tests in C++, and have one more question left to resolve. Once it's resolved, I can start to re-build the whole project and fix any issues in the pre-submit tests.

Reviewable status: 19 of 35 files reviewed, 32 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @veblush, @yashykt, and @yihuazhang)


include/grpc/grpc_security.h, line 1022 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/status/sync_status/

Done.


include/grpc/grpc_security.h, line 1023 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please move "error_details:" to the start of the next line, since it's a separate parameter. Also change it to "sync_error_details", to match the parameter name.

Done.


include/grpc/grpc_security.h, line 1024 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please document the fact that the implementation must allocate the string via gpr_malloc() or gpr_strdup().

Done.


include/grpc/grpc_security.h, line 1025 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please move "return:" to the start of the next line.

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 70 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of allocating these at construction time, it might make more sense to store only c_request_ and construct these return values dynamically in the accessor methods. That way, we won't bother allocating the fields that aren't actually going to be used. I would expect the common case here to be that the verifier only looks at a subset of the fields, so that seems more optimal.

@veblush is currently investigating whether it will be possible to use absl in the C++ API. If we can do that, then we can use absl::string_view for all of the non-vector accessor methods, which would avoid the allocation entirely.

Sounds good. If so, I think it's better to return std::string instead of std::string& for the getter functions, because now we don't cache strings to the class?


include/grpcpp/security/tls_certificate_verifier.h, line 80 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This needs comments explaining the purpose of this class and showing how to use it.

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 101 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use grpc::internal::Mutex.

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 107 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This needs comments explaining the purpose of this class and showing how to use it.

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 120 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is effectively a factory function, so it should be the first one declared, as per the style guide:

https://google.github.io/styleguide/cppguide.html#Declaration_Order

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 128 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling this AsyncRequestState.

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 134 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No reason to initialize these fields to nullptr if they are always going to be set by the ctor.

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 150 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is really unfortunate. But I guess we can make this a TODO for now, assigned to @yihuazhang, and we can clean it up as part of eliminating insecure builds.

Done


include/grpcpp/security/tls_certificate_verifier.h, line 155 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

grpc::internal::Mutex

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 161 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No, there's no need to include the code in the message. This is not for logging; it's a strict conversion from one type to another.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 169 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should use message() instead of ToString().

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 42 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling this status_code.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 43 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

There's no need to allocate anything here. Just set it to nullptr.

Note that gpr_free() is a no-op if its argument is nullptr.

Done.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.cc, line 178 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is not necessary, as per above.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 331 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No, I don't think we need to add anything to grpc_tls_credentials_options. I think we just need to add a new output to the Verify() method (which will be returned directly for sync verifiers and passed to the callback for async verifiers).

OK got it. Actually, I think we can just use one more output parameter like this:

virtual bool Verify(grpc_tls_custom_verification_check_request* request,
                      std::function<void(absl::Status)> callback,
                      absl::Status* sync_status, bool* check_per_call_host) = 0;

to cover both the sync and async case, right? Because even for async it can know whether it needs per-call host check beforehand. It doesn't have to put off setting that value until when the async operation is done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 56 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think this function is needed. In both the client and server cases, we are calling this to set all of the fields to null, and then immediately calling ParseTsiPeer() to reset the fields to the proper value. In most cases, this means that we're setting the fields twice, which is pointless. Instead, please change ParseTsiPeer() to set any field that it is not setting to null (e.g., after the if (!uri_names.empty()) { block, add else { request->peer_info.san_names.uri_names = nullptr; }).

Also, please rename ParseTsiPeer() to PendingVerifierRequestInit(), since that's really what we're doing there.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 385 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think there's a race condition here: If the async request completes after we release the lock but before we execute this, then pending_verifier_request may have been destroyed already, which would cause a crash.

To avoid this, I think we need to call pending_verifier_request->request() while holding the mutex, and then use that to call Cancel() here.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 674 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same race condition here as above.

Done.


src/cpp/common/tls_certificate_verifier.cc, line 64 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please call this status_code.

Done.


src/cpp/common/tls_certificate_verifier.cc, line 65 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should be initialized to nullptr.

Done.


src/cpp/common/tls_certificate_verifier.cc, line 111 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The definition of this function needs to be moved to the .h file, because it's a template. The compiler needs to be able to expand the template at the call site.

Ah! I should really read the comment first next time when trying something new(this my first time using a template) :) It took me some time to realize this :) Thanks for the suggestions!


src/cpp/common/tls_certificate_verifier.cc, line 158 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

There's no need for this variable or for making a copy of this string. Just pass status.error_details().c_str() directly to the callback below.

Got it. I realized the code can be simplified a lot, after knowing gpr_free can also handle nullptr case.
But I think it is still worthwhile to create a copy, though? In callback(), we might fire another thread, which can exist when the current block goes out of scope. It that case, since status has gone out of scope, we might want to keep another copy in that thread.
I know in current implementation this case doesn't exist, but shall we take this into account for future cases?


src/cpp/common/tls_certificate_verifier.cc, line 173 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is not needed.

Done.


test/core/util/tls_utils.cc, line 82 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We should set *sync_status = GRPC_STATUS_OK here.

Done


test/core/util/tls_utils.cc, line 85 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is not needed.

Done.


test/core/util/tls_utils.cc, line 108 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest moving this code to the dtor, so that all we have to do here is delete self.

Done.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is continuing to get closer!

Please let me know if you have any questions. Thanks!

Reviewed 16 of 16 files at r11.
Reviewable status: all files reviewed, 46 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @veblush, @yashykt, and @ZhenLian)


include/grpcpp/security/tls_certificate_verifier.h, line 70 at r10 (raw file):

Previously, ZhenLian wrote…

Sounds good. If so, I think it's better to return std::string instead of std::string& for the getter functions, because now we don't cache strings to the class?

Yes, that looks right.

Please add a TODO for veblush to change the return type of these methods to absl::string_view if we wind up being able to use absl in the C++ API before we wind up de-experimentalizing this API. (If we de-experimentalize this API first, then we'll be stuck with std::string here.)


include/grpcpp/security/tls_certificate_verifier.h, line 53 at r11 (raw file):

// request. Users should not directly create or destroy this request object, but
// shall interact with it through CertificateVerifier's Verify() and Cancel().
// The object of this class should not outlive the underlying c object.

This line is an implementation detail, not something users need to know, so I don't think it should be here.


include/grpcpp/security/tls_certificate_verifier.h, line 77 at r11 (raw file):

// The base class of all internal verifier implementations, and the ultimate
// class that all external verifiers will eventually be transformed into.
// - If you are a gRPC contributor and want to add an internal gRPC verifier,

This bullet point can be removed. This API documentation is for users, but this comment is for our internal use.


include/grpcpp/security/tls_certificate_verifier.h, line 92 at r11 (raw file):

  // verifier. The check on each internal verifier could be either synchronous
  // or asynchronous, and we will need to use return value to know.
  // Internally we invoke the verify() function of the c verifier object.

This line is not necessary; it's an implementation detail.


include/grpcpp/security/tls_certificate_verifier.h, line 93 at r11 (raw file):

  // or asynchronous, and we will need to use return value to know.
  // Internally we invoke the verify() function of the c verifier object.
  // One typical usage of this function is to compose external verifiers with

I think the comment about composability actually belongs on ExternalCertificateVerifier, not here.


include/grpcpp/security/tls_certificate_verifier.h, line 95 at r11 (raw file):

  // One typical usage of this function is to compose external verifiers with
  // internal verifiers. For example, users can compose their verifiers with a
  // |HostNameCertificateVerifier|. See "tls_test_utils.h" for some examples.

The reference to tls_test_utils.h should be removed, since that file is not part of the public API.


include/grpcpp/security/tls_certificate_verifier.h, line 111 at r11 (raw file):

              grpc::Status* sync_status);

  // Cancels a connection request. This is usually used to cleans up the

The wording here is a little confusing. I suggest the following:

Cancels a verification request previously started via Verify(). Used when the connection attempt times out or is cancelled while an async verification request is pending. The implementation should abort whatever async operation it is waiting for and quickly invoke the callback that was passed to Verify() with a status indicating the cancellation.


include/grpcpp/security/tls_certificate_verifier.h, line 138 at r11 (raw file):

// The base class of all external, user-specified verifiers. Users should
// inherit this class if a custom verifier is expected to be used.

s/if a custom verifier is expected to be used/to implement a custom verifier/


include/grpcpp/security/tls_certificate_verifier.h, line 144 at r11 (raw file):

  // the user-implemented verifiers should use this function to be converted to
  // verifiers compatible with |TlsCredentialsOptions|.
  // Subclass is created here, but it will be destroyed when the core

I think the rest of this comment can just say "The resulting CertificateVerifier takes ownership of the newly instantiated Subclass."


include/grpcpp/security/tls_certificate_verifier.h, line 158 at r11 (raw file):

  // The verification logic that will be performed after the TLS handshake
  // completes. Implementers can choose to do their checks synchronously or
  // asynchronously, and return different values.

I think the "and return different values" can be removed. It's a bit confusing to read, and I think what it's actually referring to is already covered below.


include/grpcpp/security/tls_certificate_verifier.h, line 176 at r11 (raw file):

                      grpc::Status* sync_status) = 0;

  // The cancellation logic that will be performed when the verifier is still

Suggest the same wording here as on line 111 above.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 331 at r2 (raw file):

Previously, ZhenLian wrote…

OK got it. Actually, I think we can just use one more output parameter like this:

virtual bool Verify(grpc_tls_custom_verification_check_request* request,
                      std::function<void(absl::Status)> callback,
                      absl::Status* sync_status, bool* check_per_call_host) = 0;

to cover both the sync and async case, right? Because even for async it can know whether it needs per-call host check beforehand. It doesn't have to put off setting that value until when the async operation is done.

Just because we don't have a case right now where we'll want to determine this dynamically doesn't mean that there will never be any such case in the future. I would allow it to be set via both sync and async APIs.

Bleh... I still really don't like having this in the verifier API at all. I guess an alternative might be to have a separate option in grpc_tls_credentials_options, as you mentioned above. The reason I had originally not wanted to do that is that it basically puts the work on the application instead of on the verifier implementation, which in general is the wrong trade-off, since there are going to be a lot more applications than there will be verifier implementations. But given how invasive this change is likely to be, I wonder if it would be better to just make the application set this. If we default it to true, then only applications that configure a verifier that does not include hostname verification will need to set this option, which might not be too bad if we think that will be a rare case. But if we think that will be common in the future (e.g., if people move more to spiffe and away from hostname-based certs), then we probably want to stick with making the verifier impl deal with this.

It might be useful to get some input from @jiangtaoli2016 on this when he gets back.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 385 at r10 (raw file):

Previously, ZhenLian wrote…

Done.

The way you've written this doesn't actually solve the problem, because you're acquiring the lock to do the map lookup, then releasing the lock, then acquiring the lock again to look inside of the map entry you looked up. But the map entry could be deleted between the first lock and the second lock.

I think what you really want to do is just change the type of pending_verifier_request from ChannelPendingVerifierRequest* to grpc_tls_custom_verification_check_request*, and set it to it->second->request() when you do the map lookup. Then we're only acquiring the lock once.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 674 at r10 (raw file):

Previously, ZhenLian wrote…

Done.

Same problem here as above.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 59 at r11 (raw file):

    tsi_peer peer, grpc_tls_custom_verification_check_request* request) {
  GPR_ASSERT(request != nullptr);
  request->target_name = nullptr;

Suggest passing in the target name and setting it here, rather than setting it to null here and then having the caller immediately reset it to something else after we return.


src/cpp/common/tls_certificate_verifier.cc, line 138 at r10 (raw file):

    auto pair = self->request_map_.emplace(
        request, AsyncCallbackProfile(callback, callback_arg, request));
    GPR_ASSERT(pair.second);

Why remove this assertion? If this is not true, there's a fairly serious bug somewhere, so I think it's good to have this. We don't want to fail to add the new entry if there's a conflict.


src/cpp/common/tls_certificate_verifier.cc, line 158 at r10 (raw file):

Previously, ZhenLian wrote…

Got it. I realized the code can be simplified a lot, after knowing gpr_free can also handle nullptr case.
But I think it is still worthwhile to create a copy, though? In callback(), we might fire another thread, which can exist when the current block goes out of scope. It that case, since status has gone out of scope, we might want to keep another copy in that thread.
I know in current implementation this case doesn't exist, but shall we take this into account for future cases?

The callback is going to make its own copy anyway by converting it to absl::Status. There is no benefit to making a copy of the C string here and passing ownership to the callback, because the callback will never actually keep that C string. This is just a needless allocation and deallocation.

More generally, even if there were cases where the callback did not want to make its own copy, it would still be more flexible for the callback API to not take ownership. That way, the callback can decide for itself what to do instead of being required to take ownership of a copy that it doesn't need and will immediately free.

This entire block should be just one statement:

callback(request, callback_arg, static_cast<grpc_status_code>(status.error_code()),
         status.error_message().c_str());

src/cpp/common/tls_certificate_verifier.cc, line 184 at r11 (raw file):

            callback = it->second.callback;
            callback_arg = it->second.callback_arg;
            // Question: would it be better if we erase this after the callback

The code in the callback belongs to a different class. It cannot see this map, which is internal to this class. Nor can it depend on any internal state of this class that may be in the map. So I don't see how the callback would ever care about this.

I think this is correct as-is.


test/core/util/tls_utils.cc, line 86 at r11 (raw file):

  }
  *sync_status = GRPC_STATUS_UNAUTHENTICATED;
  *sync_error_details = gpr_strdup("SyncExternalVerifierBadVerify failed");

s/SyncExternalVerifierBadVerify/SyncExternalVerifier/


test/core/util/tls_utils.cc, line 149 at r11 (raw file):

      request.callback(request.request, request.callback_arg,
                       GRPC_STATUS_UNAUTHENTICATED,
                       "AsyncExternalVerifierBadVerify failed");

s/AsyncExternalVerifierBadVerify/AsyncExternalVerifier/


test/cpp/security/tls_certificate_verifier_test.cc, line 47 at r11 (raw file):

      ExternalCertificateVerifier::Create<SyncCertificateVerifier>(true);
  TlsCustomVerificationCheckRequest cpp_request(&request);
  std::function<void(grpc::Status)> empty_callback;

Instead of declaring this, just pass nullptr to Verify().

Same for all the sync tests.


test/cpp/security/tls_certificate_verifier_test.cc, line 50 at r11 (raw file):

  grpc::Status sync_status;
  verifier->Verify(&cpp_request, empty_callback, &sync_status);
  EXPECT_TRUE(sync_status.ok());

Please add << sync_status.error_code() << " " << sync_status.error_message(), so that if the test fails, the actual error is logged.

Same thing for all tests where we're expecting an OK status.


test/cpp/security/tls_certificate_verifier_test.cc, line 65 at r11 (raw file):

  grpc::Status sync_status;
  verifier->Verify(&cpp_request, empty_callback, &sync_status);
  EXPECT_FALSE(sync_status.ok());

Instead of checking that ok() is false, please check the value of sync_status.error_code().

Same thing for all tests expecting a non-OK status.


test/cpp/security/tls_certificate_verifier_test.cc, line 87 at r11 (raw file):

TEST(TlsCertificateVerifierTest,
     SyncCertificateVerifierFailsOnHostnameAndAdditionalCheck) {

This test doesn't seem useful. It's actually testing the behavior of the fake verifier, whose only purpose is to help us test the production code.

In practice, any composing verifier will not bother to call any subsequent verifier if a previous one fails; it will just return immediately after the first failure.


test/cpp/security/tls_certificate_verifier_test.cc, line 129 at r11 (raw file):

  TlsCustomVerificationCheckRequest cpp_request(&request);
  std::function<void(grpc::Status)> callback = [](grpc::Status async_status) {
    EXPECT_EQ(async_status.error_message(),

Please also check async_status.error_code().


test/cpp/security/tls_certificate_verifier_test.cc, line 146 at r11 (raw file):

  TlsCustomVerificationCheckRequest cpp_request(&request);
  std::function<void(grpc::Status)> callback = [](grpc::Status async_status) {
    EXPECT_EQ(async_status.error_message(),

Please also check async_status.error_code().


test/cpp/security/tls_certificate_verifier_test.cc, line 154 at r11 (raw file):

TEST(TlsCertificateVerifierTest,
     AsyncCertificateVerifierFailsOnHostnameAndAdditionalCheck) {

Same as above: I don't think this test is actually useful.


test/cpp/security/tls_certificate_verifier_test.cc, line 202 at r11 (raw file):

  grpc::Status sync_status;
  verifier->Verify(&cpp_request, empty_callback, &sync_status);
  EXPECT_FALSE(sync_status.ok());

Please check the actual status code and message here.


test/cpp/util/tls_test_utils.h, line 38 at r11 (raw file):

class SyncCertificateVerifier : public ExternalCertificateVerifier {
 public:
  SyncCertificateVerifier(bool success)

Single-argument ctors need to be explicit.


test/cpp/util/tls_test_utils.h, line 39 at r11 (raw file):

 public:
  SyncCertificateVerifier(bool success)
      : ExternalCertificateVerifier(),

You don't need to explicitly call the base class's ctor if you're not passing any parameters to it.


test/cpp/util/tls_test_utils.h, line 41 at r11 (raw file):

      : ExternalCertificateVerifier(),
        success_(success),
        hostname_verifier_(std::make_shared<HostNameCertificateVerifier>()) {}

It seems odd to me that these verifier implementations are composing with the hostname verifier, whereas the corresponding verifier implementations used in the core tests (SyncExternalVerifier and AsyncExternalVerifier) don't do that. Why is it useful to test this composition for C++ but not C-core? If we do want to test this composition, it seems like we should do it in both places.

Even if there are some cases where we do actually care about testing composition with the hostname verifier, I suspect that the majority of tests will be simpler without involving the hostname verifier. So if there are cases where we want this composition, maybe what we really want is the option to control whether or not we do hostname verification. We could do that by having two different versions of these verifiers, one of which calls the hostname verifier and one which doesn't, or we could stick with a single version of these verifiers but give them an option to control whether the hostname verifier is called.


test/cpp/util/tls_test_utils.h, line 58 at r11 (raw file):

class AsyncCertificateVerifier : public ExternalCertificateVerifier {
 public:
  AsyncCertificateVerifier(bool success)

Same as above: Single-argument ctors need to be explicit.


test/cpp/util/tls_test_utils.h, line 59 at r11 (raw file):

 public:
  AsyncCertificateVerifier(bool success)
      : ExternalCertificateVerifier(),

Same as above: You don't need to explicitly call the base class's ctor if you're not passing any parameters to it.


test/cpp/util/tls_test_utils.h, line 61 at r11 (raw file):

      : ExternalCertificateVerifier(),
        success_(success),
        hostname_verifier_(std::make_shared<HostNameCertificateVerifier>()) {}

Same as above with regard to making the composition with the hostname verifier optional.


test/cpp/util/tls_test_utils.h, line 84 at r11 (raw file):

  bool success_ = false;
  std::shared_ptr<HostNameCertificateVerifier> hostname_verifier_;
  std::mutex mu_;

Please use grpc::internal::Mutex, and use a lock annotation on the map below.


test/cpp/util/tls_test_utils.cc, line 31 at r11 (raw file):

                                     std::function<void(grpc::Status)> callback,
                                     grpc::Status* sync_status) {
  bool is_done = hostname_verifier_->Verify(

The whole point of this class is to be a sync verifier. If the hostname verifier was ever changed to be async, then using it here would defeat the purpose of this class by turning it into an async verifier.

Given that, I think this can be changed to pass nullptr as the callback and then assert that is_done is true. That way, the test will fail if the hostname verifier is ever changed to be async, which will be an indication that this class and any tests using it will need to be refactored.

I think this can just be:

GPR_ASSERT(hostname_verifier_->Verify(request, nullptr, sync_status));
AdditionalSyncCheck(success_, sync_status);
return true;

Also, now that there's only one place where we're calling AdditionalSyncCheck(), I suggest just inlining that code here instead of having a separate function.


test/cpp/util/tls_test_utils.cc, line 44 at r11 (raw file):

}

void SyncCertificateVerifier::AdditionalSyncCheck(bool success,

The success parameter is not necessary here, because this method can just directly access the success_ data member.


test/cpp/util/tls_test_utils.cc, line 46 at r11 (raw file):

void SyncCertificateVerifier::AdditionalSyncCheck(bool success,
                                                  grpc::Status* sync_status) {
  if (!success) {

If success is true, then we should explicitly set *sync_status to grpc::Status::OK.


test/cpp/util/tls_test_utils.cc, line 49 at r11 (raw file):

    if (!sync_status->ok()) {
      *sync_status =
          grpc::Status(sync_status->error_code(),

It doesn't seem useful to merge the two statuses this way. Instead, if the hostname verifier fails, I would just use its status directly and not bother looking at success_.


test/cpp/util/tls_test_utils.cc, line 61 at r11 (raw file):

AsyncCertificateVerifier::~AsyncCertificateVerifier() {
  while (true) {

This approach seems pretty ugly. Instead, I suggest changing this to just use the same single-worker-thread approach as in AsyncExternalVerifier.


test/cpp/util/tls_test_utils.cc, line 62 at r11 (raw file):

AsyncCertificateVerifier::~AsyncCertificateVerifier() {
  while (true) {
    if (request_map_.empty()) {

This will require grabbing the lock, or else it won't be thread-safe.


test/cpp/util/tls_test_utils.cc, line 79 at r11 (raw file):

  thread_args->self = this;
  thread_args->request = request;
  bool is_done = hostname_verifier_->Verify(

Same as above: Assert that this returns true, and pass nullptr as the callback.


test/cpp/util/tls_test_utils.cc, line 118 at r11 (raw file):

      if (!thread_args->status.ok()) {
        return_status =
            grpc::Status(thread_args->status.error_code(),

Same as above: It doesn't seem useful to merge the two statuses this way. Instead, if the hostname verifier fails, I would just use its status directly and not bother looking at success_.


test/cpp/util/tls_test_utils.cc, line 135 at r11 (raw file):

    auto it = self->request_map_.find(request);
    if (it != self->request_map_.end()) {
      self->request_map_.erase(it);

This should be done where we do the map lookup above. There's no reason to reacquire the lock later and redo the lookup.

Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the comments and re-built the project. I am stuck in fixing a bug caused by introducing new changes to xds_end2end_test.cc(some tests would just go timeout without any errors printed after the change). I will keep working on fixing that, but feel free to let me know any suspicious spot in the code that might cause the problem. Thank you so much!

Reviewable status: all files reviewed, 46 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @veblush, and @yashykt)


include/grpcpp/security/tls_certificate_verifier.h, line 70 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Yes, that looks right.

Please add a TODO for veblush to change the return type of these methods to absl::string_view if we wind up being able to use absl in the C++ API before we wind up de-experimentalizing this API. (If we de-experimentalize this API first, then we'll be stuck with std::string here.)

OK, I added a TODO to myself. I will check with @veblush before trying to de-experimentalizing this API.


include/grpcpp/security/tls_certificate_verifier.h, line 53 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This line is an implementation detail, not something users need to know, so I don't think it should be here.

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 77 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This bullet point can be removed. This API documentation is for users, but this comment is for our internal use.

Sure, I've revised the wording a bit. PTAL, thanks!


include/grpcpp/security/tls_certificate_verifier.h, line 92 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This line is not necessary; it's an implementation detail.

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 93 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think the comment about composability actually belongs on ExternalCertificateVerifier, not here.

Hmm...actually I think both CertificateVerifier.Verify() and ExternalCertificateVerifier.Verify() can be composed.
My original intention for commenting was that, users might wonder what they could do with CertificateVerifier.Verify(), since it is public, but CertificateVerifier's direct use is not involved with this Verify() API(users just need to create it, set it in the options, and then they are good to go). The answer for that is, there exists an indirect use ofCertificateVerifier.Verify() to compose internal verifiers into the user-implemented ones. I just wanted to point that out.

(And also, the semantics of CertificateVerifier.Verify() and ExternalCertificateVerifier.Verify() are a bit different. CertificateVerifier.Verify()is a function users can invoke during their composition, while ExternalCertificateVerifier.Verify() is a function users should implement while making their own verifiers. I wanted to make this clear as well, but not sure if the comment is clear enough. Feel free to make any suggestions on how we can improve the comments. Thank you so much!)


include/grpcpp/security/tls_certificate_verifier.h, line 95 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The reference to tls_test_utils.h should be removed, since that file is not part of the public API.

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 111 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The wording here is a little confusing. I suggest the following:

Cancels a verification request previously started via Verify(). Used when the connection attempt times out or is cancelled while an async verification request is pending. The implementation should abort whatever async operation it is waiting for and quickly invoke the callback that was passed to Verify() with a status indicating the cancellation.

Thanks for the suggestion! Just one minor thing, I think the sentence The implementation should ... is speaking to users who might want to implement CertificateVerifier.Cancel(), while in CertificateVerifier.Cancel() users mostly likely won't need to implement it. They would just need to invoke it in a composition scenario. So it seems not appropriate to put it here.
But I find it a perfect statement for ExternalCertificateVerifier.Cancel().


include/grpcpp/security/tls_certificate_verifier.h, line 138 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/if a custom verifier is expected to be used/to implement a custom verifier/

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 144 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think the rest of this comment can just say "The resulting CertificateVerifier takes ownership of the newly instantiated Subclass."

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 158 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think the "and return different values" can be removed. It's a bit confusing to read, and I think what it's actually referring to is already covered below.

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 176 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest the same wording here as on line 111 above.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 385 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The way you've written this doesn't actually solve the problem, because you're acquiring the lock to do the map lookup, then releasing the lock, then acquiring the lock again to look inside of the map entry you looked up. But the map entry could be deleted between the first lock and the second lock.

I think what you really want to do is just change the type of pending_verifier_request from ChannelPendingVerifierRequest* to grpc_tls_custom_verification_check_request*, and set it to it->second->request() when you do the map lookup. Then we're only acquiring the lock once.

Ah yeah, that's right. Thanks for the catch!


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 674 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same problem here as above.

Done.


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 59 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest passing in the target name and setting it here, rather than setting it to null here and then having the caller immediately reset it to something else after we return.

Done.


src/cpp/common/tls_certificate_verifier.cc, line 138 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why remove this assertion? If this is not true, there's a fairly serious bug somewhere, so I think it's good to have this. We don't want to fail to add the new entry if there's a conflict.

Done.


src/cpp/common/tls_certificate_verifier.cc, line 158 at r10 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The callback is going to make its own copy anyway by converting it to absl::Status. There is no benefit to making a copy of the C string here and passing ownership to the callback, because the callback will never actually keep that C string. This is just a needless allocation and deallocation.

More generally, even if there were cases where the callback did not want to make its own copy, it would still be more flexible for the callback API to not take ownership. That way, the callback can decide for itself what to do instead of being required to take ownership of a copy that it doesn't need and will immediately free.

This entire block should be just one statement:

callback(request, callback_arg, static_cast<grpc_status_code>(status.error_code()),
         status.error_message().c_str());

That's reasonable. Sounds good!


src/cpp/common/tls_certificate_verifier.cc, line 184 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The code in the callback belongs to a different class. It cannot see this map, which is internal to this class. Nor can it depend on any internal state of this class that may be in the map. So I don't see how the callback would ever care about this.

I think this is correct as-is.

OK got it!


test/core/util/tls_utils.cc, line 86 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/SyncExternalVerifierBadVerify/SyncExternalVerifier/

Done.


test/core/util/tls_utils.cc, line 149 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/AsyncExternalVerifierBadVerify/AsyncExternalVerifier/

Done.


test/cpp/security/tls_certificate_verifier_test.cc, line 47 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of declaring this, just pass nullptr to Verify().

Same for all the sync tests.

Done.


test/cpp/security/tls_certificate_verifier_test.cc, line 50 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add << sync_status.error_code() << " " << sync_status.error_message(), so that if the test fails, the actual error is logged.

Same thing for all tests where we're expecting an OK status.

Done.


test/cpp/security/tls_certificate_verifier_test.cc, line 65 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of checking that ok() is false, please check the value of sync_status.error_code().

Same thing for all tests expecting a non-OK status.

Done.


test/cpp/security/tls_certificate_verifier_test.cc, line 87 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This test doesn't seem useful. It's actually testing the behavior of the fake verifier, whose only purpose is to help us test the production code.

In practice, any composing verifier will not bother to call any subsequent verifier if a previous one fails; it will just return immediately after the first failure.

Done.


test/cpp/security/tls_certificate_verifier_test.cc, line 129 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please also check async_status.error_code().

Done.


test/cpp/security/tls_certificate_verifier_test.cc, line 146 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please also check async_status.error_code().

Done.


test/cpp/security/tls_certificate_verifier_test.cc, line 154 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same as above: I don't think this test is actually useful.

Done.


test/cpp/security/tls_certificate_verifier_test.cc, line 202 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please check the actual status code and message here.

Done.


test/cpp/util/tls_test_utils.h, line 38 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Single-argument ctors need to be explicit.

Done.


test/cpp/util/tls_test_utils.h, line 39 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

You don't need to explicitly call the base class's ctor if you're not passing any parameters to it.

Done.


test/cpp/util/tls_test_utils.h, line 41 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It seems odd to me that these verifier implementations are composing with the hostname verifier, whereas the corresponding verifier implementations used in the core tests (SyncExternalVerifier and AsyncExternalVerifier) don't do that. Why is it useful to test this composition for C++ but not C-core? If we do want to test this composition, it seems like we should do it in both places.

Even if there are some cases where we do actually care about testing composition with the hostname verifier, I suspect that the majority of tests will be simpler without involving the hostname verifier. So if there are cases where we want this composition, maybe what we really want is the option to control whether or not we do hostname verification. We could do that by having two different versions of these verifiers, one of which calls the hostname verifier and one which doesn't, or we could stick with a single version of these verifiers but give them an option to control whether the hostname verifier is called.

Yeah, composing host name verifier into another verifier won't have too much actual meanings - I guess people would seldom do that in a real production. My original intention was to have a test to cover the composition cases, to showcase how to do that in a real example. But I guess a test is not a good place for demonstrating something... let's just keep them simple and easy to understand.
I've changed the C++ tests to match the core tests. They should now be very similar...


test/cpp/util/tls_test_utils.h, line 58 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same as above: Single-argument ctors need to be explicit.

Done.


test/cpp/util/tls_test_utils.h, line 59 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same as above: You don't need to explicitly call the base class's ctor if you're not passing any parameters to it.

Done.


test/cpp/util/tls_test_utils.h, line 61 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same as above with regard to making the composition with the hostname verifier optional.

Done.


test/cpp/util/tls_test_utils.h, line 84 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use grpc::internal::Mutex, and use a lock annotation on the map below.

Done.


test/cpp/util/tls_test_utils.cc, line 31 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The whole point of this class is to be a sync verifier. If the hostname verifier was ever changed to be async, then using it here would defeat the purpose of this class by turning it into an async verifier.

Given that, I think this can be changed to pass nullptr as the callback and then assert that is_done is true. That way, the test will fail if the hostname verifier is ever changed to be async, which will be an indication that this class and any tests using it will need to be refactored.

I think this can just be:

GPR_ASSERT(hostname_verifier_->Verify(request, nullptr, sync_status));
AdditionalSyncCheck(success_, sync_status);
return true;

Also, now that there's only one place where we're calling AdditionalSyncCheck(), I suggest just inlining that code here instead of having a separate function.

Done.


test/cpp/util/tls_test_utils.cc, line 44 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The success parameter is not necessary here, because this method can just directly access the success_ data member.

Done.


test/cpp/util/tls_test_utils.cc, line 46 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If success is true, then we should explicitly set *sync_status to grpc::Status::OK.

Done.


test/cpp/util/tls_test_utils.cc, line 49 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It doesn't seem useful to merge the two statuses this way. Instead, if the hostname verifier fails, I would just use its status directly and not bother looking at success_.

Changed the test structure to use a single processing thread for the async verifier.


test/cpp/util/tls_test_utils.cc, line 61 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This approach seems pretty ugly. Instead, I suggest changing this to just use the same single-worker-thread approach as in AsyncExternalVerifier.

Done.


test/cpp/util/tls_test_utils.cc, line 62 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This will require grabbing the lock, or else it won't be thread-safe.

Changed the test structure to use a single processing thread for the async verifier.


test/cpp/util/tls_test_utils.cc, line 79 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same as above: Assert that this returns true, and pass nullptr as the callback.

Changed the test structure to use a single processing thread for the async verifier.


test/cpp/util/tls_test_utils.cc, line 118 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same as above: It doesn't seem useful to merge the two statuses this way. Instead, if the hostname verifier fails, I would just use its status directly and not bother looking at success_.

Changed the test structure to use a single processing thread for the async verifier.


test/cpp/util/tls_test_utils.cc, line 135 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should be done where we do the map lookup above. There's no reason to reacquire the lock later and redo the lookup.

That's actually because we have an infinite loop in dtor that will break if the map is empty. So if the map is cleared before we do callback(), there could still be chances that the thread is running while the whole class is destroyed. It will cause flaky use-after-free bugs.
But yeah, I've changed to use the single processing thread.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can point me at which tests are failing, I might be able to suggest where to look.

Otherwise, this looks really good! What's left is mainly fixing the cmp() method on the security connectors and addressing the problem of the per-call host check.

Please let me know if you have any questions. Thanks!

Reviewed 31 of 31 files at r12.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @yashykt, and @ZhenLian)


include/grpcpp/security/tls_certificate_verifier.h, line 77 at r11 (raw file):

Previously, ZhenLian wrote…

Sure, I've revised the wording a bit. PTAL, thanks!

Suggest the following wording instead:

"""
To implement a custom verifier, do not extend this class; instead, implement a subclass of ExternalCertificateVerifier. Note that custom verifier implementations can compose their functionality with existing implementations of this interface, such as HostnameVerifier, by delegating to an instance of that class.
"""


include/grpcpp/security/tls_certificate_verifier.h, line 93 at r11 (raw file):

Previously, ZhenLian wrote…

Hmm...actually I think both CertificateVerifier.Verify() and ExternalCertificateVerifier.Verify() can be composed.
My original intention for commenting was that, users might wonder what they could do with CertificateVerifier.Verify(), since it is public, but CertificateVerifier's direct use is not involved with this Verify() API(users just need to create it, set it in the options, and then they are good to go). The answer for that is, there exists an indirect use ofCertificateVerifier.Verify() to compose internal verifiers into the user-implemented ones. I just wanted to point that out.

(And also, the semantics of CertificateVerifier.Verify() and ExternalCertificateVerifier.Verify() are a bit different. CertificateVerifier.Verify()is a function users can invoke during their composition, while ExternalCertificateVerifier.Verify() is a function users should implement while making their own verifiers. I wanted to make this clear as well, but not sure if the comment is clear enough. Feel free to make any suggestions on how we can improve the comments. Thank you so much!)

I don't think we need to talk about composition here. We don't need to justify the existence of this method, we just need to document how to use it. The composition mechanism is covered by the comment above about custom implementations being able to delegate to other CertificateVerfier instances.


include/grpcpp/security/tls_certificate_verifier.h, line 111 at r11 (raw file):

Previously, ZhenLian wrote…

Thanks for the suggestion! Just one minor thing, I think the sentence The implementation should ... is speaking to users who might want to implement CertificateVerifier.Cancel(), while in CertificateVerifier.Cancel() users mostly likely won't need to implement it. They would just need to invoke it in a composition scenario. So it seems not appropriate to put it here.
But I find it a perfect statement for ExternalCertificateVerifier.Cancel().

You're right, it looks good that way.

Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log for one of the failing tests is here: https://gist.github.com/ZhenLian/e47f5cbb031d7127c70f98d50b3c2745. I noticed that for the failing e2e tests, the channel security connector is not destructed, and the error is a Handshake timed out, so I guess something is wrongly holding a ref to the security connector when not needed and hence causing the problem.

I didn't push the updated changes for xds_end2end_test.cc yet(because it has this bug), but I can push it if you want me to do that.

BTW, this e2e test is super long, and since the error is a timeout issue, it takes very long to get the log result. I am wondering if there is any way to run a particular test? Take the above failing test(XdsTest/XdsSecurityTest.TestMtlsConfigurationWithRootPluginUpdate/XdsResolverV3XdsCreds) for example, I tried to run bazel test //test/cpp/end2end:xds_end2end_test --test_output=all --config=asan --test_filter="TestMtlsConfigurationWithRootPluginUpdate", but it just passed. Did I do anything wrong here?

Any help is very much appreciated!

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @yashykt, and @ZhenLian)

Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the latest changes, and the most recent commit includes the part for xds_end2end_test.cc. It will generate the error linked above if I ran bazel test //test/cpp/end2end:xds_end2end_test --test_output=all --config=asan. Any guess/suggestion on how to fix the problem is really appreciated. Thanks!

Reviewable status: 53 of 57 files reviewed, 5 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @yashykt, and @ZhenLian)

@ZhenLian ZhenLian force-pushed the zhen_custom_verification_1 branch from 1b1d778 to 12fe512 Compare May 22, 2021 21:35
Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I've fixed some small issues in code which solves the previous Handshake timed out error, but the test would still time out after all the handshakes are complete(See updated logs https://gist.github.com/ZhenLian/e47f5cbb031d7127c70f98d50b3c2745).

From the logs, it seems the security connector and all the external verifiers, including grpc::experimental::ExternalCertificateVerifier and grpc_core::ExternalCertificateVerifier, are never destroyed. grpc::experimental::CertificateVerifieris created and then immediately destroyed, though. Most of the failing tests seem to be related toXdsSecurityTest`.

I am not very familiar with how the Xds are structured on top. The only thing I could tell that might be relevant to this ownership problem is the way we manage our C++ classes. We use Create(...) to transfer the subclass of grpc::experimental::ExternalCertificateVerifier to grpc::experimental::CertificateVerifier, but we don't want the former to be destroyed when the latter get destroyed; instead, we associate the ownership of the former to the core objects. But when would the core objects be destroyed then? Without a bond to the C++ classes, we can't rely on users to tell us when to destroy the object. If we will have to do it ourselves, what is the right timing to destroy the core objects, and hence grpc::experimental::ExternalCertificateVerifier? I didn't find anywhere in code we are doing this right now.

I am almost out of way for how to debug this. So any way we could even get closer to the solution would be very much appreciated. Thank you so much!

Reviewable status: 53 of 57 files reviewed, 5 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @yashykt, and @ZhenLian)

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we don't link the lifetime of the C++ object to the lifetime of the core object is intentional. We discussed this in the design doc: we need to do it that way, because we can't pass the C++ shared_ptr<> into core. That's why they need to be separate objects in the first place.

For the core object, each channel that the object is used for should be holding a ref to it. When those channels shut down, they should drop the refs, and the object should be destroyed at that point.

I've added a suggestion below for how to debug the ref-counting problem. If you can't figure it out, I suggest asking @yashykt to help, since he wrote all of the xDS security code and the corresponding tests.

Reviewed 1 of 10 files at r14.
Reviewable status: 47 of 57 files reviewed, 15 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, @yashykt, and @ZhenLian)


include/grpcpp/security/tls_certificate_verifier.h, line 87 at r14 (raw file):

 public:
  CertificateVerifier(grpc_tls_certificate_verifier* v) : verifier_(v) {
    gpr_log(GPR_ERROR,

It looks like you've added a bunch of these log messages, presumably for debugging. Please make sure to remove these before merging.


src/core/lib/security/credentials/tls/grpc_tls_certificate_verifier.h, line 40 at r14 (raw file):

    : public grpc_core::RefCounted<grpc_tls_certificate_verifier> {
 public:
  grpc_tls_certificate_verifier() = default;

For ref-count debugging, I suggest temporarily changing this to:

grpc_tls_certificate_verifier() : grpc_core::RefCounted("core cert verifier") {}

This will provide trace logging messages whenever the refs of this object change, so you can trace where the refs and unrefs are happening and see where we're failing to drop a ref.


src/core/lib/security/credentials/xds/xds_credentials.cc, line 71 at r14 (raw file):

      : cluster_name_(std::move(cluster_name)) {
    gpr_log(GPR_ERROR, "XdsCertificateVerifier ctor is called");
    xds_matcher_ = xds_certificate_provider->GetSanMatchers(cluster_name_);

I don't think it's correct to get the SAN matchers from the XdsCertificateProvider only at instantiation time, because they can change over the lifetime of the object. We need to use the current value each time there is a connection attempt.

I assume that you made this change to avoid holding a ref to the XdsCertificateProvider here, but I don't see why that would be necessary. The certificate provider isn't holding a ref to the certificate verifier, so there's no circular reference here that would prevent everything from being shut down.


test/cpp/end2end/xds_end2end_test.cc, line 97 at r14 (raw file):

#include "test/cpp/util/tls_test_utils.h"

#include "test/core/util/tls_utils.h"

Doesn't look like this include is actually needed.


test/cpp/end2end/xds_end2end_test.cc, line 1544 at r14 (raw file):

      ReadFile(kCaCertPath), identity_key_cert_pairs);
  grpc::experimental::TlsChannelCredentialsOptions options;
  options.set_certificate_provider(certificate_provider);

std::move()


test/cpp/end2end/xds_end2end_test.cc, line 1550 at r14 (raw file):

      ExternalCertificateVerifier::Create<SyncCertificateVerifier>(true);
  options.set_verify_server_certs(true);
  options.set_certificate_verifier(verifier);

std::move()


test/cpp/end2end/xds_end2end_test.cc, line 7598 at r14 (raw file):

        ReadFile(kCaCertPath), identity_key_cert_pairs);
    grpc::experimental::TlsChannelCredentialsOptions options;
    options.set_certificate_provider(certificate_provider);

std::move()


test/cpp/end2end/xds_end2end_test.cc, line 7604 at r14 (raw file):

        ExternalCertificateVerifier::Create<SyncCertificateVerifier>(true);
    options.set_verify_server_certs(true);
    options.set_certificate_verifier(verifier);

std::move()


test/cpp/end2end/xds_end2end_test.cc, line 7622 at r14 (raw file):

        std::make_shared<StaticDataCertificateProvider>(ReadFile(kCaCertPath));
    grpc::experimental::TlsChannelCredentialsOptions options;
    options.set_certificate_provider(certificate_provider);

std::move()


test/cpp/end2end/xds_end2end_test.cc, line 7627 at r14 (raw file):

        ExternalCertificateVerifier::Create<SyncCertificateVerifier>(true);
    options.set_verify_server_certs(true);
    options.set_certificate_verifier(verifier);

std::move()

Copy link

@jiangtaoli2016 jiangtaoli2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 47 of 57 files reviewed, 15 unresolved discussions (waiting on @markdroth, @yashykt, and @ZhenLian)


src/core/lib/security/security_connector/tls/tls_security_connector.cc, line 331 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Just because we don't have a case right now where we'll want to determine this dynamically doesn't mean that there will never be any such case in the future. I would allow it to be set via both sync and async APIs.

Bleh... I still really don't like having this in the verifier API at all. I guess an alternative might be to have a separate option in grpc_tls_credentials_options, as you mentioned above. The reason I had originally not wanted to do that is that it basically puts the work on the application instead of on the verifier implementation, which in general is the wrong trade-off, since there are going to be a lot more applications than there will be verifier implementations. But given how invasive this change is likely to be, I wonder if it would be better to just make the application set this. If we default it to true, then only applications that configure a verifier that does not include hostname verification will need to set this option, which might not be too bad if we think that will be a rare case. But if we think that will be common in the future (e.g., if people move more to spiffe and away from hostname-based certs), then we probably want to stick with making the verifier impl deal with this.

It might be useful to get some input from @jiangtaoli2016 on this when he gets back.

I am fine with adding a separate option field to specify if application wants check_per_call_host.

@ZhenLian ZhenLian force-pushed the zhen_custom_verification_1 branch from 12fe512 to a655210 Compare May 28, 2021 22:58
Copy link
Contributor Author

@ZhenLian ZhenLian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions so far! I've fixed the minor comments and made the temporary changes to debug the ref-counting issue, but didn't get time to take a look at the logs. Just FYI, I will be OOO next week. Once I am back, I will continue on debugging this issue and maybe have a meeting with @yashykt.
Thanks everyone for the contribution so far!

Reviewable status: 47 of 57 files reviewed, 15 unresolved discussions (waiting on @markdroth, @yashykt, and @ZhenLian)


include/grpcpp/security/tls_certificate_verifier.h, line 77 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest the following wording instead:

"""
To implement a custom verifier, do not extend this class; instead, implement a subclass of ExternalCertificateVerifier. Note that custom verifier implementations can compose their functionality with existing implementations of this interface, such as HostnameVerifier, by delegating to an instance of that class.
"""

Done.


include/grpcpp/security/tls_certificate_verifier.h, line 93 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think we need to talk about composition here. We don't need to justify the existence of this method, we just need to document how to use it. The composition mechanism is covered by the comment above about custom implementations being able to delegate to other CertificateVerfier instances.

OK, sounds good. Removed the paragraph talking about composition.


include/grpcpp/security/tls_certificate_verifier.h, line 87 at r14 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It looks like you've added a bunch of these log messages, presumably for debugging. Please make sure to remove these before merging.

Yeah, they are mainly used for debugging purpose. Let's keep this open, and I've reply to this comment again once all of them get removed.


src/core/lib/security/credentials/xds/xds_credentials.cc, line 71 at r14 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think it's correct to get the SAN matchers from the XdsCertificateProvider only at instantiation time, because they can change over the lifetime of the object. We need to use the current value each time there is a connection attempt.

I assume that you made this change to avoid holding a ref to the XdsCertificateProvider here, but I don't see why that would be necessary. The certificate provider isn't holding a ref to the certificate verifier, so there's no circular reference here that would prevent everything from being shut down.

Yeah, that was my original intention. It did seem have nothing to do with the timeout error. I will change the code back.


test/cpp/end2end/xds_end2end_test.cc, line 97 at r14 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Doesn't look like this include is actually needed.

Done.


test/cpp/end2end/xds_end2end_test.cc, line 1544 at r14 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move()

Done.


test/cpp/end2end/xds_end2end_test.cc, line 1550 at r14 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move()

Done.


test/cpp/end2end/xds_end2end_test.cc, line 7598 at r14 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move()

Done.


test/cpp/end2end/xds_end2end_test.cc, line 7604 at r14 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move()

Done.


test/cpp/end2end/xds_end2end_test.cc, line 7622 at r14 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move()

Done.


test/cpp/end2end/xds_end2end_test.cc, line 7627 at r14 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move()

Done.

@ZhenLian
Copy link
Contributor Author

The Windows tests are failing because of a known issue, and the error logs seems related to the setup settings, not this PR.
The Interop cloud-to-cloud tests failing is a known issue: #24375
I am going to merge this soon.

@ZhenLian ZhenLian merged commit 2e14f6f into grpc:master Nov 10, 2021
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Nov 10, 2021
krestofur added a commit to krestofur/grpc that referenced this pull request Nov 24, 2021
* Revert low Java throughput hotfix; implement permanent fix (grpc#27919)

* xds/interop: Completely disable principal-present authz test (grpc#27933)

A broken fix for the server-side bug is producing invalid configuration,
causing the server to reject all the configuration. Disable the
configuration and tests until fix is fixed.

* Add EventEngine::CancelConnect (grpc#27764)

* Fix crash in bloat diff if diff_size != 0 (grpc#27935)

* Update bloat_diff.py

* Automated change: Fix sanity tests (grpc#27936)

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* ugh (grpc#27937)

* [BinderTransport] Avoid depending on NdkBinder at compile time (grpc#27912)

* [BinderTransport] Avoid depending on NdkBinder at compile time

We would like to make it possible to use BinderTransport in a APK that
has min sdk version lower than 29 (NdkBinder was introduced at 29)

We copies constants and type definitions from Ndk headers, creates a
same name wrapper for every NdkBinder API we use in
grpc_binder::ndk_util namespace.

We will try to load libbinder_ndk.so and resolve the symbol when the
NdkBinder API wrappers are invoked.

* regenerate projects

* Add GRPC_NO_BINDER guard

* Appnet schema changes (grpc#27874)

* Add back references and scope field

* Set scope in router

* Reverse order of cleanup

* Add router_scope flag

* Use router_scope flag to create Router

* I apparently don't know how to brain

* Yapf

* Yeah, that can't be the default

* Remove debug print

* Remove impossible todos

* And another

* Switch from router-scope to config-scope

* Implement schema changes

* Use backend service URL

* Use CLH reference format to backend service

* I am an idiot

* *internal screaming*

* Try project number

* Why is this all awful

* Go back to trying project name

* Try cleaning things up

* Agh

* Address review comments

* Remove superfluous Optional type

* Tweak bloat thresholds (grpc#27942)

* Add Schedule and DrainQueue to WorkSerializer (grpc#27902)

* Add Schedule and DrainQueue to WorkSerializer

* Fix comments

* Reviewer comments

* Use acq_rel semantics instead of seq_cst

* Reviewer comments

* s/dummy/no-op

* Reviewer comments

* Fix

* Reviewer comments

* Reviewer comments

* xds_end2end_test: Only start backends when needed (grpc#27911)

* xds_end2end_test: Fix flakiness on WaitForLdsNack

* xds_end2end_test: Only start the server when we want

* Revert WaitForNack changes

* Fixes

* Fix CsdsShortAdsTimeoutTest

* Fix sanity

* Revert grpc#27780 (grpc#27944)

* bump version on master to 1.43-dev (grpc#27930)

* bump version on master to 1.43-dev

* bump core version

* update g_stands_for.md

* Add support for abstract unix domain sockets (grpc#27906)

* Add failing end2end test for inconsistent percent-decoding of URIs

* Add passing h2_local_abstract_uds end2end tests

* null-safe string_view

* mac doesn't UDS

* mac doesn't do *abstract* UDS

* Add an experimental binder transport API for pre-finding Java class (grpc#27939)

Due to limitation of JVM, when user want to create binder channel in
threads created in unmanaged native code, they will need to call this
new API first to make sure Java helper classes can correctly be found.

Unused code in jni_utils.cc are also cleaned up in this commit

* Add grpc-java 1.42.0 to client_matrix.py (grpc#27949)

* Early exit BackUp() on count == 0 (grpc#27946)

* First pass IWYU tooling (grpc#27857)

* First pass IWYU tooling

* fix

* Automated change: Fix sanity tests

* Update iwyu.sh

* Update iwyu.sh

* Update Dockerfile.template

* Update iwyu.sh

* Automated change: Fix sanity tests

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* OpenCensus: Move metadata storage to arena (grpc#27948)

* fix LB policy call status notification interface, and other improvements (grpc#27947)

* fix LB policy call status notification interface, and other improvements

* fix build

* address review comments

* Use PUT not POST to avoid duplicate bloat labels (grpc#27952)

* tcp_client_custom: fix socket leak (grpc#27408)

* Delete happy-pancakes.yml (grpc#27955)

I'm not going to get a chance to finish this for a while, so delete for now.

* Address reviewer comments on grpc#27906 (grpc#27954)

* Address reviewer comments on grpc#27906

* fixes

* absl::ConsumePrefix FTW!

* Assert Android API >= v21 (grpc#27943)

* Assert Android API >= v21

This precedes a change that would otherwise break on older Android APIs,
but in a more obvious way.

* error if __ANDROID_API__ is not defined

* update all Andriod minSdkVersions to 21

* csharp experimental: android-19 to 21

* Api fuzzer extensions to support simulating traffic congestion (grpc#27820)

* adding api_fuzzer changes

* adding api fuzzer changes

* updating some typos

* updating api_fuzzer and corpus entries

* updating api_fuzzer to fix crash due to two successive receive_op batches

* adding some reverted fixes to api_fuzzer.cc

* updating api_fuzzer and corpus as per initial comments

* fix some typos

* fix memory leaks and timeout issues

* adding some comments and removing debug strings

* updating api_fuzzer to remove previous edits to always add recv initial metadata ops for client calls

* updating passthru endpoint to account for erroneous initialization when channel effects are not simulated

* tidying up code

* Migrate Infrastructure Scripts to Python 3 (grpc#27135)

* Run 2to3 on tools directory

* Delete github_stats_tracking

* Re-run 2to3

* Remove unused script

* Remove unused script

* Remove unused line count utility

* Yapf. Isort

* Remove accidentally included file

* Migrate tools/distrib directory to python 3

* Remove unnecessary shebang

* Restore line_count directory

* Immediately convert subprocess.check_output output to string

* Take care of Python 2 shebangs

* Invoke scripts using a Python 3 interpreter

* Yapf. Isort

* Try installing Python 3 first

* See if we have any Python 3 versions installed

* Add Python 3.7 to Windows path

* Try adding a symlink

* Try to symlink differently

* Install six for Python 3

* Run run_interop_tests with python 3

* Try installing six in python3.7 explicitly

* Revert "Try installing six in python3.7 explicitly"

This reverts commit 2cf60d7.

* And debug some more

* Fix issue with jobset.py

* Add debug for CI failure

* Revert microbenchmark changes

* Fix RBE upload (grpc#27969)

* Fix relative imports for Python 3 (grpc#27971)

* Fix relative imports for Python 3

* Remove space

* isort

* Revert "Api fuzzer extensions to support simulating traffic congestion (grpc#27820)" (grpc#27973)

This reverts commit c1089d2.

* Repo manager to Donna (grpc#27967)

* Remove `from . import` from benchmarking scripts. (grpc#27977)

* Remove non-loadbearing argument (grpc#27968)

* Reintroduce the EventEngine default factory (grpc#27920)

* Reintroduce the EventEngine default factory

An application can provide an EventEngine factory function that allows
gRPC internals to create EventEngines as needed. This factory would be
used when no EventEngine is provided for some given channel or server,
and where an EventEngine otherwise could not be provided by the
application. Note that there currently is no API to provide an
EventEngine per channel or per server.

I've also deleted some previous iterations on global EventEngine and
EventEngine factory ideas. This new code lives in a public API, and
coexists with iomgr instead of being isolated to an EventEngine-specific
iomgr implementation.

* add proper namespaces, and fix description

* put factory functions in their own file (for replaceability)

* add synchronization

* generate_projects.sh

* extract event_engine_base and event_engine_factory targets

Also separate iomgr/event_engine files in the BUILD, with comments

* gpr_platform

* move all EE factory declarations to event_engine_base

Makes internal hackery easier.

* add missing deps

* reorder dep alphabetically

* comment style change

* Fix rbe_upload SSL issue (grpc#27982)

* Attempt to fix rbe_upload SSL issue

* Fix comparison

* Rename the source files for ChannelArgsEndpointConfig (grpc#27972)

* Rename the source files for ChannelArgsEndpointConfig

Previous naming was non-descript

* generate_projects

* add missing file to BUILD, and generate_projects.sh

* correct include guards

* xds/interop: Delay to drain queued RPCs in authz test (grpc#27991)

The authz test flaked as no RPCs of the expected type had completed
within the sampling window. Server logs showed authz logs completing
batch of 276 RPCs back-to-back, without the expected 40 ms separation
(qps=25). It took a bit over 1 second to process through the backlog.
With the sample duration of 500 ms and there being a polling delay
between when the channel is READY and when the test driver polls
channelz, it makes sense that we can get lucky much of the time.

Obviously, adding a sleep isn't great either, but measuring the queue
length indirectly is more complex than really appropriate here. The real
solution is to stop using this continuous-qps test client.

```
Traceback (most recent call last):
  File "/tmp/work/grpc/tools/run_tests/xds_k8s_test_driver/tests/authz_test.py", line 252, in test_tls_allow
    grpc.StatusCode.OK)
  File "/tmp/work/grpc/tools/run_tests/xds_k8s_test_driver/tests/authz_test.py", line 183, in configure_and_assert
    method=rpc_type)
  File "/tmp/work/grpc/tools/run_tests/xds_k8s_test_driver/framework/xds_k8s_testcase.py", line 284, in assertRpcStatusCodes
    self.assertGreater(stats.result[status_code.value[0]], 0)
AssertionError: 0 not greater than 0
```

* Support Custom Post-handshake Verification in TlsCredentials (grpc#25631)

* custom verification refactoring - post-handshake verification

* Fix the packaging.version issue on newer Python (grpc#27999)

* Add PSM security to the list of xDS features in the grpc_xds_features.md file (grpc#28001)

* Fix python 3 encoding issues in release notes script (grpc#28002)

* Correct the wait time for url-map propagation (grpc#28004)

* Remove trickle benchmarks (grpc#28000)

* Remove trickle benchmarks

* Automated change: Fix sanity tests

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* [BinderTransport] Lower some log level from ERROR to INFO (grpc#27988)

We also plan to introduce a new tracer for tracing binder wireformat
logic.

* New resource quota integration (grpc#27643)

* new resource quota integration

* Automated change: Fix sanity tests

* fix

* fix

* fixes

* fixes

* fixes

* Automated change: Fix sanity tests

* fixes

* fixes

* Automated change: Fix sanity tests

* fixes

* fix

* fixes

* windows-fix

* fixes

* fixes

* fix

* fix-asan

* banned

* banned

* fixes

* clang-tidy-fix

* Automated change: Fix sanity tests

* fix-cronet

* review feedback

* review feedback

* Automated change: Fix sanity tests

* fixes

* bug fix

* fixes

* compile fix

* exclude megabyte size payloads from 1byte tests

* windows fix

* start moving ios

* keep moving windows

* Get windows compilation working.

* Automated change: Fix sanity tests

* better

* fixes

* remove slice buffer from memory_allocator.h

* Revert "remove slice buffer from memory_allocator.h"

This reverts commit 234a63b.

* ugh

* #fixtests

* pthread tls fixes

* Automated change: Fix sanity tests

* fixfixfix

* xxx

* add reset

* review feedback

* fix

* fix

* fixes

* fix

* mac progress

* cpp-impl-of

* rename ptr

* Automated change: Fix sanity tests

* memory-owner-is-a-memory-allocator

* fixes

* fix

* fix from prod

* fix

* Fix issue leading to bad pointers being returned on Windows.

* Automated change: Fix sanity tests

* fix multislice bug

* argh

* hyrums law fixes

* hyrums law fixes

* clang-format

* hyrums law fixes

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* Start port server separately (grpc#28005)

* Label microbenchmark differences similarly to bloat (grpc#27998)

* benchmark differences as a label

* debug

* Automated change: Fix sanity tests

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* fix AWS arm64 C++ build (grpc#27981)

* [Aio] Validate the input type for set_trailing_metadata and abort (grpc#27958)

* [Aio] Validate the input type for set_trailing_metadata and abort

* Correct the checking of sequence type

* Add feature examples with continuous integration (grpc#27917)

* Add failing end2end test for inconsistent percent-decoding of URIs

* Add passing h2_local_abstract_uds end2end tests

* Add basic abstract UDS example

* add test runner

* add proper bazel build path

* first attempt at CI configuration

* cleanup

* rename CI files for better readability

* Revert "New resource quota integration (grpc#27643)" (grpc#28014)

This reverts commit 39f0877.

* Allow API fuzzer to send multiple slices (grpc#27993)

* Allow API fuzzer to send multiple slices

* fixes

* Use strict buildifier in sanitize (grpc#28018)

Ensure that we use tools consistently everywhere!

* Fix typo (grpc#28019)

* Set BinderTransport connectivity state tracker initial state to READY (grpc#27979)

By default the state is set to IDLE, which is not desired because

1. The actual connectivity state when transport is created is READY
2. The binder tansport channel won't be able to recover from IDLE state
for some reason and cause the channel to stuck when used on multiple
thread. We have an internal bug tracking this issue.

* [BinderTransport] Add more info to class not found error msg (grpc#28009)

* Add note about starting port server out of band (grpc#28012)

* Increase the timeout of xds_k8s test to 180 (grpc#28027)

* Bloat reporting improvements (grpc#28013)

* Bloat reporting improvements

* typo

* fix

* fix

* fix

* XdsClient: fix resource timeout behavior (grpc#27860)

* XdsClient: fix resource timeout behavior

* fix clang-tidy

* more clang-tidy fixes

* yet more clang-tidy

* Fix name of feature example tests CI config file (grpc#28028)

* AVL implementation in C++ (grpc#27782)

* avl

* move-code,add-fuzzer

* done

* fix

* Automated change: Fix sanity tests

* buildifier

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* Revert "XdsClient: fix resource timeout behavior (grpc#27860)" (grpc#28032)

This reverts commit 7fdb40d.

* Use WorkSerializer in XdsClient to propagate updates in a synchronized manner (grpc#27975)

* Use WorkSerializer in XdsClient to propagate updates

* Fix breakage

* Add missing Drain

* More fixes

* Get around msvc issue

* Fix asan leak

* Reviewer comments

* Get around TSAN annotations

* Remove notes

* Clang-format

* Reviewer comments

* [BinderTransport] Send correct version to server and verify it (grpc#27990)

We support wireformat version 1.

* [BinderTransport] Print error message when API level is too low (grpc#27989)

Also removes an obsolete API

* Reland resource quota work (grpc#28017)

* Check if memory owner available prior to polling it

The transport may drop the memory owner during its destruction sequence

* tcp_fix

* Revert "Revert "New resource quota integration (grpc#27643)" (grpc#28014)"

This reverts commit 0ea2c37.

* clang-format

* fix-path

* fix

* Remove extra ';' after member function definition (grpc#28038)

Some user of gRPC library have [-Werror,-Wextra-semi] set and this extra
';' makes the code uncompilable

* fix missing header (grpc#28087)

* Revert "[BinderTransport] Send correct version to server and verify it (grpc#27990)" (grpc#28090)

This reverts commit 92c34b8.

* Exclude debug sections from bloat severity calculations (grpc#28089)

* Ensure JSON parser can consume dumped JSON (grpc#27994)

* Ensure JSON parser can consume dumped JSON

* fixes

* fixes

* Update fuzzer.cc

* internal_ci: rename grpc_xds_k8s to psm-security as part of tech-debt cleanup and name clarity (grpc#28034)

* Upgrade upb to 0e0de7d9 (grpc#27984)

* Remove upb first

* Squashed 'third_party/upb/' content from commit 0e0de7d9f9

git-subtree-dir: third_party/upb
git-subtree-split: 0e0de7d9f927aa888d9a0baeaf6576bbbb23bf0b

* Update bazel deps

* Regen upb files

* Fix build

* Second attempt: XdsClient: fix resource timeout behavior (grpc#28088)

* Revert "Revert "XdsClient: fix resource timeout behavior (grpc#27860)" (grpc#28032)"

This reverts commit 817eed0.

* use the right status code enum

* Tooling to remove redundant grpc_core:: namespace references (grpc#28030)

* Tooling to remove redundant grpc_core:: namespaces

These references tend to show up in our C++ code after C modules get
converted. Many get caught in review, many get missed.

* use it

* clang-format

* add gcr image for java release v1.42.1 (grpc#28094)

* add gcr image for java release v1.42.1

* replace java release v1.42.0 with v1.42.1

* Fix typo in bloat script (grpc#28104)

* setup-ios-bazel-test-to-run-c-core-ee-ut (grpc#28029)

* user-agent metadata trait, also: grpc_core::Slice is born (grpc#27770)

* new slice api

* storage-classes

* Automated change: Fix sanity tests

* tweaks

* refinement

* refinement

* compiles

* Automated change: Fix sanity tests

* better impl

* convert to gtest

* clean

* fixes

* tests

* Automated change: Fix sanity tests

* more-tests

* clarity

* comments

* comments

* fixes

* comment-updates

* fixes

* spam

* Automated change: Fix sanity tests

* move transport size into transport!

* mebbefix

* fix type

* x

* fix-merge

* review feedback

* review feedback

* Automated change: Fix sanity tests

* meh

* review feedback

* fix

* resolve compile issue

* Remove slice_refcount dependency on RefCounted

* update init for channel_data in http_client_filter

* Automated change: Fix sanity tests

* remove banned function

* fixes

* will it blend

* will it blend

* fix refcount init

* fix

* fix comment

* Fix typo in bloat script

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* Revert "user-agent metadata trait, also: grpc_core::Slice is born (grpc#27770)" (grpc#28108)

This reverts commit 7a40f50.

* Faster clang-format (grpc#28086)

* use parallelism to speed clang-format performance

* use parallelism to speed clang-format performance

* xDS: strip "type.googleapis.com/" prefix off of resource type constants (grpc#28024)

* Increase iOS test timeout from 1.5h to 2.0h (grpc#28110)

* Sync Protos with grpc-proto repo (grpc#27957)

* Sync with grpc_proto

* UPB gen

* Update csharp SDK to LTS versions (grpc#27966)

* update C# SDK

* regenerate dockerfiles

* install .NET6 and .NET Core 3.1

* regenerate dockerfiles

* change netcoreapp2.1 targets to netcoreapp3.1

* update installed SDKs in aarch64 C# job

* update run_tests scripts to use netcoreapp3.1 for C#

* use CppImplOf for grpc_server (grpc#28112)

* use CppImplOf for grpc_server

* fix build

* fix sanity

* enable clang-tidy readability-static-definition-in-anonymous-namespace check (grpc#28033)

* Passing repo manager to markdroth (grpc#28114)

* Passing repo manager to markdroth

* one more file

* Add Java v1.40.2 and v1.41.1 to the interop test client matrix. (grpc#27953)

* Adding prefix to authority map key (grpc#28106)

* Adding prefex to authority map key

* fixing according to code review suggestions

* fixing

* Guard against duplicate ops in the same batch (grpc#28118)

* Guard against duplicate ops in the same batch

* add test

* add fuzzer example

* better name

* Fix api_fuzzer found bug (grpc#28113)

* Don't limit bloaty output lines (grpc#28120)

* dont limit bloaty output lines

* Automated change: Fix sanity tests

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* Reland user-agent metadata work (grpc#28109)

* Revert "Revert "user-agent metadata trait, also: grpc_core::Slice is born (grpc#27770)" (grpc#28108)"

This reverts commit 89d08da.

* will it blend

* will it blend

* will it blend

* lcnagfmt

* sanity

* will it blend

* sleep @ end

* will it blend

* Revert "sleep @ end"

This reverts commit d6bca8e.

* review feedback

* review feedback

* Remove BUILD.gn (again) (grpc#28121)

* Revert "Revert "Api fuzzer extensions to support simulating traffic congestion (grpc#27820)" (grpc#27973)" (grpc#27974)

* Revert "Revert "Api fuzzer extensions to support simulating traffic congestion (grpc#27820)" (grpc#27973)"

This reverts commit 879f97e.

* updating passthru_endpoint file to fix windows breakages

* Automated change: Fix sanity tests

Co-authored-by: Vignesh2208 <Vignesh2208@users.noreply.github.com>

* Add v1.42.0 release of grpc-go to interop matrix (grpc#27985)

* remove RDS and EDS resource deletion inside of XdsClient (grpc#28128)

* Reduce OSS benchmarks polling time to 5s. (grpc#28131)

This change reduces total run time from ~86min to ~74min.

* Add missing exec ctx to public api (grpc#28134)

* Revert "use CppImplOf for grpc_server (grpc#28112)" (grpc#28130)

This reverts commit 2ea8e50.

* Fix xds_end2end_test dyld (grpc#28133)

* Support RDS updates on the server (grpc#27851)

* Port changes from grpc#27388

* Reviewer comments

* Fix resource timeout issue

* Cleanup

* Fix clang-tidy

* Revert benchmark

* Restructure

* clang-tidy

* Automated change: Fix sanity tests

* Partial commit

* Reviewer comments

* Fixes

* Reviewer comments

* Reviewer comments

* Reviewer comments

* Reviewer comments

* clang-format

* Fix FaultInjection tests

* clang-tidy

Co-authored-by: yashykt <yashykt@users.noreply.github.com>

* To LTS 20211102.0 (grpc#27916)

* internal_ci: rename grpc_xds_k8s_python to psm-security-python as part of tech-debt cleanup and name clarity (grpc#28136)

* Roll-forward grpc#27780 (grpc#27951)

* Revert "Revert grpc#27780 (grpc#27944)"

This reverts commit 26e7560.

* Fix google_api_upbdefs

* use WorkSerializer for subchannel connectivity state notifications (grpc#28111)

* ignore dynamic linker segments in bloat severity calculations (grpc#28149)

* Revert "Revert "use CppImplOf for grpc_server (grpc#28112)" (grpc#28130)" (grpc#28144)

This reverts commit eec0ca9.

* Fix C# nuget package build. (grpc#28152)

* fix nuget build by removing deprecated packageIconUrl field

* react to dotnet pack output fix in .NET

* Arena allocated promise (grpc#28023)

* Arena allocated promise

* finish

* Automated change: Fix sanity tests

* test

* Remove unused names

* fix

* older compiler fix

* fiiixes

* windows fixes?

* clangfmt

* fix windows?

* Document more

* fix destructors

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* [App Net] Switch Router to Mesh and Add unique string to Scope (grpc#28145)

* Add unique suffix to scope

* Actually add suffix

* Switch from Router to Mesh

* Yapf

* Review

* Fix bad reference

* Break circular dependency

* Add a dash

* Introduce empty targets to ease the internal merge of grpc#25586  (grpc#28122)

* add empty targets

* fix format error

* fix build format error

* address reviewer's comments

* fix test build rules

* remove server_auth_fiter from grpc_secure

* Remove unused constants (grpc#28156)

* Api fuzzer overflow bug (grpc#28161)

* fixing overflow error in api fuzzer

* fixing some sanity checks

* fix code style

* change repo mgr to nicolasnoble (grpc#28117)

* Revert "Arena allocated promise (grpc#28023)" (grpc#28171)

This reverts commit 77b4ade.

* Revert "Introduce empty targets to ease the internal merge of grpc#25586  (grpc#28122)" (grpc#28172)

This reverts commit 171c64e.

* Revert "[App Net] Switch Router to Mesh and Add unique string to Scope (grpc#28145)" (grpc#28176)

This reverts commit 7ac79c2.

* Reland arena based promises (grpc#28174)

* Revert "Revert "Arena allocated promise (grpc#28023)" (grpc#28171)"

This reverts commit 9b07a81.

* fix

* Automated change: Fix sanity tests

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* Update cxx docker images (grpc#28162)

* Channel args preconditioning (grpc#28132)

* Channel args preconditioning

* docs

* fixes

* Automated change: Fix sanity tests

* fix

* fix this again after merge error

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* Added workaround for gcc 6.3.0 (grpc#28184)

* Remove unused override, and the static metadata that supports it (grpc#28193)

* Fix cronet tests (grpc#28189)

* add --log_metadata_and_status feature to interop_client (grpc#28021)

* Boringssl update to 4fb1589 (grpc#28194)

* update submodule boringssl-with-bazel with origin/main-with-bazel

* update boringssl dependency to main-with-bazel commit SHA

* regenerate files

* generate boringssl prefix headers

* Increment podspec version

* Use more parallelism to windows portability tests (grpc#28179)

* Use more parallelism to windows portability tests

* Added /MP

* Changed it to gRPC_BUILD_MSVC_MP_COUNT

* Move a bunch of slice typed metadata to new system (grpc#28107)

* Eliminate most of grpc_message metadata handling

* Eliminate most of host metadata handling

* Remove more callouts without fixing code

* fiiixes

* typo

* Automated change: Fix sanity tests

* try-shrink

* Automated change: Fix sanity tests

* size tweaks

* less tricks

* deunique

* commonize

* commonize

* Automated change: Fix sanity tests

* size tuning, fixes

* Automated change: Fix sanity tests

* fix

* size tuning, fixes

* remove constexpr

* fix

* reuse code

* fix

* tweak code

* more tweaks

* tell no lies

* fixes

* fixes

* Automated change: Fix sanity tests

* fix

* fix

* fix

* fix

* fix?

* fix binder

* fix

* fix

* fixes

* Automated change: Fix sanity tests

* fix

Co-authored-by: ctiller <ctiller@users.noreply.github.com>

* Revert "use WorkSerializer for subchannel connectivity state notifications (grpc#28111)" (grpc#28206)

This reverts commit cfca3e5.

* maybe fixed merge?

* fix merge

Co-authored-by: brandonpaiz <brandonpaiz@users.noreply.github.com>
Co-authored-by: Eric Anderson <ejona@google.com>
Co-authored-by: AJ Heller <hork@google.com>
Co-authored-by: Craig Tiller <ctiller@google.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Co-authored-by: Ming-Chuan <mingcl@google.com>
Co-authored-by: Richard Belleville <rbellevi@google.com>
Co-authored-by: Yash Tibrewal <yashkt@google.com>
Co-authored-by: Esun Kim <veblush@google.com>
Co-authored-by: Mark D. Roth <roth@google.com>
Co-authored-by: Sergii Tkachenko <sergiitk@google.com>
Co-authored-by: Nayef Ghattas <nayef.ghattas@gmail.com>
Co-authored-by: Vignesh Babu <vigneshbabu@google.com>
Co-authored-by: Paulo Castello da Costa <6579971+paulosjca@users.noreply.github.com>
Co-authored-by: ZhenLian <zhenlian@google.com>
Co-authored-by: Lidi Zheng <lidiz@google.com>
Co-authored-by: sanjaypujare <sanjaypujare@users.noreply.github.com>
Co-authored-by: Jan Tattermusch <jtattermusch@users.noreply.github.com>
Co-authored-by: yifeizhuang <yifeizhuang@gmail.com>
Co-authored-by: Hannah Shi <hannahshisfb@gmail.com>
Co-authored-by: donnadionne <donnadionne@google.com>
Co-authored-by: Terry Wilson <tmwilson@google.com>
Co-authored-by: Vignesh2208 <Vignesh2208@users.noreply.github.com>
Co-authored-by: Zach Reyes <39203661+zasweq@users.noreply.github.com>
Co-authored-by: yashykt <yashykt@users.noreply.github.com>
Co-authored-by: yihuaz <yihuaz@google.com>
markdroth added a commit that referenced this pull request Sep 22, 2023
The `cancel_check_peer()` method is [always called with a non-OK
status](https://github.com/grpc/grpc/blob/866fc41067eb1a5ddd232093b9e2a95d19349354/src/core/lib/security/transport/security_handshaker.cc#L560),
since it's used only in cancellation cases. However, the implementation
of this method for TLS creds was bailing out if the status was non-OK,
meaning that `cancel_check_peer()` was never actually cancelling the
verification request. This bug seems to have been introduced back in
#25631, when the method was initially implemented.

I don't think we actually have any async verifier implementations today,
so this isn't actually causing a problem. I discovered this bug as part
of #34426, which was triggering the core e2e `no_logging` test to fail.
That test is designed to ensure that we don't generate any logs while
processing individual RPCs, since that would be bad for performance and
would flood logfiles. My PR caused a connection attempt to be cancelled
during the test, which triggered the error log that I am removing in
this PR.

Note that with this PR, the TLS creds `cancel_check_peer()` methods are
not actually doing anything with the status. Ideally, they should be
passing the status through to the verifier's `Cancel()` method, but we
apparently didn't add a parameter for that, which means that although
cancellation will work now, it will not properly pass through the right
error message. At some point, we should fix this and add tests covering
cancellation of async verifier requests to prove that the error message
is propagated correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/security bloat/medium imported Specifies if the PR has been imported to the internal repository lang/c++ lang/core release notes: yes Indicates if PR needs to be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants