Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c-ares DNS resolver: fix logical race between resolution timeout/cancellation and fd readability #31443

Merged
merged 4 commits into from Oct 28, 2022

Conversation

apolcyn
Copy link
Contributor

@apolcyn apolcyn commented Oct 24, 2022

This PR adds a test originally meant to hit the bug described in #31426. This test caught a more general, long-standing bug which is fixed here:

If we:

  1. Have more than one DNS query outstanding on the same file descriptor (for example, an A query and a AAAA query)
  2. Cancel the overall resolution (for example due to the overall timeout kicking in)
  3. Concurrently with 2), receive data from the DNS server for one of those queries (for example, just the A query)

... then, we can hit a bug where we run the on_readable callback for the A query and process the data as if the resolution hasn't been cancelled. Even though we've already shut down the file descriptor with the still-pending AAAA query, the c-ares library thinks we still want to wait for this AAAA query to finish. But it never will finish, because we'll stop polling I/O at this point without ever having told c-ares to cancel the query.

The fix is to explicitly check for overall cancellation in the I/O callbacks and tell c-ares to cancel everything in that case.


While we're here:

@@ -192,9 +192,9 @@ gpr_timespec TestDeadline(void) {

struct ArgsStruct {
gpr_event ev;
gpr_atm done_atm;
gpr_mu* mu;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we convert this to use grpc_core::Mutex instead, so that we can use lock annotations and allow the compiler to enforce the lock guard?

We can do the same thing in the tests that were fixed in #25398.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to do that, but we can't because this gpr_mu actually belongs to the pollset, so we'd need to migrate pollset to grpc_core::Mutex first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. This is probably something that we'll be able to address as part of migrating this code to EventEngine.

// Shutdown and join stress thread
gpr_event_set(&done_ev, reinterpret_cast<void*>(1));
socket_stress_thread.join();
}

TEST(ResolverComponentTest, TestDoesntCrashOrHangWith1MsTimeout) {
// Queries in this test could either complete successfully or time out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's no way to trigger exactly the sequence of events that we care about here without having a much more flexible fake DNS server, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we'd actually need some kind of mocked endpoint to do that. While it would be nice to have a test like that, it's not obvious to me how we could structure something like that, given the way that the c-ares library interacts directly with file descriptors.

While the current test doesn't take a deterministic code path, it at least won't flake in a way that yields a false positive. And it's been useful so far (since it quickly surfaced this bug while I was trying to repro the separate issue of #31426, I think it's giving some previously-missing but important coverage).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what we need is a fake c-ares implementation. We could implement an interface that wraps around c-ares and have the DNS resolver code use c-ares via that interface. We could then also implement a fake version of that interface, so that we can have it respond to the DNS resolver code however we want.

But in any case, I think this is a project for another day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what we need is a fake c-ares implementation. We could implement an interface that wraps around c-ares and have the DNS resolver code use c-ares via that interface.

Perhaps. But IMO this code pretty tightly coupled to the behavior of the c-ares library, and I think most of the meat of this code is about properly handling the behavior of the c-ares library. So I'd be weary of mocking at the c-ares API layer for the risk that we'd just end up testing our mock.

While I'm not really sure how to do it at the moment, I'd still be a lot more excited about something that mocked out only the "untrusted" parts of the code, e.g. timing and the bytes received off the wire.

@@ -192,9 +192,9 @@ gpr_timespec TestDeadline(void) {

struct ArgsStruct {
gpr_event ev;
gpr_atm done_atm;
gpr_mu* mu;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. This is probably something that we'll be able to address as part of migrating this code to EventEngine.

// Shutdown and join stress thread
gpr_event_set(&done_ev, reinterpret_cast<void*>(1));
socket_stress_thread.join();
}

TEST(ResolverComponentTest, TestDoesntCrashOrHangWith1MsTimeout) {
// Queries in this test could either complete successfully or time out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what we need is a fake c-ares implementation. We could implement an interface that wraps around c-ares and have the DNS resolver code use c-ares via that interface. We could then also implement a fake version of that interface, so that we can have it respond to the DNS resolver code however we want.

But in any case, I think this is a project for another day.

@apolcyn apolcyn merged commit e5f7b1b into grpc:master Oct 28, 2022
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bloat/none imported Specifies if the PR has been imported to the internal repository lang/c++ lang/core per-call-memory/neutral per-channel-memory/neutral release notes: yes Indicates if PR needs to be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants