New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c-ares DNS resolver: fix logical race between resolution timeout/cancellation and fd readability #31443
Conversation
@@ -192,9 +192,9 @@ gpr_timespec TestDeadline(void) { | |||
|
|||
struct ArgsStruct { | |||
gpr_event ev; | |||
gpr_atm done_atm; | |||
gpr_mu* mu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we convert this to use grpc_core::Mutex
instead, so that we can use lock annotations and allow the compiler to enforce the lock guard?
We can do the same thing in the tests that were fixed in #25398.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to do that, but we can't because this gpr_mu
actually belongs to the pollset, so we'd need to migrate pollset to grpc_core::Mutex
first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. This is probably something that we'll be able to address as part of migrating this code to EventEngine.
// Shutdown and join stress thread | ||
gpr_event_set(&done_ev, reinterpret_cast<void*>(1)); | ||
socket_stress_thread.join(); | ||
} | ||
|
||
TEST(ResolverComponentTest, TestDoesntCrashOrHangWith1MsTimeout) { | ||
// Queries in this test could either complete successfully or time out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's no way to trigger exactly the sequence of events that we care about here without having a much more flexible fake DNS server, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we'd actually need some kind of mocked endpoint to do that. While it would be nice to have a test like that, it's not obvious to me how we could structure something like that, given the way that the c-ares library interacts directly with file descriptors.
While the current test doesn't take a deterministic code path, it at least won't flake in a way that yields a false positive. And it's been useful so far (since it quickly surfaced this bug while I was trying to repro the separate issue of #31426, I think it's giving some previously-missing but important coverage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe what we need is a fake c-ares implementation. We could implement an interface that wraps around c-ares and have the DNS resolver code use c-ares via that interface. We could then also implement a fake version of that interface, so that we can have it respond to the DNS resolver code however we want.
But in any case, I think this is a project for another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe what we need is a fake c-ares implementation. We could implement an interface that wraps around c-ares and have the DNS resolver code use c-ares via that interface.
Perhaps. But IMO this code pretty tightly coupled to the behavior of the c-ares library, and I think most of the meat of this code is about properly handling the behavior of the c-ares library. So I'd be weary of mocking at the c-ares API layer for the risk that we'd just end up testing our mock.
While I'm not really sure how to do it at the moment, I'd still be a lot more excited about something that mocked out only the "untrusted" parts of the code, e.g. timing and the bytes received off the wire.
@@ -192,9 +192,9 @@ gpr_timespec TestDeadline(void) { | |||
|
|||
struct ArgsStruct { | |||
gpr_event ev; | |||
gpr_atm done_atm; | |||
gpr_mu* mu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. This is probably something that we'll be able to address as part of migrating this code to EventEngine.
// Shutdown and join stress thread | ||
gpr_event_set(&done_ev, reinterpret_cast<void*>(1)); | ||
socket_stress_thread.join(); | ||
} | ||
|
||
TEST(ResolverComponentTest, TestDoesntCrashOrHangWith1MsTimeout) { | ||
// Queries in this test could either complete successfully or time out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe what we need is a fake c-ares implementation. We could implement an interface that wraps around c-ares and have the DNS resolver code use c-ares via that interface. We could then also implement a fake version of that interface, so that we can have it respond to the DNS resolver code however we want.
But in any case, I think this is a project for another day.
This PR adds a test originally meant to hit the bug described in #31426. This test caught a more general, long-standing bug which is fixed here:
If we:
A
query and aAAAA
query)A
query)... then, we can hit a bug where we run the
on_readable
callback for theA
query and process the data as if the resolution hasn't been cancelled. Even though we've already shut down the file descriptor with the still-pendingAAAA
query, the c-ares library thinks we still want to wait for thisAAAA
query to finish. But it never will finish, because we'll stop polling I/O at this point without ever having told c-ares to cancel the query.The fix is to explicitly check for overall cancellation in the I/O callbacks and tell c-ares to cancel everything in that case.
While we're here:
resolver_component_test.cc
that was fixed for other tests in Fix races in resolve_address and resolve_address_posix tests #25398