New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GODRIVER-2464 Add timeout for RTT monitor hello operations. #994
GODRIVER-2464 Add timeout for RTT monitor hello operations. #994
Conversation
e1d920e
to
0d23583
Compare
According to the server monitoring spec, I think we should be using |
0d23583
to
4d5149e
Compare
@benjirewis good find in the SDAM spec! I initially tried to use |
4d5149e
to
936e77b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting design! I have some questions, but seems like very elegant test design. It does look like the TestUnifiedSpecs/server-discovery-and-monitoring/integration/hello-network-error.json/Network_error_on_Monitor_check
spec test has become flaky due to these changes, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent job! It looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, and thank you for all the explanations 😄
Any idea if the flakiness of TestUnifiedSpecs/server-discovery-and-monitoring/integration/hello-network-error.json/Network_error_on_Monitor_check
is related to these changes, though?
ec1c890
to
8baf25e
Compare
Added the proposed changes from mongodb/specifications#1272 here to test them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for investigating that test flakiness! I think I understand your diagnosis and the spec changes make sense to me (apart from that comment I made). I think we can merge this for now and use the GODRIVER ticket derived from DRIVERS-2386 to revise the tests with any potential changes.
count: 1 | ||
# We cannot assert the server was marked Unknown and pool was cleared an | ||
# exact number of times because the RTT hello may have triggered this | ||
# failpoint one or many times as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can leave this up to the spec PR, but if this is not an assertion we plan on uncommenting eventually, I'm not sure I see why we don't just remove it entirely.
GODRIVER-2464
Currently the RTT monitor uses a context without timeout to run the "hello" operation (see here). As a result, it's possible for RTT monitor "hello" operations to get stuck indefinitely if there is a network problem, preventing the monitor from recording more RTT samples. The motivation for this ticket comes from troubleshooting GODRIVER-2438.
Add a timeout to the RTT monitor "hello" operation to prevent network issues from causing the RTT monitor to get stuck. Add a test that confirms that the RTT monitor can recover from a stuck operation.
Also includes proposed SDAM spec test fixes from mongodb/specifications#1272.