Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GODRIVER-2464 Add timeout for RTT monitor hello operations. #994

Merged

Conversation

matthewdale
Copy link
Collaborator

@matthewdale matthewdale commented Jun 21, 2022

GODRIVER-2464

Currently the RTT monitor uses a context without timeout to run the "hello" operation (see here). As a result, it's possible for RTT monitor "hello" operations to get stuck indefinitely if there is a network problem, preventing the monitor from recording more RTT samples. The motivation for this ticket comes from troubleshooting GODRIVER-2438.

Add a timeout to the RTT monitor "hello" operation to prevent network issues from causing the RTT monitor to get stuck. Add a test that confirms that the RTT monitor can recover from a stuck operation.

Also includes proposed SDAM spec test fixes from mongodb/specifications#1272.

@benjirewis
Copy link
Contributor

According to the server monitoring spec, I think we should be using connectTimeoutMS as the socket timeout for hello operations run by the RTT monitor. Do we want to do that instead of adding a 1 minute timeout via context to the hello operation?

@matthewdale matthewdale force-pushed the godriver2464-fix-rtt-monitor-hang branch from 0d23583 to 4d5149e Compare June 22, 2022 23:36
@matthewdale
Copy link
Collaborator Author

@benjirewis good find in the SDAM spec! I initially tried to use connectTimeoutMS as the RTT monitor timeout, but it's a little wonky to get access to that information when the RTT monitor is initialized, so I decided to go with a flat 1-minute timeout. However, since the SDAM spec specifically calls for that value, I'll give it another shot.

@matthewdale matthewdale force-pushed the godriver2464-fix-rtt-monitor-hang branch from 4d5149e to 936e77b Compare June 27, 2022 23:09
Copy link
Contributor

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting design! I have some questions, but seems like very elegant test design. It does look like the TestUnifiedSpecs/server-discovery-and-monitoring/integration/hello-network-error.json/Network_error_on_Monitor_check spec test has become flaky due to these changes, though.

x/mongo/driver/topology/rtt_monitor.go Show resolved Hide resolved
x/mongo/driver/topology/rtt_monitor.go Show resolved Hide resolved
x/mongo/driver/topology/rtt_monitor_test.go Show resolved Hide resolved
x/mongo/driver/topology/rtt_monitor_test.go Show resolved Hide resolved
Copy link
Collaborator

@qingyang-hu qingyang-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent job! It looks good to me.

Copy link
Contributor

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, and thank you for all the explanations 😄

Any idea if the flakiness of TestUnifiedSpecs/server-discovery-and-monitoring/integration/hello-network-error.json/Network_error_on_Monitor_check is related to these changes, though?

@matthewdale matthewdale force-pushed the godriver2464-fix-rtt-monitor-hang branch from ec1c890 to 8baf25e Compare July 7, 2022 05:13
@matthewdale
Copy link
Collaborator Author

Added the proposed changes from mongodb/specifications#1272 here to test them.

Copy link
Contributor

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for investigating that test flakiness! I think I understand your diagnosis and the spec changes make sense to me (apart from that comment I made). I think we can merge this for now and use the GODRIVER ticket derived from DRIVERS-2386 to revise the tests with any potential changes.

count: 1
# We cannot assert the server was marked Unknown and pool was cleared an
# exact number of times because the RTT hello may have triggered this
# failpoint one or many times as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can leave this up to the spec PR, but if this is not an assertion we plan on uncommenting eventually, I'm not sure I see why we don't just remove it entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants