Refactor DynamicConfigSlowPreJoinBouncingTest [HZ-978] #21255
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is 1 stable member (master), 3 bouncing members and 1 driver in the test. Stable member and drive should never bounce in the test. The test was failing because master was kicking out the driver from the cluster even before driver can broadcast the dynamic changes.
Driver was kicked out because master wasn't able to handle driver's heartbeat. This would lead master to think that driver is dead.
Master isn't able to handle driver's heartbeat because of:
ClusterHeartbeatManager.handleHeartbeat()
.Master is under pressure because there are 4 members trying to join him at the start of the test. Each member will try to send join request for each second master won't respond. For master to respond, we need at least one second because before responding master needs to call
NodeEngineImpl.getPreJoinOperations()
at least once. And all of these calls are done withClusterJoinManager.clusterServiceLock
. Which is actually the same lockClusterHeartbeatManager
uses.If master for whatever reason slows down, all members want to join will send more and more join requests, which in turn make master even more congested, since all of these operations take 1 second (because of the sleep in the test) and they can't executed in parallel because of the lock.
So master would either be blocked by the join requests or can't acquire the lock for the
ClusterHeartbeatManager
. Then master will kick out the driver because it couldn't process the heartbeat.This test doesn't test any new scenarios over
DynamicConfigBouncingTest
, hence it'll be removed with this PR.Fixes #19785
Checklist:
Team:
,Type:
,Source:
,Module:
) and Milestone setAdd to Release Notes
orNot Release Notes content
set