Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor DynamicConfigSlowPreJoinBouncingTest [HZ-978] #21255

Merged
merged 3 commits into from May 30, 2022

Conversation

ramizdundar
Copy link
Contributor

@ramizdundar ramizdundar commented Apr 19, 2022

There is 1 stable member (master), 3 bouncing members and 1 driver in the test. Stable member and drive should never bounce in the test. The test was failing because master was kicking out the driver from the cluster even before driver can broadcast the dynamic changes.

Driver was kicked out because master wasn't able to handle driver's heartbeat. This would lead master to think that driver is dead.

Master isn't able to handle driver's heartbeat because of:

  1. Master's operation threads are blocked.
  2. Master can't get the lock for ClusterHeartbeatManager.handleHeartbeat().

Master is under pressure because there are 4 members trying to join him at the start of the test. Each member will try to send join request for each second master won't respond. For master to respond, we need at least one second because before responding master needs to call NodeEngineImpl.getPreJoinOperations() at least once. And all of these calls are done with ClusterJoinManager.clusterServiceLock. Which is actually the same lock ClusterHeartbeatManager uses.

If master for whatever reason slows down, all members want to join will send more and more join requests, which in turn make master even more congested, since all of these operations take 1 second (because of the sleep in the test) and they can't executed in parallel because of the lock.

So master would either be blocked by the join requests or can't acquire the lock for the ClusterHeartbeatManager. Then master will kick out the driver because it couldn't process the heartbeat.

This test doesn't test any new scenarios over DynamicConfigBouncingTest, hence it'll be removed with this PR.

Fixes #19785

Checklist:

  • Labels (Team:, Type:, Source:, Module:) and Milestone set
  • Label Add to Release Notes or Not Release Notes content set
  • Request reviewers if possible

@ramizdundar ramizdundar self-assigned this Apr 19, 2022
@ramizdundar ramizdundar added this to the 5.2 milestone Apr 19, 2022
@AyberkSorgun AyberkSorgun changed the title Refactor DynamicConfigSlowPreJoinBouncingTest Refactor DynamicConfigSlowPreJoinBouncingTest [HZ-978] May 9, 2022
@ramizdundar
Copy link
Contributor Author

the-test-cant-fail-if-there-isnt-any

@ramizdundar ramizdundar marked this pull request as ready for review May 9, 2022 12:56
@ramizdundar ramizdundar merged commit 17b92e6 into hazelcast:master May 30, 2022
@ramizdundar ramizdundar deleted the replace_prejoin_test branch May 30, 2022 12:18
ramizdundar added a commit to ramizdundar/hazelcast that referenced this pull request Jun 21, 2022
* Refactor test

* Revert "Refactor test"

This reverts commit 8668208.

* Delete DynamicConfigSlowPreJoinBouncingTest

(cherry picked from commit 17b92e6)
ramizdundar added a commit that referenced this pull request Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants