Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests for CP Membership restart issues #24903

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

lprimak
Copy link
Contributor

@lprimak lprimak commented Jun 27, 2023

Demos CP member restart and rejoin issues
Relates to #24897

@hz-devops-test hz-devops-test added the Source: Community PR or issue was opened by a community user label Jun 27, 2023
@devOpsHazelcast
Copy link
Collaborator

Can one of the admins verify this patch?

3 similar comments
@devOpsHazelcast
Copy link
Collaborator

Can one of the admins verify this patch?

@devOpsHazelcast
Copy link
Collaborator

Can one of the admins verify this patch?

@devOpsHazelcast
Copy link
Collaborator

Can one of the admins verify this patch?

@lprimak lprimak changed the title Tests for RAFT issues Tests for CP Membership restart issues Jun 27, 2023
@arodionov arodionov closed this Aug 16, 2023
@arodionov
Copy link
Contributor

If a cluster lost the majority of its members it will be blocked and should be recovered manually https://docs.hazelcast.com/hazelcast/5.3/cp-subsystem/management#handling-a-lost-majority

@lprimak
Copy link
Contributor Author

lprimak commented Aug 17, 2023

@arodionov I think it's premature to close this for the following reasons:

  • There is no event sent by Hazelcast when members shut down normally and majority is lost, thus there is no concrete way to find out when the "unsafe state" occurs.
  • There is no way (even manually) to recover the cluster to a working state unless at east 3 members exist and are functioning.
  • "Unrecoverable state" is dubious at best
  • 100% CPU usage is seen under certain cicrumstances
  • Requiring manual recovery is also dubious.

I would suggest reopening this PR.

@arodionov
Copy link
Contributor

@lprimak thanks for your points!

Regarding,

There is no event sent by Hazelcast when members shut down normally and majority is lost, thus there is no concrete way to find out when the "unsafe state" occurs.

there is a CP Group Availability Listeners https://docs.hazelcast.com/hazelcast/5.3/cp-subsystem/management#cp-group-availability-listeners

Other points, I'll copy to #24912

@lprimak
Copy link
Contributor Author

lprimak commented Aug 17, 2023

Thanks @arodionov All of this is already described in #24897

there is a CP Group Availability Listeners https://docs.hazelcast.com/hazelcast/5.3/cp-subsystem/management#cp-group-availability-listeners

Just want to reiterate that those listeners are not called when members are shut down properly, only when they die / freeze unexpectedly. This is why the above cannot relied upon currently.

If you run https://github.com/flowlogix/hazelcast-issues on 3 terminals, (stop/restart, use Ctrl-z) in about 10 minutes you will see all those issues in action.

@devOpsHazelcast
Copy link
Collaborator

PR closed by Hazelcast automation as no activity (>6 months). Please reopen with comments, if necessary. Thank you for using Hazelcast and your valuable contributions

@lprimak
Copy link
Contributor Author

lprimak commented Apr 16, 2024

Please reopen. Still valid

@arodionov arodionov reopened this Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Automation: PR auto closed Source: Community PR or issue was opened by a community user
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants