New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed a data loss issue on lite member promotion #17644
Fixed a data loss issue on lite member promotion #17644
Conversation
Notify cluster members on lite member promotion to update memberGroupSize. Otherwise, a data member might be not aware of other promoted data members, and it may cause backup operations not being issued to other data members. Closes hazelcast#17621
@@ -367,6 +367,7 @@ private MemberImpl createNewMemberImplIfChanged(MemberInfo newMemberInfo, Member | |||
member = clusterService.promoteAndGetLocalMember(); | |||
} else { | |||
member = createMember(newMemberInfo, member.getAttributes()); | |||
node.partitionService.memberAdded(member); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be out of else
block? What if this is on local member?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the local member the node.partitionService.memberAdded(member);
is doing update only when the node is not local. For local case, it issues a special task only when the node is a master. But the master is notified about promotion without this change. This is why I decided that it is enough to put it in the else
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the problem is when the member is not aware of other members being promoted. That case is covered in the else
branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think issue is a non-master member is not aware of any member being promoted, including itself. So maybe we should modify node.partitionService.memberAdded(member)
to update `memberGroupSize even with local member.
See this modified test:
@org.junit.Test
public void lite_member_promotion_data_loss() throws InterruptedException {
int entryCount = 1000;
TestHazelcastInstanceFactory factory = new TestHazelcastInstanceFactory();
Config config = new Config().setLiteMember(true);
// start first hazelcast instance as a lite member
HazelcastInstance firstHazelcastInstance = factory.newHazelcastInstance(config);
// start second hazelcast instance as a lite member
HazelcastInstance secondHazelcastInstance = factory.newHazelcastInstance(config);
// promote all instances to data members
firstHazelcastInstance.getCluster().promoteLocalLiteMember();
secondHazelcastInstance.getCluster().promoteLocalLiteMember();
// check if cluster is in a good shape
assertTrueEventually(() -> assertTrue(firstHazelcastInstance.getPartitionService().isClusterSafe()));
// insert some dummy data into the testing map
String mapName = randomMapName();
IMap<String, String> testMap = firstHazelcastInstance.getMap(mapName);
for (int i = 0; i < entryCount; ++i) {
testMap.put("key" + i, "value" + i);
}
// check all data is correctly inserted
assertEquals(entryCount, testMap.size());
// kill second instance
secondHazelcastInstance.getLifecycleService().terminate();
// backup count for the map is set to 1
// even with 1 node down, no data loss is expected
assertTrueEventually(() -> assertEquals(entryCount, firstHazelcastInstance.getMap(mapName).size()));
}
hazelcast/src/main/java/com/hazelcast/internal/cluster/impl/MembershipManager.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
public void lite_member_promotion_data_loss2() throws InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
@@ -323,6 +326,86 @@ public void promotion_shouldFail_whenMasterIsSuspected_duringPromotion() throws | |||
} | |||
} | |||
|
|||
@Test | |||
public void lite_member_promotion_data_loss() throws InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we give a better name to test? Seeing the state tested against and expected behavior in test method name would be helpful for the future readers.
@@ -342,6 +346,10 @@ void updateMembers(MembersView membersView) { | |||
|
|||
setMembers(MemberMap.createNew(membersView.getVersion(), members)); | |||
|
|||
if (updatedLiteMember) { | |||
updateMembersGroupSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is hard to to follow similar named methods maybe we can make updateMemberGroupsSize()
method public and can call it here: node.partitionService.getPartitionStateManager().updateMemberGroupsSize()
for (MemberInfo memberInfo : membersView.getMembers()) { | ||
Address address = memberInfo.getAddress(); | ||
MemberImpl member = currentMemberMap.getMember(address); | ||
|
||
if (member != null && member.getUuid().equals(memberInfo.getUuid())) { | ||
if (member.isLiteMember()) { | ||
updatedLiteMember = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please add a short comment to the code to explain the meaning of updatedLiteMember
?
@petrpleshachkov: This fix should be backported 4.0.z and 3.12.z branches. |
@ahmetmircik , I've addressed the comments. |
Guys, thanks for the review, I'm merging the PR. |
@petrpleshachkov have you backported the PR? |
Notify cluster members on lite member promotion to update memberGroupSize. Otherwise, a data member might be not aware of other promoted data members, and it may cause backup operations not being issued to other data members. Closes hazelcast#17621
Notify cluster members on lite member promotion to update memberGroupSize. Otherwise, a data member might be not aware of other promoted data members, and it may cause backup operations not being issued to other data members. Closes hazelcast#17621
Notify cluster members on lite member promotion to update
memberGroupSize. Otherwise, a data member might be not aware of other
promoted data members, and it may cause backup operations
not being issued to other data members.
Closes #17621