Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed a data loss issue on lite member promotion #17644

Merged

Conversation

petrpleshachkov
Copy link
Contributor

Notify cluster members on lite member promotion to update
memberGroupSize. Otherwise, a data member might be not aware of other
promoted data members, and it may cause backup operations
not being issued to other data members.

Closes #17621

Notify cluster members on lite member promotion to update
memberGroupSize. Otherwise, a data member might be not aware of other
promoted data members, and it may cause backup operations
not being issued to other data members.

Closes hazelcast#17621
@@ -367,6 +367,7 @@ private MemberImpl createNewMemberImplIfChanged(MemberInfo newMemberInfo, Member
member = clusterService.promoteAndGetLocalMember();
} else {
member = createMember(newMemberInfo, member.getAttributes());
node.partitionService.memberAdded(member);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be out of else block? What if this is on local member?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the local member the node.partitionService.memberAdded(member); is doing update only when the node is not local. For local case, it issues a special task only when the node is a master. But the master is notified about promotion without this change. This is why I decided that it is enough to put it in the else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the problem is when the member is not aware of other members being promoted. That case is covered in the else branch

Copy link
Contributor

@mdogan mdogan Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think issue is a non-master member is not aware of any member being promoted, including itself. So maybe we should modify node.partitionService.memberAdded(member) to update `memberGroupSize even with local member.

See this modified test:

    @org.junit.Test
    public void lite_member_promotion_data_loss() throws InterruptedException {
        int entryCount = 1000;

        TestHazelcastInstanceFactory factory = new TestHazelcastInstanceFactory();
        Config config = new Config().setLiteMember(true);

        // start first hazelcast instance as a lite member
        HazelcastInstance firstHazelcastInstance = factory.newHazelcastInstance(config);

        // start second hazelcast instance as a lite member
        HazelcastInstance secondHazelcastInstance = factory.newHazelcastInstance(config);

        // promote all instances to data members
        firstHazelcastInstance.getCluster().promoteLocalLiteMember();
        secondHazelcastInstance.getCluster().promoteLocalLiteMember();

        // check if cluster is in a good shape
        assertTrueEventually(() -> assertTrue(firstHazelcastInstance.getPartitionService().isClusterSafe()));

        // insert some dummy data into the testing map
        String mapName = randomMapName();
        IMap<String, String> testMap = firstHazelcastInstance.getMap(mapName);
        for (int i = 0; i < entryCount; ++i) {
            testMap.put("key" + i, "value" + i);
        }

        // check all data is correctly inserted
        assertEquals(entryCount, testMap.size());

        // kill second instance
        secondHazelcastInstance.getLifecycleService().terminate();

        // backup count for the map is set to 1
        // even with 1 node down, no data loss is expected
        assertTrueEventually(() -> assertEquals(entryCount, firstHazelcastInstance.getMap(mapName).size()));
    }

}

@Test
public void lite_member_promotion_data_loss2() throws InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@@ -323,6 +326,86 @@ public void promotion_shouldFail_whenMasterIsSuspected_duringPromotion() throws
}
}

@Test
public void lite_member_promotion_data_loss() throws InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we give a better name to test? Seeing the state tested against and expected behavior in test method name would be helpful for the future readers.

@@ -342,6 +346,10 @@ void updateMembers(MembersView membersView) {

setMembers(MemberMap.createNew(membersView.getVersion(), members));

if (updatedLiteMember) {
updateMembersGroupSize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is hard to to follow similar named methods maybe we can make updateMemberGroupsSize() method public and can call it here: node.partitionService.getPartitionStateManager().updateMemberGroupsSize()

for (MemberInfo memberInfo : membersView.getMembers()) {
Address address = memberInfo.getAddress();
MemberImpl member = currentMemberMap.getMember(address);

if (member != null && member.getUuid().equals(memberInfo.getUuid())) {
if (member.isLiteMember()) {
updatedLiteMember = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add a short comment to the code to explain the meaning of updatedLiteMember?

@mdogan
Copy link
Contributor

mdogan commented Oct 8, 2020

@petrpleshachkov: This fix should be backported 4.0.z and 3.12.z branches.

@petrpleshachkov
Copy link
Contributor Author

@ahmetmircik , I've addressed the comments.

@petrpleshachkov
Copy link
Contributor Author

Guys, thanks for the review, I'm merging the PR.

@petrpleshachkov petrpleshachkov merged commit 82bd5e7 into hazelcast:master Oct 9, 2020
@mmedenjak
Copy link
Contributor

@petrpleshachkov have you backported the PR?

petrpleshachkov added a commit to petrpleshachkov/hazelcast that referenced this pull request Oct 23, 2020
Notify cluster members on lite member promotion to update
memberGroupSize. Otherwise, a data member might be not aware of other
promoted data members, and it may cause backup operations
not being issued to other data members.

Closes hazelcast#17621
petrpleshachkov added a commit to petrpleshachkov/hazelcast that referenced this pull request Oct 23, 2020
Notify cluster members on lite member promotion to update
memberGroupSize. Otherwise, a data member might be not aware of other
promoted data members, and it may cause backup operations
not being issued to other data members.

Closes hazelcast#17621
mmedenjak pushed a commit that referenced this pull request Dec 1, 2020
Fixed a data loss issue on lite member promotion (#17644)

Notify cluster members on lite member promotion to update
memberGroupSize. Otherwise, a data member might be not aware of other
promoted data members, and it may cause backup operations
not being issued to other data members.

Closes #17621
petrpleshachkov added a commit that referenced this pull request Dec 1, 2020
…17758)

Notify cluster members on lite member promotion to update
memberGroupSize. Otherwise, a data member might be not aware of other
promoted data members, and it may cause backup operations
not being issued to other data members.

Closes #17621
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Losing data after promoting from lite member to data member
4 participants