Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustered eventbus is corrupted when a node rejoins after network interruption #109

Closed
arushi315 opened this issue Jan 5, 2019 · 26 comments
Assignees

Comments

@arushi315
Copy link

arushi315 commented Jan 5, 2019

A node fails to consume the published message on eventbus when another node rejoins the cluster after network interruption, even though hazelcast shows both the nodes as active members hazelcastInstance.getCluster.getMembers

Here is a reproducer using vertx version 3.6.2,
https://github.com/arushi315/vertx_cluster_test/tree/vertx_cluster_3_6_2

Please refer to readme file for instructions to reproduce eventbus clustering issue.

Originally this issue was observed in vertx version 3.5.0.

Going through other reported issues, I stumbled upon this one - #90 and tested the same behavior with vertx version 3.5.4. It is working fine and cluster in maintained when a node rejoins. All nodes are receiving the published message on eventbus and all members are active in hazelcast.
Here is the same reproducer using the vertx version 3.5.4,
https://github.com/arushi315/vertx_cluster_test/tree/vertx_cluster_3_5_4

Java version: 8

@arushi315
Copy link
Author

any update for this ticket?

@vietj vietj self-assigned this Jan 9, 2019
@vietj
Copy link
Contributor

vietj commented Jan 9, 2019

@tsegismont can you assign yourself this issue (I'm unable using my browser)

@tsegismont tsegismont assigned tsegismont and unassigned vietj Jan 9, 2019
@tsegismont
Copy link
Contributor

@arushi315 I will look into it and hopefully find a solution for 3.6.3

@tsegismont tsegismont added this to the 3.6.3 milestone Jan 9, 2019
@arushi315
Copy link
Author

Thank you @tsegismont and @vietj. I am investigating it further based on your changes done for
#90, will share more details if anything.

@arushi315
Copy link
Author

Hi @tsegismont just checking if you were able to see the issue using the reproducer or got a chance to analyze the issue?

@tsegismont
Copy link
Contributor

tsegismont commented Jan 17, 2019 via email

@tsegismont
Copy link
Contributor

@arushi315 I tried on my LAN and could not reproduce. I have two machines connected to the same wifi hotspot. As instructed on the README, I disabled the network on one machine (disabled wifi) and then switch it on again.

I changed this in the code:

Config config = clusterManager.loadConfig()

instead of:

Config config = new Config()

The problem with empty config is that it does not create the objects as Vert.x expects (see https://vertx.io/docs/vertx-hazelcast/java/#_using_an_existing_hazelcast_cluster)

@tsegismont tsegismont removed this from the 3.6.3 milestone Jan 23, 2019
@arushi315
Copy link
Author

Hi @tsegismont I had some trouble reproducing it as well, but I tried couple of times to reproduce it. Suspend node and power on.
Also I noticed once the suspended node joins the cluster back, vertx node ID and hazelcast node ID for that node are different, though that I see even when there is no issue with the eventbus. Is that expected?

@tsegismont
Copy link
Contributor

@arushi315 I tried multiple times to reproduce without success. It is expected that the Vert.x node ID and Hazelcast node ID are different after merging partitions. This is what the fix for #90 addresses.

@arushi315
Copy link
Author

arushi315 commented Jul 12, 2019

Hi @tsegismont Sorry it been a while.
We actually ran into the same issue even with 3.5.4 where as par #90 it was fixed.
So just to mention the issue again,
We have a cluster of 3 nodes,
node1 --->leader node
node2
node3

Published event from node1, it was consumed on node1 (itself), node2 and node3.

Then I suspended node1 and node3. Now node2 is the leader.

Released node1 and now node1 is back as the leader node.

Published event from node1, it was consumed on node1(itself) and not node2.

Published event from node2, it was consumed on node1 and not on node2. Node3 was still suspended.

Reproducer with vertx version 3.5.4.
https://github.com/arushi315/vertx_cluster_test/tree/vertx_cluster_3_5_4

@ramtech123
Copy link

Hi @tsegismont ,
I guess "suspending a node" actually means moving it to hibernate/sleep instead of just disconnecting the network (ref: Key Differences between VM suspend and shutdown and pause).

I'm not sure if it results in a different behavior internally with respect to NodeIDs though.

@arushi315
Copy link
Author

arushi315 commented Jul 15, 2019

Hi @tsegismont,

I tested this scenario with vertx version 3.7.1 and 3.6.3 as well and seeing the following error and thread block warning.
Scenario:
We have a cluster of 3 nodes,
node1 --->leader node
node2
node3

Published event from node1, it was consumed on node1 (itself), node2 and node3.
Suspended node1 and node3. Now node2 is the leader.
Released node1 and node3 is still suspended.
Publishing event from node2 results in thread block warning on node1 and node2.
Active Nodes: node1 and node2
Suspended Nodes: node3

> Error on node1 (only with 3.7.1):

2019-07-15 13:51:12.326 ERROR (hz._hzInstance_1_7f3a343e-11ac-49d3-9074-1e428a5dfaad-master-1.event-2) [i.v.s.c.h.HazelcastClusterManager] - Failed to handle memberRemoved
com.hazelcast.core.OperationTimeoutException: GetOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2019-07-15 13:51:12.326. Start time: 2019-07-15 13:51:03.364. Total elapsed time: 8962 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2019-07-15 13:42:47.566. Invocation{op=com.hazelcast.map.impl.operation.GetOperation{serviceName='hz:impl:mapService', identityHash=1747809211, partitionId=143, replicaIndex=0, callId=-37422, invocationTime=1563212572905 (2019-07-15 13:42:52.905), waitTimeout=-1, callTimeout=8000, name=__vertx.haInfo}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=8000, firstInvocationTimeMs=1563213063364, firstInvocationTime='2019-07-15 13:51:03.364', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 19:00:00.000', target=[node3]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=2, /node1:5701->/node3:3906, endpoint=[node3]:5701, alive=true, type=MEMBER]}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:106)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:162)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:425)
at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:347)
at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:116)
at io.vertx.core.impl.HAManager.chooseHashedNode(HAManager.java:589)
at io.vertx.core.impl.HAManager.checkSubs(HAManager.java:518)
at io.vertx.core.impl.HAManager.nodeLeft(HAManager.java:304)
at io.vertx.core.impl.HAManager.access$100(HAManager.java:97)
at io.vertx.core.impl.HAManager$1.nodeLeft(HAManager.java:150)
at io.vertx.spi.cluster.hazelcast.HazelcastClusterManager.memberRemoved(HazelcastClusterManager.java:311)
at com.hazelcast.internal.cluster.impl.ClusterServiceImpl.dispatchEvent(ClusterServiceImpl.java:810)
at com.hazelcast.internal.cluster.impl.ClusterServiceImpl.dispatchEvent(ClusterServiceImpl.java:86)
at com.hazelcast.spi.impl.eventservice.impl.LocalEventDispatcher.run(LocalEventDispatcher.java:64)
at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:226)
at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:209)

> Thread block warning after releasing node1 (with 3.7.1 and 3.6.3)

2019-07-15 14:39:26.269 WARN (vertx-blocked-thread-checker) [i.v.c.i.BlockedThreadChecker] - Thread Thread[vert.x-worker-thread-16,5,main] has been blocked for 2607559 ms, time limit is 60000 ms
io.vertx.core.VertxException: Thread blocked
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:160)
at com.hazelcast.multimap.impl.MultiMapProxySupport.invoke(MultiMapProxySupport.java:266)
at com.hazelcast.multimap.impl.MultiMapProxySupport.putInternal(MultiMapProxySupport.java:69)
at com.hazelcast.multimap.impl.ObjectMultiMapProxy.put(ObjectMultiMapProxy.java:110)
at io.vertx.spi.cluster.hazelcast.impl.HazelcastAsyncMultiMap.lambda$add$1(HazelcastAsyncMultiMap.java:89)
at io.vertx.spi.cluster.hazelcast.impl.HazelcastAsyncMultiMap$$Lambda$337/1940055334.handle(Unknown Source)
at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$2(ContextImpl.java:272)
at io.vertx.core.impl.ContextImpl$$Lambda$255/1713833639.run(Unknown Source)
at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76)
at io.vertx.core.impl.TaskQueue$$Lambda$252/428696898.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Unknown Source)

cc: @ramtech123

@arushi315
Copy link
Author

Hi @tsegismont @vietj any update or guidance on this?

@tsegismont
Copy link
Contributor

@vbekiaris I'm out of ideas on this one. Do you what issues could happen when one suspend/resume a VM?

@vbekiaris
Copy link
Contributor

@tsegismont VM suspension is expected to cause cluster members to be suspected as failed, as they miss heartbeat timeouts. On resume these members should join back to the cluster. On Hazelcast side VM suspension should be handled the same as network partitions or member crashes.

@arushi315 I noticed in the scenario above that out of a 3 node cluster, two members are suspended. Are they suspended at the same time? What is your hazelcast configuration like? To survive a 2-node crash, Hazelcast data structures like IMaps, MultiMaps etc must be configured with 2 backups. Under the default configuration of one backup, some partitions' data will be lost. For non-partitioned data structures that only support a single backup, if both replicas resided on the two nodes that were suspended, all data will be lost.

@arushi315
Copy link
Author

Hi @vbekiaris ,
Thank you for your response. I agree that if we set the backup count as 1, the partitions data will be lost. For such scenarios, we are implementing PartitionLostListener and fetch all the data again, so that is not the issue we are seeing here.
What we see is that when we release Node1, Node1 and Node2 are back in the cluster. Hazelcast shows both of them as the cluster members. When we publish an event from Node1 or even Node2, only Node1 consumes that event and never Node2 (it never recovers).

This is our Hazelcast configuration,

public HazelcastClusterManager clusterManager() throws IOException {
        final HazelcastClusterManager clusterManager = new HazelcastClusterManager();
        final Config config = clusterManager.loadConfig().setNetworkConfig(new NetworkConfig());

        config.getNetworkConfig()
                .setPort(5701)
                .setPortAutoIncrement(false)
                //Multicast is disabled, we are using TCP join here. 
                .setJoin(createJoinConfig())
                .setPublicAddress(ipOrHostname)
                .setPortCount(100);

        final MapConfig mapConfig = new MapConfig()
                .setName("cache1")
                .setBackupCount(0)
                .setAsyncBackupCount(1)
                .setReadBackupData(mapReadBackupData)
                .setTimeToLiveSeconds(mapTimeToLiveSeconds)
                .setMaxIdleSeconds(mapMaxIdleSeconds)
                .setEvictionPolicy(EvictionPolicy.NONE)
                .setInMemoryFormat(mapInMemoryFormat)
                .setStatisticsEnabled(mapStatisticsEnabled)
                .setMaxSizeConfig(new MaxSizeConfig(mapMaxSizeConfig, mapMaxSizeConfigPolicy));

        config.getMapConfigs().put("cache1", mapConfig);
        
        config.getGroupConfig().setName("TestName");
        
        config.setManagementCenterConfig(
                new ManagementCenterConfig()
                        .setEnabled(false)
                        .setUrl(managementCenterUrl));
        
        config.setProperty("hazelcast.socket.connect.timeout.seconds", "0");
        config.setProperty("hazelcast.operation.call.timeout.millis", "8000");
        config.setProperty("hazelcast.io.input.thread.count", "3");
        config.setProperty("hazelcast.io.output.thread.count", "5");
        
        config.addListenerConfig(new ListenerConfig(hazelcastPartitionLostListener));
        clusterManager.setConfig(config);
        
        return clusterManager;
    }

@tsegismont
Copy link
Contributor

For such scenarios, we are implementing PartitionLostListener and fetch all the data again, so that is not the issue we are seeing here.

With the listener you can recover business data, correct? In this case it doesn't include Vert.x clustering data.

Anyway Vert.x members republish subscription data when a node joins/leave.

@ramtech123
Copy link

ramtech123 commented Aug 1, 2019

Hi @tsegismont , @vbekiaris ,

We added some additional logs in HazelcastClusterManager to understand the exact sequence of events. Below are our findings.

Step#1
Initially cluster is created with Node1, Node2 and Node3 joining the cluster one after another.

Node1 uuid - 51d43c83-08f4-41cb-8413-ce72f495a837 (master)
Node2 uuid - 5dd7ab63-37b1-4354-8880-5751304238b2
Node3 uuid - 0c29292-3701-4edb-9438-4fd4a8df914b

Step#2
Then Node1 goes into sleep, hence its application memory is preserved. It considers itself as master with two more data members in the cluster.
However, in the active cluster, only Node2 and Node3 remain with the former as master.

Node2 uuid - 5dd7ab63-37b1-4354-8880-5751304238b2 (master)
Node3 uuid - 0c29292-3701-4edb-9438-4fd4a8df914b

Step#3
Then Node3 also goes into sleep, with Node2 being the lone member in the active cluster

Node2 uuid - 5dd7ab63-37b1-4354-8880-5751304238b2 (master)

Step#4
After a delay which is larger than heartbeat timeout, Node1 is powered on again. It resumes the application from its existing state (where it is a master of cluster with two more members)
Node1 and Node2 exchange SplitBrainJoinMessage between them.
Node2 has following log entry [Node2]:5701 CANNOT merge to [Node1]:5701, because it thinks this-node as its member.
Node1 removes Node2 from it's cluster of 3 members, with following message Removing [Node2]:5701, since it thinks it's already split from this cluster and looking to merge.

Cluster status as per Node1

Node1 uuid - 51d43c83-08f4-41cb-8413-ce72f495a837 (master)
Node3 uuid - 0c29292-3701-4edb-9438-4fd4a8df914b

Cluster status as per Node2

Node2 uuid - 5dd7ab63-37b1-4354-8880-5751304238b2 (master)

Step#5
With continued heartbeat exchanges between Node1 and Node2, they decide to merge the cluster. Node1 trumps, as it presents itself as a master of cluster with two data members against that of Node2 which has only one member.
From Node1 log [Node2]:5701 should merge to us , because our data member count is bigger than theirs [2 > 1]

In the meantime, Node3 was removed from Node1's cluster due to lack of heartbeats, but Node2 has already decided to merge itself to the other cluster.
From Node2 log We are merging to [Node1]:5701

Cluster status as per Node1

Node1 uuid - 51d43c83-08f4-41cb-8413-ce72f495a837 (master)

Cluster status as per Node2

Node2 uuid - 5dd7ab63-37b1-4354-8880-5751304238b2 (master, but decided to merge with other cluster)

Step#6
Node2 initiates merging, below are the log entries from the same node.

Locking cluster state. Initiator: [Node2]:5701, lease-time: 60000
Changing cluster state state to ClusterStateChange{type=class com.hazelcast.cluster.ClusterState, newState=FROZEN}
[Node2]:5701 is merging to [Node1]:5701, because: instructed by master [Node2]:5701
Setting new local member. old uuid: 5dd7ab63-37b1-4354-8880-5751304238b2 new uuid: 7324cb50-7f26-4818-84ee-d49a103f7b80

The last entry above with new UUID is from Hazelcast's Node#setNewLocalMember method. This method is called when Hazelcast "Resets the internal cluster-state of the Node to be able to make it ready to join a new cluster."

After Hazelcast resets Node2 with new UUID, we can observe below custom log entries added into Vertx HazelcastClusterManager for troubleshooting purpose.

HazelcastClusterManager.getNodes. Member NodeIdAttribute 51d43c83-08f4-41cb-8413-ce72f495a837, UUid 51d43c83-08f4-41cb-8413-ce72f495a837
HazelcastClusterManager.getNodes. Member NodeIdAttribute 5dd7ab63-37b1-4354-8880-5751304238b2, UUid 7324cb50-7f26-4818-84ee-d49a103f7b80
HazelcastClusterManager.getNodes final list [51d43c83-08f4-41cb-8413-ce72f495a837, 5dd7ab63-37b1-4354-8880-5751304238b2]

As we can see from last set of custom log entries in HazelcastClusterManager.getNodes, new UUID (from Hazelcast) of Node2 is completely ignored when an existing NodeId Attribute was already set. This results in mismatch in the unique NodeIDs used by Vertx and Hazelcast. In addition, we also see inconsistencies in the cluster, starting from exactly this point of time.

We are yet to verify any further with more changes to handle this scenario, especially with respect to reconciliation of EventBus subscriptions, but feel free to guide us if you have any inputs where we can make changes and test it out. Or if you would like to share any custom builds for testing, we are open to that as well.

We will share our logs with you via email shortly.

Thanks.

Cc: @arushi315

@vbekiaris
Copy link
Contributor

@ramtech123 thanks for the very helpful analysis!

The nodeID maintenance logic in HazelcastClusterManager does not seem to accommodate Hazelcast member UUID changes. Here is a test that executes a Hazelcast node reset (as happens after the split-brain merge) and fails with current master:

@Test
  public void testNodeIdUpdated_afterSplitBrainMerge() {
    HazelcastInstance instance = Hazelcast.newHazelcastInstance(createConfig());
    HazelcastClusterManager manager = new HazelcastClusterManager(instance);

    VertxOptions options = new VertxOptions().setClusterManager(manager).setClustered(true).setClusterHost("127.0.0.1");

    AtomicReference<Vertx> vertx1 = new AtomicReference<>();

    Vertx.clusteredVertx(options, res -> {
      assertTrue(res.succeeded());
      assertEquals(instance.getCluster().getLocalMember().getStringAttribute(NODE_ID_ATTRIBUTE),
        instance.getCluster().getLocalMember().getUuid());
      assertEquals(instance.getCluster().getLocalMember().getUuid(),
        manager.getNodeID());
      vertx1.set(res.result());
    });
    assertWaitUntil(() -> vertx1.get() != null);

    vertx1.get().executeBlocking(future -> {
      // reset UUID (eg after split-brain merge)
      ((ManagedService) instance.getCluster()).reset();
      assertWaitUntil(() -> instance.getPartitionService().isClusterSafe());
      manager.stateChanged(new LifecycleEvent(LifecycleState.MERGED));
      future.complete(null);
    }, r -> {
      assertEquals(instance.getCluster().getLocalMember().getUuid(),
        manager.getNodeID());
    });

    vertx1.get().close(ar -> vertx1.set(null));
    assertWaitUntil(() -> vertx1.get() == null);
  }

There are two parts of missing logic in the lifecycle listener implementation:

  • both MERGED and MERGE_FAILED events occur after local member has reset its node ID (currently code in stateChanged only takes into account MERGED event)
  • when handling a node ID changing event, the node ID must be updated both in HazelcastClusterManager (currently it is only set once on startup) and in the member's NODE_ID_ATTRIBUTE that contains a copy of the UUID. tbh I don't understand why the member's UUID is also maintained in a string attribute on the Hazelcast member.

The following patch against master makes above test pass.

Index: src/main/java/io/vertx/spi/cluster/hazelcast/HazelcastClusterManager.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- src/main/java/io/vertx/spi/cluster/hazelcast/HazelcastClusterManager.java	(revision b96f0ce978a6a25dfdf1a390244f06faa04f1881)
+++ src/main/java/io/vertx/spi/cluster/hazelcast/HazelcastClusterManager.java	(date 1564745771000)
@@ -322,7 +322,12 @@
     }
     multimaps.forEach(HazelcastAsyncMultiMap::clearCache);
     // Safeguard to make sure members list is OK after a partition merge
-    if(lifecycleEvent.getState() == LifecycleEvent.LifecycleState.MERGED) {
+    if(lifecycleEvent.getState() == LifecycleEvent.LifecycleState.MERGED
+      || lifecycleEvent.getState() == LifecycleEvent.LifecycleState.MERGE_FAILED) {
+      Member localMember = hazelcast.getCluster().getLocalMember();
+      this.nodeID = localMember.getUuid();
+      localMember.setStringAttribute(NODE_ID_ATTRIBUTE, nodeID);
+
       final List<String> currentNodes = getNodes();
       Set<String> newNodes = new HashSet<>(currentNodes);
       newNodes.removeAll(nodeIds);

@tsegismont are these changes in line with the vertx side of nodeID expectations?

@tsegismont
Copy link
Contributor

@vbekiaris I will take a look, thanks for the analysis!

@ramtech123 can you try your scenario with the patch above?

@ramtech123
Copy link

Thanks @vbekiaris for the patch.
I will verify with this patch applied and get back to you.

@ramtech123
Copy link

Hi @vbekiaris , @tsegismont
We tried with above patch, it now updates the NodeId of the local member correctly - so, Vertx uses latest UUID to identify the node. We believe, the expectation from above patch was only to maintain consistency of NodeId between Vertx and Hazelcast, and that works fine.

We still have the issue with EventBus subscriptions, so the Node2 is unable to consume any messages via EventBus.

Let us know if anything else you want us to try. Thanks.

@tsegismont
Copy link
Contributor

@ramtech123 can you please try again with this branch: https://github.com/tsegismont/vertx-hazelcast/tree/issue/109

tbh I don't understand why the member's UUID is also maintained in a string attribute on the Hazelcast member.

@vbekiaris actually we keep the original member UUID in an attribute so that, even when node id changes, we keep using the same id internally.

@ramtech123
Copy link

Hi @tsegismont ,
We verified our cluster with changes from above branch. The issue persists, Node2 is out of any communication on the event bus.

@tsegismont
Copy link
Contributor

@ramtech123 @arushi315 I took another look at your input in #109 (comment)

It seems in the end the HZ cluster state is FROZEN. There is no chance eventbus communication works if the Hazelcast cluster is not healthy.

As documented in http://vertx.io/docs/vertx-hazelcast/java/#_cluster_administration, when you use the HZ cluster manager, you are actually turning the Vert.x nodes into members of a Hazelcast cluster.

If you need to suspend/resume Vert.x nodes (perhaps for VM migration?) I would recommend create a set of stable data only-nodes, and mark Vert.x nodes as lite members. This is documented in http://vertx.io/docs/vertx-hazelcast/java/#_using_lite_members

This morning I spent some time on the reproducer again: https://github.com/arushi315/vertx_cluster_test/tree/vertx_cluster_3_6_2

The first time I looked into this issue I had used different machines at home and unplugged the cables. This time I used a Docker private network and used docker pause to get the same experience as you have. I still can't reproduce.

So I will close this issue. If you can provide a reproducer where eventbus communication does not work when HZ cluster is in healthy state, then reopen.

@ramtech123
Copy link

Hi @tsegismont
Some tips for reproducing the issue:

  1. The master node (Node1 in above case) goes into sleep first, retaining its state of 3 member cluster.
  2. Any other node (Node2) becomes the master of active cluster. Then other active member (Node3) also goes down, leaving only one node in the active cluster.
  3. The original master (Node1) will resume only after heartbeat timeout is over, so that the active members would have considered it as dead by then. That will result in a race condition between Node1 and Node2 to become master of the merged cluster.

Having said that, we will go through the resources and recommendations shared by you for further analysis and revert. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants