New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustered eventbus is corrupted when a node rejoins after network interruption #109
Comments
any update for this ticket? |
@tsegismont can you assign yourself this issue (I'm unable using my browser) |
@arushi315 I will look into it and hopefully find a solution for 3.6.3 |
Thank you @tsegismont and @vietj. I am investigating it further based on your changes done for |
Hi @tsegismont just checking if you were able to see the issue using the reproducer or got a chance to analyze the issue? |
I have planned to look into it next week.
|
@arushi315 I tried on my LAN and could not reproduce. I have two machines connected to the same wifi hotspot. As instructed on the README, I disabled the network on one machine (disabled wifi) and then switch it on again. I changed this in the code: Config config = clusterManager.loadConfig() instead of:
The problem with empty config is that it does not create the objects as Vert.x expects (see https://vertx.io/docs/vertx-hazelcast/java/#_using_an_existing_hazelcast_cluster) |
Hi @tsegismont I had some trouble reproducing it as well, but I tried couple of times to reproduce it. Suspend node and power on. |
@arushi315 I tried multiple times to reproduce without success. It is expected that the Vert.x node ID and Hazelcast node ID are different after merging partitions. This is what the fix for #90 addresses. |
Hi @tsegismont Sorry it been a while. Published event from node1, it was consumed on node1 (itself), node2 and node3. Then I suspended node1 and node3. Now node2 is the leader. Released node1 and now node1 is back as the leader node. Published event from node1, it was consumed on node1(itself) and not node2. Published event from node2, it was consumed on node1 and not on node2. Node3 was still suspended. Reproducer with vertx version 3.5.4. |
Hi @tsegismont , I'm not sure if it results in a different behavior internally with respect to NodeIDs though. |
Hi @tsegismont, I tested this scenario with vertx version 3.7.1 and 3.6.3 as well and seeing the following error and thread block warning. Published event from node1, it was consumed on node1 (itself), node2 and node3. > Error on node1 (only with 3.7.1): 2019-07-15 13:51:12.326 ERROR (hz._hzInstance_1_7f3a343e-11ac-49d3-9074-1e428a5dfaad-master-1.event-2) [i.v.s.c.h.HazelcastClusterManager] - Failed to handle memberRemoved > Thread block warning after releasing node1 (with 3.7.1 and 3.6.3) 2019-07-15 14:39:26.269 WARN (vertx-blocked-thread-checker) [i.v.c.i.BlockedThreadChecker] - Thread Thread[vert.x-worker-thread-16,5,main] has been blocked for 2607559 ms, time limit is 60000 ms cc: @ramtech123 |
Hi @tsegismont @vietj any update or guidance on this? |
@vbekiaris I'm out of ideas on this one. Do you what issues could happen when one suspend/resume a VM? |
@tsegismont VM suspension is expected to cause cluster members to be suspected as failed, as they miss heartbeat timeouts. On resume these members should join back to the cluster. On Hazelcast side VM suspension should be handled the same as network partitions or member crashes. @arushi315 I noticed in the scenario above that out of a 3 node cluster, two members are suspended. Are they suspended at the same time? What is your hazelcast configuration like? To survive a 2-node crash, Hazelcast data structures like |
Hi @vbekiaris , This is our Hazelcast configuration,
|
With the listener you can recover business data, correct? In this case it doesn't include Vert.x clustering data. Anyway Vert.x members republish subscription data when a node joins/leave. |
Hi @tsegismont , @vbekiaris , We added some additional logs in HazelcastClusterManager to understand the exact sequence of events. Below are our findings. Step#1
Step#2
Step#3
Step#4 Cluster status as per Node1
Cluster status as per Node2
Step#5 In the meantime, Node3 was removed from Node1's cluster due to lack of heartbeats, but Node2 has already decided to merge itself to the other cluster. Cluster status as per Node1
Cluster status as per Node2
Step#6
The last entry above with new UUID is from Hazelcast's After Hazelcast resets Node2 with new UUID, we can observe below custom log entries added into Vertx HazelcastClusterManager for troubleshooting purpose.
As we can see from last set of custom log entries in We are yet to verify any further with more changes to handle this scenario, especially with respect to reconciliation of EventBus subscriptions, but feel free to guide us if you have any inputs where we can make changes and test it out. Or if you would like to share any custom builds for testing, we are open to that as well. We will share our logs with you via email shortly. Thanks. Cc: @arushi315 |
@ramtech123 thanks for the very helpful analysis! The
There are two parts of missing logic in the lifecycle listener implementation:
The following patch against
@tsegismont are these changes in line with the vertx side of |
@vbekiaris I will take a look, thanks for the analysis! @ramtech123 can you try your scenario with the patch above? |
Thanks @vbekiaris for the patch. |
Hi @vbekiaris , @tsegismont We still have the issue with EventBus subscriptions, so the Node2 is unable to consume any messages via EventBus. Let us know if anything else you want us to try. Thanks. |
@ramtech123 can you please try again with this branch: https://github.com/tsegismont/vertx-hazelcast/tree/issue/109
@vbekiaris actually we keep the original member UUID in an attribute so that, even when node id changes, we keep using the same id internally. |
Hi @tsegismont , |
@ramtech123 @arushi315 I took another look at your input in #109 (comment) It seems in the end the HZ cluster state is As documented in http://vertx.io/docs/vertx-hazelcast/java/#_cluster_administration, when you use the HZ cluster manager, you are actually turning the Vert.x nodes into members of a Hazelcast cluster. If you need to suspend/resume Vert.x nodes (perhaps for VM migration?) I would recommend create a set of stable data only-nodes, and mark Vert.x nodes as lite members. This is documented in http://vertx.io/docs/vertx-hazelcast/java/#_using_lite_members This morning I spent some time on the reproducer again: https://github.com/arushi315/vertx_cluster_test/tree/vertx_cluster_3_6_2 The first time I looked into this issue I had used different machines at home and unplugged the cables. This time I used a Docker private network and used So I will close this issue. If you can provide a reproducer where eventbus communication does not work when HZ cluster is in healthy state, then reopen. |
Hi @tsegismont
Having said that, we will go through the resources and recommendations shared by you for further analysis and revert. Thanks. |
A node fails to consume the published message on eventbus when another node rejoins the cluster after network interruption, even though hazelcast shows both the nodes as active members hazelcastInstance.getCluster.getMembers
Here is a reproducer using vertx version 3.6.2,
https://github.com/arushi315/vertx_cluster_test/tree/vertx_cluster_3_6_2
Please refer to readme file for instructions to reproduce eventbus clustering issue.
Originally this issue was observed in vertx version 3.5.0.
Going through other reported issues, I stumbled upon this one - #90 and tested the same behavior with vertx version 3.5.4. It is working fine and cluster in maintained when a node rejoins. All nodes are receiving the published message on eventbus and all members are active in hazelcast.
Here is the same reproducer using the vertx version 3.5.4,
https://github.com/arushi315/vertx_cluster_test/tree/vertx_cluster_3_5_4
Java version: 8
The text was updated successfully, but these errors were encountered: