Update NodeId n event-bus subscribers #163

javastack59 · 2022-12-17T11:25:10Z

Motivation:
Hazelcast is sometimes resetting the UUID of a member.
(https://github.com/hazelcast/hazelcast/blob/master/hazelcast/src/main/java/com/hazelcast/internal/cluster/impl/ClusterServiceImpl.java#L342)
This is almost happening everytime in a container environment like Kubernetes where multiple pods with several verticles are all started and deployed at the same time.
Even after the UUID is reset, the ClusterManager (Vert.x HazelcastClusterManager) is still republishing the event-bus subscribers with the stale UUID of the member.
The problem does not immediately manifest , as the IP address of the member POD still remains the same and eventhough the Vert.x NodeSelector does a lookup with stale UUID, the correct IP address is still returned.

Issue creeps up the moment, a member POD is restarted either due to a delete or a crash and a new member POD comes up with a new IP address and UUID.
Hazelcast is now broadcasting a member removed event with the latest member UUID(of the deleted POD) to other members but since the subs-map and node-map is not updated during the reset (earlier) , it is not getting evicted which is resulting in a huge problem with eventBus timing out mostly during a reply as it keeps trying to reply to a member that is already gone.

Sample logs:

Service 1 (First POD)

=======
Line 144-155

jvm 1 | Members {size:1, ver:1} [
jvm 1 | Member [10.244.237.60]:5701 - 202d7784-d695-4d46-83ba-24541b6a2943 this
jvm 1 | ]
jvm 1 |
jvm 1 | 2022-12-16 12:12:18.265+0000 [] [vert.x-worker-thread-0] INFO com.hazelcast.core.LifecycleService - [10.244.237.60]:5701 [nimbus-v3] [4.2.4] [10.244.237.60]:5701 is STARTED
jvm 1 | 2022-12-16 12:12:18.265+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - Local Member : 10.244.237.60
jvm 1 | 2022-12-16 12:12:18.265+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - Local Member Node Id: 202d7784-d695-4d46-83ba-24541b6a2943
jvm 1 | 2022-12-16 12:12:18.295+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - Join complete
jvm 1 | 2022-12-16 12:12:18.296+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - [
jvm 1 | 2022-12-16 12:12:18.296+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - ]
jvm 1 | 2022-12-16 12:12:18.296+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - []
jvm 1 | 2022-12-16 12:12:18.356+0000 [] [vert.x-eventloop-thread-0] INFO i.v.c.e.i.c.ClusteredEventBus - Starting EventBus with Cluster Address : 10.244.237.60 and nodeId :202d7784-d695-4d46-83ba-24541b6a2943

Line 374:

jvm 1 | 2022-12-16 12:17:13.071+0000 [] [hz.happy_borg.cached.thread-8] WARN c.h.internal.cluster.ClusterService - [10.244.237.60]:5701 [nimbus-v3] [4.2.4] Resetting local member UUID. Previous: 202d7784-d695-4d46-83ba-24541b6a2943, new: e0a23b31-e32c-4ad1-9acc-d43d76c0dd01

Line 1077:

jvm 1 | 2022-12-16 12:17:20.227+0000 [] [vert.x-worker-thread-2] INFO i.v.s.c.hazelcast.impl.SubsMapHelper - republishOwnSubs::
jvm 1 | 2022-12-16 12:17:20.227+0000 [] [vert.x-worker-thread-2] INFO i.v.s.c.hazelcast.impl.SubsMapHelper - republish for adress: 202d7784-d695-4d46-83ba-24541b6a2943

After this POD is deleted:

Service 2 (2nd POD)

=====================
Line 5206 - 5216:

jvm 1 | 2022-12-16 12:26:54.853+0000 [] [hz.sharp_wilson.priority-generic-operation.thread-0] DEBUG c.h.i.p.InternalPartitionService - [10.244.237.57]:5701 [nimbus-v3] [4.2.4] Removing Member [10.244.237.60]:5701 - e0a23b31-e32c-4ad1-9acc-d43d76c0dd01
jvm 1 | 2022-12-16 12:26:54.854+0000 [] [hz.sharp_wilson.priority-generic-operation.thread-0] INFO c.h.internal.cluster.ClusterService - [10.244.237.57]:5701 [nimbus-v3] [4.2.4]
jvm 1 |
jvm 1 | Members {size:3, ver:5} [
jvm 1 | Member [10.244.237.8]:5701 - b12a46d5-1572-4bd3-afff-1e9915b27662
jvm 1 | Member [10.244.237.57]:5701 - d99db539-c9be-4aaf-a813-5e303c63fc5c this
jvm 1 | Member [10.244.237.9]:5701 - 9b4a6841-ceff-494c-a358-dcca647938de
jvm 1 | ]
jvm 1 |
jvm 1 | 2022-12-16 12:26:54.854+0000 [] [hz.sharp_wilson.event-3] INFO i.v.s.c.h.HazelcastClusterManager - New HZ Members removed event: [e0a23b31-e32c-4ad1-9acc-d43d76c0dd01]
jvm 1 | 2022-12-16 12:26:54.856+0000 [] [hz.sharp_wilson.cached.thread-5] INFO c.h.i.server.tcp.TcpServerConnector - [10.244.237.57]:5701 [nimbus-v3] [4.2.4] Could not connect to: /10.244.237.60:5701. Reason: IOException[Connect timed out to address /10.244.237.60:5701]
Conformance:

Your commits should be signed and you should have signed the Eclipse Contributor Agreement as explained in https://github.com/eclipse/vert.x/blob/master/CONTRIBUTING.md
Please also make sure you adhere to the code style guidelines: https://github.com/vert-x3/wiki/wiki/Vert.x-code-style-guidelines

tsegismont · 2022-12-21T18:01:38Z

Thank you for the PR. We actually had a similar issue reported against 3.9, and we lost the fix when rebuilding the CM for Vert.x 4.

I am going to reintegrate the fix in master and close this one. Thank you

javastack59 · 2022-12-22T10:33:11Z

Thank you so much. I can confirm that this issue did not occur in 3.9

Update NodeId n event-bus subscribers

fc255fa

javastack59 mentioned this pull request Dec 18, 2022

Hazelcast Cluster Manager (Vertx side) does not update nodeId of member when Hazelcast resets it #164

Closed

tsegismont added the wontfix label Dec 21, 2022

tsegismont closed this Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update NodeId n event-bus subscribers #163

Update NodeId n event-bus subscribers #163

javastack59 commented Dec 17, 2022 •

edited

tsegismont commented Dec 21, 2022

javastack59 commented Dec 22, 2022

Update NodeId n event-bus subscribers #163

Update NodeId n event-bus subscribers #163

Conversation

javastack59 commented Dec 17, 2022 • edited

Service 1 (First POD)

======= Line 144-155

Line 374:

Line 1077:

Service 2 (2nd POD)

===================== Line 5206 - 5216:

tsegismont commented Dec 21, 2022

javastack59 commented Dec 22, 2022

javastack59 commented Dec 17, 2022 •

edited

=======
Line 144-155

=====================
Line 5206 - 5216: