Update NodeId n event-bus subscribers #163
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation:
Hazelcast is sometimes resetting the UUID of a member.
(https://github.com/hazelcast/hazelcast/blob/master/hazelcast/src/main/java/com/hazelcast/internal/cluster/impl/ClusterServiceImpl.java#L342)
This is almost happening everytime in a container environment like Kubernetes where multiple pods with several verticles are all started and deployed at the same time.
Even after the UUID is reset, the ClusterManager (Vert.x HazelcastClusterManager) is still republishing the event-bus subscribers with the stale UUID of the member.
The problem does not immediately manifest , as the IP address of the member POD still remains the same and eventhough the Vert.x NodeSelector does a lookup with stale UUID, the correct IP address is still returned.
Issue creeps up the moment, a member POD is restarted either due to a delete or a crash and a new member POD comes up with a new IP address and UUID.
Hazelcast is now broadcasting a member removed event with the latest member UUID(of the deleted POD) to other members but since the subs-map and node-map is not updated during the reset (earlier) , it is not getting evicted which is resulting in a huge problem with eventBus timing out mostly during a reply as it keeps trying to reply to a member that is already gone.
Sample logs:
Service 1 (First POD)
=======
Line 144-155
jvm 1 | Members {size:1, ver:1} [
jvm 1 | Member [10.244.237.60]:5701 - 202d7784-d695-4d46-83ba-24541b6a2943 this
jvm 1 | ]
jvm 1 |
jvm 1 | 2022-12-16 12:12:18.265+0000 [] [vert.x-worker-thread-0] INFO com.hazelcast.core.LifecycleService - [10.244.237.60]:5701 [nimbus-v3] [4.2.4] [10.244.237.60]:5701 is STARTED
jvm 1 | 2022-12-16 12:12:18.265+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - Local Member : 10.244.237.60
jvm 1 | 2022-12-16 12:12:18.265+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - Local Member Node Id: 202d7784-d695-4d46-83ba-24541b6a2943
jvm 1 | 2022-12-16 12:12:18.295+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - Join complete
jvm 1 | 2022-12-16 12:12:18.296+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - [
jvm 1 | 2022-12-16 12:12:18.296+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - ]
jvm 1 | 2022-12-16 12:12:18.296+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - []
jvm 1 | 2022-12-16 12:12:18.356+0000 [] [vert.x-eventloop-thread-0] INFO i.v.c.e.i.c.ClusteredEventBus - Starting EventBus with Cluster Address : 10.244.237.60 and nodeId :202d7784-d695-4d46-83ba-24541b6a2943
Line 374:
jvm 1 | 2022-12-16 12:17:13.071+0000 [] [hz.happy_borg.cached.thread-8] WARN c.h.internal.cluster.ClusterService - [10.244.237.60]:5701 [nimbus-v3] [4.2.4] Resetting local member UUID. Previous: 202d7784-d695-4d46-83ba-24541b6a2943, new: e0a23b31-e32c-4ad1-9acc-d43d76c0dd01
Line 1077:
jvm 1 | 2022-12-16 12:17:20.227+0000 [] [vert.x-worker-thread-2] INFO i.v.s.c.hazelcast.impl.SubsMapHelper - republishOwnSubs::
jvm 1 | 2022-12-16 12:17:20.227+0000 [] [vert.x-worker-thread-2] INFO i.v.s.c.hazelcast.impl.SubsMapHelper - republish for adress: 202d7784-d695-4d46-83ba-24541b6a2943
After this POD is deleted:
Service 2 (2nd POD)
=====================
Line 5206 - 5216:
jvm 1 | 2022-12-16 12:26:54.853+0000 [] [hz.sharp_wilson.priority-generic-operation.thread-0] DEBUG c.h.i.p.InternalPartitionService - [10.244.237.57]:5701 [nimbus-v3] [4.2.4] Removing Member [10.244.237.60]:5701 - e0a23b31-e32c-4ad1-9acc-d43d76c0dd01
jvm 1 | 2022-12-16 12:26:54.854+0000 [] [hz.sharp_wilson.priority-generic-operation.thread-0] INFO c.h.internal.cluster.ClusterService - [10.244.237.57]:5701 [nimbus-v3] [4.2.4]
jvm 1 |
jvm 1 | Members {size:3, ver:5} [
jvm 1 | Member [10.244.237.8]:5701 - b12a46d5-1572-4bd3-afff-1e9915b27662
jvm 1 | Member [10.244.237.57]:5701 - d99db539-c9be-4aaf-a813-5e303c63fc5c this
jvm 1 | Member [10.244.237.9]:5701 - 9b4a6841-ceff-494c-a358-dcca647938de
jvm 1 | ]
jvm 1 |
jvm 1 | 2022-12-16 12:26:54.854+0000 [] [hz.sharp_wilson.event-3] INFO i.v.s.c.h.HazelcastClusterManager - New HZ Members removed event: [e0a23b31-e32c-4ad1-9acc-d43d76c0dd01]
jvm 1 | 2022-12-16 12:26:54.856+0000 [] [hz.sharp_wilson.cached.thread-5] INFO c.h.i.server.tcp.TcpServerConnector - [10.244.237.57]:5701 [nimbus-v3] [4.2.4] Could not connect to: /10.244.237.60:5701. Reason: IOException[Connect timed out to address /10.244.237.60:5701]
Conformance:
Your commits should be signed and you should have signed the Eclipse Contributor Agreement as explained in https://github.com/eclipse/vert.x/blob/master/CONTRIBUTING.md
Please also make sure you adhere to the code style guidelines: https://github.com/vert-x3/wiki/wiki/Vert.x-code-style-guidelines