Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update NodeId n event-bus subscribers #163

Closed
wants to merge 1 commit into from

Conversation

javastack59
Copy link

@javastack59 javastack59 commented Dec 17, 2022

Motivation:
Hazelcast is sometimes resetting the UUID of a member.
(https://github.com/hazelcast/hazelcast/blob/master/hazelcast/src/main/java/com/hazelcast/internal/cluster/impl/ClusterServiceImpl.java#L342)
This is almost happening everytime in a container environment like Kubernetes where multiple pods with several verticles are all started and deployed at the same time.
Even after the UUID is reset, the ClusterManager (Vert.x HazelcastClusterManager) is still republishing the event-bus subscribers with the stale UUID of the member.
The problem does not immediately manifest , as the IP address of the member POD still remains the same and eventhough the Vert.x NodeSelector does a lookup with stale UUID, the correct IP address is still returned.

Issue creeps up the moment, a member POD is restarted either due to a delete or a crash and a new member POD comes up with a new IP address and UUID.
Hazelcast is now broadcasting a member removed event with the latest member UUID(of the deleted POD) to other members but since the subs-map and node-map is not updated during the reset (earlier) , it is not getting evicted which is resulting in a huge problem with eventBus timing out mostly during a reply as it keeps trying to reply to a member that is already gone.

Sample logs:

Service 1 (First POD)

=======
Line 144-155

jvm 1 | Members {size:1, ver:1} [
jvm 1 | Member [10.244.237.60]:5701 - 202d7784-d695-4d46-83ba-24541b6a2943 this
jvm 1 | ]
jvm 1 |
jvm 1 | 2022-12-16 12:12:18.265+0000 [] [vert.x-worker-thread-0] INFO com.hazelcast.core.LifecycleService - [10.244.237.60]:5701 [nimbus-v3] [4.2.4] [10.244.237.60]:5701 is STARTED
jvm 1 | 2022-12-16 12:12:18.265+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - Local Member : 10.244.237.60
jvm 1 | 2022-12-16 12:12:18.265+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - Local Member Node Id: 202d7784-d695-4d46-83ba-24541b6a2943
jvm 1 | 2022-12-16 12:12:18.295+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - Join complete
jvm 1 | 2022-12-16 12:12:18.296+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - [
jvm 1 | 2022-12-16 12:12:18.296+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - ]
jvm 1 | 2022-12-16 12:12:18.296+0000 [] [vert.x-worker-thread-0] INFO i.v.s.c.h.HazelcastClusterManager - []
jvm 1 | 2022-12-16 12:12:18.356+0000 [] [vert.x-eventloop-thread-0] INFO i.v.c.e.i.c.ClusteredEventBus - Starting EventBus with Cluster Address : 10.244.237.60 and nodeId :202d7784-d695-4d46-83ba-24541b6a2943

Line 374:

jvm 1 | 2022-12-16 12:17:13.071+0000 [] [hz.happy_borg.cached.thread-8] WARN c.h.internal.cluster.ClusterService - [10.244.237.60]:5701 [nimbus-v3] [4.2.4] Resetting local member UUID. Previous: 202d7784-d695-4d46-83ba-24541b6a2943, new: e0a23b31-e32c-4ad1-9acc-d43d76c0dd01

Line 1077:

jvm 1 | 2022-12-16 12:17:20.227+0000 [] [vert.x-worker-thread-2] INFO i.v.s.c.hazelcast.impl.SubsMapHelper - republishOwnSubs::
jvm 1 | 2022-12-16 12:17:20.227+0000 [] [vert.x-worker-thread-2] INFO i.v.s.c.hazelcast.impl.SubsMapHelper - republish for adress: 202d7784-d695-4d46-83ba-24541b6a2943

After this POD is deleted:

Service 2 (2nd POD)

=====================
Line 5206 - 5216:

jvm 1 | 2022-12-16 12:26:54.853+0000 [] [hz.sharp_wilson.priority-generic-operation.thread-0] DEBUG c.h.i.p.InternalPartitionService - [10.244.237.57]:5701 [nimbus-v3] [4.2.4] Removing Member [10.244.237.60]:5701 - e0a23b31-e32c-4ad1-9acc-d43d76c0dd01
jvm 1 | 2022-12-16 12:26:54.854+0000 [] [hz.sharp_wilson.priority-generic-operation.thread-0] INFO c.h.internal.cluster.ClusterService - [10.244.237.57]:5701 [nimbus-v3] [4.2.4]
jvm 1 |
jvm 1 | Members {size:3, ver:5} [
jvm 1 | Member [10.244.237.8]:5701 - b12a46d5-1572-4bd3-afff-1e9915b27662
jvm 1 | Member [10.244.237.57]:5701 - d99db539-c9be-4aaf-a813-5e303c63fc5c this
jvm 1 | Member [10.244.237.9]:5701 - 9b4a6841-ceff-494c-a358-dcca647938de
jvm 1 | ]
jvm 1 |
jvm 1 | 2022-12-16 12:26:54.854+0000 [] [hz.sharp_wilson.event-3] INFO i.v.s.c.h.HazelcastClusterManager - New HZ Members removed event: [e0a23b31-e32c-4ad1-9acc-d43d76c0dd01]
jvm 1 | 2022-12-16 12:26:54.856+0000 [] [hz.sharp_wilson.cached.thread-5] INFO c.h.i.server.tcp.TcpServerConnector - [10.244.237.57]:5701 [nimbus-v3] [4.2.4] Could not connect to: /10.244.237.60:5701. Reason: IOException[Connect timed out to address /10.244.237.60:5701]
Conformance:

Your commits should be signed and you should have signed the Eclipse Contributor Agreement as explained in https://github.com/eclipse/vert.x/blob/master/CONTRIBUTING.md
Please also make sure you adhere to the code style guidelines: https://github.com/vert-x3/wiki/wiki/Vert.x-code-style-guidelines

@tsegismont
Copy link
Contributor

Thank you for the PR. We actually had a similar issue reported against 3.9, and we lost the fix when rebuilding the CM for Vert.x 4.

I am going to reintegrate the fix in master and close this one. Thank you

@tsegismont tsegismont closed this Dec 21, 2022
@javastack59
Copy link
Author

Thank you so much. I can confirm that this issue did not occur in 3.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

2 participants