Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster manager is corrupted after merging of hazelcast partition #90

Closed
michalsida opened this issue Jun 26, 2018 · 12 comments
Closed
Assignees
Milestone

Comments

@michalsida
Copy link

I think that there is problem, that Vert.x use Hazelcast local interface endpoint UUID as unique constant identification of node, see io.vertx.spi.cluster.hazelcast.HazelcastClusterManager#getNodeID and initialization of nodeId field: nodeID = hazelcast.getLocalEndpoint().getUuid() [in io.vertx.spi.cluster.hazelcast.HazelcastClusterManager#join called from io.vertx.core.impl.VertxImpl#VertxImpl()]

This nodeID is used for node registration under multimap of topic subcribers "__vertx.subs" and it is used e.g. for subscriber removing from disconnected nodes (lambda in io.vertx.core.eventbus.impl.clustered.ClusteredEventBus#setClusterViewChangedHandler)

But it looks that in some situation is this UUID regenerated, see com.hazelcast.instance.Node#setNewLocalMember, e.g during merging of hazelcast partiotions.

After that it is in the situation, that hazelcast knows new node UUID, but the vertx registers topics still under the old value, I did not find any place, where the nodeId would be updated. And the lambda from io.vertx.core.eventbus.impl.clustered.ClusteredEventBus#setClusterViewChangedHandler will remove subscribers for this node from subscriber multimap.

I add some logging mechanism, which every 30s compares nodeId from Hazelcast and from Vertx HazelcastClusterManager:

final ClusterManager clusterManager = ((VertxImpl) vertx).getClusterManager();
final String currentNodeID = ((VertxImpl) vertx).getNodeID();

if (clusterManager instanceof HazelcastClusterManager) {
    String currentHazelcastNodeID = ((HazelcastClusterManager) clusterManager).getHazelcastInstance().getLocalEndpoint().getUuid();
    if (!currentNodeID.equals(currentHazelcastNodeID)) {
            getLogger().error("Hazelcast local endpoint {} UUID {} differs from Vertx NodeId {}",
                    ((HazelcastClusterManager) clusterManager).getHazelcastInstance().getLocalEndpoint().getSocketAddress().toString(),
                    currentHazelcastNodeID, currentNodeID);
    }
}

And after hazelcast cluster merge is this in the log:

TID: [2018-06-21 15:01:39,947] WARN [c.h.i.c.i.DiscoveryJoiner] (hz.MCI_SERVICE_CAMPAIGN.cached.thread-11) [] - [10.148.250.33]:5703 [hazelcast-consul-discovery-spi] [3.8.2] [10.148.250.33]:5703 is merging [tcp/ip] to [10.148.250.34]:5702
TID: [2018-06-21 15:01:39,973] WARN [c.h.i.c.i.o.MergeClustersOperation] (hz.MCI_SERVICE_CAMPAIGN.cached.thread-11) [] - [10.148.250.33]:5703 [hazelcast-consul-discovery-spi] [3.8.2] [10.148.250.33]:5703 is merging to [10.148.250.34]:5702, because: instructed by master [10.148.250.33]:5703
TID: [2018-06-21 15:01:39,977] INFO [c.c.m.l.c.m.h.l.NodeLifecycleListener] (hz.MCI_SERVICE_CAMPAIGN.cached.thread-17) [] - Hazelcast state changed: LifecycleEvent [state=MERGING]
TID: [2018-06-21 15:01:39,978] WARN [c.hazelcast.instance.Node] (hz.MCI_SERVICE_CAMPAIGN.cached.thread-17) [] - [10.148.250.33]:5703 [hazelcast-consul-discovery-spi] [3.8.2] Setting new local member. old uuid: 82ffa5f9-f059-48be-be16-7528c547fdd8 new uuid: 2446732d-70df-4201-bccb-7bec82f384fd
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5702 - 455989de-a9bc-4964-83d1-ec463bdda952,type=added}
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.37]:5701 - efef4cfe-8463-4e3b-aa34-eca29b0b6157,type=added}
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5701 - 7efdcfcb-5460-4e8d-ac61-1ac1a8eaba8b,type=added}
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5706 - 26cf5948-8718-4230-a5bc-1b9ee0ed6015,type=added}
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5707 - dcfdeb3d-6ebb-474d-80bc-9bfece2d771a,type=added}
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5702 - aaf66b6b-4026-452c-80d4-cd6cd15fa3a9,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5708 - 72ee2e95-f237-4687-9b1e-973c9cd427b6,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5709 - 170ab501-1b1a-48b1-ad7f-e4cbe12fa5dc,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5703 - ee16f3e5-88f8-4fa4-9efb-a06749ee0996,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5710 - 9f9b3669-4d8b-42c8-b724-52286908f6e0,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5706 - e14bd98a-060a-460f-b352-2fb39399101a,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5711 - eaeeca1d-5c20-4847-8d34-388ed2167f4c,type=added}
TID: [2018-06-21 15:01:46,087] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5704 - 800be4ec-4921-46c8-b20e-067fe4ac3f84,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5705 - 80e49d13-6de8-409e-85ac-59f17deb8f9e,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5703 - fa85141d-b02c-4078-91c6-ed66cd176452,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5704 - b82add35-1d7e-43c8-9388-fdfd997f4121,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5701 - 39e7ae3e-bb62-47c1-8da7-190669c058ef,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5705 - 69b6f627-3840-4b6e-9705-dfd158e64dc3,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5707 - 92d52042-b967-4075-b4b9-f9023bba2d49,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5708 - 09771edf-56fd-496e-a958-725fd4120357,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5709 - 4cd1905b-2e71-4bbd-8733-5f7e5268f30d,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5710 - 366ebd54-714c-4ed4-8637-03cdadac87fe,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.33]:5701 - 764ba439-4812-418d-975c-0c3ad4a84b0f,type=added}
TID: [2018-06-21 15:01:46,089] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.33]:5709 - 23f95919-4682-4003-be2e-cac876b47f70,type=added}
TID: [2018-06-21 15:01:46,089] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.33]:5702 - 2991f9e2-72d8-49e2-9135-b3dd964fe53d,type=added}
TID: [2018-06-21 15:01:46,325] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-6) [] - Hazelcast migration started: MigrationEvent{partitionId=0, status=STARTED, oldOwner=Member [10.148.250.33]:5701 - 764ba439-4812-418d-975c-0c3ad4a84b0f, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,359] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-6) [] - Hazelcast migration completed: MigrationEvent{partitionId=0, status=COMPLETED, oldOwner=Member [10.148.250.33]:5701 - 764ba439-4812-418d-975c-0c3ad4a84b0f, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,646] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-8) [] - Hazelcast migration started: MigrationEvent{partitionId=52, status=STARTED, oldOwner=Member [10.148.238.196]:5710 - 366ebd54-714c-4ed4-8637-03cdadac87fe, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,656] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-8) [] - Hazelcast migration completed: MigrationEvent{partitionId=52, status=COMPLETED, oldOwner=Member [10.148.238.196]:5710 - 366ebd54-714c-4ed4-8637-03cdadac87fe, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,685] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration started: MigrationEvent{partitionId=64, status=STARTED, oldOwner=Member [10.148.250.34]:5701 - 39e7ae3e-bb62-47c1-8da7-190669c058ef, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,693] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration completed: MigrationEvent{partitionId=64, status=COMPLETED, oldOwner=Member [10.148.250.34]:5701 - 39e7ae3e-bb62-47c1-8da7-190669c058ef, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,755] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration started: MigrationEvent{partitionId=33, status=STARTED, oldOwner=Member [10.148.238.196]:5703 - ee16f3e5-88f8-4fa4-9efb-a06749ee0996, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,755] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration completed: MigrationEvent{partitionId=33, status=COMPLETED, oldOwner=Member [10.148.238.196]:5703 - ee16f3e5-88f8-4fa4-9efb-a06749ee0996, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,755] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration started: MigrationEvent{partitionId=58, status=STARTED, oldOwner=Member [10.148.250.34]:5706 - 26cf5948-8718-4230-a5bc-1b9ee0ed6015, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,755] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration completed: MigrationEvent{partitionId=58, status=COMPLETED, oldOwner=Member [10.148.250.34]:5706 - 26cf5948-8718-4230-a5bc-1b9ee0ed6015, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,779] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration started: MigrationEvent{partitionId=103, status=STARTED, oldOwner=Member [10.148.250.34]:5703 - fa85141d-b02c-4078-91c6-ed66cd176452, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,788] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration completed: MigrationEvent{partitionId=103, status=COMPLETED, oldOwner=Member [10.148.250.34]:5703 - fa85141d-b02c-4078-91c6-ed66cd176452, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,804] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration started: MigrationEvent{partitionId=124, status=STARTED, oldOwner=Member [10.148.250.37]:5701 - efef4cfe-8463-4e3b-aa34-eca29b0b6157, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,822] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration completed: MigrationEvent{partitionId=124, status=COMPLETED, oldOwner=Member [10.148.250.37]:5701 - efef4cfe-8463-4e3b-aa34-eca29b0b6157, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,875] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration started: MigrationEvent{partitionId=169, status=STARTED, oldOwner=Member [10.148.238.196]:5701 - 7efdcfcb-5460-4e8d-ac61-1ac1a8eaba8b, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,892] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration completed: MigrationEvent{partitionId=169, status=COMPLETED, oldOwner=Member [10.148.238.196]:5701 - 7efdcfcb-5460-4e8d-ac61-1ac1a8eaba8b, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,892] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-8) [] - Hazelcast migration started: MigrationEvent{partitionId=187, status=STARTED, oldOwner=Member [10.148.250.34]:5708 - 72ee2e95-f237-4687-9b1e-973c9cd427b6, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,907] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-8) [] - Hazelcast migration completed: MigrationEvent{partitionId=187, status=COMPLETED, oldOwner=Member [10.148.250.34]:5708 - 72ee2e95-f237-4687-9b1e-973c9cd427b6, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,928] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-6) [] - Hazelcast migration started: MigrationEvent{partitionId=200, status=STARTED, oldOwner=Member [10.148.238.196]:5705 - 80e49d13-6de8-409e-85ac-59f17deb8f9e, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,944] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-6) [] - Hazelcast migration completed: MigrationEvent{partitionId=200, status=COMPLETED, oldOwner=Member [10.148.238.196]:5705 - 80e49d13-6de8-409e-85ac-59f17deb8f9e, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:47,279] ERROR [c.c.m.s.c.Application] (vert.x-eventloop-thread-0) [] - Hazelcast local endpoint /10.148.250.33:5703 UUID 2446732d-70df-4201-bccb-7bec82f384fd differs from Vertx NodeId 82ffa5f9-f059-48be-be16-7528c547fdd8

But new registered subscribers are still registered under 82ffa5f9-f059-48be-be16-7528c547fdd8, I registered some subscribers after this operation and in MultiMap is e.g. this:

{
  "key": "topic-getCampaignMaterials",
  "values": [
	{
	  "serverId": "10.148.250.34:15702", -- subsriber from another node
	  "nodeId": "455989de-a9bc-4964-83d1-ec463bdda952"
	},
	{
	  "serverId": "10.148.250.33:15703", -- Hazelcast port + 10000
	  "nodeId": "82ffa5f9-f059-48be-be16-7528c547fdd8"
	}
  ]
}

but 82ffa5f9-f059-48be-be16-7528c547fdd8 is not list of Hazelcast members, there is uuid 2446732d-70df-4201-bccb-7bec82f384fd for [10.148.250.33]:5703 only. And if some nodes are removed/added to cluster, the lambda in io.vertx.core.eventbus.impl.clustered.ClusteredEventBus#setClusterViewChangedHandler will remove these subsribers, I think.

And the subsribers registred earlier from this node are lost, because the multimap recovery is in Hazelcast implemented in last releases. I tried use the latest release of Hazelcast, multimap recovery is possibly solved there, but problem with unupdated nodeId-UUID remains, so the subsribers are removed from map by Vertx anyway.

Shouldn't be there some nodeId updating after Hazelcast merge notification?

Used versions: Vert.X 3.5.2, Hazelcast: 3.8.2 (and 3.10.2)

Link to original discussion topic.

@tsegismont
Copy link
Contributor

@michalsida thank you, great report!

@tsegismont tsegismont added this to the 3.6.0 milestone Jun 26, 2018
@tsegismont tsegismont self-assigned this Jun 26, 2018
@rvega-arg
Copy link

I can confirm i'm having the same issue with Vert.X 3.5.3, Hazelcast: 3.8.2 (and 3.10.4)

@michalsida did you find any workaround?

@michalsida
Copy link
Author

@rvega-arg My workaround is that I have registered a timer controlling internal state of topic multimap (once per minute) and if it detects, that any of own registered topic is missing in the multimap, it will unregister all member topics and register them again.

@rvega-arg
Copy link

Another issue related to multimaps hazelcast/hazelcast#13559

tsegismont added a commit to tsegismont/vertx-hazelcast that referenced this issue Sep 19, 2018
…ast partition

The original HZ member uuid is saved as a cluster member attribute.
Consequently, even if HZ changes it during partition merging, HAManager and ClusteredEventBus will keep using the same original uuid.

Note that when member added event is handled by existing cluster nodes, the member attribute may not be saved yet.
In this case, it is safe to use member uuid (at this stage, the member uuid and member attribute are equal).
@Birmania
Copy link

@michalsida First, I want to thank you for this analysis. We are currently encountering the same problem as you on our Project.

Question : Your workaround would be really useful in our context. If I understand well, your watchdog iterate over the subs multimap (every minute) to check existency of current verticle owned and registered topic and, if one is missing, you unregister/re-register the full Verticle ?

However, would it be sufficent to create the watchdog on the only principle of comparing the Vert.X UUID and Hazelcast UUID ? Can we consider that distinct UUID are always the result of a problem of Brain Split Merge ? If yes, it seems more simple than checking the sub multimap but I could be wrong ?
The only counterpart I see is that it could redeploy your Verticle even if you do not consume any topic on the clustered Event Bus...

Thanks for your answer/help !

@tsegismont
Copy link
Contributor

@Birmania see #95 , this should be fixed in 3.6

tsegismont added a commit to tsegismont/vertx-hazelcast that referenced this issue Sep 20, 2018
…ast partition

The original HZ member uuid is saved as a cluster member attribute.
Consequently, even if HZ changes it during partition merging, HAManager and ClusteredEventBus will keep using the same original uuid.

Note that when member added event is handled by existing cluster nodes, the member attribute may not be saved yet.
In this case, it is safe to use member uuid (at this stage, the member uuid and member attribute are equal).
@pmlopes pmlopes removed the to review label Sep 20, 2018
@tsegismont tsegismont modified the milestones: 3.6.0, 3.5.4 Sep 20, 2018
tsegismont added a commit that referenced this issue Sep 20, 2018
…tition

The original HZ member uuid is saved as a cluster member attribute.
Consequently, even if HZ changes it during partition merging, HAManager and ClusteredEventBus will keep using the same original uuid.

Note that when member added event is handled by existing cluster nodes, the member attribute may not be saved yet.
In this case, it is safe to use member uuid (at this stage, the member uuid and member attribute are equal).
@michalsida
Copy link
Author

@Birmania Yes, I did it exactly by this way. I can send a code snippet, which covers this.
May be controlling of node UUID would be sufficient, but I was lucky to find some working solution, so I keep it in that way.

@tsegismont Great, I am looking forward to this version

@tsegismont
Copy link
Contributor

@michalsida if you give a try to the snapshot version it would be great. In any case, thanks again for the thorough analysis, it was a great contribution!

@Birmania
Copy link

@tsegismont Excellent news for the fix ! However we need a workaround to deliver a client in 4 weeks.

@michalsida Thanks for the answer, I am really interested by your snippet. How can we exchange ?

@tsegismont
Copy link
Contributor

@Birmania the fix has been backported to the 3.5 branch. Vert.x 3.5.4 should be out in the next couple of weeks.

@michalsida
Copy link
Author

@tsegismont I hope I will try it soon and I can give feedback

@Birmania Look at this snippet It's a little ugly and there are some references to our other code (and come references were removed before posting), but I hope it can illustrate my approach and it works for our purposes.

@Birmania
Copy link

Birmania commented Oct 12, 2018

@michalsida Thanks a lot for this snippet, it will be very useful for us !

@tsegismont Cool ! Thanks for the tip about the incoming backport.
Edit : I checked the pom and it does not use the 3.10 (management de MultiMap in merge) version of Hazelcast. Will subscribers map be ok after the split brain merge ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants