Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Investigation of jet.enterprise.BlueGreenClientTest.testOriginalClusterComebackBeforeSwitching revealed that we have a bug because of a race. The race happens as follows: we have two clusters with one member each. clusterA on port 5701 clusterB on port 5702 The client is on clusterA and expected to switch to clusterB when clusterA is down. After clusterA is restarted, client starts connection attempts to 5701 to connect back. Sends two authentication attempts to 5701 ``` 2021-03-04 06:37:32,333 [ INFO] [hz.flamboyant_buck.priority-generic-operation.thread-0] [c.h.c.i.p.t.AuthenticationMessageTask]: [127.0.0.1]:5701 [clusterA] [4.5-SNAPSHOT] Received auth from Connection[id=1, /127.0.0.1:5701->/127.0.0.1:38650, qualifier=null, endpoint=[127.0.0.1]:38650, alive=true, connectionType=JVM, planeIndex=-1], successfully authenticated, clientUuid: 040378d9-ce22-4efd-a25a-6c6dc11f9122, client version: 4.2-SNAPSHOT 2021-03-04 06:37:32,336 [ INFO] [hz.flamboyant_buck.priority-generic-operation.thread-0] [c.h.c.i.p.t.AuthenticationMessageTask]: [127.0.0.1]:5701 [clusterA] [4.5-SNAPSHOT] Received auth from Connection[id=2, /127.0.0.1:5701->/127.0.0.1:45775, qualifier=null, endpoint=[127.0.0.1]:45775, alive=true, connectionType=JVM, planeIndex=-1], successfully authenticated, clientUuid: 040378d9-ce22-4efd-a25a-6c6dc11f9122, client version: 4.2-SNAPSHOT ``` Gots a response back from one of them and finds out that cluster id has changed. At this point, we fallback to failover logic "Force to hard cluster switch" ```/////// 2021-03-04 06:37:32,338 [ INFO] [hz.client_2.internal-1] [c.h.c.i.c.t.TcpClientConnection]: hz.client_2 [clusterA] [4.5-SNAPSHOT] [4.2-SNAPSHOT] ClientConnection{alive=false, connectionId=11, channel=NioChannel{/127.0.0.1:38650->/127.0.0.1:5701}, remoteAddress=[127.0.0.1]:5701, lastReadTime=2021-03-04 06:37:32.337, lastWriteTime=2021-03-04 06:37:32.332, closedTime=2021-03-04 06:37:32.337, connected server version=4.2-SNAPSHOT} closed. Reason: Force to hard cluster switch 2021-03-04 06:37:32,339 [ WARN] [hz.client_2.internal-1] [c.h.c.i.c.ClientConnectionManager]: hz.client_2 [clusterA] [4.5-SNAPSHOT] [4.2-SNAPSHOT] Exception during initial connection to Member [127.0.0.1]:5701 - 047bf9a6-2604-4c69-895b-d5504e16ea65: com.hazelcast.client.ClientNotAllowedInClusterException: Force to hard cluster switch 2021-03-04 06:37:32,339 [ WARN] [hz.client_2.internal-1] [c.h.c.i.c.ClientConnectionManager]: hz.client_2 [clusterA] [4.5-SNAPSHOT] [4.2-SNAPSHOT] Stopped trying on the cluster: clusterA reason: Force to hard cluster switch ``` We see in the logs that client has changed context and will try to connect to clusterB ``` 2021-03-04 06:37:32,339 [ INFO] [hz.client_2.internal-1] [c.h.c.i.c.ClientConnectionManager]: hz.client_2 [clusterB] [4.5-SNAPSHOT] [4.2-SNAPSHOT] Trying to connect to next cluster: clusterB 2021-03-04 06:37:32,340 [ INFO] [hz.client_2.internal-1] [c.h.c.i.c.ClientConnectionManager]: hz.client_2 [clusterB] [4.5-SNAPSHOT] [4.2-SNAPSHOT] Trying to connect to [127.0.0.1]:5701 ``` Attempts 5701 because on the client config ports are not given but the cluster names are different. So we expect this one to fail the authentication. ``` 2021-03-04 06:37:32,346 [ WARN] [hz.flamboyant_buck.generic-operation.thread-32] [c.h.c.i.p.t.AuthenticationMessageTask]: [127.0.0.1]:5701 [clusterA] [4.5-SNAPSHOT] Received auth from Connection[id=3, /127.0.0.1:5701->/127.0.0.1:42692, qualifier=null, endpoint=null, alive=true, connectionType=NONE, planeIndex=-1] with clientUuid 040378d9-ce22-4efd-a25a-6c6dc11f9122, authentication failed 2021-03-04 06:37:32,528 [ WARN] [hz.client_2.internal-1] [c.h.c.i.c.t.TcpClientConnection]: hz.client_2 [clusterB] [4.5-SNAPSHOT] [4.2-SNAPSHOT] ClientConnection{alive=false, connectionId=13, channel=NioChannel{/127.0.0.1:42692->/127.0.0.1:5701}, remoteAddress=null, lastReadTime=2021-03-04 06:37:32.526, lastWriteTime=2021-03-04 06:37:32.343, closedTime=2021-03-04 06:37:32.528, connected server version=null} closed. Reason: Failed to authenticate connection com.hazelcast.client.AuthenticationException: Authentication failed. The configured cluster name on the client (see ClientConfig.setClusterName()) does not match the one configured in the cluster or the credentials set in the Client security config could not be authenticated ``` Just before that we see that client is switching cluster(which should not happen) ``` 2021-03-04 06:37:32,341 [ WARN] [hz.client_2.internal-4] [c.h.c.i.c.ClientConnectionManager]: hz.client_2 [clusterB] [4.5-SNAPSHOT] [4.2-SNAPSHOT] Switching from current cluster: f9c285d1-168d-460d-8b94-f1db7687a948 to new cluster: b3a9a9af-763e-4d36-b810-9d7e31de8640 2021-03-04 06:37:32,342 [ INFO] [hz.client_2.internal-4] [c.h.c.HazelcastInstance]: hz.client_2 [clusterB] [4.5-SNAPSHOT] [4.2-SNAPSHOT] Clearing local state of the client, because of a cluster restart. ``` And also client says that it is connected to 5701 again ``` 2021-03-04 06:37:32,476 [ INFO] [hz.client_2.internal-4] [c.h.c.i.c.ClientConnectionManager]: hz.client_2 [clusterB] [4.5-SNAPSHOT] [4.2-SNAPSHOT] Authenticated with server [127.0.0.1]:5701:35d11648-d679-4651-b1b7-333302e89e1b, server version: 4.2-SNAPSHOT, local address: /127.0.0.1:45775 2021-03-04 06:37:32,482 [ INFO] [hz.client_2.internal-4] [c.h.c.LifecycleService]: hz.client_2 [clusterB] [4.5-SNAPSHOT] [4.2-SNAPSHOT] HazelcastClient 4.2-SNAPSHOT (20210303 - 83fa58a, 9a3e1e7) is CLIENT_CONNECTED ``` This is happened because of the second connection attempt to 5701. First one is rejected and we made client to switch by force. They were setting a global switch `switchingToNextCluster` to denote that further responses are OK to change the switch to cluster. The false assumption was that we will not get any response other than the path that started with failover logic. But the response to the second connection was done by periodic task `ConnectToAllClusterMembersTask`, and its intend is not to switch cluster. The solution: Together with global state `switchingToNextCluster`, I have introduced a boolean on the connect methods to deliver the intend of the callee so that we can decide on the response by comparing these two. If the connection is opened without switching cluster intend, it could be connected to false cluster and we should close it back. Also note that the boolean `switchingToNextCluster` is moved to `ClientState.SWITCHING_CLUSTER`. fixes https://github.com/hazelcast/hazelcast-jet-enterprise/issues/155 backport of #18511 (cherry picked from commit fbe3386)
- Loading branch information