You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a three-node Consul cluster, with server nodes 0,1 and 2, if I run the following test, the cluster cannot elect a leader:
Stop one follower node. For example, say that node 0 is the leader. I stop node 1.
Allow enough time for the leader to tell the other follower (node 2) that node 1 has left the cluster.
Stop the leader, in this case, node 0.
Bring back node 1.
Now the cluster can never elect a leader, even though it has a quorum, because in node 2's configuration, it only has itself and the old leader, node 0, so it will not accept/send vote requests from/to node 1, and node 0 is down.
I think this happens because only the leader can update the other followers' configuration, which will not happen if there's no leader.
To me it seems that this is an important bug, but I need someone to confirm that.
I caused this behaviour using the latest Consul Docker image. Here are the commands that should reproduce the issue:
docker run \
-d \
-p 8500:8500 \
-p 8600:8600/udp \
--name=node0 \
consul agent -server -ui -node=server-0 -bootstrap-expect=3 -client=0.0.0.0 \
-retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4
docker run \
-d \
--name=node1 \
consul agent -server -ui -node=server-1 -bootstrap-expect=3 -client=0.0.0.0 \
-retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4
docker run \
-d \
--name=node2 \
consul agent -server -ui -node=server-2 -bootstrap-expect=3 -client=0.0.0.0 \
-retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4
docker stop node1 #given that node0 is the leader
docker stop node0
docker start node1
After running this you should see similar to the following messages on node 2:
failed to make requestVote RPC: target="{Voter 7505e313-c898-de46-944f-921948a36bb8 172.17.0.3:8300}" error="dial tcp <nil>->172.17.0.3:8300: connect: no route to host" term=17
rejecting vote request since node is not in configuration: from=172.17.0.2:8300
I also opened a bug in Consul as that's what I used to reproduce the problem: hashicorp/consul#15940
The text was updated successfully, but these errors were encountered:
I re-ran these steps with Consul version 1.14.3 and was unable to reproduce the problem reliably. With the following changes to the steps, it should be reliably reproducible:
use docker kill insetad of docker stop when stopping the leader node.
after killing the first node, give enough time for the leader update the Raft configuration of the cluster before killing the leader. You will know when enough time has elapsed when you see this log on the leader: 2023-02-06T11:30:42.806Z [INFO] agent.server.raft: updating configuration: command=RemoveServer server-id=25760f36-3d02-eda1-bccf-b7a05ee0d9c5 server-addr= servers="[{Suffrage:Voter ID:4ad7524b-7124-06cd-f40e-3d 980ec4ff30 Address:172.17.0.3:8300} {Suffrage:Voter ID:8ace7581-ccd7-4e2d-ee9f-8c8018775b4f Address:172.17.0.4:8300}]"
In a three-node Consul cluster, with server nodes 0,1 and 2, if I run the following test, the cluster cannot elect a leader:
Now the cluster can never elect a leader, even though it has a quorum, because in node 2's configuration, it only has itself and the old leader, node 0, so it will not accept/send vote requests from/to node 1, and node 0 is down.
I think this happens because only the leader can update the other followers' configuration, which will not happen if there's no leader.
To me it seems that this is an important bug, but I need someone to confirm that.
I caused this behaviour using the latest Consul Docker image. Here are the commands that should reproduce the issue:
After running this you should see similar to the following messages on node 2:
I also opened a bug in Consul as that's what I used to reproduce the problem: hashicorp/consul#15940
The text was updated successfully, but these errors were encountered: