Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster unable to elect a leader after restarting a follower and stopping the leader while the follower is down #535

Closed
thomacr opened this issue Jan 9, 2023 · 2 comments

Comments

@thomacr
Copy link

thomacr commented Jan 9, 2023

In a three-node Consul cluster, with server nodes 0,1 and 2, if I run the following test, the cluster cannot elect a leader:

  • Stop one follower node. For example, say that node 0 is the leader. I stop node 1.
  • Allow enough time for the leader to tell the other follower (node 2) that node 1 has left the cluster.
  • Stop the leader, in this case, node 0.
  • Bring back node 1.

Now the cluster can never elect a leader, even though it has a quorum, because in node 2's configuration, it only has itself and the old leader, node 0, so it will not accept/send vote requests from/to node 1, and node 0 is down.
I think this happens because only the leader can update the other followers' configuration, which will not happen if there's no leader.
To me it seems that this is an important bug, but I need someone to confirm that.

I caused this behaviour using the latest Consul Docker image. Here are the commands that should reproduce the issue:

docker run \
    -d \
    -p 8500:8500 \
    -p 8600:8600/udp \
    --name=node0 \
    consul agent -server -ui -node=server-0 -bootstrap-expect=3 -client=0.0.0.0 \
    -retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4

docker run \
    -d \
    --name=node1 \
    consul agent -server -ui -node=server-1 -bootstrap-expect=3 -client=0.0.0.0 \
    -retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4

docker run \
    -d \
    --name=node2 \
    consul agent -server -ui -node=server-2 -bootstrap-expect=3 -client=0.0.0.0 \
    -retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4

docker stop node1 #given that node0 is the leader

docker stop node0

docker start node1

After running this you should see similar to the following messages on node 2:

failed to make requestVote RPC: target="{Voter 7505e313-c898-de46-944f-921948a36bb8 172.17.0.3:8300}" error="dial tcp <nil>->172.17.0.3:8300: connect: no route to host" term=17
rejecting vote request since node is not in configuration: from=172.17.0.2:8300

I also opened a bug in Consul as that's what I used to reproduce the problem: hashicorp/consul#15940

@thomacr
Copy link
Author

thomacr commented Mar 7, 2023

I re-ran these steps with Consul version 1.14.3 and was unable to reproduce the problem reliably. With the following changes to the steps, it should be reliably reproducible:

  • use docker kill insetad of docker stop when stopping the leader node.
  • after killing the first node, give enough time for the leader update the Raft configuration of the cluster before killing the leader. You will know when enough time has elapsed when you see this log on the leader:
    2023-02-06T11:30:42.806Z [INFO] agent.server.raft: updating configuration: command=RemoveServer server-id=25760f36-3d02-eda1-bccf-b7a05ee0d9c5 server-addr= servers="[{Suffrage:Voter ID:4ad7524b-7124-06cd-f40e-3d 980ec4ff30 Address:172.17.0.3:8300} {Suffrage:Voter ID:8ace7581-ccd7-4e2d-ee9f-8c8018775b4f Address:172.17.0.4:8300}]"

@banks
Copy link
Member

banks commented Apr 24, 2023

Hi, I don't think this is a bug - the issue you described looses quorum which per Raft's design requires manual recovery.

The issue is complicated by autopilot features in some HashiCorp products but I commented on ways to control those in the Consul issue #15940.

Closing as I think this is working as expected but let us know if there is something we overlooked here!

Thanks for reporting!

@banks banks closed this as completed Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants