Cluster unable to elect a leader after restarting a follower and stopping the leader while the follower is down #15940

thomacr · 2023-01-09T12:34:43Z

In a three-node Consul cluster, with server nodes 0,1 and 2, if I run the following test, the cluster cannot elect a leader:

Stop one follower node. For example, say that node 0 is the leader. I stop node 1.
Allow enough time for the leader to tell the other follower (node 2) that node 1 has left the cluster.
Stop the leader, in this case, node 0.
Bring back node 1.

Now the cluster can never elect a leader, even though it has a quorum, because in node 2's configuration, it only has itself and the old leader, node 0, so it will not accept/send vote requests from/to node 1, and node 0 is down.
I think this happens because only the leader can update the other followers' configuration, which will not happen if there's no leader.
To me it seems that this is an important bug, but I need someone to confirm that.

I caused this behaviour using the latest Consul Docker image. Here are the commands that should reproduce the issue:

docker run \
    -d \
    -p 8500:8500 \
    -p 8600:8600/udp \
    --name=node0 \
    consul agent -server -ui -node=server-0 -bootstrap-expect=3 -client=0.0.0.0 \
    -retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4

docker run \
    -d \
    --name=node1 \
    consul agent -server -ui -node=server-1 -bootstrap-expect=3 -client=0.0.0.0 \
    -retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4

docker run \
    -d \
    --name=node2 \
    consul agent -server -ui -node=server-2 -bootstrap-expect=3 -client=0.0.0.0 \
    -retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4

docker stop node1 #given that node0 is the leader

docker stop node0

docker start node1

After running this you should see similar to the following messages on node 2:

failed to make requestVote RPC: target="{Voter 7505e313-c898-de46-944f-921948a36bb8 172.17.0.3:8300}" error="dial tcp <nil>->172.17.0.3:8300: connect: no route to host" term=17
rejecting vote request since node is not in configuration: from=172.17.0.2:8300

I also opened a bug in hashicorp/raft, as I think the problem here is with the Raft implementation rather than Consul: hashicorp/raft#535

The text was updated successfully, but these errors were encountered:

thomacr · 2023-03-07T13:55:53Z

I re-ran these steps with Consul version 1.14.3 and was unable to reproduce the problem reliably. With the following changes to the steps, it should be reliably reproducible:

use docker kill insetad of docker stop when stopping the leader node.
after killing the first node, give enough time for the leader update the Raft configuration of the cluster before killing the leader. You will know when enough time has elapsed when you see this log on the leader:
2023-02-06T11:30:42.806Z [INFO] agent.server.raft: updating configuration: command=RemoveServer server-id=25760f36-3d02-eda1-bccf-b7a05ee0d9c5 server-addr= servers="[{Suffrage:Voter ID:4ad7524b-7124-06cd-f40e-3d 980ec4ff30 Address:172.17.0.3:8300} {Suffrage:Voter ID:8ace7581-ccd7-4e2d-ee9f-8c8018775b4f Address:172.17.0.4:8300}]"

erikschul · 2023-03-24T11:05:29Z

I seem to encounter the same issue. And I agree that it "seems that this is an important bug", but also want to confirm that this is not a configuration issue.

According to the simulator at https://observablehq.com/@stwind/raft-consensus-simulator
this sequence of events should be recoverable.

Steps to reproduce:

I have three VMs running.
I've started the cluster with bootstrap-expect 3 and the cluster forms correctly.

consul agent -server -bootstrap-expect 3 -retry-join 10.13.0.1 -retry-join 10.13.0.2 -retry-join 10.13.0.3 -ui -bind '{{ GetPrivateInterfaces | include "network" "10.13.0.0/24" | attr "address" }}' -data-dir /home/user/consul

I then stopped the cluster, and started it without bootstrapping. The cluster forms correctly.

consul agent -server -retry-join 10.13.0.1 -retry-join 10.13.0.2 -retry-join 10.13.0.3 -ui -bind '{{ GetPrivateInterfaces | include "network" "10.13.0.0/24" | attr "address" }}' -data-dir /home/user/consul

I now have members M1, M2, and M3.
M2 is the leader. I stop M2, and M3 is elected leader.
I stop M3. Only M1 remains. I start M2.

consul members shows M1 and M2 as alive... but they never elect a leader.
They repeated the following sequence of events:

Election timeout reached, restarting election
entering candidate state: node="Node at 10.13.0.1:8300 [Candidate]" term=186
unable to get address for server, using fallback address: [...] fallback=10.13.0.3:8300 [...]
failed to make requestVote RPC: target="{Voter 14c01a6a-671e-6695-fcb4-f81b308b1d26 10.13.0.3:8300}" error="dial tcp 10.13.0.1:0->10.13.0.3:8300: connect: connection refused" term=186
error getting server health from server: server=m3 error="context deadline exceeded"
error getting server health from server: server=m3 error="rpc error getting client: failed to get conn: dial tcp 10.13.0.1:0->10.13.0.3:8300: connect: connection refused"
Coordinate update error: error="No cluster leader"
rejecting vote request since node is not in configuration: from=10.13.0.2:8300
Election timeout reached, restarting election

The most relevant error seems to be rejecting vote request since node is not in configuration, i.e. that M2 was unregistered from the cluster when it went down?

erikschul · 2023-03-24T11:27:55Z

It seems that this bug is related to Autopilot being enabled by default.
The bug disappears when disabling auto pruning:

consul operator autopilot set-config -cleanup-dead-servers=false

The idea of pruning is great, but can you explain why it cannot support a server rejoining the cluster?
This problem could occur even after something as simple as a reboot that takes 2 minutes.

erikschul · 2023-03-24T11:32:32Z

Related issue: hashicorp/raft#524

banks · 2023-04-24T19:36:07Z

Hi! The issue describes a scenario where the cluster is allowed to lose quorum - i.e. only 1 of the 3 nodes is both available and part of the current consensus configuration so it's expected per Raft's guarantees that the cluster would be unable to recover without manual intervention.

One way this can be mitigated in Consul is by configuring the autopilot.min_quorum setting which specific prevents Autopilot from reducing the cluster size beyond the minimum number you expect to have. You can then still get the benefits of auto-pruning dead servers if you replace them or scale up and down but without the risk of servers being removed below the 3 you need to keep the cluster healthy.

I'm going to close this because I don't think there is a way Consul can behave differently and there is already the config mentioned above to make it never remove more servers than you intend.

thomacr mentioned this issue Jan 9, 2023

Cluster unable to elect a leader after restarting a follower and stopping the leader while the follower is down hashicorp/raft#535

Closed

banks closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster unable to elect a leader after restarting a follower and stopping the leader while the follower is down #15940

Cluster unable to elect a leader after restarting a follower and stopping the leader while the follower is down #15940

thomacr commented Jan 9, 2023 •

edited

thomacr commented Mar 7, 2023

erikschul commented Mar 24, 2023 •

edited

erikschul commented Mar 24, 2023

erikschul commented Mar 24, 2023

banks commented Apr 24, 2023

Cluster unable to elect a leader after restarting a follower and stopping the leader while the follower is down #15940

Cluster unable to elect a leader after restarting a follower and stopping the leader while the follower is down #15940

Comments

thomacr commented Jan 9, 2023 • edited

thomacr commented Mar 7, 2023

erikschul commented Mar 24, 2023 • edited

erikschul commented Mar 24, 2023

erikschul commented Mar 24, 2023

banks commented Apr 24, 2023

thomacr commented Jan 9, 2023 •

edited

erikschul commented Mar 24, 2023 •

edited