Joining a Consul server node to a 5 node cluster causes periodic loss of leader #11355

alainnonga · 2021-10-20T03:39:21Z

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

Adding a Consul server node to a 5 cluster nodes (4 Consul servers, 1 Consul client), causes periodic loss of leader. Some time restarting the Consul agent resolves the issues, some times it does not you have to restart the Consul agents on all nodes.

Reproduction Steps

We have seen this issue from time to time in production environment. But, have not been able to reproduce it.

Create a cluster with 5 server nodes and 1 client node with autopilot CleanupDeadServers set to false
Remove 1 server node with (consul leave) from the cluster for maintenance
About 12 hours later, add the node back to the cluster (consul join)

Consul info for both Client and Server

Client info

output from client 'consul info' command here
	check_monitors = 16
	check_ttls = 0
	checks = 16
	services = 16
build:
	prerelease = 
	revision = a82e6a7f
	version = 1.5.2
consul:
	acl = enabled
	known_servers = 5
	server = false
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 63
	max_procs = 2
	os = linux
	version = 1.12
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 4
	members = 6
	query_queue = 0
	query_time = 1


</details>

<details>
  <summary>Server info</summary>

output from server 'consul info' command here
agent:
check_monitors = 16
check_ttls = 0
checks = 16
services = 16
build:
prerelease =
revision = a82e6a7
version = 1.5.2
consul:
acl = enabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = ipA:8300
server = true
raft:
applied_index = 8401
commit_index = 8401
fsm_pending = 0
last_contact = 47.060828ms
last_log_index = 8401
last_log_term = 2
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidC Address:ipD:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidF Address:ipF:8300}]
latest_configuration_index = 83
num_peers = 4
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 2
runtime:
arch = amd64
cpu_count = 2
goroutines = 104
max_procs = 2
os = linux
version = 1.12
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 2
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 4
members = 6
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 8
members = 5
query_queue = 0
query_time = 1

Operating system and Environment details

CentOS 7, amd64

Log Fragments

From the HostF added to the cluster
2021/10/06 03:08:52 [INFO] agent: (LAN) joined: 2
2021/10/06 03:08:52 [INFO] agent: Join LAN completed. Synced with 2 initial agents
2021/10/06 03:08:52 [INFO] agent: (WAN) joined: 2
2021/10/06 03:08:52 [INFO] agent: Join WAN completed. Synced with 2 initial agents
2021/10/06 03:08:52 [INFO] consul: Existing Raft peers reported by HostD, disabling bootstrap mode
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostD (Addr: tcp/ipD:8300) (DC: dcAA)
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostC (Addr: tcp/ipC:8300) (DC: dcAA)
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostA (Addr: tcp/ipA:8300) (DC: dcAA)
...
2021/10/06 03:10:27 [WARN] raft: Election timeout reached, restarting election
2021/10/06 03:10:27 [INFO] raft: Node at ipF:8300 [Candidate] entering Candidate state in term 527
2021/10/06 03:10:33 [ERR] agent: Coordinate update error: No cluster leader
...
2021/10/06 03:11:01 [INFO] consul: New leader elected: HostA
...
2021/10/06 03:13:07 [INFO] raft: Node at ipF:8300 [Candidate] entering Candidate state in term 533
2021/10/06 03:13:14 [ERR] http: Request GET /v1/kv/cluster_public_addr, error: No cluster leader from=@
2021/10/06 03:13:15 [ERR] http: Request GET /v1/kv/cluster_health_data, error: No cluster leader from=@

From HostA: cluster leader
2021/10/06 03:08:52 [INFO] serf: EventMemberJoin: HostF.dnA.net ipF
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostF.dnA.net (Addr: tcp/ipF:8300) (DC: dcAA)
2021/10/06 03:08:52 [INFO] raft: Updating configuration with AddNonvoter (uuidF, ipF:8300) to [{Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidE Address:ipE:8300} {Suffrage:Nonvoter ID:uuidF Address:ipF:8300}]
...
2021/10/06 03:08:55 [INFO] raft: Updating configuration with RemoveServer (uuidF, ) to [{Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidE Address:ipE:8300}]
2021/10/06 03:08:55 [INFO] raft: Removed peer uuidF, stopping replication after 105626800
...
2021/10/06 03:10:27 [WARN] raft: Rejecting vote request from ipF:8300 since we have a leader: ipA:8300
...
2021/10/06 03:10:55 [ERR] consul: failed to reconcile member: {HostF.dnA.net ipF 8301 map[acls:1 build:1.5.2:a82e6a7f dc:dcAA expect:3 id:uuidF port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}: leadership lost while committing log

2021/10/06 03:10:55 [INFO] raft: aborting pipeline replication to peer {Voter uuidC ipC:8300}
2021/10/06 03:10:55 [INFO] consul: removing server by ID: "uuidF"
...
2021/10/06 03:10:55 [INFO] consul: cluster leadership lost
2021/10/06 03:10:57 [WARN] raft: Rejecting vote request from ipF:8300 since our last index is greater (105627279, 105627019)

2021/10/06 03:11:01 [WARN] raft: Heartbeat timeout from "" reached, starting election
2021/10/06 03:11:01 [INFO] raft: Node at ipA:8300 [Candidate] entering Candidate state in term 532
2021/10/06 03:11:01 [INFO] raft: Election won. Tally: 3
...
vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}: leadership lost while committing log
2021/10/06 03:14:01 [INFO] raft: aborting pipeline replication to peer {Voter uuidE ipE:8300}
2021/10/06 03:14:01 [INFO] consul: cluster leadership lost

The text was updated successfully, but these errors were encountered:

blake · 2021-10-23T04:40:50Z

Hi @alainnonga, thanks for the detailed info. This issue sounds similar to the bug initially reported in #9755 (further detail in #10970).

Assuming its the same issue, several fixes have already been merged and will be available in next patch releases for the currently supported versions of Consul (1.8.x - 1.10.x).

jkirschner-hashicorp added theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp labels Oct 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joining a Consul server node to a 5 node cluster causes periodic loss of leader #11355

Joining a Consul server node to a 5 node cluster causes periodic loss of leader #11355

alainnonga commented Oct 20, 2021

blake commented Oct 23, 2021

Joining a Consul server node to a 5 node cluster causes periodic loss of leader #11355

Joining a Consul server node to a 5 node cluster causes periodic loss of leader #11355

Comments

alainnonga commented Oct 20, 2021

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

blake commented Oct 23, 2021