Joining a Consul server node to a 5 node cluster causes periodic loss of leader #11355
Labels
theme/internals
Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics
type/question
Not an "enhancement" or "bug". Please post on discuss.hashicorp
When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.
Overview of the Issue
Adding a Consul server node to a 5 cluster nodes (4 Consul servers, 1 Consul client), causes periodic loss of leader. Some time restarting the Consul agent resolves the issues, some times it does not you have to restart the Consul agents on all nodes.
Reproduction Steps
We have seen this issue from time to time in production environment. But, have not been able to reproduce it.
Consul info for both Client and Server
Client info
output from server 'consul info' command here
agent:
check_monitors = 16
check_ttls = 0
checks = 16
services = 16
build:
prerelease =
revision = a82e6a7
version = 1.5.2
consul:
acl = enabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = ipA:8300
server = true
raft:
applied_index = 8401
commit_index = 8401
fsm_pending = 0
last_contact = 47.060828ms
last_log_index = 8401
last_log_term = 2
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidC Address:ipD:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidF Address:ipF:8300}]
latest_configuration_index = 83
num_peers = 4
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 2
runtime:
arch = amd64
cpu_count = 2
goroutines = 104
max_procs = 2
os = linux
version = 1.12
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 2
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 4
members = 6
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 8
members = 5
query_queue = 0
query_time = 1
Operating system and Environment details
CentOS 7, amd64
Log Fragments
From the HostF added to the cluster
2021/10/06 03:08:52 [INFO] agent: (LAN) joined: 2
2021/10/06 03:08:52 [INFO] agent: Join LAN completed. Synced with 2 initial agents
2021/10/06 03:08:52 [INFO] agent: (WAN) joined: 2
2021/10/06 03:08:52 [INFO] agent: Join WAN completed. Synced with 2 initial agents
2021/10/06 03:08:52 [INFO] consul: Existing Raft peers reported by HostD, disabling bootstrap mode
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostD (Addr: tcp/ipD:8300) (DC: dcAA)
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostC (Addr: tcp/ipC:8300) (DC: dcAA)
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostA (Addr: tcp/ipA:8300) (DC: dcAA)
...
2021/10/06 03:10:27 [WARN] raft: Election timeout reached, restarting election
2021/10/06 03:10:27 [INFO] raft: Node at ipF:8300 [Candidate] entering Candidate state in term 527
2021/10/06 03:10:33 [ERR] agent: Coordinate update error: No cluster leader
...
2021/10/06 03:11:01 [INFO] consul: New leader elected: HostA
...
2021/10/06 03:13:07 [INFO] raft: Node at ipF:8300 [Candidate] entering Candidate state in term 533
2021/10/06 03:13:14 [ERR] http: Request GET /v1/kv/cluster_public_addr, error: No cluster leader from=@
2021/10/06 03:13:15 [ERR] http: Request GET /v1/kv/cluster_health_data, error: No cluster leader from=@
From HostA: cluster leader
2021/10/06 03:08:52 [INFO] serf: EventMemberJoin: HostF.dnA.net ipF
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostF.dnA.net (Addr: tcp/ipF:8300) (DC: dcAA)
2021/10/06 03:08:52 [INFO] raft: Updating configuration with AddNonvoter (uuidF, ipF:8300) to [{Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidE Address:ipE:8300} {Suffrage:Nonvoter ID:uuidF Address:ipF:8300}]
...
2021/10/06 03:08:55 [INFO] raft: Updating configuration with RemoveServer (uuidF, ) to [{Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidE Address:ipE:8300}]
2021/10/06 03:08:55 [INFO] raft: Removed peer uuidF, stopping replication after 105626800
...
2021/10/06 03:10:27 [WARN] raft: Rejecting vote request from ipF:8300 since we have a leader: ipA:8300
...
2021/10/06 03:10:55 [ERR] consul: failed to reconcile member: {HostF.dnA.net ipF 8301 map[acls:1 build:1.5.2:a82e6a7f dc:dcAA expect:3 id:uuidF port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}: leadership lost while committing log
2021/10/06 03:10:55 [INFO] raft: aborting pipeline replication to peer {Voter uuidC ipC:8300}
2021/10/06 03:10:55 [INFO] consul: removing server by ID: "uuidF"
...
2021/10/06 03:10:55 [INFO] consul: cluster leadership lost
2021/10/06 03:10:57 [WARN] raft: Rejecting vote request from ipF:8300 since our last index is greater (105627279, 105627019)
2021/10/06 03:11:01 [WARN] raft: Heartbeat timeout from "" reached, starting election
2021/10/06 03:11:01 [INFO] raft: Node at ipA:8300 [Candidate] entering Candidate state in term 532
2021/10/06 03:11:01 [INFO] raft: Election won. Tally: 3
...
vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}: leadership lost while committing log
2021/10/06 03:14:01 [INFO] raft: aborting pipeline replication to peer {Voter uuidE ipE:8300}
2021/10/06 03:14:01 [INFO] consul: cluster leadership lost
The text was updated successfully, but these errors were encountered: