Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining a Consul server node to a 5 node cluster causes periodic loss of leader #11355

Open
alainnonga opened this issue Oct 20, 2021 · 1 comment
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp

Comments

@alainnonga
Copy link

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

Adding a Consul server node to a 5 cluster nodes (4 Consul servers, 1 Consul client), causes periodic loss of leader. Some time restarting the Consul agent resolves the issues, some times it does not you have to restart the Consul agents on all nodes.

Reproduction Steps

We have seen this issue from time to time in production environment. But, have not been able to reproduce it.

  1. Create a cluster with 5 server nodes and 1 client node with autopilot CleanupDeadServers set to false
  2. Remove 1 server node with (consul leave) from the cluster for maintenance
  3. About 12 hours later, add the node back to the cluster (consul join)

Consul info for both Client and Server

Client info
output from client 'consul info' command here
	check_monitors = 16
	check_ttls = 0
	checks = 16
	services = 16
build:
	prerelease = 
	revision = a82e6a7f
	version = 1.5.2
consul:
	acl = enabled
	known_servers = 5
	server = false
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 63
	max_procs = 2
	os = linux
	version = 1.12
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 4
	members = 6
	query_queue = 0
	query_time = 1


</details>

<details>
  <summary>Server info</summary>

output from server 'consul info' command here
agent:
check_monitors = 16
check_ttls = 0
checks = 16
services = 16
build:
prerelease =
revision = a82e6a7
version = 1.5.2
consul:
acl = enabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = ipA:8300
server = true
raft:
applied_index = 8401
commit_index = 8401
fsm_pending = 0
last_contact = 47.060828ms
last_log_index = 8401
last_log_term = 2
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidC Address:ipD:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidF Address:ipF:8300}]
latest_configuration_index = 83
num_peers = 4
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 2
runtime:
arch = amd64
cpu_count = 2
goroutines = 104
max_procs = 2
os = linux
version = 1.12
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 2
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 4
members = 6
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 8
members = 5
query_queue = 0
query_time = 1

Operating system and Environment details

CentOS 7, amd64

Log Fragments

From the HostF added to the cluster
2021/10/06 03:08:52 [INFO] agent: (LAN) joined: 2
2021/10/06 03:08:52 [INFO] agent: Join LAN completed. Synced with 2 initial agents
2021/10/06 03:08:52 [INFO] agent: (WAN) joined: 2
2021/10/06 03:08:52 [INFO] agent: Join WAN completed. Synced with 2 initial agents
2021/10/06 03:08:52 [INFO] consul: Existing Raft peers reported by HostD, disabling bootstrap mode
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostD (Addr: tcp/ipD:8300) (DC: dcAA)
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostC (Addr: tcp/ipC:8300) (DC: dcAA)
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostA (Addr: tcp/ipA:8300) (DC: dcAA)
...
2021/10/06 03:10:27 [WARN] raft: Election timeout reached, restarting election
2021/10/06 03:10:27 [INFO] raft: Node at ipF:8300 [Candidate] entering Candidate state in term 527
2021/10/06 03:10:33 [ERR] agent: Coordinate update error: No cluster leader
...
2021/10/06 03:11:01 [INFO] consul: New leader elected: HostA
...
2021/10/06 03:13:07 [INFO] raft: Node at ipF:8300 [Candidate] entering Candidate state in term 533
2021/10/06 03:13:14 [ERR] http: Request GET /v1/kv/cluster_public_addr, error: No cluster leader from=@
2021/10/06 03:13:15 [ERR] http: Request GET /v1/kv/cluster_health_data, error: No cluster leader from=@

From HostA: cluster leader
2021/10/06 03:08:52 [INFO] serf: EventMemberJoin: HostF.dnA.net ipF
2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostF.dnA.net (Addr: tcp/ipF:8300) (DC: dcAA)
2021/10/06 03:08:52 [INFO] raft: Updating configuration with AddNonvoter (uuidF, ipF:8300) to [{Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidE Address:ipE:8300} {Suffrage:Nonvoter ID:uuidF Address:ipF:8300}]
...
2021/10/06 03:08:55 [INFO] raft: Updating configuration with RemoveServer (uuidF, ) to [{Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidE Address:ipE:8300}]
2021/10/06 03:08:55 [INFO] raft: Removed peer uuidF, stopping replication after 105626800
...
2021/10/06 03:10:27 [WARN] raft: Rejecting vote request from ipF:8300 since we have a leader: ipA:8300
...
2021/10/06 03:10:55 [ERR] consul: failed to reconcile member: {HostF.dnA.net ipF 8301 map[acls:1 build:1.5.2:a82e6a7f dc:dcAA expect:3 id:uuidF port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}: leadership lost while committing log

2021/10/06 03:10:55 [INFO] raft: aborting pipeline replication to peer {Voter uuidC ipC:8300}
2021/10/06 03:10:55 [INFO] consul: removing server by ID: "uuidF"
...
2021/10/06 03:10:55 [INFO] consul: cluster leadership lost
2021/10/06 03:10:57 [WARN] raft: Rejecting vote request from ipF:8300 since our last index is greater (105627279, 105627019)

2021/10/06 03:11:01 [WARN] raft: Heartbeat timeout from "" reached, starting election
2021/10/06 03:11:01 [INFO] raft: Node at ipA:8300 [Candidate] entering Candidate state in term 532
2021/10/06 03:11:01 [INFO] raft: Election won. Tally: 3
...
vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}: leadership lost while committing log
2021/10/06 03:14:01 [INFO] raft: aborting pipeline replication to peer {Voter uuidE ipE:8300}
2021/10/06 03:14:01 [INFO] consul: cluster leadership lost

@blake
Copy link
Member

blake commented Oct 23, 2021

Hi @alainnonga, thanks for the detailed info. This issue sounds similar to the bug initially reported in #9755 (further detail in #10970).

Assuming its the same issue, several fixes have already been merged and will be available in next patch releases for the currently supported versions of Consul (1.8.x - 1.10.x).

@jkirschner-hashicorp jkirschner-hashicorp added theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp labels Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp
Projects
None yet
Development

No branches or pull requests

3 participants