Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection #15110

Open
OmegaRogue opened this issue Oct 22, 2022 · 4 comments

Comments

@OmegaRogue
Copy link

Overview of the Issue

I am setting up a small consul cluster on one Raspberry Pi 4 and two Pine64 Rockpro64 and I'm getting alot of errors and warning trying to get the cluster running. The UI is also not accessible.

Reproduction Steps

The configs are

{
  "data_dir": "/var/consul",
  "server": true,
  "bootstrap_expect": 3,
  "disable_update_check": true,
  "disable_remote_exec": true,
  "enable_syslog": true
}

and

ui_config {
  enabled =true 
}
retry_join = ["192.168.222.11","192.168.222.12","192.168.222.13"]

the agents are started with

doas -u consul consul agent -config-dir=/etc/consul -retry-join=192.168.222.11 -retry-join=192.168.222.12 -retry-join=192.168.222.13

Consul info

cloud1 info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = 
	version = 1.13.2
	version_metadata = 
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 192.168.222.11:8300
	server = true
raft:
	applied_index = 93
	commit_index = 93
	fsm_pending = 0
	last_contact = 0
	last_log_index = 93
	last_log_term = 81
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:8e7365e1-a17d-b941-5413-0708eee4146e Address:192.168.222.11:8300} {Suffrage:Voter ID:c7973f14-3389-ec13-65f9-1a1e6f981299 Address:192.168.222.13:8300} {Suffrage:Nonvoter ID:b614d0de-be83-0931-ba66-0eb10438350c Address:192.168.222.12:8300}]
	latest_configuration_index = 0
	num_peers = 1
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 81
runtime:
	arch = arm64
	cpu_count = 6
	goroutines = 116
	max_procs = 6
	os = linux
	version = go1.19.2
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 6
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 19
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 1
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 11
	members = 3
	query_queue = 0
	query_time = 1
cloud2 info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = 
	version = 1.13.2
	version_metadata = 
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 
	server = true
raft:
	applied_index = 0
	commit_index = 0
	fsm_pending = 0
	last_contact = never
	last_log_index = 1
	last_log_term = 1
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:b614d0de-be83-0931-ba66-0eb10438350c Address:192.168.222.12:8300} {Suffrage:Voter ID:8e7365e1-a17d-b941-5413-0708eee4146e Address:192.168.222.11:8300} {Suffrage:Voter ID:c7973f14-3389-ec13-65f9-1a1e6f981299 Address:192.168.222.13:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Candidate
	term = 213
runtime:
	arch = arm64
	cpu_count = 6
	goroutines = 101
	max_procs = 6
	os = linux
	version = go1.19.2
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 6
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 19
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 11
	members = 3
	query_queue = 0
	query_time = 1
cloud3 info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = 
	version = 1.13.2
	version_metadata = 
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 192.168.222.11:8300
	server = true
raft:
	applied_index = 116
	commit_index = 116
	fsm_pending = 0
	last_contact = 12.614866ms
	last_log_index = 116
	last_log_term = 81
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:8e7365e1-a17d-b941-5413-0708eee4146e Address:192.168.222.11:8300} {Suffrage:Voter ID:c7973f14-3389-ec13-65f9-1a1e6f981299 Address:192.168.222.13:8300} {Suffrage:Nonvoter ID:b614d0de-be83-0931-ba66-0eb10438350c Address:192.168.222.12:8300}]
	latest_configuration_index = 0
	num_peers = 1
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 81
runtime:
	arch = arm64
	cpu_count = 4
	goroutines = 115
	max_procs = 4
	os = linux
	version = go1.19.2
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 6
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 19
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 11
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

The servers are cloud1, cloud2 and cloud3 at 192.168.222.11, 192.168.222.12 and 192.168.222.13 respectively.
The two rockpro64 are running postmarketos, the raspberry pi is running alpine linux edge using the same consul package version.

uname -a
Linux cloud1 5.18.0 #5-postmarketos-rockchip SMP PREEMPT Thu Sep 22 00:44:45 UTC 202 aarch64 Linux
Linux cloud2 5.18.0 #5-postmarketos-rockchip SMP PREEMPT Thu Sep 22 00:44:45 UTC 202 aarch64 Linux
Linux cloud3 5.15.55-0-rpi4 #1-Alpine SMP PREEMPT Mon Jul 18 10:51:25 UTC 2022 aarch64 Linux
consul members
Node    ID                                    Address              State     Voter  RaftProtocol
cloud1  8e7365e1-a17d-b941-5413-0708eee4146e  192.168.222.11:8300  leader    true   3
cloud3  c7973f14-3389-ec13-65f9-1a1e6f981299  192.168.222.13:8300  follower  true   3
cloud2  b614d0de-be83-0931-ba66-0eb10438350c  192.168.222.12:8300  follower  false  3
consul operator raft list-peers
Node    ID                                    Address              State     Voter  RaftProtocol
cloud1  8e7365e1-a17d-b941-5413-0708eee4146e  192.168.222.11:8300  leader    true   3
cloud3  c7973f14-3389-ec13-65f9-1a1e6f981299  192.168.222.13:8300  follower  true   3
cloud2  b614d0de-be83-0931-ba66-0eb10438350c  192.168.222.12:8300  follower  false  3
consul version
Consul v1.13.2
Build Date 1970-01-01T00:00:01Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Log Fragments

2022-10-22T15:41:11.757+0200 [WARN]  agent: error getting server health from server: server=cloud2 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
2022-10-22T15:41:11.759+0200 [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
2022-10-22T15:41:12.044+0200 [ERROR] agent.server.memberlist.wan: memberlist: Push/Pull with cloud2.dc1 failed: dial tcp 192.168.222.12:8302: i/o timeout
2022-10-22T15:41:12.756+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:13.606+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:13.675+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:13.755+0200 [WARN]  agent: error getting server health from server: server=cloud1 error="context deadline exceeded"
2022-10-22T15:41:13.756+0200 [WARN]  agent: error getting server health from server: server=cloud2 error="context deadline exceeded"
2022-10-22T15:41:14.486+0200 [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.222.12:8300
2022-10-22T15:41:14.755+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:15.604+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:15.675+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:15.754+0200 [WARN]  agent: error getting server health from server: server=cloud1 error="context deadline exceeded"
2022-10-22T15:41:15.754+0200 [WARN]  agent: error getting server health from server: server=cloud2 error="context deadline exceeded"
2022-10-22T15:41:16.757+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:17.604+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:17.674+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:17.755+0200 [WARN]  agent: error getting server health from server: server=cloud2 error="context deadline exceeded"
2022-10-22T15:41:17.755+0200 [WARN]  agent: error getting server health from server: server=cloud1 error="context deadline exceeded"
2022-10-22T15:41:18.747+0200 [TRACE] agent.server.usage_metrics: Starting usage run
2022-10-22T15:41:18.755+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:19.604+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:19.674+0200 [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=false
2022-10-22T15:41:19.754+0200 [WARN]  agent: error getting server health from server: server=cloud1 error="context deadline exceeded"
2022-10-22T15:41:19.754+0200 [WARN]  agent: error getting server health from server: server=cloud2 error="context deadline exceeded"
2022-10-22T15:41:19.755+0200 [WARN]  agent: error getting server health from server: server=cloud1 error="rpc error getting client: failed to get conn: dial tcp <nil>->192.168.222.11:8300: i/o timeout"
2022-10-22T15:41:19.755+0200 [WARN]  agent: error getting server health from server: server=cloud1 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
2022-10-22T15:41:19.756+0200 [WARN]  agent: error getting server health from server: server=cloud1 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
2022-10-22T15:41:19.756+0200 [WARN]  agent: error getting server health from server: server=cloud1 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
2022-10-22T15:41:19.756+0200 [WARN]  agent: error getting server health from server: server=cloud2 error="rpc error getting client: failed to get conn: dial tcp <nil>->192.168.222.12:8300: i/o timeout"
2022-10-22T15:41:19.756+0200 [WARN]  agent: error getting server health from server: server=cloud2 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
2022-10-22T15:41:19.756+0200 [WARN]  agent: error getting server health from server: server=cloud2 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
2022-10-22T15:41:19.756+0200 [WARN]  agent: error getting server health from server: server=cloud2 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
@henryxparker
Copy link

I see a similar looping of those last 4 error lines with a completely different setup, which means there are multiple ways to get into this state. Whatever it is.

@OmegaRogue
Copy link
Author

Are there any news regarding this problem? I havent had the time to try it again yet

@ArtemViacheslavovich
Copy link

agent: error getting server health from server: server=consul01.local error="rpc error getting client: failed to get conn: dial tcp ->172.24.1.255:8300: connect: network is unreachable"
Имею такую же ошибку

@lkysow
Copy link
Member

lkysow commented Jun 15, 2023

Hi, "rejecting vote request since node is not a voter" is a known issue in Consul 1.13.2 that is fixed in 1.13.4 (this PR: #14897)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants