Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Candidate server failed to vote for itself (Can a server be a non-voter or is this a bug?) #13889

Open
benbuzbee opened this issue Jul 21, 2022 · 7 comments

Comments

@benbuzbee
Copy link
Contributor

Looking for some guidance at least; I have logs from a server that indicates it never voted for itself. My question is, is this possible or a bug?

Details:
We restarted a bunch of servers in a datacenter and lost quorum. When we brought servers back, nomad failed to elect a leader. I am trying to understand why that happened.

In digging I found this strange behavior and need advice. Take term=252218 as an example

2022-07-14T03:54:59.965Z	entering candidate state
2022-07-14T03:54:59.965Z	votes needed=3
2022-07-14T03:54:59.966Z	failed to make requestVote RPC 	dial tcp 10.131.8.33:4647: connect: connection refused
2022-07-14T03:55:01.489Z	Election timeout reached, restarting election

This server entered a candidate state but never voted for itself, it never once prints this:
r.logger.Debug("vote granted", "from", vote.voterID, "term", vote.Term, "tally", grantedVotes)
How is this possible? The only thing that makes sense to me is perhaps this loop never contains itself; which means either 1. theres a bug where the server doesnt see itself in the server list or 2. the server isnt a voter

for _, server := range r.configurations.latest.Servers {
		if server.Suffrage == Voter {
			if server.ID == r.localID {

For the avoidance of doubt, this is not set and we do not have enterprise nomad:
https://www.nomadproject.io/docs/configuration/server#non_voting_server

Really appreciate guidance on how this is supposed to work to help me investigate. Digging through raft to find out where sufferage can change is a nightmare

Nomad version

Nomad v1.2.6

@benbuzbee
Copy link
Contributor Author

Related: hashicorp/raft#516

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 23, 2022
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jul 25, 2022
@tgross tgross self-assigned this Jul 25, 2022
@tgross
Copy link
Member

tgross commented Jul 25, 2022

The "failed to make requestVote RPC" log entry suggests that this node did try to self-elect, as that's from raft.go#L1723 in the electSelf() function. We know it attempted to send at least one vote for itself to a peer. I would also expect to see it persist a vote for itself (raft.go#L1736-L1741), but the slice is ordered. One plausible sequence of events I could see for the logs you've provided here are:

  • List of servers is [peer1, peer2, self]
  • It sends a request to peer1, and gets a connection refused
  • It sends a request to peer2, and... (hangs without a response)
  • The election times out. The nomad.raft.candidate.electSelf metric might give some clues as to the timing of all the steps here.

In any case, that's just one term and raft isn't guaranteed to make forward progress in the face of connectivity issues. But you said that the cluster never elected a leader. Were all the server nodes able to reach each other at some point? Usually when we see a cluster completely stuck on leader election it's because of either that or disk IO problems causing log applies to be too slow.

The Agent Members API gives a per-agent view of how many raft members are in play, but won't really help for post-mortem debugging here. The serf peer lists should also be in the logs though, and that's what feeds raft's configuration.

@benbuzbee
Copy link
Contributor Author

benbuzbee commented Jul 25, 2022

One plausible sequence of events I could see for the logs you've provided here are:

  • List of servers is [peer1, peer2, self]
  • It sends a request to peer1, and gets a connection refused
  • It sends a request to peer2, and... (hangs without a response)
  • The election times out. The nomad.raft.candidate.electSelf metric might give some clues as to the timing of all the steps here.

I considered this, except

	askPeer := func(peer Server) {
		r.goFunc(func() {

r.goFunc means it runs asynchronously and should not block attempts to ask other peers. So I am still not entirely convinced :)

In any case, that's just one term and raft isn't guaranteed to make forward progress in the face of connectivity issues. But you said that the cluster never elected a leader. Were all the server nodes able to reach each other at some point? Usually when we see a cluster completely stuck on leader election it's because of either that or disk IO problems causing log applies to be too slow.

The "usual suspect" list is interesting thank you for sharing. I have limited information (just what logs and long term metric storage can tell us) since our ops team remediated the issue by restarting everything. I was not able to debug it live. But we had 5 servers online that should have been able to reach eachother, but they never established a leader within the 15 minutes or so our ops team took to restart everything.

Of interest the servers listed as the target of "failed to make requestVote RPC" logs were not expected to be live. We had 5 servers live, but perhaps more than 5 servers were recognized as potential peers. So I am pretty sure this is a complication in how we change which metals in our datacenter are servers dynamically (we use consul to discover peers)

Anyway I realize your eyes are probably glazing over with this description 😁 I think this is likely to be a problem in how we pick/move servers but to understand how to do it better I feel like I need to be able to look at the logs and understand what happened, but I cannot with the current logs. If I could get hashicorp/raft#516 maybe I would be able to make sense of it.

I have been able to determine from logs that for term 252218

  • 3 of the five were candidates (printed "entering candidate state"
  • only two servers ever got a vote (ever printed "vote granted" (1 got 1, another got 2)
    ---> This struck me as weird and why I opened this bug; I cannot explain why all 3 candidates wouldn't have at least printed a log line for itself. It sounds like you agree this is strange?
  • some of the five servers print the RPC failure, to 10.131.8.33 or 10.131.8.42, but those servers were not one of the five online so I don't immediately see this as a problem

The serf peer lists should also be in the logs though, and that's what feeds raft's configuration.

Is it this log? found expected number of peers, attempting to bootstrap cluster...?

Kibana is no longer loading my logs from the time of this incident so I will have to check next time :(

Hey as a side note let me know if I am abusing git issues with questions like this; let me know if there is a better place. I would love to be able to more intelligently debug these so I can contribute better but I felt here that either 1. I found a bug or 2. I didnt understand something; and I didnt want to assume #2. I still feel I dont understand how it could ever not print "vote granted" for itself since it requests from peers asynchronously.

@tgross
Copy link
Member

tgross commented Jul 26, 2022

r.goFunc means it runs asynchronously and should not block attempts to ask other peers. So I am still not entirely convinced :)

Oh you're right, I missed that.

I need to be able to look at the logs and understand what happened, but I cannot with the current logs. If I could get hashicorp/raft#516 maybe I would be able to make sense of it.

Aye, I'll try to nudge that along. There's probably a few other open issues we could work on to make raft more observable. Of course we'd then have to land that version of raft in Nomad.

only two servers ever got a vote (ever printed "vote granted" (1 got 1, another got 2)
---> This struck me as weird and why I opened this bug; I cannot explain why all 3 candidates wouldn't have at least printed a log line for itself. It sounds like you agree this is strange?

Yeah totally agreed. I'm noticing the channel for votes is unbuffered, so the log for "vote granted" would only come after we've polled on that channel at raft.go#L280. That suggests that we may be blocking somewhere else in that runCandidate loop. That would have to happen on multiple servers simultaneously though.

Is it this log? found expected number of peers, attempting to bootstrap cluster...?

That one is pretty good but only happens once I think? The messages like nomad: serf: messageJoinType: nomad-server0.global would give you an idea of servers coming in to the cluster after the initial bootstrap.

If you did see this sort of thing again, you could take a goroutine dump and that would let us know exactly where everything is getting stuck.

Hey as a side note let me know if I am abusing git issues with questions like this; let me know if there is a better place. I would love to be able to more intelligently debug these so I can contribute better but I felt here that either 1. I found a bug or 2. I didnt understand something; and I didnt want to assume #2. I still feel I dont understand how it could ever not print "vote granted" for itself since it requests from peers asynchronously.

We encourage folks to use Discuss for "question-y" things, but for a good in-depth discussion where you need folks from the Nomad engineering team, GitHub issues is totally fine and probably more likely to get good results.

@benbuzbee
Copy link
Contributor Author

Aye, I'll try to nudge that along. There's probably a few other open issues we could work on to make raft more observable. Of course we'd then have to land that version of raft in Nomad.

If any come to mind let me know happy to spend a day adding some visibility to raft; and I can take a swing at updating nomad's dependency

Yeah totally agreed. I'm noticing the channel for votes is unbuffered, so the log for "vote granted" would only come after we've polled on that channel at raft.go#L280. That suggests that we may be blocking somewhere else in that runCandidate loop. That would have to happen on multiple servers simultaneously though.

It is buffered https://github.com/hashicorp/raft/blob/v1.3.5/raft.go#L1700

I considered the buffer size might be wrong somehow but I think it looks right, since you only write to the chan from a range loop over r.configurations.latest.Servers

The only explanation I can come up with is the server is a candidate but not a voter, and I don't know how that could happen.

That one is pretty good but only happens once I think? The messages like nomad: serf: messageJoinType: nomad-server0.global would give you an idea of servers coming in to the cluster after the initial bootstrap.

Thanks!

@tgross
Copy link
Member

tgross commented Jul 27, 2022

It is buffered https://github.com/hashicorp/raft/blob/v1.3.5/raft.go#L1700

Augh, you're right. Just FYI our team is focused on a team conference this week and so I've been kind of dipping in and out, which isn't fair for getting the right attention here. So my apologies for getting a little sloppy here. I'll commit to circling back on this early next week with some renewed focus.

@benbuzbee
Copy link
Contributor Author

No problem Tim you've been helpful. I think all I'd ask for here is confirmation that "non-voter candidate servers" should NOT be possible if that setting is false, and help with the raft/logging. Appreciate the help, enjoy the conference!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants