Non voter with higher term should not cause cluster instability #525

dhiaayachi · 2022-09-08T19:09:42Z

As described in #524, when a non Voter with a higher term is added to the cluster the log replication to that node will fail, because it have higher term which ultimately cause the leader to step down and start an election.

This should not happen as the non voter node cannot win the election but at the same time the cluster expect the node with higher term to win the election, which get the cluster in a forever election loop.

In this PR a check is introduced to make sure that when a node receive an append entry request with a lower term and that node is a non voter it would accept it and overrite its own term to that lower term. That ensure the stability of the cluster and also break the loop as this node term is reset back to the cluster term.

dhiaayachi · 2022-09-08T19:11:20Z

raft.go

@@ -1410,7 +1422,7 @@ func (r *Raft) appendEntries(rpc RPC, a *AppendEntriesRequest) {
 					"previous-index", a.PrevLogEntry,
 					"last-index", lastIdx,
 					"error", err)
-				resp.NoRetryBackoff = true
+				resp.NoRetryBackoff = hasVote


not sure about this change, I observed in the test that a newer term (unseen by the node) is sent. This change ensure that we retry with an older log until we get a log that the node is able to apply.

Hmm yeah changing the "catchup" behaviour for a non-voter doesn't seem like the right thing to do here at a first glance.

banks · 2022-09-09T12:02:38Z

replication.go

@@ -233,7 +233,7 @@ START:
 	appendStats(string(peer.ID), start, float32(len(req.Entries)))

 	// Check for a newer term, stop running
-	if resp.Term > req.Term {
+	if resp.Term > req.Term && s.peer.Suffrage != Nonvoter {


This seems like the real fix to prevent disruption to the leader and I think is safe and correct.

Yes actually that was the first change I made but this won't help getting the non voter effectively join the cluster as it will continue rejecting the replication forever.

Why this instead of Suffrage == Voter?

banks · 2022-09-09T12:47:54Z

raft.go

@@ -1410,7 +1422,7 @@ func (r *Raft) appendEntries(rpc RPC, a *AppendEntriesRequest) {
 					"previous-index", a.PrevLogEntry,
 					"last-index", lastIdx,
 					"error", err)
-				resp.NoRetryBackoff = true
+				resp.NoRetryBackoff = hasVote


Hmm yeah changing the "catchup" behaviour for a non-voter doesn't seem like the right thing to do here at a first glance.

banks · 2022-09-09T12:50:20Z

raft.go

+	// this use case would happen when a node was a voter and is added back to the cluster as non voter,
+	// it term could be higher then the cluster, but because it's a non voter that term need to be discarded
+	// in favor of the cluster current term to keep the cluster stable, as an election don't need to be started
+	// by a node which don't have voting rights.
+	currentTerm := r.getCurrentTerm()
+	hasVote := hasVote(r.getLatestConfiguration(), r.localID)
+	if a.Term < currentTerm && !hasVote {
+		r.setState(Follower)
+		r.setCurrentTerm(a.Term)
+		resp.Term = a.Term
+	}


I'm having a hard time reasoning about the correctness of this change. It's a fundamental deviation from the publishes/model checked/peer reviewed Raft algorithm which makes it significantly risky IMO.

I'm going to dream up an example where this gets subtle. It might not be incorrect as we think through but maybe helps us reason about correctness here.

Imagine servers A, B, C, D where A is leader in term 1, A-C form a quorum of voters, D is a non-voter all in steady state.

Now imagine that A is partitioned from the others. For a while it doesn't know this.

B, and C elect B the new leader in term 2 with and D starts following B.

But now (still in the window before A notices its leader lease timeout) A reconnects and attempts to send append entries to the peers. B and C will correctly ignore it and tell it to step down, but before it get's their response, D will silently accept it and revert to term 1 (in which A was leader) but is still actually in follower state following B in term 2.

Is this state valid?

Well D may now try to append an entry from A in term 1 that isn't actually ever going to commit because A won't get a quorum for it.

What happens next depends on the state in D's log and exactly what was inflight from A when it was partitioned.

Let's imagine just one (I think) possible state: A had everything up to index 99 commited before the partition and was just starting to replicate index 100 when it lost connection to all other servers so is now re-trying to append index 100 in term 1.

Meanwhile B has won election and already replicated index 100 to D in term 2.

Since D's last log will be index 100 but the append with have PrevLogEntry == 99 we hit the else case on line 1418 below.

D will now lookup it's own log 99 (which was the correct one back from term 1) and set prevLogTerm to 1 based on that.

So it will skip right over the term check on line 1431 - effectively D is treating the old leader as the new leader and in line 1459 below will find the terms don't match so it will truncate the good entries from term 2 out of it's log and replace them with the ones from the old leader in term 1.

Lets say A managed to replicate a few more logs to it in Term 1 before noticing that it needs to stand down or getting a response form a voter telling it the term is stale.

But while that's happening B is also sending appendEntries to D in term 2 each time it does D updates it's term truncates all the "bad" stuff from term 1 out and request a replay from the fork point from B but while that's happening A might still try to append more stuff and each time it does D will truncate the good stuff out etc.

I think this scenario works out OK in the end because A will eventually step down and stop battling the new leader to append and when it does Ds log should be reset.

IF D were a voter and so the fact it logged something was taken into account when working out commitment, this kind of flapping would break correctness in general because an old leader could cause deletion of "comitted" logs by the newer leader. Since it's not a voter it might just about be OK because the voters logs will remain consistent and this non-voter will eventually end up consistent (certainly by the time it's learned about what is committed and applied it to it's FSM so even stale reads shouldn't "see" any bad data from an old leader since it won't be committed.

I don't think there is a scenario where D can end up with a log that's inconsistent with the quorum by the time the entries are committed but I don't think we should trust my thinking here - that's why Raft and other algorithms are rigorously defined and model checked etc so straying from that seems risky to me.

Conclusion

I didn't prove this is "wrong" in the sense of user-visible consistency here, but I'm still not sure if it's a good move to alter this behaviour in this way.

I think the real fix for overall cluster stability is the other one below where we just don't let the leader stand down if the node with a higher term is a non-voter. That seems safer to me.

I guess without this though that change alone leaves the non-voter in a strange state?

I guess my question would be whether there is a more correct solution. For example I think the PreVote optimization described in Diego's full Raft thesis would also solve this problem (as well as when the same thing happens for voters) and would be totally correct in all cases?

If not, how likely is this to happen? Could we just crash the non-voter and have an operator delete the non-voters raft state and start again?

I'm trying to think of other options but it's really hard to reason about - you can't just reset the non-voter state. Even though it's a non-voter now, it might have voted in a term that was counted towards a quorum so just resetting it's term from its last successful replicated log on restart is quite likely not correct either in some subtle edge case.

I think you could avoid the old leader horribleness above correctly in other ways too, for example if the non-voter had a record of the last term it saw a stable leader in (which is subtly different from it's Term state since that is incremented when it attempted to become a leader as a voter though no leader had existed at that term) then it could continue to fence of appendEntries from leaders older than the current stable leader and avoid the issue above, while still allowing a reset from a "spurios" term that it tried an election in before becoming a non-voter?

What do you think @dhiaayachi ? I'd love to figure out something less risky than deviating from the reviewed raft protocol here if possible.

if the non-voter had a record of the last term it saw a stable leader in

🤔 actually that might be the same thing as the term of its last persisted log so we do have that... but I'm not sure how we use it correctly. Above I was thinking we could reset the non-voters current Term to that on startup to avoid it being disruptive but I think that's dangerous as noted above.

Perhaps we could use this value in the new check here though and only downgrade the term if a non-voter AND our currentTerm is higher than the last log term? (or the new lower term is exactly equal to our last log term, not sure if that's the same).

Either way, we should think about what we need to do to convince ourselves any solution we come up with is actually rigorously correct!

For the record by PreVote I was misremembering but referring to Section 4.2.3 Disruptive Servers in https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf.

And I think it's not going to help either because in this case the non-voter isn't disrupting by trying to get votes with a higher term (it's a non-voter after all) but rather returning it during replication.

So I think the problem can only occur if:

The non-voter starts up with a currentTerm persisted that is higher than the actual stable leader term.

I assume this can only happen when a voter is restarted as a non-voter and happens to attempt an election just before it restarts causing it to increment its term. It has to persist the new term to it's state file, but not actually deliver the RPCs before shutting down (since that would cause an election and a move to that term or higher in the rest of the cluster anyway)

When it starts up, it's currentTerm is stableLeaderTerm + 1 (assuming no leader change since it shutdown which seems to be required for this bug to manifest) but its last appended log term will be at most stableLeaderTerm (could be lower in theory).

When the stable leader attempts an append entries at stableLeaderTerm it rejects it sending it's higher term in response.

(before this patch) leader steps down and a new election happens needlessly.

Does that match your understanding Dhia? Specifically what I said in 1 the only possible way to get into this state or did I miss something?

If so then I think a solution would be:

If we get an appendEntries for a term lower than currentTerm AND we are a non-voter AND the term from the apparent leader is greater than or equal to the term of the last log entry we have persisted. Then it's safe to assume we are in this state.

I reason that this is safe because we might still revert to a term of an old leader and append it's entries, but we'll never delete entries from a later term in doing it which means that we effectively are in the same state as if this node were partitioned along with the old leader for a while before it stepped down, and that state will be corrected as soon as the new leader makes contact.

We can't flip flop - if the new leader made contact first and appended in the new term then the additional check above will effectively fence out the old leader from modifying our log and we will proceed normally.

What do you think @dhiaayachi ?

Thank you for reasoning with me about this @banks! I was trying to go through those scenario and I agree it's not easy without a formal proof. I ended up reasoning based on the following and using facts that should not vary based on timing or external events:

The node in question is a non-voter

Any election triggered by a non voter is a disruption because that node cannot participate in it

Term kept by non voters is always coming from current leader as it can't participate in an election so it can't increase its term by its own.

Based on those assumptions, If the term of a non voter node is higher than the cluster current term this bring us to 2 possible scenarios.

That higher term come from another node that stepped as a leader but partitioned from the majority of nodes (the scenario you just described)

That higher term is a residual from the node being a voter formerly (described in Unstable leadership when running server is demoted/removed without its participation #524)

I agree, first scenario is not ideal with the fix as the nonvoter will be in an unstable state until the cluster is restored but if we run the same scenario without the fix, the non voter node will be in most cases following the partitioned node as it will very likely have a higher term, which is not practical in real life situation.

I can think of a scenario that's a bit different to above and breaks my assumption slightly.

If the server while it had been a voter had been partitioned from the other servers it might have sat there for arbitrarily long attempting elections and so may have a persisted term that is arbitrarily higher than the stable leader. It didn't disrupt the leader due to not being able to send it messages.

Some time after it restarts as a non-voter with this state, the partition heals and the bug manifests when the current leader attempts to replicate. Note that the rest of the cluster might have been through any number of leader elections in the time before the partition healed (either before or after the node restarted) but as long as the new leader's term is lower than the term the partitioned voter got itself to the bug can occur.

I don't think that changes the correctness of the solution above (because the non-voter's last log term would be lower than the new leader which is OK), but I think it's worth understanding the most general way that this state is possible!

Some time after it restarts as a non-voter with this state, the partition heals and the bug manifests when the current leader attempts to replicate. Note that the rest of the cluster might have been through any number of leader elections in the time before the partition healed (either before or after the node restarted) but as long as the new leader's term is lower than the term the partitioned voter got itself to the bug can occur.

That is exactly the bug described in #524 and fall under 2 in my last comment.

banks · 2022-09-09T15:01:28Z

Term kept by non voters is always coming from current leader as it can't participate in an election so it can't increase its term by its own.

I think we both understand this, but this assumption is only true in the steady state for non-voters. It's violated in the problematic case here because the non-voter use to be a non-voter and increased its own term beyond the rest of the cluster back when it was a voter.

the partitioned node as it will very likely have a higher term, which is not practical in real life situation.

Hmmm, I'm trying to think this one through. If there is a node that is partitioned and so is trying to get itself elected in a higher term than the current leader, you're right that it could send a requestVote RPC and force up the non-voter's current term... BUT I don't think it's possible for it to ever call appendEntries with that higher term since it won't ever get enough votes to be elected leader in that term. So that might be a cause of the non-voter having a higher term, but I don't think that the non-voter will ever "follow" that bad leader because that bad leader will never call appendEntries. So I think my suggestion for the handling of this in appendEntries isn't impacted by this case? Did I miss something?

…

On Fri, Sep 9, 2022 at 2:54 PM Dhia Ayachi ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In raft.go <#525 (comment)>: > + // this use case would happen when a node was a voter and is added back to the cluster as non voter, + // it term could be higher then the cluster, but because it's a non voter that term need to be discarded + // in favor of the cluster current term to keep the cluster stable, as an election don't need to be started + // by a node which don't have voting rights. + currentTerm := r.getCurrentTerm() + hasVote := hasVote(r.getLatestConfiguration(), r.localID) + if a.Term < currentTerm && !hasVote { + r.setState(Follower) + r.setCurrentTerm(a.Term) + resp.Term = a.Term + } Some time after it restarts as a non-voter with this state, the partition heals and the bug manifests when the current leader attempts to replicate. Note that the rest of the cluster might have been through any number of leader elections in the time before the partition healed (either before or after the node restarted) but as long as the new leader's term is lower than the term the partitioned voter got itself to the bug can occur. That is exactly the bug described in #524 <#524> and fall under 2 in my last comment. — Reply to this email directly, view it on GitHub <#525 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA5QU32ONQKIOSMNPBHKKTV5M6RPANCNFSM6AAAAAAQIBJY6A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dhiaayachi · 2022-09-09T16:04:20Z

I think we both understand this, but this assumption is only true in the
steady state for non-voters. It's violated in the problematic case here
because the non-voter use to be a non-voter and increased its own term
beyond the rest of the cluster back when it was a voter.

yes and that's the case we want to avoid, as soon as a node become a non voter its formal term is not relevant anymore, right? The naive approach to fix this would be to reset the term of the node as soon as it change from voter to non voter and that config is committed but it would present a chicken and an egg problem as this is done part of appendEntries.

Hmmm, I'm trying to think this one through. If there is a node that is
partitioned and so is trying to get itself elected in a higher term than
the current leader, you're right that it could send a requestVote RPC and
force up the non-voter's current term... BUT I don't think it's possible
for it to ever call appendEntries with that higher term since it won't ever
get enough votes to be elected leader in that term.

Sorry that's right, I was misunderstanding the scenario you described in your first comment which is a transient state and I don't think it's problematic as you said.

but I was referring to a slightly different scenario where a node is partitioned from all the nodes except the non-voter but the non-voter is not partitioned. In that case the requestVote RPC will increase the non-voter term which will impact the append entries coming from the leader, which will

trigger a disruptive election in current case and that election will keep happening until the partitioning is resolved
if we include the first fix it would render the non voter not able to append logs coming from the leader
if we include both fixes the non-voter would flip between both terms for a while (until the partition is resolved) but still accept append entries from the leader.

That said, I think for that specific case, pre-vote would help independently from the fix in here because it prevent term inflation.

If so then I think a solution would be:

If we get an appendEntries for a term lower than currentTerm AND we are a non-voter AND the term from the apparent leader is greater than or equal to the term of the last log entry we have persisted. Then it's safe to assume we are in this state.
I reason that this is safe because we might still revert to a term of an old leader and append it's entries, but we'll never delete entries from a later term in doing it which means that we effectively are in the same state as if this node were partitioned along with the old leader for a while before it stepped down, and that state will be corrected as soon as the new leader makes contact.

This is interesting, that means we will favour a leader that have newer log term and I think increase the overall stability in the 2 scenarios (the one you described in your first comment and the one above in this one)

Let me try to change the fix to reflect this and see how the test behave. I can also try to implement other scenarios we discussed in here in tests.

banks

Thanks. This is looking good I think.

This change updates the one case I could think of that was a bit dubious.

I'd love it if we were able to model check this or something to convince ourselves it's nor subtly wrong in a new way, but I don't know if that is pragmatic. Even the official Raft TLA+ spec would need a lot of additions to model all the additional concerns here meaningfully (e.g. non-voters, ability to reset from voter to non-voter etc.) so that likely isn't pragmatic for now.

Perhaps we could have a couple other folks review too. Ultimately though I think given that this change can only effect non-voters, it can't cause the worst kinds of correctness bugs (e.g. loss of committed writes).

I will think for a while on whether there are any scenarios that could arise after a non-voter in this state transitions back to a voter... But I don't think there are since the non-voter is always left in a state that is consistent with some "normal" history even if it ends up replicating logs from an old leader...

banks · 2022-09-09T19:02:31Z

raft.go

+	// this use case would happen when a node was a voter and is added back to the cluster as non voter,
+	// it term could be higher then the cluster, but because it's a non voter that term need to be discarded
+	// in favor of the cluster current term to keep the cluster stable, as an election don't need to be started
+	// by a node which don't have voting rights.


I think a comment here is useful but perhaps a link to this PR or the issue would be good because it's super hard to express all the nuance of the issue and why this is correct in a few lines?

banks · 2022-09-09T19:05:07Z

raft.go

+			"lastLogTerm", r.lastLogTerm,
+			" Term", a.Term, "currentTerm", currentTerm)
+		r.setState(Follower)
+		r.setCurrentTerm(a.Term)


Is there any other state we need to set here 🤔 I don't think so because this is the same as the code below for normal happy-path replication where we switch to a new leader's term after an election when we get the first appendEntries but I feel like we also store state about which node is our current leader in raft and it feels weird to me that we might be potentially switching term and leader here without updating anything about who that leader is....

But i know we do persist the identity of the candidate we last voted for, so I wonder if the happy path doesn't update anything because it assumes that we set the identity during election? Even then that can't be right in the case of a follower coming up and learning about a new leader that it didn't vote for so again I guess you did the right thing here but I'm slightly intrigued about how and where we update the state about who our current leader 🤔. Do you know @dhiaayachi ?

it's done right after in here
but I'm a bit confused about it as that mean we trust any node who sends us an appendEntries request with a higher term as the leader, but maybe that's the right thing to do, I'm not sure.

szechuen · 2022-09-09T23:18:35Z

Hi 👋 Catching up on the discussion here.

The PreVote optimization sounds like a reasonable way to prevent partitioned voters (turned non-voters) from incrementing their term in the first place, and solves the disruption issue more generally. Is that a viable approach here vs. reasoning about specific bad states?

dhiaayachi · 2022-09-12T13:19:55Z

Thinking about this more, I think you're right @szechen. The PreVote optimization, in this case, would prevent the node term from inflating before being converted to a non voter. So when it gets converted its term would always be less or equal the remaining cluster term and prevent the issue from happening all together.

That said, I think that the fix in this PR would be beneficial because PreVote would very likely require a new protocol version which make its adoption slower. Also it would be a protection against making a node that just have a bad state (was part of a split brains) that join as a non voter.

szechuen · 2022-09-12T15:11:08Z

Gotcha. In that case, perhaps the more "spec-compliant" approach here would be to split the resolution into 2 parts: (1) ensure that the cluster can regain stability even if it's disrupted by a non-voter, (2) prevent the disruption from happening in the first place.

(2) will be solved by the PreVote optimization. (1) can be solved by allowing our term to be set to the highest seen term regardless of whether the node is a non-voter, i.e. pre-#477 behavior where a non-voter's RequestVote will trigger a new election if its term is higher but still won't win by rejecting its vote in that election. Specifically, the term check can be ordered before the non-voter check.

raft/raft.go

Lines 1583 to 1595 in 1075246

    
           // Ignore an older term 
        
           if req.Term < r.getCurrentTerm() { 
        
           	return 
        
           } 
        
           // Increase the term if we see a newer one 
        
           if req.Term > r.getCurrentTerm() { 
        
           	// Ensure transition to follower 
        
           	r.logger.Debug("lost leadership because received a requestVote with a newer term") 
        
           	r.setState(Follower) 
        
           	r.setCurrentTerm(req.Term) 
        
           	resp.Term = req.Term 
        
           }

raft/raft.go

Lines 1563 to 1581 in 1075246

    
           // For older raft version ID is not part of the packed message 
        
           // We assume that the peer is part of the configuration and skip this check 
        
           if len(req.ID) > 0 { 
        
           	candidateID := ServerID(req.ID) 
        
           	// if the Servers list is empty that mean the cluster is very likely trying to bootstrap, 
        
           	// Grant the vote 
        
           	if len(r.configurations.latest.Servers) > 0 && !hasVote(r.configurations.latest, candidateID) { 
        
           		r.logger.Warn("rejecting vote request since node is not a voter", 
        
           			"from", candidate) 
        
           		return 
        
           	} 
        
           } 
        
           if leaderAddr, leaderID := r.LeaderWithID(); leaderAddr != "" && leaderAddr != candidate && !req.LeadershipTransfer { 
        
           	r.logger.Warn("rejecting vote request since we have a leader", 
        
           		"from", candidate, 
        
           		"leader", leaderAddr, 
        
           		"leader-id", string(leaderID)) 
        
           	return 
        
           }

That should ensure that the cluster will be able to make progress and regain stability after being disrupted, which is also in inline with the behavior when a partitioned voter rejoins the cluster.

szechuen · 2022-09-12T17:53:29Z

Also seems like there's a race condition either with the current fix or test. Adding an additional debug condition to the block:

if a.Term < currentTerm && !hasVote && a.PrevLogTerm >= r.lastLogTerm {
  r.logger.Warn("older term sent to non-voter", "PrevLogTerm", a.PrevLogTerm,
    "lastLogTerm", r.lastLogTerm,
    " Term", a.Term, "currentTerm", currentTerm)
  r.setState(Follower)
  r.setCurrentTerm(a.Term)
  resp.Term = a.Term
  noRetryBackoffIfErr = false
} else if a.Term < currentTerm {
  r.logger.Warn("higher term appendEntries received but not resetting term", "hasVote", hasVote,
    "PrevLogTerm", a.PrevLogTerm, "lastLogTerm", r.lastLogTerm)
}

% go test -run TestRaft_StabilityIsKept_WhenNonVoterWithHigherTermJoin -count 100 
--- FAIL: TestRaft_StabilityIsKept_WhenNonVoterWithHigherTermJoin (16.81s)
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.517-0700 [INFO]  server-daeab638-c616-2579-4afb-20969ce080a3: initial configuration: index=1 servers="[{Suffrage:Voter ID:server-daeab638-c616-2579-4afb-20969ce080a3 Address:daeab638-c616-2579-4afb-20969ce080a3} {Suffrage:Voter ID:server-f445d6ee-2106-7257-be07-290f948c152f Address:f445d6ee-2106-7257-be07-290f948c152f} {Suffrage:Voter ID:server-68cb85a4-318a-459c-137f-ebad9e46ed13 Address:68cb85a4-318a-459c-137f-ebad9e46ed13}]"
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.517-0700 [INFO]  server-daeab638-c616-2579-4afb-20969ce080a3: entering follower state: follower="Node at daeab638-c616-2579-4afb-20969ce080a3 [Follower]" leader-address= leader-id=
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.518-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: initial configuration: index=1 servers="[{Suffrage:Voter ID:server-daeab638-c616-2579-4afb-20969ce080a3 Address:daeab638-c616-2579-4afb-20969ce080a3} {Suffrage:Voter ID:server-f445d6ee-2106-7257-be07-290f948c152f Address:f445d6ee-2106-7257-be07-290f948c152f} {Suffrage:Voter ID:server-68cb85a4-318a-459c-137f-ebad9e46ed13 Address:68cb85a4-318a-459c-137f-ebad9e46ed13}]"
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.518-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: entering follower state: follower="Node at f445d6ee-2106-7257-be07-290f948c152f [Follower]" leader-address= leader-id=
    testing.go:147: server-68cb85a4-318a-459c-137f-ebad9e46ed13: 2022-09-12T10:49:10.518-0700 [INFO]  server-68cb85a4-318a-459c-137f-ebad9e46ed13: initial configuration: index=1 servers="[{Suffrage:Voter ID:server-daeab638-c616-2579-4afb-20969ce080a3 Address:daeab638-c616-2579-4afb-20969ce080a3} {Suffrage:Voter ID:server-f445d6ee-2106-7257-be07-290f948c152f Address:f445d6ee-2106-7257-be07-290f948c152f} {Suffrage:Voter ID:server-68cb85a4-318a-459c-137f-ebad9e46ed13 Address:68cb85a4-318a-459c-137f-ebad9e46ed13}]"
    testing.go:147: cluster: 2022-09-12T10:49:10.518-0700 [INFO]  cluster: starting stability test: raft-state=Leader
    testing.go:147: server-68cb85a4-318a-459c-137f-ebad9e46ed13: 2022-09-12T10:49:10.518-0700 [INFO]  server-68cb85a4-318a-459c-137f-ebad9e46ed13: entering follower state: follower="Node at 68cb85a4-318a-459c-137f-ebad9e46ed13 [Follower]" leader-address= leader-id=
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.578-0700 [WARN]  server-f445d6ee-2106-7257-be07-290f948c152f: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.578-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: entering candidate state: node="Node at f445d6ee-2106-7257-be07-290f948c152f [Candidate]" term=2
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.578-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: election won: term=2 tally=2
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.578-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: entering leader state: leader="Node at f445d6ee-2106-7257-be07-290f948c152f [Leader]"
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.578-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: added peer, starting replication: peer=server-daeab638-c616-2579-4afb-20969ce080a3
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.578-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: added peer, starting replication: peer=server-68cb85a4-318a-459c-137f-ebad9e46ed13
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.578-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: pipelining replication: peer="{Voter server-68cb85a4-318a-459c-137f-ebad9e46ed13 68cb85a4-318a-459c-137f-ebad9e46ed13}"
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.578-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: pipelining replication: peer="{Voter server-daeab638-c616-2579-4afb-20969ce080a3 daeab638-c616-2579-4afb-20969ce080a3}"
    testing.go:147: cluster: 2022-09-12T10:49:10.683-0700 [INFO]  cluster: stable state for Leader reached at 2022-09-12 10:49:10.578699 -0700 PDT m=+242.522723890 (1 nodes), 60.519553ms from start of poll, 61.032553ms from cluster start. Timeout at 2022-09-12 10:49:10.683732 -0700 PDT m=+242.627756163, 105.032273ms after stability
    testing.go:147: cluster: 2022-09-12T10:49:10.689-0700 [INFO]  cluster: starting stability test: raft-state=Follower
    testing.go:147: cluster: 2022-09-12T10:49:10.794-0700 [INFO]  cluster: stable state for Follower reached at 2022-09-12 10:49:10.689135 -0700 PDT m=+242.633159185 (2 nodes), 1.88µs from start of poll, 171.467848ms from cluster start. Timeout at 2022-09-12 10:49:10.794277 -0700 PDT m=+242.738300209, 105.141024ms after stability
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.794-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: updating configuration: command=RemoveServer server-id=server-daeab638-c616-2579-4afb-20969ce080a3 server-addr= servers="[{Suffrage:Voter ID:server-f445d6ee-2106-7257-be07-290f948c152f Address:f445d6ee-2106-7257-be07-290f948c152f} {Suffrage:Voter ID:server-68cb85a4-318a-459c-137f-ebad9e46ed13 Address:68cb85a4-318a-459c-137f-ebad9e46ed13}]"
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.794-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: removed peer, stopping replication: peer=server-daeab638-c616-2579-4afb-20969ce080a3 last-index=3
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.794-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: aborting pipeline replication: peer="{Voter server-daeab638-c616-2579-4afb-20969ce080a3 daeab638-c616-2579-4afb-20969ce080a3}"
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.794-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: updating configuration: command=AddNonvoter server-id=server-daeab638-c616-2579-4afb-20969ce080a3 server-addr=daeab638-c616-2579-4afb-20969ce080a3 servers="[{Suffrage:Voter ID:server-f445d6ee-2106-7257-be07-290f948c152f Address:f445d6ee-2106-7257-be07-290f948c152f} {Suffrage:Voter ID:server-68cb85a4-318a-459c-137f-ebad9e46ed13 Address:68cb85a4-318a-459c-137f-ebad9e46ed13} {Suffrage:Nonvoter ID:server-daeab638-c616-2579-4afb-20969ce080a3 Address:daeab638-c616-2579-4afb-20969ce080a3}]"
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.794-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=2 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.794-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=2 lastLogTerm=2
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.794-0700 [INFO]  server-f445d6ee-2106-7257-be07-290f948c152f: added peer, starting replication: peer=server-daeab638-c616-2579-4afb-20969ce080a3
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.794-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=2 lastLogTerm=2
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.794-0700 [WARN]  server-f445d6ee-2106-7257-be07-290f948c152f: appendEntries rejected, sending older logs: peer="{Nonvoter server-daeab638-c616-2579-4afb-20969ce080a3 daeab638-c616-2579-4afb-20969ce080a3}" next=3
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.801-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.805-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=2 lastLogTerm=2
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.805-0700 [WARN]  server-f445d6ee-2106-7257-be07-290f948c152f: appendEntries rejected, sending older logs: peer="{Nonvoter server-daeab638-c616-2579-4afb-20969ce080a3 daeab638-c616-2579-4afb-20969ce080a3}" next=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.809-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.815-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=1 lastLogTerm=2
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.815-0700 [WARN]  server-f445d6ee-2106-7257-be07-290f948c152f: appendEntries rejected, sending older logs: peer="{Nonvoter server-daeab638-c616-2579-4afb-20969ce080a3 daeab638-c616-2579-4afb-20969ce080a3}" next=1
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.816-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.825-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.835-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.835-0700 [WARN]  server-f445d6ee-2106-7257-be07-290f948c152f: appendEntries rejected, sending older logs: peer="{Nonvoter server-daeab638-c616-2579-4afb-20969ce080a3 daeab638-c616-2579-4afb-20969ce080a3}" next=1
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.835-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.844-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.847-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: heartbeat timeout reached, starting election: last-leader-addr=f445d6ee-2106-7257-be07-290f948c152f last-leader-id=server-f445d6ee-2106-7257-be07-290f948c152f
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.847-0700 [INFO]  server-daeab638-c616-2579-4afb-20969ce080a3: entering candidate state: node="Node at daeab638-c616-2579-4afb-20969ce080a3 [Candidate]" term=103
    testing.go:147: server-68cb85a4-318a-459c-137f-ebad9e46ed13: 2022-09-12T10:49:10.847-0700 [WARN]  server-68cb85a4-318a-459c-137f-ebad9e46ed13: rejecting vote request since node is not a voter: from=daeab638-c616-2579-4afb-20969ce080a3
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.847-0700 [WARN]  server-f445d6ee-2106-7257-be07-290f948c152f: rejecting vote request since node is not a voter: from=daeab638-c616-2579-4afb-20969ce080a3
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.852-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.861-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.871-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.876-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.876-0700 [WARN]  server-f445d6ee-2106-7257-be07-290f948c152f: appendEntries rejected, sending older logs: peer="{Nonvoter server-daeab638-c616-2579-4afb-20969ce080a3 daeab638-c616-2579-4afb-20969ce080a3}" next=1
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.878-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.883-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.889-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.898-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.906-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: higher term appendEntries received but not resetting term: hasVote=true PrevLogTerm=0 lastLogTerm=2
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.910-0700 [WARN]  server-daeab638-c616-2579-4afb-20969ce080a3: Election timeout reached, restarting election
    testing.go:147: server-daeab638-c616-2579-4afb-20969ce080a3: 2022-09-12T10:49:10.910-0700 [INFO]  server-daeab638-c616-2579-4afb-20969ce080a3: entering candidate state: node="Node at daeab638-c616-2579-4afb-20969ce080a3 [Candidate]" term=104
    testing.go:147: server-68cb85a4-318a-459c-137f-ebad9e46ed13: 2022-09-12T10:49:10.910-0700 [WARN]  server-68cb85a4-318a-459c-137f-ebad9e46ed13: rejecting vote request since node is not a voter: from=daeab638-c616-2579-4afb-20969ce080a3
    testing.go:147: server-f445d6ee-2106-7257-be07-290f948c152f: 2022-09-12T10:49:10.910-0700 [WARN]  server-f445d6ee-2106-7257-be07-290f948c152f: rejecting vote request since node is not a voter: from=daeab638-c616-2579-4afb-20969ce080a3

Update: there seems to be a chicken-and-egg problem here. We will not reset the term unless !hasVote and it will only become true when the new config is appended, but the new config will not be appended when a.Term < r.getCurrentTerm().

dhiaayachi · 2022-09-27T15:42:57Z

Closing this in favor of #526

fix for non voter with higher term and add test

d142e9b

dhiaayachi commented Sep 8, 2022

View reviewed changes

dhiaayachi mentioned this pull request Sep 8, 2022

Unstable leadership when running server is demoted/removed without its participation #524

Closed

fix data race

67d7891

banks reviewed Sep 9, 2022

View reviewed changes

change fix to only apply for append entries with newer logs term

dbd6d06

banks reviewed Sep 9, 2022

View reviewed changes

dhiaayachi mentioned this pull request Sep 14, 2022

check if server is in configuration when receiving a voteRequest #526

Merged

dhiaayachi closed this Sep 27, 2022

raskchanky deleted the fix_nonvoter_term branch September 27, 2022 16:28

narvikd mentioned this pull request Feb 27, 2023

feat: Add resilience in case the cluster falls bellow 1 node narvikd/nubedb#53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non voter with higher term should not cause cluster instability #525

Non voter with higher term should not cause cluster instability #525

dhiaayachi commented Sep 8, 2022

dhiaayachi Sep 8, 2022

banks Sep 9, 2022

banks Sep 9, 2022

dhiaayachi Sep 9, 2022

ncabatoff Sep 12, 2022

banks Sep 9, 2022

banks Sep 9, 2022

banks Sep 9, 2022 •

edited

banks Sep 9, 2022

banks Sep 9, 2022

banks Sep 9, 2022 •

edited

dhiaayachi Sep 9, 2022

banks Sep 9, 2022 •

edited

dhiaayachi Sep 9, 2022

banks commented Sep 9, 2022 via email •

edited

dhiaayachi commented Sep 9, 2022

banks left a comment

banks Sep 9, 2022

banks Sep 9, 2022

dhiaayachi Sep 9, 2022

szechuen commented Sep 9, 2022

dhiaayachi commented Sep 12, 2022

szechuen commented Sep 12, 2022

szechuen commented Sep 12, 2022 •

edited

dhiaayachi commented Sep 27, 2022

Non voter with higher term should not cause cluster instability #525

Non voter with higher term should not cause cluster instability #525

Conversation

dhiaayachi commented Sep 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Conclusion

banks Sep 9, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks Sep 9, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks Sep 9, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks commented Sep 9, 2022 via email • edited

dhiaayachi commented Sep 9, 2022

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szechuen commented Sep 9, 2022

dhiaayachi commented Sep 12, 2022

szechuen commented Sep 12, 2022

szechuen commented Sep 12, 2022 • edited

dhiaayachi commented Sep 27, 2022

banks Sep 9, 2022 •

edited

banks Sep 9, 2022 •

edited

banks Sep 9, 2022 •

edited

banks commented Sep 9, 2022 via email •

edited

szechuen commented Sep 12, 2022 •

edited