Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non voter with higher term should not cause cluster instability #525

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
16 changes: 14 additions & 2 deletions raft.go
Expand Up @@ -1375,7 +1375,19 @@ func (r *Raft) appendEntries(rpc RPC, a *AppendEntriesRequest) {
rpc.Respond(resp, rpcErr)
}()

// Ignore an older term
// this use case would happen when a node was a voter and is added back to the cluster as non voter,
// it term could be higher then the cluster, but because it's a non voter that term need to be discarded
// in favor of the cluster current term to keep the cluster stable, as an election don't need to be started
// by a node which don't have voting rights.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment here is useful but perhaps a link to this PR or the issue would be good because it's super hard to express all the nuance of the issue and why this is correct in a few lines?

currentTerm := r.getCurrentTerm()
hasVote := hasVote(r.getLatestConfiguration(), r.localID)
if a.Term < currentTerm && !hasVote {
r.setState(Follower)
r.setCurrentTerm(a.Term)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other state we need to set here 🤔 I don't think so because this is the same as the code below for normal happy-path replication where we switch to a new leader's term after an election when we get the first appendEntries but I feel like we also store state about which node is our current leader in raft and it feels weird to me that we might be potentially switching term and leader here without updating anything about who that leader is....

But i know we do persist the identity of the candidate we last voted for, so I wonder if the happy path doesn't update anything because it assumes that we set the identity during election? Even then that can't be right in the case of a follower coming up and learning about a new leader that it didn't vote for so again I guess you did the right thing here but I'm slightly intrigued about how and where we update the state about who our current leader 🤔. Do you know @dhiaayachi ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's done right after in here
but I'm a bit confused about it as that mean we trust any node who sends us an appendEntries request with a higher term as the leader, but maybe that's the right thing to do, I'm not sure.

resp.Term = a.Term
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time reasoning about the correctness of this change. It's a fundamental deviation from the publishes/model checked/peer reviewed Raft algorithm which makes it significantly risky IMO.

I'm going to dream up an example where this gets subtle. It might not be incorrect as we think through but maybe helps us reason about correctness here.

Imagine servers A, B, C, D where A is leader in term 1, A-C form a quorum of voters, D is a non-voter all in steady state.

Now imagine that A is partitioned from the others. For a while it doesn't know this.

B, and C elect B the new leader in term 2 with and D starts following B.

But now (still in the window before A notices its leader lease timeout) A reconnects and attempts to send append entries to the peers. B and C will correctly ignore it and tell it to step down, but before it get's their response, D will silently accept it and revert to term 1 (in which A was leader) but is still actually in follower state following B in term 2.

Is this state valid?

Well D may now try to append an entry from A in term 1 that isn't actually ever going to commit because A won't get a quorum for it.

What happens next depends on the state in D's log and exactly what was inflight from A when it was partitioned.

Let's imagine just one (I think) possible state: A had everything up to index 99 commited before the partition and was just starting to replicate index 100 when it lost connection to all other servers so is now re-trying to append index 100 in term 1.

Meanwhile B has won election and already replicated index 100 to D in term 2.

Since D's last log will be index 100 but the append with have PrevLogEntry == 99 we hit the else case on line 1418 below.

D will now lookup it's own log 99 (which was the correct one back from term 1) and set prevLogTerm to 1 based on that.

So it will skip right over the term check on line 1431 - effectively D is treating the old leader as the new leader and in line 1459 below will find the terms don't match so it will truncate the good entries from term 2 out of it's log and replace them with the ones from the old leader in term 1.

Lets say A managed to replicate a few more logs to it in Term 1 before noticing that it needs to stand down or getting a response form a voter telling it the term is stale.

But while that's happening B is also sending appendEntries to D in term 2 each time it does D updates it's term truncates all the "bad" stuff from term 1 out and request a replay from the fork point from B but while that's happening A might still try to append more stuff and each time it does D will truncate the good stuff out etc.

I think this scenario works out OK in the end because A will eventually step down and stop battling the new leader to append and when it does Ds log should be reset.

IF D were a voter and so the fact it logged something was taken into account when working out commitment, this kind of flapping would break correctness in general because an old leader could cause deletion of "comitted" logs by the newer leader. Since it's not a voter it might just about be OK because the voters logs will remain consistent and this non-voter will eventually end up consistent (certainly by the time it's learned about what is committed and applied it to it's FSM so even stale reads shouldn't "see" any bad data from an old leader since it won't be committed.

I don't think there is a scenario where D can end up with a log that's inconsistent with the quorum by the time the entries are committed but I don't think we should trust my thinking here - that's why Raft and other algorithms are rigorously defined and model checked etc so straying from that seems risky to me.

Conclusion

I didn't prove this is "wrong" in the sense of user-visible consistency here, but I'm still not sure if it's a good move to alter this behaviour in this way.

I think the real fix for overall cluster stability is the other one below where we just don't let the leader stand down if the node with a higher term is a non-voter. That seems safer to me.

I guess without this though that change alone leaves the non-voter in a strange state?

I guess my question would be whether there is a more correct solution. For example I think the PreVote optimization described in Diego's full Raft thesis would also solve this problem (as well as when the same thing happens for voters) and would be totally correct in all cases?

If not, how likely is this to happen? Could we just crash the non-voter and have an operator delete the non-voters raft state and start again?

I'm trying to think of other options but it's really hard to reason about - you can't just reset the non-voter state. Even though it's a non-voter now, it might have voted in a term that was counted towards a quorum so just resetting it's term from its last successful replicated log on restart is quite likely not correct either in some subtle edge case.

I think you could avoid the old leader horribleness above correctly in other ways too, for example if the non-voter had a record of the last term it saw a stable leader in (which is subtly different from it's Term state since that is incremented when it attempted to become a leader as a voter though no leader had existed at that term) then it could continue to fence of appendEntries from leaders older than the current stable leader and avoid the issue above, while still allowing a reset from a "spurios" term that it tried an election in before becoming a non-voter?

What do you think @dhiaayachi ? I'd love to figure out something less risky than deviating from the reviewed raft protocol here if possible.

Copy link
Member

@banks banks Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the non-voter had a record of the last term it saw a stable leader in

🤔 actually that might be the same thing as the term of its last persisted log so we do have that... but I'm not sure how we use it correctly. Above I was thinking we could reset the non-voters current Term to that on startup to avoid it being disruptive but I think that's dangerous as noted above.

Perhaps we could use this value in the new check here though and only downgrade the term if a non-voter AND our currentTerm is higher than the last log term? (or the new lower term is exactly equal to our last log term, not sure if that's the same).

Either way, we should think about what we need to do to convince ourselves any solution we come up with is actually rigorously correct!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record by PreVote I was misremembering but referring to Section 4.2.3 Disruptive Servers in https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think it's not going to help either because in this case the non-voter isn't disrupting by trying to get votes with a higher term (it's a non-voter after all) but rather returning it during replication.

Copy link
Member

@banks banks Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the problem can only occur if:

  1. The non-voter starts up with a currentTerm persisted that is higher than the actual stable leader term.
    • I assume this can only happen when a voter is restarted as a non-voter and happens to attempt an election just before it restarts causing it to increment its term. It has to persist the new term to it's state file, but not actually deliver the RPCs before shutting down (since that would cause an election and a move to that term or higher in the rest of the cluster anyway)
  2. When it starts up, it's currentTerm is stableLeaderTerm + 1 (assuming no leader change since it shutdown which seems to be required for this bug to manifest) but its last appended log term will be at most stableLeaderTerm (could be lower in theory).
  3. When the stable leader attempts an append entries at stableLeaderTerm it rejects it sending it's higher term in response.
  4. (before this patch) leader steps down and a new election happens needlessly.

Does that match your understanding Dhia? Specifically what I said in 1 the only possible way to get into this state or did I miss something?

If so then I think a solution would be:

  • If we get an appendEntries for a term lower than currentTerm AND we are a non-voter AND the term from the apparent leader is greater than or equal to the term of the last log entry we have persisted. Then it's safe to assume we are in this state.

I reason that this is safe because we might still revert to a term of an old leader and append it's entries, but we'll never delete entries from a later term in doing it which means that we effectively are in the same state as if this node were partitioned along with the old leader for a while before it stepped down, and that state will be corrected as soon as the new leader makes contact.

We can't flip flop - if the new leader made contact first and appended in the new term then the additional check above will effectively fence out the old leader from modifying our log and we will proceed normally.

What do you think @dhiaayachi ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reasoning with me about this @banks! I was trying to go through those scenario and I agree it's not easy without a formal proof. I ended up reasoning based on the following and using facts that should not vary based on timing or external events:

  • The node in question is a non-voter
  • Any election triggered by a non voter is a disruption because that node cannot participate in it
  • Term kept by non voters is always coming from current leader as it can't participate in an election so it can't increase its term by its own.

Based on those assumptions, If the term of a non voter node is higher than the cluster current term this bring us to 2 possible scenarios.

  1. That higher term come from another node that stepped as a leader but partitioned from the majority of nodes (the scenario you just described)
  2. That higher term is a residual from the node being a voter formerly (described in Unstable leadership when running server is demoted/removed without its participation #524)

I agree, first scenario is not ideal with the fix as the nonvoter will be in an unstable state until the cluster is restored but if we run the same scenario without the fix, the non voter node will be in most cases following the partitioned node as it will very likely have a higher term, which is not practical in real life situation.

Copy link
Member

@banks banks Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of a scenario that's a bit different to above and breaks my assumption slightly.

If the server while it had been a voter had been partitioned from the other servers it might have sat there for arbitrarily long attempting elections and so may have a persisted term that is arbitrarily higher than the stable leader. It didn't disrupt the leader due to not being able to send it messages.

Some time after it restarts as a non-voter with this state, the partition heals and the bug manifests when the current leader attempts to replicate. Note that the rest of the cluster might have been through any number of leader elections in the time before the partition healed (either before or after the node restarted) but as long as the new leader's term is lower than the term the partitioned voter got itself to the bug can occur.

I don't think that changes the correctness of the solution above (because the non-voter's last log term would be lower than the new leader which is OK), but I think it's worth understanding the most general way that this state is possible!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some time after it restarts as a non-voter with this state, the partition heals and the bug manifests when the current leader attempts to replicate. Note that the rest of the cluster might have been through any number of leader elections in the time before the partition healed (either before or after the node restarted) but as long as the new leader's term is lower than the term the partitioned voter got itself to the bug can occur.

That is exactly the bug described in #524 and fall under 2 in my last comment.


// if a node is a voter, Ignore an older term
if a.Term < r.getCurrentTerm() {
return
}
Expand Down Expand Up @@ -1410,7 +1422,7 @@ func (r *Raft) appendEntries(rpc RPC, a *AppendEntriesRequest) {
"previous-index", a.PrevLogEntry,
"last-index", lastIdx,
"error", err)
resp.NoRetryBackoff = true
resp.NoRetryBackoff = hasVote
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this change, I observed in the test that a newer term (unseen by the node) is sent. This change ensure that we retry with an older log until we get a log that the node is able to apply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah changing the "catchup" behaviour for a non-voter doesn't seem like the right thing to do here at a first glance.

return
}
prevLogTerm = prevLog.Term
Expand Down
79 changes: 79 additions & 0 deletions raft_test.go
Expand Up @@ -2644,6 +2644,85 @@ func TestRaft_VoteNotGranted_WhenNodeNotInCluster(t *testing.T) {
}
}

func TestRaft_StabilityIsKept_WhenNonVoterWithHigherTermJoin(t *testing.T) {
// Make a cluster
c := MakeCluster(3, t, nil)

defer c.Close()

// Get the leader
leader := c.Leader()

// Wait until we have 2 followers
limit := time.Now().Add(c.longstopTimeout)
var followers []*Raft
for time.Now().Before(limit) && len(followers) != 2 {
c.WaitEvent(nil, c.conf.CommitTimeout)
followers = c.GetInState(Follower)
}
if len(followers) != 2 {
t.Fatalf("expected two followers: %v", followers)
}

// Remove a follower
followerRemoved := followers[0]
future := leader.RemoveServer(followerRemoved.localID, 0, 0)
if err := future.Error(); err != nil {
t.Fatalf("err: %v", err)
}

//set that follower term to higher term to simulate a partitioning
followerRemoved.setCurrentTerm(leader.getCurrentTerm() + 100)

//Add the node back as NonVoter
future = leader.AddNonvoter(followerRemoved.localID, followerRemoved.localAddr, 0, 0)
if err := future.Error(); err != nil {
t.Fatalf("err: %v", err)
}
// Wait a while
time.Sleep(c.propagateTimeout * 100)

//Write some logs to ensure they replicate
for i := 0; i < 100; i++ {
future := leader.Apply([]byte(fmt.Sprintf("test%d", i)), 0)
if err := future.Error(); err != nil {
t.Fatalf("[ERR] apply err: %v", err)
}
}
c.WaitForReplication(100)

// Check leader stable
newLeader := c.Leader()
if newLeader.leaderID != leader.leaderID {
t.Fatalf("leader changed")
}

//Remove the server and add it back as Voter
future = leader.RemoveServer(followerRemoved.localID, 0, 0)
if err := future.Error(); err != nil {
t.Fatalf("err: %v", err)
}
leader.AddVoter(followerRemoved.localID, followerRemoved.localAddr, 0, 0)

// Wait a while
time.Sleep(c.propagateTimeout * 10)

//Write some logs to ensure they replicate
for i := 100; i < 200; i++ {
future := leader.Apply([]byte(fmt.Sprintf("test%d", i)), 0)
if err := future.Error(); err != nil {
t.Fatalf("[ERR] apply err: %v", err)
}
}
c.WaitForReplication(200)

// Check leader stable
newLeader = c.Leader()
if newLeader.leaderID != leader.leaderID {
t.Fatalf("leader changed")
}
}

// TestRaft_FollowerRemovalNoElection ensures that a leader election is not
// started when a standby is shut down and restarted.
func TestRaft_FollowerRemovalNoElection(t *testing.T) {
Expand Down
2 changes: 1 addition & 1 deletion replication.go
Expand Up @@ -233,7 +233,7 @@ START:
appendStats(string(peer.ID), start, float32(len(req.Entries)))

// Check for a newer term, stop running
if resp.Term > req.Term {
if resp.Term > req.Term && s.peer.Suffrage != Nonvoter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the real fix to prevent disruption to the leader and I think is safe and correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes actually that was the first change I made but this won't help getting the non voter effectively join the cluster as it will continue rejecting the replication forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this instead of Suffrage == Voter?

r.handleStaleTerm(s)
return true
}
Expand Down