New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non voter with higher term should not cause cluster instability #525
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1375,7 +1375,19 @@ func (r *Raft) appendEntries(rpc RPC, a *AppendEntriesRequest) { | |
rpc.Respond(resp, rpcErr) | ||
}() | ||
|
||
// Ignore an older term | ||
// this use case would happen when a node was a voter and is added back to the cluster as non voter, | ||
// it term could be higher then the cluster, but because it's a non voter that term need to be discarded | ||
// in favor of the cluster current term to keep the cluster stable, as an election don't need to be started | ||
// by a node which don't have voting rights. | ||
currentTerm := r.getCurrentTerm() | ||
hasVote := hasVote(r.getLatestConfiguration(), r.localID) | ||
if a.Term < currentTerm && !hasVote { | ||
r.setState(Follower) | ||
r.setCurrentTerm(a.Term) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any other state we need to set here 🤔 I don't think so because this is the same as the code below for normal happy-path replication where we switch to a new leader's term after an election when we get the first But i know we do persist the identity of the candidate we last voted for, so I wonder if the happy path doesn't update anything because it assumes that we set the identity during election? Even then that can't be right in the case of a follower coming up and learning about a new leader that it didn't vote for so again I guess you did the right thing here but I'm slightly intrigued about how and where we update the state about who our current leader 🤔. Do you know @dhiaayachi ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's done right after in here |
||
resp.Term = a.Term | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm having a hard time reasoning about the correctness of this change. It's a fundamental deviation from the publishes/model checked/peer reviewed Raft algorithm which makes it significantly risky IMO. I'm going to dream up an example where this gets subtle. It might not be incorrect as we think through but maybe helps us reason about correctness here. Imagine servers A, B, C, D where A is leader in term 1, A-C form a quorum of voters, D is a non-voter all in steady state. Now imagine that A is partitioned from the others. For a while it doesn't know this. B, and C elect B the new leader in term 2 with and D starts following B. But now (still in the window before A notices its leader lease timeout) A reconnects and attempts to send append entries to the peers. B and C will correctly ignore it and tell it to step down, but before it get's their response, D will silently accept it and revert to term 1 (in which A was leader) but is still actually in follower state following B in term 2. Is this state valid? Well D may now try to append an entry from A in term 1 that isn't actually ever going to commit because A won't get a quorum for it. What happens next depends on the state in D's log and exactly what was inflight from A when it was partitioned. Let's imagine just one (I think) possible state: A had everything up to index 99 commited before the partition and was just starting to replicate index 100 when it lost connection to all other servers so is now re-trying to append index 100 in term 1. Meanwhile B has won election and already replicated index 100 to D in term 2. Since D's last log will be index 100 but the append with have D will now lookup it's own log 99 (which was the correct one back from term 1) and set So it will skip right over the term check on line 1431 - effectively D is treating the old leader as the new leader and in line 1459 below will find the terms don't match so it will truncate the good entries from term 2 out of it's log and replace them with the ones from the old leader in term 1. Lets say A managed to replicate a few more logs to it in Term 1 before noticing that it needs to stand down or getting a response form a voter telling it the term is stale. But while that's happening B is also sending appendEntries to D in term 2 each time it does D updates it's term truncates all the "bad" stuff from term 1 out and request a replay from the fork point from B but while that's happening A might still try to append more stuff and each time it does D will truncate the good stuff out etc. I think this scenario works out OK in the end because A will eventually step down and stop battling the new leader to append and when it does Ds log should be reset. IF D were a voter and so the fact it logged something was taken into account when working out commitment, this kind of flapping would break correctness in general because an old leader could cause deletion of "comitted" logs by the newer leader. Since it's not a voter it might just about be OK because the voters logs will remain consistent and this non-voter will eventually end up consistent (certainly by the time it's learned about what is committed and applied it to it's FSM so even stale reads shouldn't "see" any bad data from an old leader since it won't be committed. I don't think there is a scenario where D can end up with a log that's inconsistent with the quorum by the time the entries are committed but I don't think we should trust my thinking here - that's why Raft and other algorithms are rigorously defined and model checked etc so straying from that seems risky to me. ConclusionI didn't prove this is "wrong" in the sense of user-visible consistency here, but I'm still not sure if it's a good move to alter this behaviour in this way. I think the real fix for overall cluster stability is the other one below where we just don't let the leader stand down if the node with a higher term is a non-voter. That seems safer to me. I guess without this though that change alone leaves the non-voter in a strange state? I guess my question would be whether there is a more correct solution. For example I think the If not, how likely is this to happen? Could we just crash the non-voter and have an operator delete the non-voters raft state and start again? I'm trying to think of other options but it's really hard to reason about - you can't just reset the non-voter state. Even though it's a non-voter now, it might have voted in a term that was counted towards a quorum so just resetting it's term from its last successful replicated log on restart is quite likely not correct either in some subtle edge case. I think you could avoid the old leader horribleness above correctly in other ways too, for example if the non-voter had a record of the last term it saw a stable leader in (which is subtly different from it's Term state since that is incremented when it attempted to become a leader as a voter though no leader had existed at that term) then it could continue to fence of appendEntries from leaders older than the current stable leader and avoid the issue above, while still allowing a reset from a "spurios" term that it tried an election in before becoming a non-voter? What do you think @dhiaayachi ? I'd love to figure out something less risky than deviating from the reviewed raft protocol here if possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
🤔 actually that might be the same thing as the term of its last persisted log so we do have that... but I'm not sure how we use it correctly. Above I was thinking we could reset the non-voters current Term to that on startup to avoid it being disruptive but I think that's dangerous as noted above. Perhaps we could use this value in the new check here though and only downgrade the term if a non-voter AND our Either way, we should think about what we need to do to convince ourselves any solution we come up with is actually rigorously correct! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the record by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And I think it's not going to help either because in this case the non-voter isn't disrupting by trying to get votes with a higher term (it's a non-voter after all) but rather returning it during replication. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I think the problem can only occur if:
Does that match your understanding Dhia? Specifically what I said in 1 the only possible way to get into this state or did I miss something? If so then I think a solution would be:
I reason that this is safe because we might still revert to a term of an old leader and append it's entries, but we'll never delete entries from a later term in doing it which means that we effectively are in the same state as if this node were partitioned along with the old leader for a while before it stepped down, and that state will be corrected as soon as the new leader makes contact. We can't flip flop - if the new leader made contact first and appended in the new term then the additional check above will effectively fence out the old leader from modifying our log and we will proceed normally. What do you think @dhiaayachi ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for reasoning with me about this @banks! I was trying to go through those scenario and I agree it's not easy without a formal proof. I ended up reasoning based on the following and using facts that should not vary based on timing or external events:
Based on those assumptions, If the term of a non voter node is higher than the cluster current term this bring us to 2 possible scenarios.
I agree, first scenario is not ideal with the fix as the nonvoter will be in an unstable state until the cluster is restored but if we run the same scenario without the fix, the non voter node will be in most cases following the partitioned node as it will very likely have a higher term, which is not practical in real life situation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can think of a scenario that's a bit different to above and breaks my assumption slightly. If the server while it had been a voter had been partitioned from the other servers it might have sat there for arbitrarily long attempting elections and so may have a persisted term that is arbitrarily higher than the stable leader. It didn't disrupt the leader due to not being able to send it messages. Some time after it restarts as a non-voter with this state, the partition heals and the bug manifests when the current leader attempts to replicate. Note that the rest of the cluster might have been through any number of leader elections in the time before the partition healed (either before or after the node restarted) but as long as the new leader's term is lower than the term the partitioned voter got itself to the bug can occur. I don't think that changes the correctness of the solution above (because the non-voter's last log term would be lower than the new leader which is OK), but I think it's worth understanding the most general way that this state is possible! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That is exactly the bug described in #524 and fall under 2 in my last comment. |
||
|
||
// if a node is a voter, Ignore an older term | ||
if a.Term < r.getCurrentTerm() { | ||
return | ||
} | ||
|
@@ -1410,7 +1422,7 @@ func (r *Raft) appendEntries(rpc RPC, a *AppendEntriesRequest) { | |
"previous-index", a.PrevLogEntry, | ||
"last-index", lastIdx, | ||
"error", err) | ||
resp.NoRetryBackoff = true | ||
resp.NoRetryBackoff = hasVote | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure about this change, I observed in the test that a newer term (unseen by the node) is sent. This change ensure that we retry with an older log until we get a log that the node is able to apply. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm yeah changing the "catchup" behaviour for a non-voter doesn't seem like the right thing to do here at a first glance. |
||
return | ||
} | ||
prevLogTerm = prevLog.Term | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -233,7 +233,7 @@ START: | |
appendStats(string(peer.ID), start, float32(len(req.Entries))) | ||
|
||
// Check for a newer term, stop running | ||
if resp.Term > req.Term { | ||
if resp.Term > req.Term && s.peer.Suffrage != Nonvoter { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like the real fix to prevent disruption to the leader and I think is safe and correct. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes actually that was the first change I made but this won't help getting the non voter effectively join the cluster as it will continue rejecting the replication forever. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why this instead of Suffrage == Voter? |
||
r.handleStaleTerm(s) | ||
return true | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a comment here is useful but perhaps a link to this PR or the issue would be good because it's super hard to express all the nuance of the issue and why this is correct in a few lines?