Document some Apply error cases better #561

banks · 2023-07-03T19:40:27Z

It isn't clear what ErrLeadershipLost means in terms of write success or failure. This aims to clarify that it's indeterminate and there is nothing stronger we can say about that in general.

It could be possible to avoid this case in some types of leadership loss or shutdown, but we can't make a stronger guarantee in all cases so there will always be at least some cases where the result is "we don't know if this committed or not".

For example, if the leader in a three node cluster has been partitioned from both followers, it will eventually lose leadership. When it does, any in-flight commands are in an indeterminate state. If they have been received and written to disk by at least one follower, then it is likely that that follower will become the new leader and "commit" them in it's new term. If none of the followers has received the request before the partition then the write will be lost completely. Since the partitioned leader can't communicate with the followers, it can't find out which outcome is right and so must return ErrLeadershipLost which implies "I don't have enough information to say whether this committed or not but I can't make progress as I'm not the leader any more".

banks · 2023-07-03T19:42:24Z

@otoolep does this address your question on hashicorp/consul#2893? Open to suggestions about improvements to the docs or code.

I think in general though there is no way to make a stronger guarantee in all cases than this.

otoolep · 2023-07-06T01:00:38Z

Before I comment on this, I want to be as clear as I can on what is going on. I re-read the Raft paper, to refresh my memory on all this.

If they have been received and written to disk by at least one follower

This implies that the leader that is about to "lose leadership" received a "true" response from at least one follower for the given log command. As a result 2 nodes of the 3 node cluster (the leader and the follower) consider the log committed. What would normally happen next is that each node would then apply the committed log to the state machine. So it sounds like what you're saying is in the narrow window of time between "commit index" being updated and "applied index" being updated, the leader loses leadership, and falls back to being follower. Is that what you're saying? If that's the case, the new leader comes up and starts interrogating nodes to determine which logs are in a majority of cluster nodes -- and finds the command we just spoke about, and then each node applies it to its own FSM.

This seems reasonable. I do want to confirm that the Hashicorp implementation will only apply the command once however, to each FSM under each node, even in this scenario. Is that the case here?

otoolep · 2023-07-06T01:17:39Z

I know what I'm suspicious about. It's this statement:

// If the node discovers it is no longer the leader while applying the command,
// it will return ErrLeadershipLost.

"Applying the command" implies that the leader already considers the log committed. If it did not consider the log committed, it should never be in the business of applying it. So which is it? Is leadership lost during replication, or is leadership lost during application of committed logs?

otoolep · 2023-07-07T18:11:49Z

Thinking about this more, I think it would be helpful if ErrLeadershipLost was more precise what about it means. For example does it mean:

The Leader appended the command to the log
The Leader sent out AppendEntries RPCs to the followers
It lost Leadership before it received sufficient responses from those RPCs to be sure the log was committed.

(the comment for ErrLeadershipLost makes it sound like this is the case).

OR

It did receive sufficient responses to the AppendEntries RPCs that it considered the log committed, but lost leadership while it was apply the log to the FSM.

However this second option doesn't make sense, since every node applies every committed command to the its own FSM, and Leadership state has nothing to do with applying to the FSM.

Of course, the Raft function Apply() actually wraps up these two operations in one go, and that's what makes it a bit confusing, but they are two separate operations.

I think the fundamental doc change for this should be something as follows:

// ErrLeadershipLost is returned when a leader cannot confirm that a log entry was
// committed because it's been deposed while waiting for confirmation from the Followers
// that the log was replicated. When this error is returned a client cannot be certain if the log
// was replicated successfully or not. See section 8 of the Raft paper to read more about this
// type of scenario.
ErrLeadershipLost = errors.New("leadership lost while committing log")

This whole situation seems similar to the following situation: a client is talking to a Raft system, the Leader commits and applies the request successfully, but the response from the Leader back to the client is dropped (perhaps the Leader crashes just after applying the log). The client is truly uncertain about whether the request was successful. This case is called out in the Raft paper (see section 8).

But the most important thing is that the system remains safe and correct, and each node will always agree on the committed log entries in the log.

otoolep · 2023-07-07T18:35:10Z

Related issues on the rqlite GitHub repo: rqlite/rqlite#1123

banks · 2023-07-17T20:23:12Z

Thanks for the detailed thoughts @otoolep. I think you've called out a few other good points of confusion. I'll see if I can pick out the questions in what you posted first and then we can see what the best approach is here.

If they have been received and written to disk by at least one follower
This implies that the leader that is about to "lose leadership" received a "true" response from at least one follower for the given log command.

Hmm not quite, it's actually a bit more subtle than that. For example a leader might write to it's disk and then replicate a log to 2 followers, they might both also write to disk - at this point the log is "committed" in the cluster, but the leader doesn't yet know it. If the leader is partitioned from both followers right after it sent the AppendEntries message it will not know that the log is actually committed or not, but it can't just keep waiting forever to find out because then the cluster won't make progress when there are network partitions. So it has to return an error that says "i'm not the leader any more, but I don't know if this operation was committed or not.".

If it does receive at least one (assuming 3 node cluster) ack for the AppendEntries then it does know that it's committed and so it could (but doesn't currently) return a different error. But the thing is it's still not been a successful write in the sense that a subsequent read to the leader only might not see the result of the write as its not yet been applied to the FSM and we rely on the "leader lease" optimization to allow read-after-write consistency on the leader without read always going to a quorum of nodes.

So it sounds like what you're saying is in the narrow window of time between "commit index" being updated and "applied index" being updated, the leader loses leadership, and falls back to being follower. Is that what you're saying?

Almost. You described one particular case but the other is like I mentioned above where the leader might not even have observed enough acks to know it was committed yet at the point it has to step down. We could treat these all differently but so far haven't seen a case where the client could make use of that distinction in our implementations. We still can't say it was a "succesful" write because we define that to be one which can immediately be observed by reading the result from the leader which is only true after the FSM apply has happened as well.

I do want to confirm that the Hashicorp implementation will only apply the command once however, to each FSM under each node, even in this scenario. Is that the case here?

Yes Raft guarantees that each op that is committed in the cluster will be applied exactly one to each FSM and in the same order.

What we don't make any guarantees about though is if you client gets and ErrLeadershipLost and retries the write on a different node, then it might end up "happening twice" if the original one actually did commit. If that is an issue for you then you'd need to implement some form of idempotency checking in your application (e.g. include a unique ID or "version" field in each write operation and have your FSM silently ignore duplicates that it has already seen).

"Applying the command" implies that the leader already considers the log committed.

Ah yes, I see why this is confusing. I was going to explain but you already figured it out:

Of course, the Raft function Apply() actually wraps up these two operations in one go, and that's what makes it a bit confusing, but they are two separate operations.

We use Apply in two different ways. The raft.Apply sense of the whole write being outstanding regardless of what state of the protocol it got to vs the FSM.Apply sense of "it is already committed so now we apply to the FSM".

// If the node discovers it is no longer the leader while applying the command,
// it will return ErrLeadershipLost.

So this ^ is using "applying" in the more general sense of "has called raft.Apply which could be confusing. I think this is something we can improve!

Thinking about this more, I think it would be helpful if ErrLeadershipLost was more precise what about it means. For example does it mean: ...

I think this is the crux of the issue. Thanks for calling it out explicitly.

The thing is, right now it means (almost) any one of the situations you described and some others. Technically it means "this request was accepted from the chan by the leader loop but some time before it was fully committed and applies to the FSM we lost leadership. It may or may not have been committed. If committed it may or may not have been applied to the new leader's FSM yet, if not committed it may or may not commit in the future and later be applied to the new leader's FSM".

I think it would be possible to distinguish some of these cases but I'm not convinced it's a good idea for the reasons hinted at above - even if the leader has not seen it commit yet, that doesn't mean it's not committed so the client can't safely take a different action in that case anyway. In fact, even if it's not technically committed (because no follower wrote it to disk yet) it might still be in the future (because at least one follower has the packets in it's recv buffer and will process them before noticing the connection has been closed).

Similarly, if it has committed but hasn't been applied to the leader's FSM yet (since that is asynchronous from updating the commit index) we could return an error that indicates that we know it's committed but this is not the same as "OK" since no node has actually applied it to FSM so a subsequent read to this old leader or the new leader might not reflect the change.

About the only useful distinction might be a client that doesn't care about reading the value back right away as long as it's been persistent and will eventually be available on all correct replicas, could know not to retry... But since all clients might need to retry for other cases and will need to correctly handle both committed and uncommitted cases, I don't think it would ever be useful to actually have different behaviour in that one case.

Do you see cases where that is important and would be a significant advantage to clients?

However this second option doesn't make sense, since every node applies every committed command to the its own FSM, and Leadership state has nothing to do with applying to the FSM.

Right applying to the FSM is not something that is exclusive to leaders and I think would not stop due to loss of leadership. Once the leader processes enough responses to know it's committed it will pushed the log onto the FSM apply chan before it gets a chance to loop again and decide to step down. In that case when it does step down it still doesn't know if the FSM actually applied it or not, but it will certainly and likely even before this node becomes a follower and starts replication from the new leader. Again, it's hard to reason about how a client would make good use of that information with all these subtleties!

This whole situation seems similar to the following situation: a client is talking to a Raft system, the Leader commits and applies the request successfully, but the response from the Leader back to the client is dropped (perhaps the Leader crashes just after applying the log). The client is truly uncertain about whether the request was successful. This case is called out in the Raft paper (see section 8).

Yep exactly that. Great call out.

I really like your proposed wording. I'll update this PR to use that.

otoolep · 2023-07-17T21:52:54Z

If that is an issue for you then you'd need to implement some form of idempotency checking in your application (e.g. include a unique ID or "version" field in each write operation and have your FSM silently ignore duplicates that it has already seen).

That's what I did. Since I only hit this issue in my testing, I changed my test data so that the inserted records include explicit primary keys. When this error occurs, my test resends the record. Thanks to the use of explicit primary keys, all inserts are now idempotent, and my test case (SELECT COUNT(*)) always gets the right value now (modulo any real regressions of course!). My test is solid now.

api.go

banks requested a review from a team as a code owner July 3, 2023 19:40

banks requested review from rboyer and removed request for a team July 3, 2023 19:40

banks mentioned this pull request Jul 3, 2023

Potential inconsistency after 'leadership lost while committing log' hashicorp/consul#2893

Closed

banks force-pushed the banks-patch-1 branch from dd3802d to 74bf375 Compare July 3, 2023 20:17

Document some Apply error cases better

e8ff322

banks force-pushed the banks-patch-1 branch from 74bf375 to e8ff322 Compare July 17, 2023 20:37

ncabatoff reviewed Sep 25, 2023

View reviewed changes

api.go Outdated Show resolved Hide resolved

Update api.go

55f96de

ncabatoff approved these changes Sep 25, 2023

View reviewed changes

go fmt

3c476dd

ncabatoff merged commit a3a1c10 into main Sep 25, 2023
6 checks passed

ncabatoff deleted the banks-patch-1 branch September 25, 2023 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document some Apply error cases better #561

Document some Apply error cases better #561

banks commented Jul 3, 2023

banks commented Jul 3, 2023

otoolep commented Jul 6, 2023

otoolep commented Jul 6, 2023

otoolep commented Jul 7, 2023 •

edited

otoolep commented Jul 7, 2023

banks commented Jul 17, 2023

otoolep commented Jul 17, 2023

Document some Apply error cases better #561

Document some Apply error cases better #561

Conversation

banks commented Jul 3, 2023

banks commented Jul 3, 2023

otoolep commented Jul 6, 2023

otoolep commented Jul 6, 2023

otoolep commented Jul 7, 2023 • edited

otoolep commented Jul 7, 2023

banks commented Jul 17, 2023

otoolep commented Jul 17, 2023

otoolep commented Jul 7, 2023 •

edited