Fix rare leadership transfer failures when writes happen during transfer #581

ncabatoff · 2023-11-23T19:48:15Z

The problem I'm trying to fix: after we send the TimeoutNow during a leader transfer, we remain the leader for a little while. During that time we allow writes, which can result in the upcoming election being lost by our chosen target, if it doesn't have the highest index at the time when it's asking for votes.

The fix: wait for up to ElectionTimeout after the TimeoutNow before we allow writes to proceed.

… happening during the transfer. Fix the problem partially: it can still occur if we dispatchLogs after the transfer message has been sent to the target, before it wins the election.

… changes: we weren't properly respecting stopCh.

…tionTimeout before resuming applies. Fix is in scare quotes because I'm not all sure this is acceptable.

…lel, and improve test log output.

…fail before responding to the transfer request.

…, which is racy in that in between calling pollState and setting up event monitoring, the cluster could've elected a leader. When that happens, Leader() errors returning 0 leaders.

banks

I think this looks great and don't see any issues.

I did have one question in line about whether or not an alternative was easier to read/reason about but I'm not really convinced either way to be honest and I think both are effectively evuivalent so not blocking!

banks · 2023-11-27T13:05:27Z

raft.go

+						// Wait for up to ElectionTimeout before flagging the
+						// leadership transfer as done and unblocking applies in
+						// the leaderLoop.
+						select {
+						case <-time.After(r.config().ElectionTimeout):
+							err := fmt.Errorf("leadership transfer timeout")
+							r.logger.Debug(err.Error())
+							future.respond(err)
+						case <-leftLeaderLoop:
+							r.logger.Debug("lost leadership during transfer (expected)")


What do you think about just staying in the loop here instead of starting a new ElectionTimeout ticker and having to duplicate the leftLeaderLoop code path too?

I was originally expecting that we'd not wait for another ElectionTimeout but rather keep waiting for up to the original one we started for the transfer to complete. I don't think it's terrible to do it this way but it strikes me that it's less code and maybe simpler to reason able if the behaviour is just: block until leadership transfer works or one election timeout from handling the request...

The rationale for using ElectionTimeout here presumably was that, LeadershipTransfer is only an optimization to avoid just ungracefully stopping and having to wait for an election timeout... so letting leader transfer take longer than a whole election timeout is a little bit defeating the point and we should probably return fast and let the old leader just shut down if it wants to.

I think that rationale gets a bit more subtle in cases like Autopilot upgrade where we rely on Leadership Transfer working before we move on to the next phase rather than just shutting down anyway.

tl;dr, I don't think the behaviour here is bad or wrong especially and it beats the bug without this wait. If it turns out easier to follow to duplicate the code and make the behavior more explicit like you have here I'm OK with it. Just curious if you tried simply replacing this else branch with continue which I think is also a correct fix?

Yes, I tried that. It got messy due to the other cases wanting to read from doneCh. In the end I decided this was clearer.

banks · 2023-11-27T13:08:23Z

raft_test.go

@@ -2337,17 +2337,71 @@ func TestRaft_LeadershipTransferWithOneNode(t *testing.T) {
 	}
 }

+func TestRaft_LeadershipTransferWithWrites(t *testing.T) {


Do you have anecdata about roughly how frequently this reproduced a bug before your fix locally? I assume it's still non-deterministic?

I couldn't give you numbers, but it didn't take many tries to provoke a failure. It can still fail if I up the concurrency enough, e.g. if I run 16 parallel -race instances. But I suspect the same is true for many other tests, as the failures are things like timeouts due to heartbeats not being processed quickly enough.

ncabatoff added 7 commits November 23, 2023 14:47

Reproduce leadership transfer failures that can occur when writes are…

e338309

… happening during the transfer. Fix the problem partially: it can still occur if we dispatchLogs after the transfer message has been sent to the target, before it wins the election.

Fix some collateral damage I introduced. Also a bug in leaderTransfer…

a76a6f8

… changes: we weren't properly respecting stopCh.

Attempt at a "fix": after we send the TimeoutNow, wait for up to Elec…

8cecc28

…tionTimeout before resuming applies. Fix is in scare quotes because I'm not all sure this is acceptable.

Slow down writes a bit to make this more amenable to testing in paral…

477cf7e

…lel, and improve test log output.

Revert changes that turned out be unneeded.

2b715ac

A bit more cleanup, plus now we wait for the transfer to complete or …

04fdca6

…fail before responding to the transfer request.

Make the test more robust. Part of this required modifying GetInState…

cb62297

…, which is racy in that in between calling pollState and setting up event monitoring, the cluster could've elected a leader. When that happens, Leader() errors returning 0 leaders.

ncabatoff changed the title ~~Reproduce leadership transfer failures when writes happen during transfer~~ Fix rare leadership transfer failures when writes happen during transfer Nov 24, 2023

ncabatoff marked this pull request as ready for review November 24, 2023 20:18

ncabatoff requested a review from a team as a code owner November 24, 2023 20:18

ncabatoff requested review from DanStough and removed request for a team November 24, 2023 20:18

banks approved these changes Nov 27, 2023

View reviewed changes

ncabatoff merged commit 62eaa1c into main Dec 4, 2023
6 checks passed

ncabatoff deleted the repro-leader-transfer-failure branch December 4, 2023 14:38

ncabatoff added a commit to hashicorp/vault that referenced this pull request Feb 13, 2024

Upgrade raft to pull in hashicorp/raft#581.

d7dccc0

ncabatoff added a commit to hashicorp/vault that referenced this pull request Feb 16, 2024

Upgrade raft to pull in hashicorp/raft#581. (#25390)

04ea63c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rare leadership transfer failures when writes happen during transfer #581

Fix rare leadership transfer failures when writes happen during transfer #581

ncabatoff commented Nov 23, 2023 •

edited

banks left a comment

banks Nov 27, 2023

ncabatoff Nov 27, 2023

banks Nov 27, 2023

ncabatoff Nov 27, 2023

Fix rare leadership transfer failures when writes happen during transfer #581

Fix rare leadership transfer failures when writes happen during transfer #581

Conversation

ncabatoff commented Nov 23, 2023 • edited

banks left a comment

Choose a reason for hiding this comment

banks Nov 27, 2023

Choose a reason for hiding this comment

ncabatoff Nov 27, 2023

Choose a reason for hiding this comment

banks Nov 27, 2023

Choose a reason for hiding this comment

ncabatoff Nov 27, 2023

Choose a reason for hiding this comment

ncabatoff commented Nov 23, 2023 •

edited