Leaders with no log before snapshot get stuck in a loop when replicating the snapshot #9820

oleschoenburg · 2022-07-15T12:58:41Z

Describe the bug

When a leader's first log entry is the next entry after the snapshot, the leader will get stuck in a snapshot replication loop, sending the same snapshot over and over again. The follower won't receive regular appends and will not catch up to the leader.

To Reproduce

Given this scenario:
Leader has a snapshot with index X, and a log that starts at index X+1.

After finishing snapshot replication to a member, the members position is reset to the snapshot position X + 1. That causes a seek for the previous position, X, followed by a read of the next entry, X+1. The position X does not exist (log starts with X+1) which means that the member context has no current entry and thus currentIndex = 0. This means that shouldReplicateSnapshot will decide to replicate the snapshot again, closing the loop.

Expected behavior

After successful snapshot replication, regular event replication should continue and the same snapshot shouldn't be sent over and over again.

This is a valid scenario that has worked previously but we removed one condition from shouldReplicateSnapshot that prevented repeated snapshot replication: 6ffc0cb

if (member.getSnapshotIndex() >= persistedSnapshot.getIndex()) {
  return false;
}

Support Case: SUPPORT-13931

The text was updated successfully, but these errors were encountered:

9824: fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=oleschoenburg ## Description This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot. ## Related issues  closes #9820 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

9831: [Backport stable/8.0] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action # Description Backport of #9824 to `stable/8.0`. relates to #9820 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

9830: [Backport stable/1.3] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action # Description Backport of #9824 to `stable/1.3`. relates to #9820 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

oleschoenburg self-assigned this Jul 15, 2022

oleschoenburg mentioned this issue Jul 15, 2022

fix: don't replicate snapshot if member already has the latest snapshot #9824

Merged

10 tasks

zeebe-bors-camunda bot closed this as completed in 145c697 Jul 18, 2022

This was referenced Jul 18, 2022

[Backport stable/1.3] fix: don't replicate snapshot if member already has the latest snapshot #9830

Merged

[Backport stable/8.0] fix: don't replicate snapshot if member already has the latest snapshot #9831

Merged

This was referenced Jul 19, 2022

Extend RandomizedRaftTest with snapshot and data loss operations #9837

Closed

Leader recovery fails because because the logstream does not contain events from snapshot #9645

Closed

npepinpe added version:1.3.13 release/8.0.5 version:8.1.0-alpha4 labels Aug 1, 2022

deepthidevaki mentioned this issue Aug 3, 2022

Increase test coverage for recovery from dataloss #9960

Closed

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

Zelldon mentioned this issue Feb 2, 2023

IllegalStateException: Expected to find event with the snapshot position 4659997 in log stream, but nothing was found. Failed to recover 'Broker-0-StreamProcessor-1'. #11506

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaders with no log before snapshot get stuck in a loop when replicating the snapshot #9820

Leaders with no log before snapshot get stuck in a loop when replicating the snapshot #9820

oleschoenburg commented Jul 15, 2022 •

edited by menski

Leaders with no log before snapshot get stuck in a loop when replicating the snapshot #9820

Leaders with no log before snapshot get stuck in a loop when replicating the snapshot #9820

Comments

oleschoenburg commented Jul 15, 2022 • edited by menski

oleschoenburg commented Jul 15, 2022 •

edited by menski