Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaders with no log before snapshot get stuck in a loop when replicating the snapshot #9820

Closed
oleschoenburg opened this issue Jul 15, 2022 · 0 comments · Fixed by #9824
Closed
Assignees
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request version:1.3.13 version:8.1.0-alpha4 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0

Comments

@oleschoenburg
Copy link
Member

oleschoenburg commented Jul 15, 2022

Describe the bug

When a leader's first log entry is the next entry after the snapshot, the leader will get stuck in a snapshot replication loop, sending the same snapshot over and over again. The follower won't receive regular appends and will not catch up to the leader.

To Reproduce

Given this scenario:
Leader has a snapshot with index X, and a log that starts at index X+1.

After finishing snapshot replication to a member, the members position is reset to the snapshot position X + 1. That causes a seek for the previous position, X, followed by a read of the next entry, X+1. The position X does not exist (log starts with X+1) which means that the member context has no current entry and thus currentIndex = 0. This means that shouldReplicateSnapshot will decide to replicate the snapshot again, closing the loop.

Expected behavior

After successful snapshot replication, regular event replication should continue and the same snapshot shouldn't be sent over and over again.

This is a valid scenario that has worked previously but we removed one condition from shouldReplicateSnapshot that prevented repeated snapshot replication: 6ffc0cb

if (member.getSnapshotIndex() >= persistedSnapshot.getIndex()) {
  return false;
}

Support Case: SUPPORT-13931

@oleschoenburg oleschoenburg added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) labels Jul 15, 2022
@oleschoenburg oleschoenburg self-assigned this Jul 15, 2022
zeebe-bors-camunda bot added a commit that referenced this issue Jul 18, 2022
9824: fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=oleschoenburg

## Description

This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #9820 



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 18, 2022
9824: fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=oleschoenburg

## Description

This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #9820 



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 18, 2022
9831: [Backport stable/8.0] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action

# Description
Backport of #9824 to `stable/8.0`.

relates to #9820

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 18, 2022
9830: [Backport stable/1.3] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action

# Description
Backport of #9824 to `stable/1.3`.

relates to #9820

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 18, 2022
9830: [Backport stable/1.3] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action

# Description
Backport of #9824 to `stable/1.3`.

relates to #9820

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 18, 2022
9830: [Backport stable/1.3] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action

# Description
Backport of #9824 to `stable/1.3`.

relates to #9820

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request version:1.3.13 version:8.1.0-alpha4 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants