New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaders with no log before snapshot get stuck in a loop when replicating the snapshot #9820
Labels
area/reliability
Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected)
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/high
Marks a bug as having a noticeable impact on the user with no known workaround
support
Marks an issue as related to a customer support request
version:1.3.13
version:8.1.0-alpha4
version:8.1.0
Marks an issue as being completely or in parts released in 8.1.0
Comments
oleschoenburg
added
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/high
Marks a bug as having a noticeable impact on the user with no known workaround
support
Marks an issue as related to a customer support request
area/reliability
Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected)
labels
Jul 15, 2022
10 tasks
zeebe-bors-camunda bot
added a commit
that referenced
this issue
Jul 18, 2022
9824: fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=oleschoenburg ## Description This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #9820 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot
added a commit
that referenced
this issue
Jul 18, 2022
9824: fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=oleschoenburg ## Description This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #9820 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
This was referenced Jul 19, 2022
Zelldon
added
the
version:8.1.0
Marks an issue as being completely or in parts released in 8.1.0
label
Oct 4, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/reliability
Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected)
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/high
Marks a bug as having a noticeable impact on the user with no known workaround
support
Marks an issue as related to a customer support request
version:1.3.13
version:8.1.0-alpha4
version:8.1.0
Marks an issue as being completely or in parts released in 8.1.0
Describe the bug
When a leader's first log entry is the next entry after the snapshot, the leader will get stuck in a snapshot replication loop, sending the same snapshot over and over again. The follower won't receive regular appends and will not catch up to the leader.
To Reproduce
Given this scenario:
Leader has a snapshot with index X, and a log that starts at index X+1.
After finishing snapshot replication to a member, the members position is reset to the snapshot position X + 1. That causes a seek for the previous position, X, followed by a read of the next entry, X+1. The position X does not exist (log starts with X+1) which means that the member context has no current entry and thus currentIndex = 0. This means that
shouldReplicateSnapshot
will decide to replicate the snapshot again, closing the loop.Expected behavior
After successful snapshot replication, regular event replication should continue and the same snapshot shouldn't be sent over and over again.
This is a valid scenario that has worked previously but we removed one condition from
shouldReplicateSnapshot
that prevented repeated snapshot replication: 6ffc0cbSupport Case: SUPPORT-13931
The text was updated successfully, but these errors were encountered: