Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't replicate snapshot if member already has the latest snapshot #9824

Merged
merged 2 commits into from
Jul 18, 2022

Conversation

lenaschoenburg
Copy link
Member

@lenaschoenburg lenaschoenburg commented Jul 15, 2022

Description

This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot.

Related issues

closes #9820

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 15, 2022

Unit Test Results

   792 files  ±  0     792 suites  ±0   1h 41m 57s ⏱️ + 6m 5s
5 978 tests +72  5 969 ✔️ +72  9 💤 ±0  0 ±0 
6 147 runs  +72  6 138 ✔️ +72  9 💤 ±0  0 ±0 

Results for commit df0d417. ± Comparison against base commit 14e98f8.

♻️ This comment has been updated with latest results.

@lenaschoenburg lenaschoenburg force-pushed the 9820-fix-snapshot-replication-loop branch from 13a3a62 to df0d417 Compare July 18, 2022 07:40
@lenaschoenburg lenaschoenburg marked this pull request as ready for review July 18, 2022 07:47
Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

If I understand correctly, the leader will keep track of each member's position to know which entry to send to them next. When we've replicated a snapshot at X, then we reset the member's context to X+1 (which means pointing at X such that the next entry is X+1). If there's no X, then we set the member's current index to 0? It seems a little strange to me, but I guess we really cannot set it to X 😄 Or would it be "OK" for the context to have X, as we treat the snapshot as that first entry? 💭 I wonder if it would make sense to have a different view in this context.

At any rate, just thinking out loud. Good catch!

@lenaschoenburg
Copy link
Member Author

Yep, that's summarized well I think 👍
I'd rather fix it differently (similar to how you described it) too. We could either:

  • track the current index separately (set it whenever we set currentEntry but also when installing a snapshot)
  • Have currentEntry be more flexible, so that it can either be a IndexRaftLogEntry or a snapshot.

Either way seems like they require a lot of changes in RaftMemberContext so I'd rather fix this by just re-adding this check. The check makes sense anyway, we never want to re-install a snapshot.

@lenaschoenburg
Copy link
Member Author

bors r+

zeebe-bors-camunda bot added a commit that referenced this pull request Jul 18, 2022
9824: fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=oleschoenburg

## Description

This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #9820 



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed:

@lenaschoenburg
Copy link
Member Author

bors retry

zeebe-bors-camunda bot added a commit that referenced this pull request Jul 18, 2022
9824: fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=oleschoenburg

## Description

This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #9820 



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed:

@lenaschoenburg
Copy link
Member Author

bors retry

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@backport-action
Copy link
Collaborator

Successfully created backport PR #9830 for stable/1.3.

@backport-action
Copy link
Collaborator

Successfully created backport PR #9831 for stable/8.0.

zeebe-bors-camunda bot added a commit that referenced this pull request Jul 18, 2022
9831: [Backport stable/8.0] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action

# Description
Backport of #9824 to `stable/8.0`.

relates to #9820

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Jul 18, 2022
9830: [Backport stable/1.3] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action

# Description
Backport of #9824 to `stable/1.3`.

relates to #9820

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Jul 18, 2022
9830: [Backport stable/1.3] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action

# Description
Backport of #9824 to `stable/1.3`.

relates to #9820

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Jul 18, 2022
9830: [Backport stable/1.3] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action

# Description
Backport of #9824 to `stable/1.3`.

relates to #9820

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@korthout
Copy link
Member

korthout commented Aug 2, 2022

@oleschoenburg I think this should've been added to the release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leaders with no log before snapshot get stuck in a loop when replicating the snapshot
4 participants