fix: don't replicate snapshot if member already has the latest snapshot #9824

lenaschoenburg · 2022-07-15T17:30:21Z

Description

This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot.

Related issues

closes #9820

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
New content is added to the release announcement
If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

github-actions · 2022-07-15T17:45:27Z

Unit Test Results

  792 files ±  0   792 suites ±0 1h 41m 57s ⏱️ + 6m 5s
5 978 tests +72 5 969 ✔️ +72 9 💤 ±0 0 ❌ ±0
6 147 runs +72 6 138 ✔️ +72 9 💤 ±0 0 ❌ ±0

Results for commit df0d417. ± Comparison against base commit 14e98f8.

♻️ This comment has been updated with latest results.

npepinpe

👍

If I understand correctly, the leader will keep track of each member's position to know which entry to send to them next. When we've replicated a snapshot at X, then we reset the member's context to X+1 (which means pointing at X such that the next entry is X+1). If there's no X, then we set the member's current index to 0? It seems a little strange to me, but I guess we really cannot set it to X 😄 Or would it be "OK" for the context to have X, as we treat the snapshot as that first entry? 💭 I wonder if it would make sense to have a different view in this context.

At any rate, just thinking out loud. Good catch!

lenaschoenburg · 2022-07-18T12:29:13Z

Yep, that's summarized well I think 👍
I'd rather fix it differently (similar to how you described it) too. We could either:

track the current index separately (set it whenever we set currentEntry but also when installing a snapshot)
Have currentEntry be more flexible, so that it can either be a IndexRaftLogEntry or a snapshot.

Either way seems like they require a lot of changes in RaftMemberContext so I'd rather fix this by just re-adding this check. The check makes sense anyway, we never want to re-install a snapshot.

lenaschoenburg · 2022-07-18T13:24:30Z

bors r+

9824: fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=oleschoenburg ## Description This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot. ## Related issues  closes #9820 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

zeebe-bors-camunda · 2022-07-18T13:38:00Z

Build failed:

Test summary

lenaschoenburg · 2022-07-18T13:40:12Z

bors retry

9824: fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=oleschoenburg ## Description This fixes an issue where a leader becomes stuck in a snapshot replication loop if the leader's log starts right after the snapshot index. After successfully installing a snapshot, the leader attempts to seek back to the log entry before the snapshot which will fail if the log starts after the snapshot. When that happens, the member context has no current entry until the next append request is send so we must not use the current entry of the member to decide between sending a snapshot or not. Instead, we can bail out and decide not to send a snapshot when the member has the latest snapshot. ## Related issues  closes #9820 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

zeebe-bors-camunda · 2022-07-18T13:53:26Z

Build failed:

Test summary

lenaschoenburg · 2022-07-18T13:59:33Z

bors retry

zeebe-bors-camunda · 2022-07-18T14:13:19Z

Build succeeded:

backport-action · 2022-07-18T14:16:25Z

Successfully created backport PR #9830 for stable/1.3.

backport-action · 2022-07-18T14:16:33Z

Successfully created backport PR #9831 for stable/8.0.

9831: [Backport stable/8.0] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action # Description Backport of #9824 to `stable/8.0`. relates to #9820 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

9830: [Backport stable/1.3] fix: don't replicate snapshot if member already has the latest snapshot r=oleschoenburg a=backport-action # Description Backport of #9824 to `stable/1.3`. relates to #9820 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

korthout · 2022-08-02T10:08:31Z

@oleschoenburg I think this should've been added to the release notes.

lenaschoenburg added 2 commits July 18, 2022 09:34

fix: don't replicate snapshot if member already has the latest snapshot

7317e36

test: regression test for snapshot replication loop

df0d417

lenaschoenburg force-pushed the 9820-fix-snapshot-replication-loop branch from 13a3a62 to df0d417 Compare July 18, 2022 07:40

lenaschoenburg added backport stable/1.3 labels Jul 18, 2022

lenaschoenburg requested a review from npepinpe July 18, 2022 07:47

lenaschoenburg marked this pull request as ready for review July 18, 2022 07:47

npepinpe approved these changes Jul 18, 2022

View reviewed changes

zeebe-bors-camunda bot merged commit 145c697 into main Jul 18, 2022

zeebe-bors-camunda bot deleted the 9820-fix-snapshot-replication-loop branch July 18, 2022 14:13

backport-action mentioned this pull request Jul 18, 2022

[Backport stable/1.3] fix: don't replicate snapshot if member already has the latest snapshot #9830

Merged

backport-action mentioned this pull request Jul 18, 2022

[Backport stable/8.0] fix: don't replicate snapshot if member already has the latest snapshot #9831

Merged

npepinpe added version:1.3.13 release/8.0.5 labels Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't replicate snapshot if member already has the latest snapshot #9824

fix: don't replicate snapshot if member already has the latest snapshot #9824

lenaschoenburg commented Jul 15, 2022 •

edited

github-actions bot commented Jul 15, 2022 •

edited

npepinpe left a comment

lenaschoenburg commented Jul 18, 2022

lenaschoenburg commented Jul 18, 2022

zeebe-bors-camunda bot commented Jul 18, 2022

lenaschoenburg commented Jul 18, 2022

zeebe-bors-camunda bot commented Jul 18, 2022

lenaschoenburg commented Jul 18, 2022

zeebe-bors-camunda bot commented Jul 18, 2022

backport-action commented Jul 18, 2022

backport-action commented Jul 18, 2022

korthout commented Aug 2, 2022

fix: don't replicate snapshot if member already has the latest snapshot #9824

fix: don't replicate snapshot if member already has the latest snapshot #9824

Conversation

lenaschoenburg commented Jul 15, 2022 • edited

Description

Related issues

Definition of Done

github-actions bot commented Jul 15, 2022 • edited

Unit Test Results

npepinpe left a comment

Choose a reason for hiding this comment

lenaschoenburg commented Jul 18, 2022

lenaschoenburg commented Jul 18, 2022

zeebe-bors-camunda bot commented Jul 18, 2022

lenaschoenburg commented Jul 18, 2022

zeebe-bors-camunda bot commented Jul 18, 2022

lenaschoenburg commented Jul 18, 2022

zeebe-bors-camunda bot commented Jul 18, 2022

backport-action commented Jul 18, 2022

backport-action commented Jul 18, 2022

korthout commented Aug 2, 2022

lenaschoenburg commented Jul 15, 2022 •

edited

github-actions bot commented Jul 15, 2022 •

edited