fix(raft): follower reset pendingsnapshot after rejecting install request #10183

deepthidevaki · 2022-08-25T11:25:25Z

Description

Extended RandomizedRaftTest to include snapshotting and compaction. This test could reproduce the bug Follower cannot receive snapshot because "chunk received out of order" #10180
Without this fix, when a follower rejects a snapshot install request because it receives a duplicate chunk, the leader resets the snapshot replication and restart sending the same snapshot. But the follower has not reset its state, so it is still expecting a different chunk and reject the request again. The fix is to reset the pending snapshot when follower rejects a request on any error.

Extended RandomizedRaftTest partially covers #9837

Related issues

closes #10180
closes #10202

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
New content is added to the release announcement
If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

When comparing compacted logs, not all entries exist in all logs. So compare only the entries that exist.

When follower responds with any error, leader restarts send the snapshot from the initial chunk. When the follower has not resets its state, then the follower is not expecting the initial chunk. As a result, the follower reject the requests and this will continue endlessly.

…iability

github-actions · 2022-08-25T11:41:37Z

Test Results

  853 files +    1   853 suites +1 1h 37m 12s ⏱️ + 1m 51s
6 588 tests +150 6 577 ✔️ +151 11 💤 - 1 0 ❌ ±0
6 772 runs +150 6 761 ✔️ +151 11 💤 - 1 0 ❌ ±0

Results for commit 69df0fa. ± Comparison against base commit fabb2d2.

♻️ This comment has been updated with latest results.

oleschoenburg · 2022-08-25T14:26:56Z

livenessTestWithSnapshot with seed "6" fails for me, even if I increase maxStepsToReplicateEntries to 10000.

org.opentest4j.AssertionFailedError: [All entries should be committed] 
expected: IndexedRaftLogEntryImpl[index=1, term=2, entry=InitialEntry[], record=PersistedJournalRecord[metadata=RecordMetadata[checksum=966814682, length=45], record=RecordData[index=1, asqn=-1, data=UnsafeBuffer{addressOffset=140332419498078, capacity=17, byteArray=null, byteBuffer=java.nio.DirectByteBufferR[pos=0 lim=10194 cap=10194]}]]]
 but was: null
Expected :IndexedRaftLogEntryImpl[index=1, term=2, entry=InitialEntry[], record=PersistedJournalRecord[metadata=RecordMetadata[checksum=966814682, length=45], record=RecordData[index=1, asqn=-1, data=UnsafeBuffer{addressOffset=140332419498078, capacity=17, byt ...

Actual   :null

deepthidevaki · 2022-08-25T15:05:40Z

livenessTestWithSnapshot with seed "6" fails for me, even if I increase maxStepsToReplicateEntries to 10000.

Thanks. Will look into it.

deepthidevaki · 2022-08-25T15:12:36Z

🤔 livenessWithSnapshot passes for me with seed = 6.

 @Property(tries = 10, shrinking = ShrinkingMode.OFF, edgeCases = EdgeCasesMode.NONE, seed = "6")
  void livenessTestWithSnapshot(

Config reported by the test

tries = 10                    | # of calls to property
checks = 10                   | # of not rejected calls
generation = RANDOMIZED       | parameters are randomly generated
after-failure = PREVIOUS_SEED | use the previous seed
when-fixed-seed = ALLOW       | fixing the random seed is allowed
edge-cases#mode = NONE        | edge cases are not explicitly generated
seed = 6                      | random seed to reproduce generated values

oleschoenburg · 2022-08-25T15:31:18Z

Interesting. At first I thought I couldn't reproduce it either but after bumping "tries" to 1000, it started failing again. Just to make sure, I did a mvn clean compile and then ran the test with:

@Property(tries = 1000, shrinking = ShrinkingMode.OFF, edgeCases = EdgeCasesMode.NONE, seed = "6")

for me this fails again:

timestamp = 2022-08-25T17:29:59.343969738, RandomizedRaftTest:livenessTestWithSnapshot = 
  org.opentest4j.AssertionFailedError:
    [All entries should be committed] 
    expected: IndexedRaftLogEntryImpl[index=1, term=3, entry=InitialEntry[], record=PersistedJournalRecord[metadata=RecordMetadata[checksum=3449732498, length=45], record=RecordData[index=1, asqn=-1, data=UnsafeBuffer{addressOffset=139960984518750, capacity=17, byteArray=null, byteBuffer=java.nio.DirectByteBufferR[pos=0 lim=10194 cap=10194]}]]]
     but was: null

                              |-------------------jqwik-------------------
tries = 72                    | # of calls to property
checks = 72                   | # of not rejected calls
generation = RANDOMIZED       | parameters are randomly generated
after-failure = PREVIOUS_SEED | use the previous seed
when-fixed-seed = ALLOW       | fixing the random seed is allowed
edge-cases#mode = NONE        | edge cases are not explicitly generated
seed = 6                      | random seed to reproduce generated values

EDIT: Ah I think I get it: the number of tries is also required to reproduce, just the seed is not enough. It always fails on try number 72, so running with just 10 tries will not reproduce it even with the correct seed.

oleschoenburg

The fix itself looks good to me but I think we should look into the test failures before merging. Anyway, nice find @deepthidevaki 🚀

atomix/cluster/src/test/java/io/atomix/raft/ControllableRaftContexts.java

deepthidevaki · 2022-08-26T08:17:12Z

I can reproduce it with seed="6" and tries =100. It failed at try 72.

On an initial look, it could have uncovered a new bug in raft. Follower 0 is not able replicate event from the leader. It is caught in an error loop:

10:10:43.887 [] TRACE io.atomix.raft.roles.LeaderAppender - RaftServer{1-partition-1} - Sending AppendRequest{term=26, leader=1, prevLogIndex=1, prevLogTerm=10, entries=2, commitIndex=2} to 0

10:10:43.887 [] DEBUG io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Rejected AppendRequest{term=26, leader=1, prevLogIndex=1, prevLogTerm=10, entries=2, commitIndex=2}: Previous entry term (10) does not equal the local log's last term (3)
10:10:43.887 [] TRACE io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Sending AppendResponse{status=OK, term=26, succeeded=false, lastLogIndex=0, lastSnapshotIndex=1}

10:10:43.887 [] TRACE io.atomix.raft.roles.LeaderAppender - RaftServer{1-partition-1} - Received AppendResponse{status=OK, term=26, succeeded=false, lastLogIndex=0, lastSnapshotIndex=1} from 0
10:10:43.887 [] TRACE io.atomix.raft.roles.LeaderAppender - RaftServer{1-partition-1} - Reset next index for RaftMemberContext{member=0, term=26, configIndex=0, snapshotIndex=1, nextSnapshotIndex=0, nextSnapshotChunk=null, matchIndex=0, heartbeatTime=1661501443887, appending=0, appendSucceeded=false, appendTime=1661501443887, configuring=false, installing=false, failures=0} to 1
10:10:43.887 [] TRACE io.atomix.raft.roles.LeaderAppender - RaftServer{1-partition-1} - Sending AppendRequest{term=26, leader=1, prevLogIndex=1, prevLogTerm=10, entries=2, commitIndex=2} to 0

I will look into it further.

deepthidevaki · 2022-08-26T14:00:37Z

Was also able to reproduce with seed "-47087393671576789" in try 9.

Here is what happens:

15:22:09.891 [] INFO  io.atomix.raft.impl.RaftContext - RaftServer{0-partition-1} - Found leader 2
15:22:09.891 [] TRACE io.atomix.raft.impl.RaftContext - RaftServer{0-partition-1} - Set leader 2
15:22:09.891 [] TRACE io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Sending ConfigureResponse{status=OK}
15:22:09.891 [] TRACE io.atomix.raft.impl.RandomizedElectionTimer - Election timeout scheduled for PT4.468S
15:22:09.891 [] TRACE io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Received AppendRequest{term=74, leader=2, prevLogIndex=2, prevLogTerm=37, entries=0, commitIndex=2}
15:22:09.891 [] DEBUG io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Rejected AppendRequest{term=74, leader=2, prevLogIndex=2, prevLogTerm=37, entries=0, commitIndex=2}: Previous index (2) is greater than the local log's last index (1)
15:22:09.891 [] TRACE io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Sending AppendResponse{status=OK, term=74, succeeded=false, lastLogIndex=1, lastSnapshotIndex=0}

Node 0 log has one entry which is at index 1.

15:22:09.892 [] TRACE io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Received AppendRequest{term=74, leader=2, prevLogIndex=1, prevLogTerm=19, entries=2, commitIndex=2}
15:22:09.892 [] DEBUG io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Rejected AppendRequest{term=74, leader=2, prevLogIndex=1, prevLogTerm=19, entries=2, commitIndex=2}: Previous entry term (19) does not equal the local log's last term (5)
15:22:09.893 [] TRACE io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Sending AppendResponse{status=OK, term=74, succeeded=false, lastLogIndex=0, lastSnapshotIndex=0}

Node 0 log's entry at index 1 has term 5. Leader's index 1 has term 19.

15:22:09.895 [] TRACE io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Received InstallRequest{currentTerm=74, leader=2, index=1, term=19, version=1, chunkId=HeapByteBuffer{position=0, remaining=7, limit=7, capacity=7, mark=java.nio.HeapByteBuffer[pos=0 lim=7 cap=7], hash=263875441}, nextChunkId=null, data=HeapByteBuffer{position=0, remaining=57, limit=57, capacity=57, mark=java.nio.HeapByteBuffer[pos=0 lim=57 cap=57], hash=1020455180}, initial=false, complete=true}
15:22:09.895 [] DEBUG io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Received snapshot 1 chunk from 2
15:22:09.895 [] DEBUG io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Committing snapshot io.atomix.raft.snapshot.InMemorySnapshot@176fb1
15:22:09.895 [] INFO  io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Committed snapshot io.atomix.raft.snapshot.InMemorySnapshot@176fb1
15:22:09.895 [] TRACE io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Sending InstallResponse{status=OK}

Leader sends snapshot at index 1. Node 0 commits it. But does not reset the log. (Bug!) We expect the follower to reset the log when ever it receives a new snapshot from the leader.

Now Node 0 has snapshot at index 1 and term 19 and log with entry at index 1 and term 5.(Bug!!)

15:22:09.896 [] TRACE io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Received AppendRequest{term=74, leader=2, prevLogIndex=1, prevLogTerm=19, entries=2, commitIndex=3}
15:22:09.896 [] DEBUG io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Rejected AppendRequest{term=74, leader=2, prevLogIndex=1, prevLogTerm=19, entries=2, commitIndex=3}: Previous entry term (19) does not equal the local log's last term (5)
15:22:09.896 [] TRACE io.atomix.raft.roles.FollowerRole - RaftServer{0-partition-1}{role=FOLLOWER} - Sending AppendResponse{status=OK, term=74, succeeded=false, lastLogIndex=0, lastSnapshotIndex=1}

Node 0 receives entry at index 2. Expected behavior => there is a snapshot at index 1, so it can append entry at 2. But since we have an entry at index 1 in the log, it is using that to verify the previous entry. So it fails. And this repeats.

Fix for this is to ensure that the log is reset when a follower receives a snapshot from the leader.

Why is the log not reset?

Because of this check if (lastIndex < index) { https://github.com/camunda/zeebe/blob/ced608bdcf13e3fb99bac9145191d6e742d73aa8/atomix/cluster/src/main/java/io/atomix/raft/roles/PassiveRole.java#L770

Can we remove this check?

This check was needed because of the snapshot replication outside of raft. We can remove this now. But simply removing this does not work. The same code is also used to reset the log in the init of a role. This is needed to guarantee that if the node crashes after snapshot is committed, but before resetting the log, the log is reset immediately after restart.

Proposed solution:

Remove this check and reset the log always when the follower receives a snapshot
Reset the log before the snapshot is committed.
Remove the reset of the log during init. No need for that because of Step 2

There were some cases where the log is not reset and leads to scenarios where a follower is not able to replicate new events. This case is explained in #10183 (comment)

deepthidevaki · 2022-08-26T14:42:21Z

After the fix, I ran the test > 125 times with default test configuration. All tests passed. Before it was failing with in 20 runs. Hope it will be fine now 🤞

oleschoenburg · 2022-08-29T10:24:15Z

atomix/cluster/src/main/java/io/atomix/raft/roles/PassiveRole.java

@@ -265,11 +230,12 @@ public CompletableFuture<InstallResponse> onInstall(final InstallRequest request
      final long elapsed = System.currentTimeMillis() - pendingSnapshotStartTimestamp;
      log.debug("Committing snapshot {}", pendingSnapshot);
      try {
+        // Reset before committing to prevent the edge case where the system crashes after
+        // committing the snapshot, and restart with a snapshot and invalid log.
+        resetLogOnReceivingSnapshot(pendingSnapshot.index());


❓ With these changes here, a failure while committing the snapshot means that the broker loses data because it already reset the log prior to committing the snapshot. Isn't this problematic?

Say there is a bug where a received snapshot can't be committed. Previously, a running system would have been able to continue (just without snapshot replication, and maybe the next received snapshot can be committed again). Now, with these changes, all followers (or at least those that receive such a snapshot) would lose data immediately and only the leader would be left with the full data.

oleschoenburg

That's it, LGTM 🚀

atomix/cluster/src/test/java/io/atomix/raft/ControllableRaftContexts.java

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

deepthidevaki · 2022-08-29T12:43:20Z

bors merge

zeebe-bors-camunda · 2022-08-29T12:57:24Z

Build succeeded:

backport-action · 2022-08-29T12:58:27Z

Backport failed for stable/1.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/1.3
git worktree add -d .worktree/backport-10183-to-stable/1.3 origin/stable/1.3
cd .worktree/backport-10183-to-stable/1.3
git checkout -b backport-10183-to-stable/1.3
ancref=$(git merge-base fabb2d2f2e1f3affd1073a9069e21c27a7b8f0bd 69df0fac65c909a3e97a0c39174cd857ad2d9680)
git cherry-pick -x $ancref..69df0fac65c909a3e97a0c39174cd857ad2d9680

backport-action · 2022-08-29T12:58:31Z

Backport failed for stable/8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-10183-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-10183-to-stable/8.0
git checkout -b backport-10183-to-stable/8.0
ancref=$(git merge-base fabb2d2f2e1f3affd1073a9069e21c27a7b8f0bd 69df0fac65c909a3e97a0c39174cd857ad2d9680)
git cherry-pick -x $ancref..69df0fac65c909a3e97a0c39174cd857ad2d9680

There were some cases where the log is not reset and leads to scenarios where a follower is not able to replicate new events. This case is explained in #10183 (comment) (cherry picked from commit 4c82bd9)

10210: [Backport stable/8.0] fix(raft): follower resets pending snapshot after rejecting install request r=oleschoenburg a=deepthidevaki ## Description Backport #10183 closes #10180 #10202 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@users.noreply.github.com>

10211: [Backport stable/1.3] fix(raft): follower resets pending snapshot after rejecting install request r=oleschoenburg a=deepthidevaki ## Description Backport #10183 closes #10180 #10202 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@users.noreply.github.com>

10211: [Backport stable/1.3] fix(raft): follower resets pending snapshot after rejecting install request r=deepthidevaki a=deepthidevaki ## Description Backport #10183 closes #10180 #10202 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@users.noreply.github.com>

10443: Do not take a backup if it already exists r=deepthidevaki a=deepthidevaki ## Description After restore, the log is truncated to the checkpoint position. So the checkpoint record is processed again and will trigger a new backup with the same Id of the backup it restored from. With this PR, `BackupService` handles this case gracefully. In addition, we also do not take a new backup if existing backup is failed or in progress. Alternatively, we can delete this backup and take a new one. But chances of it happening (i.e triggering a new backup when one already is in progress/failed) is very low. So we can keep this simple. ## Related issues closes #10430 10463: Do not fail consistency check if log is empty r=deepthidevaki a=deepthidevaki ## Description When a follower receives a snapshot from the leader, it has to throw away the log and reset the log to `snapshotIndex + 1`. Previously we were doing it in the following order: 1. commit snapshot 2. reset In this case, if the system crashed after step 1, when the node restarts it is in an invalid state because the log is not reset after the snapshot. To prevent this case, we reset the log on startup based on the existing snapshot. This was buggy and caused issues, which was fixed by #10183. The fix was to reverse the order: 1. reset log 2. commit snapshot. So on restart, there is no need to reset the log. If the system crashes after step 1, we have any empty log and no snapshot (or a previous snapshot). This is a valid state because this follower is not in the quorum, so no data is lost. After the restart the follower will receive the snapshot and the following events. But this caused the consistency check to fail because it detected gaps between the snapshot and the first log entry. The state is not actually inconsistent, because no data is lost. So we fix this by updating the consistency check to treat this is a valid state. To make the state valid, if the log is empty, we reset it based on the available snapshot. ## Related issues closes #10451 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

10463: Do not fail consistency check if log is empty r=deepthidevaki a=deepthidevaki ## Description When a follower receives a snapshot from the leader, it has to throw away the log and reset the log to `snapshotIndex + 1`. Previously we were doing it in the following order: 1. commit snapshot 2. reset In this case, if the system crashed after step 1, when the node restarts it is in an invalid state because the log is not reset after the snapshot. To prevent this case, we reset the log on startup based on the existing snapshot. This was buggy and caused issues, which was fixed by #10183. The fix was to reverse the order: 1. reset log 2. commit snapshot. So on restart, there is no need to reset the log. If the system crashes after step 1, we have any empty log and no snapshot (or a previous snapshot). This is a valid state because this follower is not in the quorum, so no data is lost. After the restart the follower will receive the snapshot and the following events. But this caused the consistency check to fail because it detected gaps between the snapshot and the first log entry. The state is not actually inconsistent, because no data is lost. So we fix this by updating the consistency check to treat this is a valid state. To make the state valid, if the log is empty, we reset it based on the available snapshot. ## Related issues closes #10451 10482: deps(maven): bump snakeyaml from 1.32 to 1.33 r=Zelldon a=dependabot[bot] Bumps [snakeyaml](https://bitbucket.org/snakeyaml/snakeyaml) from 1.32 to 1.33. <details> <summary>Commits</summary> <ul> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/eafb23ec31a0babe591c00e1b50e557a5e3f9a1d"><code>eafb23e</code></a> [maven-release-plugin] prepare for next development iteration</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/26624702fab8e0a1c301d7fad723c048528f75c3"><code>2662470</code></a> Improve JavaDoc</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/80827798f06aeb3d4f2632b94075ca7633418829"><code>8082779</code></a> Always emit numberish strings with quotes</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/42d6c79430431fe9033d3ba50f6a7dc6798ba7ad"><code>42d6c79</code></a> Reformat test</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/1962a437263348c3b90857cda4bbfa2bd97908f8"><code>1962a43</code></a> Refactor: rename variables in Emitter</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/bc594ad6e2b87c3fc26844e407276796fd866a40"><code>bc594ad</code></a> Issue 553: honor code point limit in loadAll</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/c3e98fd755a949f65cf11f2ff39e55a1c2afd1c2"><code>c3e98fd</code></a> Update changes.xml</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/a06f76859f2f07580b1d9fa6b66ea84aaad26cf8"><code>a06f768</code></a> Remove deprecated Tag manipulation</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/5a0027a3781b92f59bf92cdeb1b7590589993efd"><code>5a0027a</code></a> Remove unused WhitespaceToken</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/3f05838828b8df36ab961bf836f373b8c20cb8ff"><code>3f05838</code></a> Improve JavaDoc</li> <li>Additional commits viewable in <a href="https://bitbucket.org/snakeyaml/snakeyaml/branches/compare/snakeyaml-1.33..snakeyaml-1.32">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.yaml:snakeyaml&package-manager=maven&previous-version=1.32&new-version=1.33)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 10486: test: `PartitionRestoreServiceTest` does not block on taking a backup r=oleschoenburg a=oleschoenburg We saw some unit tests timing out in `PartitionRestoreServiceTest`: ``` "ForkJoinPool-1-worker-1" #19 daemon prio=5 os_prio=0 cpu=1567.91ms elapsed=914.45s tid=0x00007facfca78b60 nid=0x15ab5 waiting on condition [0x00007facb83df000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.4.1/Native Method) - parking to wait for <0x0000000511f04c68> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(java.base@17.0.4.1/LockSupport.java:211) at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.4.1/CompletableFuture.java:1864) at java.util.concurrent.ForkJoinPool.compensatedBlock(java.base@17.0.4.1/ForkJoinPool.java:3449) at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.4.1/ForkJoinPool.java:3432) at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.4.1/CompletableFuture.java:1898) at java.util.concurrent.CompletableFuture.join(java.base@17.0.4.1/CompletableFuture.java:2117) at io.camunda.zeebe.restore.PartitionRestoreServiceTest.takeBackup(PartitionRestoreServiceTest.java:212) at io.camunda.zeebe.restore.PartitionRestoreServiceTest.shouldFailToRestoreWhenSnapshotIsCorrupted(PartitionRestoreServiceTest.java:182) ``` With these changes here we ensure that the test does not wait forever on a backup, instead setting a maximum of 30 seconds. Additionally, `TestRestorableBackupStore` now fails the future when a backup is marked as failed. 10489: Do not use DefaultActorClock r=Zelldon a=Zelldon ## Description The default ActorClock is not thread safe and shouldn't be shared with multiple threads. This means we need to set the clock in the ActorClockConfiguration to null. Creating the ActorScheduler with no clock will cause that each threads gets its own clock. Note: This is a quick fix, at some point, we want to make DefaultActorClock threadsafe so we can use always the same clock. See #10400  ## Related issues  related #10400 10490: ci(macos): set code cache size of 64m r=megglos a=megglos To counter occasional out of code cache errors observed on macos builds. Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>

10463: Do not fail consistency check if log is empty r=deepthidevaki a=deepthidevaki ## Description When a follower receives a snapshot from the leader, it has to throw away the log and reset the log to `snapshotIndex + 1`. Previously we were doing it in the following order: 1. commit snapshot 2. reset In this case, if the system crashed after step 1, when the node restarts it is in an invalid state because the log is not reset after the snapshot. To prevent this case, we reset the log on startup based on the existing snapshot. This was buggy and caused issues, which was fixed by #10183. The fix was to reverse the order: 1. reset log 2. commit snapshot. So on restart, there is no need to reset the log. If the system crashes after step 1, we have any empty log and no snapshot (or a previous snapshot). This is a valid state because this follower is not in the quorum, so no data is lost. After the restart the follower will receive the snapshot and the following events. But this caused the consistency check to fail because it detected gaps between the snapshot and the first log entry. The state is not actually inconsistent, because no data is lost. So we fix this by updating the consistency check to treat this is a valid state. To make the state valid, if the log is empty, we reset it based on the available snapshot. ## Related issues closes #10451 10482: deps(maven): bump snakeyaml from 1.32 to 1.33 r=Zelldon a=dependabot[bot] Bumps [snakeyaml](https://bitbucket.org/snakeyaml/snakeyaml) from 1.32 to 1.33. <details> <summary>Commits</summary> <ul> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/eafb23ec31a0babe591c00e1b50e557a5e3f9a1d"><code>eafb23e</code></a> [maven-release-plugin] prepare for next development iteration</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/26624702fab8e0a1c301d7fad723c048528f75c3"><code>2662470</code></a> Improve JavaDoc</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/80827798f06aeb3d4f2632b94075ca7633418829"><code>8082779</code></a> Always emit numberish strings with quotes</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/42d6c79430431fe9033d3ba50f6a7dc6798ba7ad"><code>42d6c79</code></a> Reformat test</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/1962a437263348c3b90857cda4bbfa2bd97908f8"><code>1962a43</code></a> Refactor: rename variables in Emitter</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/bc594ad6e2b87c3fc26844e407276796fd866a40"><code>bc594ad</code></a> Issue 553: honor code point limit in loadAll</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/c3e98fd755a949f65cf11f2ff39e55a1c2afd1c2"><code>c3e98fd</code></a> Update changes.xml</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/a06f76859f2f07580b1d9fa6b66ea84aaad26cf8"><code>a06f768</code></a> Remove deprecated Tag manipulation</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/5a0027a3781b92f59bf92cdeb1b7590589993efd"><code>5a0027a</code></a> Remove unused WhitespaceToken</li> <li><a href="https://bitbucket.org/snakeyaml/snakeyaml/commits/3f05838828b8df36ab961bf836f373b8c20cb8ff"><code>3f05838</code></a> Improve JavaDoc</li> <li>Additional commits viewable in <a href="https://bitbucket.org/snakeyaml/snakeyaml/branches/compare/snakeyaml-1.33..snakeyaml-1.32">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.yaml:snakeyaml&package-manager=maven&previous-version=1.32&new-version=1.33)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

10463: Do not fail consistency check if log is empty r=deepthidevaki a=deepthidevaki ## Description When a follower receives a snapshot from the leader, it has to throw away the log and reset the log to `snapshotIndex + 1`. Previously we were doing it in the following order: 1. commit snapshot 2. reset In this case, if the system crashed after step 1, when the node restarts it is in an invalid state because the log is not reset after the snapshot. To prevent this case, we reset the log on startup based on the existing snapshot. This was buggy and caused issues, which was fixed by #10183. The fix was to reverse the order: 1. reset log 2. commit snapshot. So on restart, there is no need to reset the log. If the system crashes after step 1, we have any empty log and no snapshot (or a previous snapshot). This is a valid state because this follower is not in the quorum, so no data is lost. After the restart the follower will receive the snapshot and the following events. But this caused the consistency check to fail because it detected gaps between the snapshot and the first log entry. The state is not actually inconsistent, because no data is lost. So we fix this by updating the consistency check to treat this is a valid state. To make the state valid, if the log is empty, we reset it based on the available snapshot. ## Related issues closes #10451 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

deepthidevaki added 4 commits August 25, 2022 13:03

test(raft): allow comparing compacted logs

773dfa2

When comparing compacted logs, not all entries exist in all logs. So compare only the entries that exist.

test(raft): extend RandomizedRaftTest with snapshot and compaction

5db2932

test(raft): write different content to each chunk to improve test rel…

fe11e34

…iability

deepthidevaki added backport stable/1.3 labels Aug 25, 2022

deepthidevaki requested a review from oleschoenburg August 25, 2022 13:03

oleschoenburg reviewed Aug 25, 2022

View reviewed changes

atomix/cluster/src/test/java/io/atomix/raft/ControllableRaftContexts.java Outdated Show resolved Hide resolved

atomix/cluster/src/test/java/io/atomix/raft/ControllableRaftContexts.java Show resolved Hide resolved

test(raft): improve assertion failure message

6a39814

fix(raft): reset log always on receiving a snapshot

4c82bd9

There were some cases where the log is not reset and leads to scenarios where a follower is not able to replicate new events. This case is explained in #10183 (comment)

deepthidevaki mentioned this pull request Aug 26, 2022

Follower cannot receive new entries because it did not reset the log on receiving snapshot #10202

Closed

refactor(raft): remove unused snapshot listener

43c96cd

test(raft): improve assertion failure message

d79b730

deepthidevaki requested a review from oleschoenburg August 26, 2022 15:01

deepthidevaki added 2 commits August 26, 2022 17:09

test(raft): remove unnecessary seek

1362765

refactor(raft): fix typo in error message

455a09c

oleschoenburg reviewed Aug 29, 2022

View reviewed changes

oleschoenburg approved these changes Aug 29, 2022

View reviewed changes

atomix/cluster/src/test/java/io/atomix/raft/ControllableRaftContexts.java Outdated Show resolved Hide resolved

test(raft): allow snapshot with one chunk in randomized tests

69df0fa

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

zeebe-bors-camunda bot merged commit 49adfac into main Aug 29, 2022

zeebe-bors-camunda bot deleted the dd-10180-snapshot-replication branch August 29, 2022 12:57

deepthidevaki mentioned this pull request Aug 29, 2022

[Backport stable/8.0] fix(raft): follower resets pending snapshot after rejecting install request #10210

Merged

deepthidevaki mentioned this pull request Aug 29, 2022

[Backport stable/1.3] fix(raft): follower resets pending snapshot after rejecting install request #10211

Merged

npepinpe mentioned this pull request Aug 31, 2022

Extend RandomizedRaftTest with snapshot and data loss operations #9837

Closed

saig0 added release/8.0.8 version:1.3.14 Marks an issue as being completely or in parts released in 1.3.14 labels Sep 1, 2022

This was referenced Sep 22, 2022

Corrupt partition prevents startup of node #10451

Closed

Do not fail consistency check if log is empty #10463

Merged

Zelldon mentioned this pull request Dec 29, 2022

SnapshotWriteException: Expected snapshot chunk with equal snapshot total count 51, but got chunk with total count 23 #8381

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(raft): follower reset pendingsnapshot after rejecting install request #10183

fix(raft): follower reset pendingsnapshot after rejecting install request #10183

deepthidevaki commented Aug 25, 2022 •

edited

github-actions bot commented Aug 25, 2022 •

edited

oleschoenburg commented Aug 25, 2022 •

edited

deepthidevaki commented Aug 25, 2022

deepthidevaki commented Aug 25, 2022

oleschoenburg commented Aug 25, 2022 •

edited

oleschoenburg left a comment

deepthidevaki commented Aug 26, 2022

deepthidevaki commented Aug 26, 2022

deepthidevaki commented Aug 26, 2022

oleschoenburg Aug 29, 2022

oleschoenburg left a comment

deepthidevaki commented Aug 29, 2022

zeebe-bors-camunda bot commented Aug 29, 2022

backport-action commented Aug 29, 2022

backport-action commented Aug 29, 2022

fix(raft): follower reset pendingsnapshot after rejecting install request #10183

fix(raft): follower reset pendingsnapshot after rejecting install request #10183

Conversation

deepthidevaki commented Aug 25, 2022 • edited

Description

Related issues

Definition of Done

github-actions bot commented Aug 25, 2022 • edited

Test Results

oleschoenburg commented Aug 25, 2022 • edited

deepthidevaki commented Aug 25, 2022

deepthidevaki commented Aug 25, 2022

oleschoenburg commented Aug 25, 2022 • edited

oleschoenburg left a comment

Choose a reason for hiding this comment

deepthidevaki commented Aug 26, 2022

deepthidevaki commented Aug 26, 2022

deepthidevaki commented Aug 26, 2022

oleschoenburg Aug 29, 2022

Choose a reason for hiding this comment

oleschoenburg left a comment

Choose a reason for hiding this comment

deepthidevaki commented Aug 29, 2022

zeebe-bors-camunda bot commented Aug 29, 2022

backport-action commented Aug 29, 2022

backport-action commented Aug 29, 2022

deepthidevaki commented Aug 25, 2022 •

edited

github-actions bot commented Aug 25, 2022 •

edited

oleschoenburg commented Aug 25, 2022 •

edited

oleschoenburg commented Aug 25, 2022 •

edited