Extend `RandomizedRaftTest` with snapshot and data loss operations #9837

lenaschoenburg · 2022-07-19T08:44:11Z

Recent issues with snapshots have highlighted a need to improve test coverage. We can extend the RandomizedRaftTest by adding additional operations for taking snapshots and for losing data.

The text was updated successfully, but these errors were encountered:

menski · 2022-07-22T09:28:24Z

marked it with the hot backup epic as it might be valuable to verify the system

10183: fix(raft): follower reset pendingsnapshot after rejecting install request r=deepthidevaki a=deepthidevaki ## Description - Extended RandomizedRaftTest to include snapshotting and compaction. This test could reproduce the bug #10180 - Without this fix, when a follower rejects a snapshot install request because it receives a duplicate chunk, the leader resets the snapshot replication and restart sending the same snapshot. But the follower has not reset its state, so it is still expecting a different chunk and reject the request again. The fix is to reset the pending snapshot when follower rejects a request on any error. Extended RandomizedRaftTest partially covers #9837 ## Related issues closes #10180 closes #10202 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@users.noreply.github.com>

npepinpe · 2022-08-31T10:51:19Z

Is this done via #10183 ?

deepthidevaki · 2022-08-31T10:58:47Z

Is this done via #10183 ?

No. It only added snapshot. We still have to add restarts and restarts with dataloss.

10249: Add restarts to `RandomizedRaftTest` r=oleschoenburg a=oleschoenburg Adds the ability to restart raft members without data loss. After the restart, the same data directory and snapshot store is used. Adds additional randomized tests for default operations + restarts and default operations + snapshots + restarts. relates to #9837 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

11547: fix: append an empty expired message command r=remcowesterhoud a=romansmirnov ## Description To discard an expired message the processor and applier only need the message key. All other information is not necessary to process a expire message command. That's why only the `messageKey` is submitted to the command to expire a message. This prevents the message checker to stop too early when collecting expired messages and not being able to collect as many as necessary expired messages in one batch to prevent building up state. ## Related issues closes #10643 11581: Create `decisionRequirementsKeyByIdAndVersion` ColumnFamily r=koevskinikola a=remcowesterhoud ## Description  When we delete a DRG it is possible that this is the "latest" DRG. When this is the case we must be able to find the previous version of the DRG to make this the new "latest" DRG. This PR introduces a new ColumnFamily: `decisionRequirementsKeyByIdAndVersion`. The key of this ColumnFamily is a composite of the `drgId` and the `drgVersion`. This makes sure the key is unique for all DRGs, whilst allowing us to iterate over all entries using the `drgId`. The value of the ColumnFamily is the `drgKey`. A new method has been introduced to iterate over this ColumnFamily. It takes a `drgId` and a `currentVersion`. Using this information it will find the previous version (relative to the given `currentVersion`) of this specific `drgId`. If it found something it will return the key of this DRG. As part of another issue this method will be used and the key will be stored in the `latestDecisionRequirementsKeysById` ColumnFamily. When a new DRG is stored in the state it will be inserted into this ColumnFamily accordingly. Last but not least we must make sure we migrate all DRGs in the state to initially fill this new ColumnFamily. A new migration for this has been added. It will iterate over the `decisionRequirementsByKey` ColumnFamily and use these DRGs to insert new entries into the new ColumnFamily. ## Related issues  closes #11541 11632: Add randomized test for dataloss scenario r=deepthidevaki a=deepthidevaki ## Description Add restart with dataloss to the set of input operations in randomized raft tests. The operations ensures that we can only inject data loss of one node at a time. It waits until the restarted node have recovered and caught up before proceeding with the other operations. This PR also adds additional verification to ensure that there is no dataloss, by keeping track of committed entries and verifying that it is not overwritten. ## Related issues closes #9837 Co-authored-by: Roman <roman.smirnov@camunda.com> Co-authored-by: Remco Westerhoud <remco@westerhoud.nl> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

11581: Create `decisionRequirementsKeyByIdAndVersion` ColumnFamily r=koevskinikola a=remcowesterhoud ## Description  When we delete a DRG it is possible that this is the "latest" DRG. When this is the case we must be able to find the previous version of the DRG to make this the new "latest" DRG. This PR introduces a new ColumnFamily: `decisionRequirementsKeyByIdAndVersion`. The key of this ColumnFamily is a composite of the `drgId` and the `drgVersion`. This makes sure the key is unique for all DRGs, whilst allowing us to iterate over all entries using the `drgId`. The value of the ColumnFamily is the `drgKey`. A new method has been introduced to iterate over this ColumnFamily. It takes a `drgId` and a `currentVersion`. Using this information it will find the previous version (relative to the given `currentVersion`) of this specific `drgId`. If it found something it will return the key of this DRG. As part of another issue this method will be used and the key will be stored in the `latestDecisionRequirementsKeysById` ColumnFamily. When a new DRG is stored in the state it will be inserted into this ColumnFamily accordingly. Last but not least we must make sure we migrate all DRGs in the state to initially fill this new ColumnFamily. A new migration for this has been added. It will iterate over the `decisionRequirementsByKey` ColumnFamily and use these DRGs to insert new entries into the new ColumnFamily. ## Related issues  closes #11541 11632: Add randomized test for dataloss scenario r=deepthidevaki a=deepthidevaki ## Description Add restart with dataloss to the set of input operations in randomized raft tests. The operations ensures that we can only inject data loss of one node at a time. It waits until the restarted node have recovered and caught up before proceeding with the other operations. This PR also adds additional verification to ensure that there is no dataloss, by keeping track of committed entries and verifying that it is not overwritten. ## Related issues closes #9837 Co-authored-by: Remco Westerhoud <remco@westerhoud.nl> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

lenaschoenburg added the kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. label Jul 19, 2022

deepthidevaki mentioned this issue Aug 25, 2022

fix(raft): follower reset pendingsnapshot after rejecting install request #10183

Merged

10 tasks

lenaschoenburg mentioned this issue Sep 1, 2022

Add restarts to RandomizedRaftTest #10249

Merged

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

deepthidevaki self-assigned this Dec 9, 2022

Zelldon added component/raft area/test Marks an issue as improving or extending the test coverage of the project labels Jan 2, 2023

deepthidevaki mentioned this issue Feb 13, 2023

Add randomized test for dataloss scenario #11632

Merged

zeebe-bors-camunda bot closed this as completed in e738db1 Feb 21, 2023

deepthidevaki added the version:8.2.0-alpha5 Marks an issue as being completely or in parts released in 8.2.0-alpha5 label Mar 7, 2023

npepinpe added the version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0 label Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend `RandomizedRaftTest` with snapshot and data loss operations #9837

Extend `RandomizedRaftTest` with snapshot and data loss operations #9837

lenaschoenburg commented Jul 19, 2022

menski commented Jul 22, 2022

npepinpe commented Aug 31, 2022

deepthidevaki commented Aug 31, 2022

Extend RandomizedRaftTest with snapshot and data loss operations #9837

Extend RandomizedRaftTest with snapshot and data loss operations #9837

Comments

lenaschoenburg commented Jul 19, 2022

menski commented Jul 22, 2022

npepinpe commented Aug 31, 2022

deepthidevaki commented Aug 31, 2022

Extend `RandomizedRaftTest` with snapshot and data loss operations #9837

Extend `RandomizedRaftTest` with snapshot and data loss operations #9837