Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend RandomizedRaftTest with snapshot and data loss operations #9837

Closed
lenaschoenburg opened this issue Jul 19, 2022 · 3 comments · Fixed by #11632
Closed

Extend RandomizedRaftTest with snapshot and data loss operations #9837

lenaschoenburg opened this issue Jul 19, 2022 · 3 comments · Fixed by #11632
Assignees
Labels
area/test Marks an issue as improving or extending the test coverage of the project component/raft kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 version:8.2.0-alpha5 Marks an issue as being completely or in parts released in 8.2.0-alpha5 version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0

Comments

@lenaschoenburg
Copy link
Member

Recent issues with snapshots have highlighted a need to improve test coverage. We can extend the RandomizedRaftTest by adding additional operations for taking snapshots and for losing data.

@lenaschoenburg lenaschoenburg added the kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. label Jul 19, 2022
@menski
Copy link
Contributor

menski commented Jul 22, 2022

marked it with the hot backup epic as it might be valuable to verify the system

zeebe-bors-camunda bot added a commit that referenced this issue Aug 29, 2022
10183: fix(raft): follower reset pendingsnapshot after rejecting install request r=deepthidevaki a=deepthidevaki

## Description

- Extended RandomizedRaftTest to include snapshotting and compaction. This test could reproduce the bug #10180 
- Without this fix, when a follower rejects a snapshot install request because it receives a duplicate chunk, the leader resets the snapshot replication and restart sending the same snapshot. But the follower has not reset its state, so it is still expecting a different chunk and reject the request again. The fix is to reset the pending snapshot when follower rejects a request on any error.

Extended RandomizedRaftTest partially covers #9837 

## Related issues

closes #10180 
closes #10202 



Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@users.noreply.github.com>
@npepinpe
Copy link
Member

Is this done via #10183 ?

@deepthidevaki
Copy link
Contributor

Is this done via #10183 ?

No. It only added snapshot. We still have to add restarts and restarts with dataloss.

zeebe-bors-camunda bot added a commit that referenced this issue Sep 2, 2022
10249: Add restarts to `RandomizedRaftTest` r=oleschoenburg a=oleschoenburg

Adds the ability to restart raft members without data loss. After the restart, the same data directory and snapshot store is used.

Adds additional randomized tests for default operations + restarts and default operations + snapshots + restarts.

relates to #9837

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
@deepthidevaki deepthidevaki self-assigned this Dec 9, 2022
@Zelldon Zelldon added component/raft area/test Marks an issue as improving or extending the test coverage of the project labels Jan 2, 2023
zeebe-bors-camunda bot added a commit that referenced this issue Feb 13, 2023
11547: fix: append an empty expired message command r=remcowesterhoud a=romansmirnov

## Description

To discard an expired message the processor and applier only need the message key. All other information is not necessary to process a expire message command. That's why only the `messageKey` is submitted to the command to expire a message.

This prevents the message checker to stop too early when collecting expired messages and not being able to collect as many as necessary expired messages in one batch to prevent building up state.

## Related issues
closes #10643 



11581: Create `decisionRequirementsKeyByIdAndVersion` ColumnFamily r=koevskinikola a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
When we delete a DRG it is possible that this is the "latest" DRG. When this is the case we must be able to find the previous version of the DRG to make this the new "latest" DRG.
This PR introduces a new ColumnFamily: `decisionRequirementsKeyByIdAndVersion`. The key of this ColumnFamily is a composite of the `drgId` and the `drgVersion`. This makes sure the key is unique for all DRGs, whilst allowing us to iterate over all entries using the `drgId`. The value of the ColumnFamily is the `drgKey`.

A new method has been introduced to iterate over this ColumnFamily. It takes a `drgId` and a `currentVersion`. Using this information it will find the previous version (relative to the given `currentVersion`) of this specific `drgId`. If it found something it will return the key of this DRG. As part of another issue this method will be used and the key will be stored in the `latestDecisionRequirementsKeysById` ColumnFamily.

When a new DRG is stored in the state it will be inserted into this ColumnFamily accordingly.

Last but not least we must make sure we migrate all DRGs in the state to initially fill this new ColumnFamily. A new migration for this has been added. It will iterate over the `decisionRequirementsByKey` ColumnFamily and use these DRGs to insert new entries into the new ColumnFamily.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #11541 



11632: Add randomized test for dataloss scenario r=deepthidevaki a=deepthidevaki

## Description

Add restart with dataloss to the set of input operations in randomized raft tests. The operations ensures that we can only inject data loss of one node at a time. It waits until the restarted node have recovered and caught up before proceeding with the other operations. 

This PR also adds additional verification to ensure that there is no dataloss, by keeping track of committed entries and verifying that it is not overwritten.

## Related issues

closes #9837 


Co-authored-by: Roman <roman.smirnov@camunda.com>
Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Feb 13, 2023
11581: Create `decisionRequirementsKeyByIdAndVersion` ColumnFamily r=koevskinikola a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
When we delete a DRG it is possible that this is the "latest" DRG. When this is the case we must be able to find the previous version of the DRG to make this the new "latest" DRG.
This PR introduces a new ColumnFamily: `decisionRequirementsKeyByIdAndVersion`. The key of this ColumnFamily is a composite of the `drgId` and the `drgVersion`. This makes sure the key is unique for all DRGs, whilst allowing us to iterate over all entries using the `drgId`. The value of the ColumnFamily is the `drgKey`.

A new method has been introduced to iterate over this ColumnFamily. It takes a `drgId` and a `currentVersion`. Using this information it will find the previous version (relative to the given `currentVersion`) of this specific `drgId`. If it found something it will return the key of this DRG. As part of another issue this method will be used and the key will be stored in the `latestDecisionRequirementsKeysById` ColumnFamily.

When a new DRG is stored in the state it will be inserted into this ColumnFamily accordingly.

Last but not least we must make sure we migrate all DRGs in the state to initially fill this new ColumnFamily. A new migration for this has been added. It will iterate over the `decisionRequirementsByKey` ColumnFamily and use these DRGs to insert new entries into the new ColumnFamily.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #11541 



11632: Add randomized test for dataloss scenario r=deepthidevaki a=deepthidevaki

## Description

Add restart with dataloss to the set of input operations in randomized raft tests. The operations ensures that we can only inject data loss of one node at a time. It waits until the restarted node have recovered and caught up before proceeding with the other operations. 

This PR also adds additional verification to ensure that there is no dataloss, by keeping track of committed entries and verifying that it is not overwritten.

## Related issues

closes #9837 


Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@deepthidevaki deepthidevaki added the version:8.2.0-alpha5 Marks an issue as being completely or in parts released in 8.2.0-alpha5 label Mar 7, 2023
@npepinpe npepinpe added the version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0 label Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test Marks an issue as improving or extending the test coverage of the project component/raft kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 version:8.2.0-alpha5 Marks an issue as being completely or in parts released in 8.2.0-alpha5 version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants