Reject duplicate DeploymentDistribution Complete command #10074

korthout · 2022-08-15T17:22:32Z

Description

There existed a special case that could lead to a ZeebeDbInconsistentException:

a pending deployment is distributed multiple times to another partition by the DeploymentRedistributor.
the other partition processes the distribution twice and both times sends a DeploymentDistribution:Complete command to the deployment partition (i.e. partitionId: 1).
the deployment partition processes the first complete command, and writes DeploymentDistribution:Completed event, which is applied to the state.
applying the completed event results in deleting the Pending Deployment for that partition.
when it processes the second complete command, there could still be a pending deployment for another partition open for the same deployment, if so, the error happens.
the second command is not rejected, because there is still a pending deployment for the deployment key, so another completed event is written and applied.
applying fails this time, because the pending deployment no longer exists.

This PR changes the behavior. It makes sure the second command is rejected because the specific pending deployment no longer exists. In that case, we don't write the completed event a second time.

Related issues

closes #10064

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
New content is added to the release announcement
If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

github-actions · 2022-08-15T17:34:38Z

Test Results

  844 files +    1   844 suites +1 1h 43m 13s ⏱️ + 8m 1s
6 226 tests - 231 6 214 ✔️ - 232 11 💤 ±0 1 ❌ +1
6 410 runs - 231 6 398 ✔️ - 232 11 💤 ±0 1 ❌ +1

For more details on these failures, see this check.

Results for commit 6db151e. ± Comparison against base commit 17630e8.

♻️ This comment has been updated with latest results.

The complete deployment distribution command must be processed idempotently. Like we do in other inter-partition communication cases, we reject the command when it was already processed.

The pending deployment is deleted when the DeploymentDistribution COMPLETED event is written. We should only write this event if the pending deployment still exists. If it doesn't exist it either was already completed or it was never pending. In both cases, the command to COMPLETE the deployment distribution should be rejected. We can achieve this by providing a way to check whether a specific deployment distribution exists in the state. The current method only allows to check that there is at least one pending distribution for a specific deployment, but it doesn't allow to check the specific partition id.

saig0

@korthout LGTM 👍

korthout · 2022-08-17T07:51:53Z

bors merge

10074: Reject duplicate DeploymentDistribution Complete command r=korthout a=korthout ## Description  There existed a special case that could lead to a ZeebeDbInconsistentException: - a pending deployment is distributed multiple times to another partition by the [DeploymentRedistributor](https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/processing/deployment/distribute/DeploymentRedistributor.java). - the other partition processes the distribution twice and both times sends a `DeploymentDistribution:Complete` command to the deployment partition (i.e. `partitionId: 1`). - the deployment partition processes the first complete command, and writes `DeploymentDistribution:Completed` event, which is applied to the state. - applying the completed event results in deleting the Pending Deployment for that partition. - when it processes the second complete command, there could still be a pending deployment for another partition open for the same deployment, if so, the error happens. - the second command is not rejected, because there is still a pending deployment for the deployment key, so another completed event is written and applied. - applying fails this time, because the pending deployment no longer exists. This PR changes the behavior. It makes sure the second command is rejected because the specific pending deployment no longer exists. In that case, we don't write the completed event a second time. ## Related issues  closes #10064 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>

zeebe-bors-camunda · 2022-08-17T08:21:00Z

Build failed:

Test summary

korthout · 2022-08-18T11:09:49Z

I missed that this failed to merge, but it seems to have happened due to Out of space in CodeCache 🤷

bors retry

zeebe-bors-camunda · 2022-08-18T11:34:43Z

Build succeeded:

backport-action · 2022-08-18T11:35:38Z

Backport failed for stable/1.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/1.3
git worktree add -d .worktree/backport-10074-to-stable/1.3 origin/stable/1.3
cd .worktree/backport-10074-to-stable/1.3
git checkout -b backport-10074-to-stable/1.3
ancref=$(git merge-base 17630e8cc3f1d81c5dc9523208196147d17114ed 6db151ecf7578ff97c6aa7b590614fc81d666293)
git cherry-pick -x $ancref..6db151ecf7578ff97c6aa7b590614fc81d666293

backport-action · 2022-08-18T11:35:42Z

Successfully created backport PR #10117 for stable/8.0.

korthout · 2022-08-18T14:09:16Z

Deployment redistribution works completely differently on 8.0 and 1.3. For example, the complete command is already written when the distribute command is written to the other partition. So it won't ever write it twice on the other partition. This is totally different from 8.1 where the command can be written multiple times to the other partition.

@saig0 I think we shouldn't backport this, so I'll close the opened backport PR. Let me know if you feel otherwise.

korthout added 2 commits August 16, 2022 11:48

test(engine): complete distro command idempotently

e8bd2c5

The complete deployment distribution command must be processed idempotently. Like we do in other inter-partition communication cases, we reject the command when it was already processed.

korthout force-pushed the korthout-10064-has-specific-pending-deployment branch from 1d170e9 to 6db151e Compare August 16, 2022 09:48

korthout added backport stable/1.3 labels Aug 16, 2022

korthout marked this pull request as ready for review August 16, 2022 12:40

korthout requested a review from saig0 August 16, 2022 12:40

saig0 approved these changes Aug 17, 2022

View reviewed changes

zeebe-bors-camunda bot merged commit 8ff06e4 into main Aug 18, 2022

zeebe-bors-camunda bot deleted the korthout-10064-has-specific-pending-deployment branch August 18, 2022 11:34

backport-action mentioned this pull request Aug 18, 2022

[Backport stable/8.0] Reject duplicate DeploymentDistribution Complete command #10117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject duplicate DeploymentDistribution Complete command #10074

Reject duplicate DeploymentDistribution Complete command #10074

korthout commented Aug 15, 2022 •

edited

github-actions bot commented Aug 15, 2022 •

edited

saig0 left a comment

korthout commented Aug 17, 2022

zeebe-bors-camunda bot commented Aug 17, 2022

korthout commented Aug 18, 2022

zeebe-bors-camunda bot commented Aug 18, 2022

backport-action commented Aug 18, 2022

backport-action commented Aug 18, 2022

korthout commented Aug 18, 2022

Reject duplicate DeploymentDistribution Complete command #10074

Reject duplicate DeploymentDistribution Complete command #10074

Conversation

korthout commented Aug 15, 2022 • edited

Description

Related issues

Definition of Done

github-actions bot commented Aug 15, 2022 • edited

Test Results

saig0 left a comment

Choose a reason for hiding this comment

korthout commented Aug 17, 2022

zeebe-bors-camunda bot commented Aug 17, 2022

korthout commented Aug 18, 2022

zeebe-bors-camunda bot commented Aug 18, 2022

backport-action commented Aug 18, 2022

backport-action commented Aug 18, 2022

korthout commented Aug 18, 2022

korthout commented Aug 15, 2022 •

edited

github-actions bot commented Aug 15, 2022 •

edited