Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject duplicate DeploymentDistribution Complete command #10074

Merged

Conversation

korthout
Copy link
Member

@korthout korthout commented Aug 15, 2022

Description

There existed a special case that could lead to a ZeebeDbInconsistentException:

  • a pending deployment is distributed multiple times to another partition by the DeploymentRedistributor.
  • the other partition processes the distribution twice and both times sends a DeploymentDistribution:Complete command to the deployment partition (i.e. partitionId: 1).
  • the deployment partition processes the first complete command, and writes DeploymentDistribution:Completed event, which is applied to the state.
  • applying the completed event results in deleting the Pending Deployment for that partition.
  • when it processes the second complete command, there could still be a pending deployment for another partition open for the same deployment, if so, the error happens.
  • the second command is not rejected, because there is still a pending deployment for the deployment key, so another completed event is written and applied.
  • applying fails this time, because the pending deployment no longer exists.

This PR changes the behavior. It makes sure the second command is rejected because the specific pending deployment no longer exists. In that case, we don't write the completed event a second time.

Related issues

closes #10064

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 15, 2022

Test Results

   844 files  +    1     844 suites  +1   1h 43m 13s ⏱️ + 8m 1s
6 226 tests  - 231  6 214 ✔️  - 232  11 💤 ±0  1 +1 
6 410 runs   - 231  6 398 ✔️  - 232  11 💤 ±0  1 +1 

For more details on these failures, see this check.

Results for commit 6db151e. ± Comparison against base commit 17630e8.

♻️ This comment has been updated with latest results.

The complete deployment distribution command must be processed
idempotently. Like we do in other inter-partition communication cases,
we reject the command when it was already processed.
The pending deployment is deleted when the DeploymentDistribution
COMPLETED event is written.

We should only write this event if the pending deployment still exists.
If it doesn't exist it either was already completed or it was never
pending. In both cases, the command to COMPLETE the deployment
distribution should be rejected.

We can achieve this by providing a way to check whether a specific
deployment distribution exists in the state. The current method only
allows to check that there is at least one pending distribution for a
specific deployment, but it doesn't allow to check the specific
partition id.
@korthout korthout force-pushed the korthout-10064-has-specific-pending-deployment branch from 1d170e9 to 6db151e Compare August 16, 2022 09:48
@korthout korthout marked this pull request as ready for review August 16, 2022 12:40
@korthout korthout requested a review from saig0 August 16, 2022 12:40
Copy link
Member

@saig0 saig0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@korthout LGTM 👍

@korthout
Copy link
Member Author

bors merge

zeebe-bors-camunda bot added a commit that referenced this pull request Aug 17, 2022
10074: Reject duplicate DeploymentDistribution Complete command r=korthout a=korthout

## Description

<!-- Please explain the changes you made here. -->
There existed a special case that could lead to a ZeebeDbInconsistentException:
- a pending deployment is distributed multiple times to another partition by the [DeploymentRedistributor](https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/processing/deployment/distribute/DeploymentRedistributor.java).
- the other partition processes the distribution twice and both times sends a `DeploymentDistribution:Complete` command to the deployment partition (i.e. `partitionId: 1`).
- the deployment partition processes the first complete command, and writes `DeploymentDistribution:Completed` event, which is applied to the state.
- applying the completed event results in deleting the Pending Deployment for that partition.
- when it processes the second complete command, there could still be a pending deployment for another partition open for the same deployment, if so, the error happens.
- the second command is not rejected, because there is still a pending deployment for the deployment key, so another completed event is written and applied.
- applying fails this time, because the pending deployment no longer exists.

This PR changes the behavior. It makes sure the second command is rejected because the specific pending deployment no longer exists. In that case, we don't write the completed event a second time.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #10064



Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed:

@korthout
Copy link
Member Author

I missed that this failed to merge, but it seems to have happened due to Out of space in CodeCache 🤷

bors retry

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit 8ff06e4 into main Aug 18, 2022
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the korthout-10064-has-specific-pending-deployment branch August 18, 2022 11:34
@backport-action
Copy link
Collaborator

Backport failed for stable/1.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/1.3
git worktree add -d .worktree/backport-10074-to-stable/1.3 origin/stable/1.3
cd .worktree/backport-10074-to-stable/1.3
git checkout -b backport-10074-to-stable/1.3
ancref=$(git merge-base 17630e8cc3f1d81c5dc9523208196147d17114ed 6db151ecf7578ff97c6aa7b590614fc81d666293)
git cherry-pick -x $ancref..6db151ecf7578ff97c6aa7b590614fc81d666293

@backport-action
Copy link
Collaborator

Successfully created backport PR #10117 for stable/8.0.

@korthout
Copy link
Member Author

Deployment redistribution works completely differently on 8.0 and 1.3. For example, the complete command is already written when the distribute command is written to the other partition. So it won't ever write it twice on the other partition. This is totally different from 8.1 where the command can be written multiple times to the other partition.

@saig0 I think we shouldn't backport this, so I'll close the opened backport PR. Let me know if you feel otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delete existing PENDING_DEPLOYMENT causes ZeebeDbInconsistentException
3 participants