-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reject duplicate DeploymentDistribution Complete command #10074
Reject duplicate DeploymentDistribution Complete command #10074
Conversation
Test Results 844 files + 1 844 suites +1 1h 43m 13s ⏱️ + 8m 1s For more details on these failures, see this check. Results for commit 6db151e. ± Comparison against base commit 17630e8. ♻️ This comment has been updated with latest results. |
The complete deployment distribution command must be processed idempotently. Like we do in other inter-partition communication cases, we reject the command when it was already processed.
The pending deployment is deleted when the DeploymentDistribution COMPLETED event is written. We should only write this event if the pending deployment still exists. If it doesn't exist it either was already completed or it was never pending. In both cases, the command to COMPLETE the deployment distribution should be rejected. We can achieve this by providing a way to check whether a specific deployment distribution exists in the state. The current method only allows to check that there is at least one pending distribution for a specific deployment, but it doesn't allow to check the specific partition id.
1d170e9
to
6db151e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@korthout LGTM 👍
bors merge |
10074: Reject duplicate DeploymentDistribution Complete command r=korthout a=korthout ## Description <!-- Please explain the changes you made here. --> There existed a special case that could lead to a ZeebeDbInconsistentException: - a pending deployment is distributed multiple times to another partition by the [DeploymentRedistributor](https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/processing/deployment/distribute/DeploymentRedistributor.java). - the other partition processes the distribution twice and both times sends a `DeploymentDistribution:Complete` command to the deployment partition (i.e. `partitionId: 1`). - the deployment partition processes the first complete command, and writes `DeploymentDistribution:Completed` event, which is applied to the state. - applying the completed event results in deleting the Pending Deployment for that partition. - when it processes the second complete command, there could still be a pending deployment for another partition open for the same deployment, if so, the error happens. - the second command is not rejected, because there is still a pending deployment for the deployment key, so another completed event is written and applied. - applying fails this time, because the pending deployment no longer exists. This PR changes the behavior. It makes sure the second command is rejected because the specific pending deployment no longer exists. In that case, we don't write the completed event a second time. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #10064 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
Build failed: |
I missed that this failed to merge, but it seems to have happened due to Out of space in CodeCache 🤷 bors retry |
Build succeeded: |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/1.3
git worktree add -d .worktree/backport-10074-to-stable/1.3 origin/stable/1.3
cd .worktree/backport-10074-to-stable/1.3
git checkout -b backport-10074-to-stable/1.3
ancref=$(git merge-base 17630e8cc3f1d81c5dc9523208196147d17114ed 6db151ecf7578ff97c6aa7b590614fc81d666293)
git cherry-pick -x $ancref..6db151ecf7578ff97c6aa7b590614fc81d666293 |
Successfully created backport PR #10117 for |
Deployment redistribution works completely differently on 8.0 and 1.3. For example, the complete command is already written when the distribute command is written to the other partition. So it won't ever write it twice on the other partition. This is totally different from 8.1 where the command can be written multiple times to the other partition. @saig0 I think we shouldn't backport this, so I'll close the opened backport PR. Let me know if you feel otherwise. |
Description
There existed a special case that could lead to a ZeebeDbInconsistentException:
DeploymentDistribution:Complete
command to the deployment partition (i.e.partitionId: 1
).DeploymentDistribution:Completed
event, which is applied to the state.This PR changes the behavior. It makes sure the second command is rejected because the specific pending deployment no longer exists. In that case, we don't write the completed event a second time.
Related issues
closes #10064
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Please refer to our review guidelines.