New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete existing PENDING_DEPLOYMENT causes ZeebeDbInconsistentException #10064
Labels
area/reliability
Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected)
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/critical
Marks a stop-the-world bug, with a high impact and no existing workaround
version:8.1.0-alpha5
Marks an issue as being completely or in parts released in 8.1.0-alpha5
version:8.1.0
Marks an issue as being completely or in parts released in 8.1.0
Comments
korthout
added
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/critical
Marks a stop-the-world bug, with a high impact and no existing workaround
area/reliability
Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected)
labels
Aug 15, 2022
I've marked this as critical as it marks the deployment partition as DEAD, and we should resolve it ASAP. |
This was referenced Aug 15, 2022
zeebe-bors-camunda bot
added a commit
that referenced
this issue
Aug 15, 2022
10073: Abbreviate Deployment Distribution records in compact logger r=pihme a=korthout ## Description <!-- Please explain the changes you made here. --> Many record types are abbreviated in the compact logger, but some of the more common (and also longer) ones aren't. See for example, this output after a failed test: ``` 2 C DEPLOYMENT DISTRIBUTE - #1->-1 P1K2 - process.xml 3 C DEPLOYMENT DISTRIBUTE - #1->-1 P1K2 - process.xml 1 C DEPLOYMENT CREATE - #1->-1 -1 - 1 E PROC CREATED - #2->#1 P1K1 - process.xml -> "shouldR..ecovery" (version:1) 1 E DEPLOYMENT CREATED - #3->#1 P1K2 - process.xml 1 E DEPLOYMENT_DISTRIBUTION DISTRIBUTING - #4->#1 P1K2 - DeploymentDistributionRecord {"partitionId":2} 1 E DEPLOYMENT_DISTRIBUTION DISTRIBUTING - #5->#1 P1K2 - DeploymentDistributionRecord {"partitionId":3} 2 C DEPLOYMENT DISTRIBUTE - #2->-1 P1K2 - process.xml 3 C DEPLOYMENT DISTRIBUTE - #2->-1 P1K2 - process.xml 1 C DEPLOYMENT_DISTRIBUTION COMPLETE - #6->-1 P1K2 - DeploymentDistributionRecord {"partitionId":2} 1 E DEPLOYMENT_DISTRIBUTION COMPLETED - #7->#6 P1K2 - DeploymentDistributionRecord {"partitionId":2} 2 E DEPLOYMENT DISTR - #3->#1 P1K2 - process.xml 2 E DEPLOYMENT DISTR - #4->#2 P1K2 - process.xml ``` After this PR, the same log is outputted as: ``` 2 C DPLY DISTRIBUTE - #1->-1 P1K2 - process.xml 3 C DPLY DISTRIBUTE - #1->-1 P1K2 - process.xml 1 C DPLY CREATE - #1->-1 -1 - 1 E PROC CREATED - #2->#1 P1K1 - process.xml -> "shouldR..ecovery" (version:1) 1 E DPLY CREATED - #3->#1 P1K2 - process.xml 1 E DSTR DISTRIBUTING - #4->#1 P1K2 - on partition 2 1 E DSTR DISTRIBUTING - #5->#1 P1K2 - on partition 3 2 C DPLY DISTRIBUTE - #2->-1 P1K2 - process.xml 3 C DPLY DISTRIBUTE - #2->-1 P1K2 - process.xml 1 C DSTR COMPLETE - #6->-1 P1K2 - on partition 2 1 E DSTR COMPLETED - #7->#6 P1K2 - on partition 2 2 E DPLY DISTRIBUTED - #3->#1 P1K2 - process.xml 2 E DPLY DISTRIBUTED - #4->#2 P1K2 - process.xml ``` ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #10064 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
zeebe-bors-camunda bot
added a commit
that referenced
this issue
Aug 15, 2022
10072: Document ProcessingScheduleService.runAtFixedRate r=pihme a=korthout ## Description <!-- Please explain the changes you made here. --> The ProcessingScheduleService has a runAtFixedRate which is implemented in a simple way: run the task after a delay and then schedule itself. But there are some quirks around this that should be clear to the caller. Specifically, that there is an initial delay before the task is run, **and** that due to the way it reschedules the task, time traveling only affects the currently scheduled delay. ## Related issues <!-- Which issues are closed by this PR or are related --> related to #10064 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
10 tasks
zeebe-bors-camunda bot
added a commit
that referenced
this issue
Aug 17, 2022
10074: Reject duplicate DeploymentDistribution Complete command r=korthout a=korthout ## Description <!-- Please explain the changes you made here. --> There existed a special case that could lead to a ZeebeDbInconsistentException: - a pending deployment is distributed multiple times to another partition by the [DeploymentRedistributor](https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/processing/deployment/distribute/DeploymentRedistributor.java). - the other partition processes the distribution twice and both times sends a `DeploymentDistribution:Complete` command to the deployment partition (i.e. `partitionId: 1`). - the deployment partition processes the first complete command, and writes `DeploymentDistribution:Completed` event, which is applied to the state. - applying the completed event results in deleting the Pending Deployment for that partition. - when it processes the second complete command, there could still be a pending deployment for another partition open for the same deployment, if so, the error happens. - the second command is not rejected, because there is still a pending deployment for the deployment key, so another completed event is written and applied. - applying fails this time, because the pending deployment no longer exists. This PR changes the behavior. It makes sure the second command is rejected because the specific pending deployment no longer exists. In that case, we don't write the completed event a second time. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #10064 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
zeebe-bors-camunda bot
added a commit
that referenced
this issue
Aug 18, 2022
10083: Document how to do inter-partition communication r=korthout a=korthout ## Description <!-- Please explain the changes you made here. --> During our team meeting, we discussed that we wanted to document how we do (and want to do) inter-partition communication. This is an attempt to document how we want to do inter-partition communication, and why it should be done this way. ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #10064 Co-authored-by: Nico Korthout <nico.korthout@camunda.com> Co-authored-by: Nico Korthout <korthout@users.noreply.github.com>
Zelldon
added
the
version:8.1.0
Marks an issue as being completely or in parts released in 8.1.0
label
Oct 4, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/reliability
Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected)
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/critical
Marks a stop-the-world bug, with a high impact and no existing workaround
version:8.1.0-alpha5
Marks an issue as being completely or in parts released in 8.1.0-alpha5
version:8.1.0
Marks an issue as being completely or in parts released in 8.1.0
Describe the bug
The deployment partition (i.e. partition 1) on benchmark cluster
medic-cw-32-be18e23b78-benchmark
isDEAD
, due to a ZeebeDbInconsistentException.This cluster was already facing other problems, but these appear unrelated
The ZeebeDbInconsistentException is thrown when a partition tries to make a state change that is unexpected and would otherwise lead to inconsistent state. The partition is then marked as
DEAD
to preserve the consistent state.See
ColumnFamily
#8787Looking at the stacktrace (below), it's pretty clear that an attempt was made to delete a non-existing pending deployment from the state. This attempt was made as a direct result of writing a
DeploymentDistribution:Completed
event.We check whether a pending deployment exists before writing the COMPLETED event, but not for the specific partition id.
To Reproduce
May be hard to reproduce, but is potentially caused by a retried deployment distribution.
Expected behavior
We should reject the
DeploymentDistribution:Complete
command when the specific pending deployment (for the deployment key and the partition id) doesn't exist, instead of writing theDeploymentDistribution:Completed
event.Log/Stacktrace
Full Stacktrace
Environment:
Reported errors
The text was updated successfully, but these errors were encountered: