Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeebeDbInconsistentException in ColumnFamily DMN_DECISION_REQUIREMENTS #9115

Closed
pihme opened this issue Apr 13, 2022 · 17 comments · Fixed by #9121 or #9432
Closed

ZeebeDbInconsistentException in ColumnFamily DMN_DECISION_REQUIREMENTS #9115

pihme opened this issue Apr 13, 2022 · 17 comments · Fixed by #9121 or #9432
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 version:8.1.0-alpha2 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0

Comments

@pihme
Copy link
Contributor

pihme commented Apr 13, 2022

Describe the bug
Found in error logs
https://console.cloud.google.com/errors/detail/CPDM9-CV9Nvk3wE;service=zeebe;time=P7D?project=camunda-cloud-240911
https://console.cloud.google.com/errors/detail/CLWTn7vY7pS04QE;service=zeebe;time=P7D?project=camunda-cloud-240911

Expected behavior

Log/Stacktrace

Full Stacktrace

 io.camunda.zeebe.db.ZeebeDbInconsistentException: Key DbLong{2251799813685350} in ColumnFamily DMN_DECISION_REQUIREMENTS already exists
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.assertKeyDoesNotExist(TransactionalColumnFamily.java:273) ~[zeebe-db-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.lambda$insert$0(TransactionalColumnFamily.java:81) ~[zeebe-db-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.lambda$ensureInOpenTransaction$17(TransactionalColumnFamily.java:301) ~[zeebe-db-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.DefaultTransactionContext.runInTransaction(DefaultTransactionContext.java:33) ~[zeebe-db-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.ensureInOpenTransaction(TransactionalColumnFamily.java:300) ~[zeebe-db-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.insert(TransactionalColumnFamily.java:76) ~[zeebe-db-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.state.deployment.DbDecisionState.storeDecisionRequirements(DbDecisionState.java:163) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.lambda$putDmnResourcesInState$0(DeploymentDistributedApplier.java:50) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at java.lang.Iterable.forEach(Unknown Source) ~[?:?]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.putDmnResourcesInState(DeploymentDistributedApplier.java:45) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.applyState(DeploymentDistributedApplier.java:39) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.applyState(DeploymentDistributedApplier.java:23) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.state.appliers.EventAppliers.applyState(EventAppliers.java:239) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.writers.EventApplyingStateWriter.appendFollowUpEvent(EventApplyingStateWriter.java:36) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.processing.deployment.distribute.DeploymentDistributeProcessor.processRecord(DeploymentDistributeProcessor.java:58) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.lambda$processInTransaction$3(ProcessingStateMachine.java:300) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:84) ~[zeebe-db-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processInTransaction(ProcessingStateMachine.java:290) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processCommand(ProcessingStateMachine.java:253) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.tryToReadNextRecord(ProcessingStateMachine.java:213) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.readNextRecord(ProcessingStateMachine.java:189) ~[zeebe-workflow-engine-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) ~[zeebe-util-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) ~[zeebe-util-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) ~[zeebe-util-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) ~[zeebe-util-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) ~[zeebe-util-8.0.0.jar:8.0.0]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) ~[zeebe-util-8.0.0.jar:8.0.0] 

Environment:

  • Zeebe Version: 8.0.0
@pihme
Copy link
Contributor Author

pihme commented Apr 13, 2022

@oleschoenburg / @npepinpe Please have a look at this one. Both in terms of the bug and whether or not the consistency checks should be enabled

@pihme
Copy link
Contributor Author

pihme commented Apr 13, 2022

@npepinpe Which area would you use for data integrity?

@pihme
Copy link
Contributor Author

pihme commented Apr 13, 2022

BTW First occurrence was for a paying customer on what looks like a test system.

@npepinpe
Copy link
Member

Hm, the checks should be enabled only on the trial plan 🤔

@pihme
Copy link
Contributor Author

pihme commented Apr 13, 2022

It is a trial plan

@npepinpe
Copy link
Member

Then yes, all trial users should get the checks as a means to test it via progressive roll out. The goal is to make sure there's no false positives and no big performance impact before rolling it out to everyone.

@pihme pihme added the kind/bug Categorizes an issue or PR as a bug label Apr 13, 2022
@npepinpe
Copy link
Member

Regarding label, each of them has a description, so pick which one seems to fit best (e.g. area/reliability). However, like I mentioned last night, if you think it makes more sense to have a more specific label, go for it - while I want to avoid having tons of labels and nobody using the same ones or understanding what others are doing, it makes sense to let it grow organically, and I'll just keep an eye on them and do a pass to consolidate them every once in a while (so I might have some questions I guess).

@oleschoenburg
Copy link
Member

Looks like someone deployed DMN resources but the decision requirements key already exists. If this were valid, that would mean that DbDecisionState#storeDecisionRequirements should upsert, not insert into DMN_DECISION_REQUIREMENTS.
However, looking at tests such as DeploymentDmnTest#shouldDeployDuplicate, redeploying the same DMN twice seems to not re-use the same decision requirements key. I think I'd need some help from @saig0 or @korthout to understand if upsert or insert is correct here.

@Override
public void storeDecisionRequirements(final DecisionRequirementsRecord record) {
dbDecisionRequirementsKey.wrapLong(record.getDecisionRequirementsKey());
dbPersistedDecisionRequirements.wrap(record);
decisionRequirementsByKey.insert(dbDecisionRequirementsKey, dbPersistedDecisionRequirements);
updateLatestDecisionRequirementsVersion(record);
}

@saig0
Copy link
Member

saig0 commented Apr 13, 2022

It is interesting that the error only happens on partition 2 but not on partition 1. I assume that it is related to the distribution of deployments.

Having a quick look at DeploymentDistributedApplier, it seems that we don't check for duplicated DRG or decisions. As a result, we try to insert the DRG again.

@saig0 saig0 added the scope/broker Marks an issue or PR to appear in the broker section of the changelog label Apr 13, 2022
@oleschoenburg
Copy link
Member

Based on the logs it looks like maybe the distribution of the deployment timed out, was retried and then fails:

Distribute deployment 2251799813685349 to partition 2
Deployment DISTRIBUTE command for deployment 2251799813685349 was written on partition 2
Failed to receive deployment response for partition 2 (on topic 'deployment-response-2251799813685349-2'). Retrying
Deployment DISTRIBUTE command for deployment 2251799813685349 was written on partition 2
io.camunda.zeebe.db.ZeebeDbInconsistentException: Key DbLong{2251799813685347} in ColumnFamily DMN_DECISION_REQUIREMENTS already exists

@saig0 saig0 added the severity/high Marks a bug as having a noticeable impact on the user with no known workaround label Apr 13, 2022
@oleschoenburg
Copy link
Member

I think we could temporarily disable the checks for this one cluster, restart the brokers, wait for the deployment distribution to finish and then re-enable the checks. This would make this cluster healthy again (and not corrupt any data) but not solve the underlying issue so we might see the same exception again soon.

@remcowesterhoud remcowesterhoud self-assigned this Apr 13, 2022
zeebe-bors-camunda bot added a commit that referenced this issue Apr 13, 2022
9121: Prevent duplicate key insertion for DMN r=remcowesterhoud a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
To make sure we keep our data consistent we should make sure we don't store duplicate values into the state. The DMN resources were missing the required checks to prevent this. We would always try to insert the resources, disregarding if it is a duplicate. This change filters out the duplicate records and guarantees we only store the non-duplicates.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #9115 



Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
zeebe-bors-camunda bot added a commit that referenced this issue Apr 13, 2022
9121: Prevent duplicate key insertion for DMN r=remcowesterhoud a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
To make sure we keep our data consistent we should make sure we don't store duplicate values into the state. The DMN resources were missing the required checks to prevent this. We would always try to insert the resources, disregarding if it is a duplicate. This change filters out the duplicate records and guarantees we only store the non-duplicates.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #9115 



Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@npepinpe
Copy link
Member

Cool! Let's do a patch release tomorrow 👍

zeebe-bors-camunda bot added a commit that referenced this issue Apr 13, 2022
9125: [Backport stable/8.0] fix(broker): do not log transition failure due to term mismatch as error r=deepthidevaki a=github-actions[bot]

# Description
Backport of #9122 to `stable/8.0`.

relates to #9040

9133: [Backport stable/8.0] Prevent duplicate key insertion for DMN r=remcowesterhoud a=github-actions[bot]

# Description
Backport of #9121 to `stable/8.0`.

relates to #9115

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@deepthidevaki deepthidevaki added the version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 label May 3, 2022
@oleschoenburg
Copy link
Member

oleschoenburg commented May 16, 2022

Apparently versions 8.0.1 and 8.0.2 show the same symptoms. At least https://console.cloud.google.com/errors/detail/CPDM9-CV9Nvk3wE;service=zeebe;time=P7D?project=camunda-cloud-240911 shows that the same exception is thrown on newer versions.
I'll re-open this issue. Maybe @remcowesterhoud could take another look?

@oleschoenburg oleschoenburg reopened this May 16, 2022
@oleschoenburg
Copy link
Member

I saved the state from a 8.0.2 cluster where this has happened again in case it's useful for root-causing: https://drive.google.com/file/d/1EO6a_zBeTR5bJvc-GXYRYasY972_9z7B/view?usp=sharing

@remcowesterhoud
Copy link
Contributor

This is most likely related to #9337

It’s interesting this is happening in Zeebe now as I couldn’t find how to reproduce it. But I have found a most likely root cause and will fix it when I’m back from holiday.

zeebe-bors-camunda bot added a commit that referenced this issue May 30, 2022
9458: [Backport stable/8.0] Support deploying multiple DMN files at once r=remcowesterhoud a=backport-action

# Description
Backport of #9432 to `stable/8.0`.

relates to camunda/zeebe-process-test#357 #9337 #9115

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue May 30, 2022
9458: [Backport stable/8.0] Support deploying multiple DMN files at once r=saig0 a=backport-action

# Description
Backport of #9432 to `stable/8.0`.

relates to camunda/zeebe-process-test#357 #9337 #9115

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
@oleschoenburg
Copy link
Member

fixed by #9887

@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 version:8.1.0-alpha2 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
7 participants