Deployment Distribution not idempotent #9877

korthout · 2022-07-25T10:39:37Z

Describe the bug

A DMN resource was deployed which was distributed to another partition, and then this distribution was retried (for some reason). The retry results in the same command on the log. The second command is processed in the same way as the first and results in the same DMN_DECISION_REQUIREMENTS to be inserted into the state. The engine expects that this is an insert operation and thus triggers the consistency check. The consistency check then stops the second partition from making progress. The partition is marked as unhealthy.

In the meantime, the deployment partition keeps retrying the distribution, each time appending a new command to the other partition's log. This increases the disk usage continuously, as an additional symptom.

To Reproduce

06e517a contains a test in DeploymentClusteredTest.shouldRedeployDmn() that reproduces the problem. Note that this test cannot be used directly in the fix, as it required some additional hard-coded changes to test the scenario.

Expected behavior

Retried sending of the Distribute command should be idempotent and not trigger the consistency checks.

Log/Stacktrace

Full Stacktrace

ERROR io.camunda.zeebe.logstreams - Actor Broker-2-StreamProcessor-3 failed in phase STARTED.
io.camunda.zeebe.db.ZeebeDbInconsistentException: Key DbLong{2251799813685249} in ColumnFamily DMN_DECISION_REQUIREMENTS already exists
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.assertKeyDoesNotExist(TransactionalColumnFamily.java:273) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.lambda$insert$0(TransactionalColumnFamily.java:81) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.lambda$ensureInOpenTransaction$17(TransactionalColumnFamily.java:301) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.DefaultTransactionContext.runInTransaction(DefaultTransactionContext.java:33) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.ensureInOpenTransaction(TransactionalColumnFamily.java:300) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.insert(TransactionalColumnFamily.java:76) ~[classes/:?]
	at io.camunda.zeebe.engine.state.deployment.DbDecisionState.storeDecisionRequirements(DbDecisionState.java:163) ~[classes/:?]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.lambda$putDmnResourcesInState$0(DeploymentDistributedApplier.java:53) ~[classes/:?]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]
	at java.util.Iterator.forEachRemaining(Iterator.java:133) ~[?:?]
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845) ~[?:?]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) ~[?:?]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.putDmnResourcesInState(DeploymentDistributedApplier.java:48) ~[classes/:?]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.applyState(DeploymentDistributedApplier.java:42) ~[classes/:?]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.applyState(DeploymentDistributedApplier.java:26) ~[classes/:?]
	at io.camunda.zeebe.engine.state.appliers.EventAppliers.applyState(EventAppliers.java:239) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.writers.EventApplyingStateWriter.appendFollowUpEvent(EventApplyingStateWriter.java:36) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.deployment.distribute.DeploymentDistributeProcessor.processRecord(DeploymentDistributeProcessor.java:58) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.lambda$processInTransaction$3(ProcessingStateMachine.java:300) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:84) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processInTransaction(ProcessingStateMachine.java:290) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processCommand(ProcessingStateMachine.java:253) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.tryToReadNextRecord(ProcessingStateMachine.java:213) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.readNextRecord(ProcessingStateMachine.java:189) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) ~[classes/:?]

Environment:

Zeebe Version: 8.0.4

The text was updated successfully, but these errors were encountered:

korthout · 2022-07-25T10:41:19Z

A workaround exists by disabling the preconditions consistency check. Please take caution with this, and re-enable the check once your version is patched against this bug.

korthout · 2022-07-25T13:00:05Z

I've also discovered why this doesn't cause problems for re-distributed process resources: it uses upsert. Upsert avoids this consistency check.

korthout · 2022-07-25T14:26:08Z

We can either:

also upsert the decision and decision requirements
reject the Deployment:DISTRIBUTE command if all of the resources already exist, if some don't yet exist then we need to accept and insert those and probably also update the existing ones.

Of these solutions:

is an easy fix, but also not very nice.
is more complex, and because of that also not so nice.

I don't think we can just reject the command and cause a retry through it because it would become a never-ending retry loop. We can also not just accept the command and throw away info we don't have on the other partition.

For the patch, I want to keep things as simple as possible, so I'll go for solution 1.

npepinpe · 2022-07-25T14:31:18Z

I like solution no. 2 to be honest, especially since we can explain explicitly why it's being rejected (ALREADY_EXISTS). I think it's a flaw that we don't deal with rejections internally in general. For example, this incident highlighted that it's dangerous for our internal inter-partition communication not to go through the same command rate limiter, since it caused an incredible backlog as there was no flow control from the other partition. And flow control would necessarily be implemented with rejections (at least at the moment anyway).

Can you quickly summarize why no.2 is more complex? I think also with this PR, it might not be so complicated anymore, as that already simplifies the communication in the deployment distribution. Of course that only applies to main, so we might still want to go with no. 1 for a patch, but no. 2 for the main branch.

I guess we'll talk about that in the incident as well :)

oleschoenburg · 2022-07-25T15:56:45Z

In the meantime, the deployment partition keeps retrying the distribution, each time appending a new command to the other partition's log. This increases the disk usage continuously, as an additional symptom.

The impact of this is somewhat mitigated by #9858 where we now use an exponential backoff strategy. The log would still grow indefinitely, but slower.

korthout · 2022-07-25T15:58:19Z

Can you quickly summarize why no.2 is more complex?

It needs logic to check that:

none of the resources exist yet (expected, acceptable case)
or all exist already (expected on retry from deployment leader that hasn't received an ack yet, rejected command but accepting the message to avoid further retry)
or if some exists then we need to consider how to deal with this
(potentially we could add) if all exist, but with differences, then we need to consider how to deal with this

It's not clear to me what should happen on the REAL failure cases. But it's also additional complexity that we could simply ignore if we just allow idempotent deployment distribution (storing the last received one).

TBH, I'm not strongly leaning either way, except for using the simple solution (i.e. just use upsert) to patch the broken version 8.0.x

npepinpe · 2022-07-25T15:59:10Z

100% on using it for the patch 👍 Just thinking we may do better for future versions, but that's something we can discuss in the review 🚀

This fixes critical a critical bug, where a decision and/or drg can be inserted twice due to retry logic for inter-partition distribution of deployments. To counter act, this allows an overwrite of existing values. See #9877 and DeploymentClusteredTest.shouldDistributeDmnResourceOnRetry.

9883: [stable/8.0] Allow retried DMN resource distribution r=korthout a=korthout ## Description  This patches a critical bug related to the retry mechanism of deployment distribution. If the retried deployment distribution contains a DMN resource, then this would trigger the consistency checks. This patch is only available for `stable/8.0`, as we want to provide a different solution on `main`. See #9877 (comment). ## Related issues  closes #9877 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Nico Korthout <nico.korthout@camunda.com>

This fixes critical a critical bug, where a decision and/or drg can be inserted twice due to retry logic for inter-partition distribution of deployments. To counter act, this allows an overwrite of existing values. See #9877 and DeploymentClusteredTest.shouldDistributeDmnResourceOnRetry. (cherry picked from commit 0e60752)

9885: Support logging partition id in compact record logger r=remcowesterhoud a=korthout ## Description  The compact record logger could already log the encoded partition id of keys when multiple partitions were used, but sometimes we write records on other partitions without a key (-1), or we write records on other partitions with a key that encodes a different partition. This PR adds support for printing the partition id at the start of each log line. The partition id is omitted if the records don't contain any reference to other partitions (neither `partitionId` nor encoded in the `key`). I ran into this while working on a test case for #9877. ## Related issues  relates #9877 9887: [Backport main] [stable/8.0] Allow retried DMN resource distribution r=korthout a=github-actions[bot] # Description Backport of #9883 to `main`. relates to #9877 Co-authored-by: Nico Korthout <nico.korthout@camunda.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

9888: [Backport stable/1.3] Support logging partition id in compact record logger r=korthout a=backport-action # Description Backport of #9885 to `stable/1.3`. relates to #9877 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>

9889: [Backport stable/8.0] Support logging partition id in compact record logger r=korthout a=backport-action # Description Backport of #9885 to `stable/8.0`. relates to #9877 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>

korthout self-assigned this Jul 25, 2022

korthout mentioned this issue Jul 25, 2022

[stable/8.0] Allow retried DMN resource distribution #9883

Merged

10 tasks

korthout mentioned this issue Jul 25, 2022

Support logging partition id in compact record logger #9885

Merged

10 tasks

github-actions bot mentioned this issue Jul 26, 2022

[Backport main] Allow retried DMN resource distribution #9887

Merged

This was referenced Jul 26, 2022

[Backport stable/1.3] Support logging partition id in compact record logger #9888

Merged

[Backport stable/8.0] Support logging partition id in compact record logger #9889

Merged

korthout closed this as completed Jul 28, 2022

korthout mentioned this issue Jul 28, 2022

Reject deployment distribute command if deployment already distributed #9912

Open

npepinpe added version:1.3.13 release/8.0.5 version:8.1.0-alpha4 labels Aug 1, 2022

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

coderabbitai bot mentioned this issue Feb 24, 2024

Zell terminate workers 2lambda123/zeebe-io-zeebe-chaos#6

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment Distribution not idempotent #9877

Deployment Distribution not idempotent #9877

korthout commented Jul 25, 2022

korthout commented Jul 25, 2022 •

edited

korthout commented Jul 25, 2022

korthout commented Jul 25, 2022 •

edited

npepinpe commented Jul 25, 2022

oleschoenburg commented Jul 25, 2022

korthout commented Jul 25, 2022

npepinpe commented Jul 25, 2022

Deployment Distribution not idempotent #9877

Deployment Distribution not idempotent #9877

Comments

korthout commented Jul 25, 2022

korthout commented Jul 25, 2022 • edited

korthout commented Jul 25, 2022

korthout commented Jul 25, 2022 • edited

npepinpe commented Jul 25, 2022

oleschoenburg commented Jul 25, 2022

korthout commented Jul 25, 2022

npepinpe commented Jul 25, 2022

korthout commented Jul 25, 2022 •

edited

korthout commented Jul 25, 2022 •

edited