Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment Distribution not idempotent #9877

Closed
korthout opened this issue Jul 25, 2022 · 7 comments
Closed

Deployment Distribution not idempotent #9877

korthout opened this issue Jul 25, 2022 · 7 comments
Assignees
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) incident Marks an issue or PR as related to an incident report kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround version:1.3.13 version:8.1.0-alpha4 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0

Comments

@korthout
Copy link
Member

Describe the bug

A DMN resource was deployed which was distributed to another partition, and then this distribution was retried (for some reason). The retry results in the same command on the log. The second command is processed in the same way as the first and results in the same DMN_DECISION_REQUIREMENTS to be inserted into the state. The engine expects that this is an insert operation and thus triggers the consistency check. The consistency check then stops the second partition from making progress. The partition is marked as unhealthy.

In the meantime, the deployment partition keeps retrying the distribution, each time appending a new command to the other partition's log. This increases the disk usage continuously, as an additional symptom.

To Reproduce

06e517a contains a test in DeploymentClusteredTest.shouldRedeployDmn() that reproduces the problem. Note that this test cannot be used directly in the fix, as it required some additional hard-coded changes to test the scenario.

Expected behavior

Retried sending of the Distribute command should be idempotent and not trigger the consistency checks.

Log/Stacktrace

Full Stacktrace

ERROR io.camunda.zeebe.logstreams - Actor Broker-2-StreamProcessor-3 failed in phase STARTED.
io.camunda.zeebe.db.ZeebeDbInconsistentException: Key DbLong{2251799813685249} in ColumnFamily DMN_DECISION_REQUIREMENTS already exists
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.assertKeyDoesNotExist(TransactionalColumnFamily.java:273) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.lambda$insert$0(TransactionalColumnFamily.java:81) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.lambda$ensureInOpenTransaction$17(TransactionalColumnFamily.java:301) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.DefaultTransactionContext.runInTransaction(DefaultTransactionContext.java:33) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.ensureInOpenTransaction(TransactionalColumnFamily.java:300) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.insert(TransactionalColumnFamily.java:76) ~[classes/:?]
	at io.camunda.zeebe.engine.state.deployment.DbDecisionState.storeDecisionRequirements(DbDecisionState.java:163) ~[classes/:?]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.lambda$putDmnResourcesInState$0(DeploymentDistributedApplier.java:53) ~[classes/:?]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]
	at java.util.Iterator.forEachRemaining(Iterator.java:133) ~[?:?]
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845) ~[?:?]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) ~[?:?]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.putDmnResourcesInState(DeploymentDistributedApplier.java:48) ~[classes/:?]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.applyState(DeploymentDistributedApplier.java:42) ~[classes/:?]
	at io.camunda.zeebe.engine.state.appliers.DeploymentDistributedApplier.applyState(DeploymentDistributedApplier.java:26) ~[classes/:?]
	at io.camunda.zeebe.engine.state.appliers.EventAppliers.applyState(EventAppliers.java:239) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.writers.EventApplyingStateWriter.appendFollowUpEvent(EventApplyingStateWriter.java:36) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.deployment.distribute.DeploymentDistributeProcessor.processRecord(DeploymentDistributeProcessor.java:58) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.lambda$processInTransaction$3(ProcessingStateMachine.java:300) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:84) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processInTransaction(ProcessingStateMachine.java:290) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processCommand(ProcessingStateMachine.java:253) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.tryToReadNextRecord(ProcessingStateMachine.java:213) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.readNextRecord(ProcessingStateMachine.java:189) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) ~[classes/:?]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) ~[classes/:?]

Environment:

  • Zeebe Version: 8.0.4
@korthout korthout added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround incident Marks an issue or PR as related to an incident report area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) labels Jul 25, 2022
@korthout
Copy link
Member Author

korthout commented Jul 25, 2022

A workaround exists by disabling the preconditions consistency check. Please take caution with this, and re-enable the check once your version is patched against this bug.

@korthout korthout self-assigned this Jul 25, 2022
@korthout
Copy link
Member Author

I've also discovered why this doesn't cause problems for re-distributed process resources: it uses upsert. Upsert avoids this consistency check.

@korthout
Copy link
Member Author

korthout commented Jul 25, 2022

We can either:

  1. also upsert the decision and decision requirements
  2. reject the Deployment:DISTRIBUTE command if all of the resources already exist, if some don't yet exist then we need to accept and insert those and probably also update the existing ones.

Of these solutions:

  1. is an easy fix, but also not very nice.
  2. is more complex, and because of that also not so nice.

I don't think we can just reject the command and cause a retry through it because it would become a never-ending retry loop. We can also not just accept the command and throw away info we don't have on the other partition.

For the patch, I want to keep things as simple as possible, so I'll go for solution 1.

@npepinpe
Copy link
Member

I like solution no. 2 to be honest, especially since we can explain explicitly why it's being rejected (ALREADY_EXISTS). I think it's a flaw that we don't deal with rejections internally in general. For example, this incident highlighted that it's dangerous for our internal inter-partition communication not to go through the same command rate limiter, since it caused an incredible backlog as there was no flow control from the other partition. And flow control would necessarily be implemented with rejections (at least at the moment anyway).

Can you quickly summarize why no.2 is more complex? I think also with this PR, it might not be so complicated anymore, as that already simplifies the communication in the deployment distribution. Of course that only applies to main, so we might still want to go with no. 1 for a patch, but no. 2 for the main branch.

I guess we'll talk about that in the incident as well :)

@oleschoenburg
Copy link
Member

In the meantime, the deployment partition keeps retrying the distribution, each time appending a new command to the other partition's log. This increases the disk usage continuously, as an additional symptom.

The impact of this is somewhat mitigated by #9858 where we now use an exponential backoff strategy. The log would still grow indefinitely, but slower.

@korthout
Copy link
Member Author

Can you quickly summarize why no.2 is more complex?

It needs logic to check that:

  • none of the resources exist yet (expected, acceptable case)
  • or all exist already (expected on retry from deployment leader that hasn't received an ack yet, rejected command but accepting the message to avoid further retry)
  • or if some exists then we need to consider how to deal with this
  • (potentially we could add) if all exist, but with differences, then we need to consider how to deal with this

It's not clear to me what should happen on the REAL failure cases. But it's also additional complexity that we could simply ignore if we just allow idempotent deployment distribution (storing the last received one).

TBH, I'm not strongly leaning either way, except for using the simple solution (i.e. just use upsert) to patch the broken version 8.0.x

@npepinpe
Copy link
Member

100% on using it for the patch 👍 Just thinking we may do better for future versions, but that's something we can discuss in the review 🚀

korthout added a commit that referenced this issue Jul 25, 2022
This fixes critical a critical bug, where a decision and/or drg can be
inserted twice due to retry logic for inter-partition distribution of
deployments. To counter act, this allows an overwrite of existing
values.

See #9877 and DeploymentClusteredTest.shouldDistributeDmnResourceOnRetry.
korthout added a commit that referenced this issue Jul 25, 2022
This fixes critical a critical bug, where a decision and/or drg can be
inserted twice due to retry logic for inter-partition distribution of
deployments. To counter act, this allows an overwrite of existing
values.

See #9877 and DeploymentClusteredTest.shouldDistributeDmnResourceOnRetry.
zeebe-bors-camunda bot added a commit that referenced this issue Jul 26, 2022
9883: [stable/8.0] Allow retried DMN resource distribution r=korthout a=korthout

## Description

<!-- Please explain the changes you made here. -->
This patches a critical bug related to the retry mechanism of deployment distribution. If the retried deployment distribution contains a DMN resource, then this would trigger the consistency checks.

This patch is only available for `stable/8.0`, as we want to provide a different solution on `main`. See #9877 (comment).

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #9877



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
github-actions bot pushed a commit that referenced this issue Jul 26, 2022
This fixes critical a critical bug, where a decision and/or drg can be
inserted twice due to retry logic for inter-partition distribution of
deployments. To counter act, this allows an overwrite of existing
values.

See #9877 and DeploymentClusteredTest.shouldDistributeDmnResourceOnRetry.

(cherry picked from commit 0e60752)
zeebe-bors-camunda bot added a commit that referenced this issue Jul 26, 2022
9885: Support logging partition id in compact record logger r=remcowesterhoud a=korthout

## Description

<!-- Please explain the changes you made here. -->
The compact record logger could already log the encoded partition id of keys when multiple partitions were used, but sometimes we write records on other partitions without a key (-1), or we write records on other partitions with a key that encodes a different partition.

This PR adds support for printing the partition id at the start of each log line. The partition id is omitted if the records don't contain any reference to other partitions (neither `partitionId` nor encoded in the `key`). 

I ran into this while working on a test case for #9877.

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates #9877 



9887: [Backport main] [stable/8.0] Allow retried DMN resource distribution r=korthout a=github-actions[bot]

# Description
Backport of #9883 to `main`.

relates to #9877

Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 26, 2022
9888: [Backport stable/1.3] Support logging partition id in compact record logger r=korthout a=backport-action

# Description
Backport of #9885 to `stable/1.3`.

relates to #9877

Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 26, 2022
9889: [Backport stable/8.0] Support logging partition id in compact record logger r=korthout a=backport-action

# Description
Backport of #9885 to `stable/8.0`.

relates to #9877

Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) incident Marks an issue or PR as related to an incident report kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround version:1.3.13 version:8.1.0-alpha4 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

No branches or pull requests

4 participants