New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deployment Distribution not idempotent #9877
Comments
A workaround exists by disabling the preconditions consistency check. Please take caution with this, and re-enable the check once your version is patched against this bug. |
I've also discovered why this doesn't cause problems for re-distributed process resources: it uses upsert. Upsert avoids this consistency check. |
We can either:
Of these solutions:
I don't think we can just reject the command and cause a retry through it because it would become a never-ending retry loop. We can also not just accept the command and throw away info we don't have on the other partition. For the patch, I want to keep things as simple as possible, so I'll go for solution 1. |
I like solution no. 2 to be honest, especially since we can explain explicitly why it's being rejected ( Can you quickly summarize why no.2 is more complex? I think also with this PR, it might not be so complicated anymore, as that already simplifies the communication in the deployment distribution. Of course that only applies to main, so we might still want to go with no. 1 for a patch, but no. 2 for the main branch. I guess we'll talk about that in the incident as well :) |
The impact of this is somewhat mitigated by #9858 where we now use an exponential backoff strategy. The log would still grow indefinitely, but slower. |
It needs logic to check that:
It's not clear to me what should happen on the REAL failure cases. But it's also additional complexity that we could simply ignore if we just allow idempotent deployment distribution (storing the last received one). TBH, I'm not strongly leaning either way, except for using the simple solution (i.e. just use upsert) to patch the broken version 8.0.x |
100% on using it for the patch 👍 Just thinking we may do better for future versions, but that's something we can discuss in the review 🚀 |
This fixes critical a critical bug, where a decision and/or drg can be inserted twice due to retry logic for inter-partition distribution of deployments. To counter act, this allows an overwrite of existing values. See #9877 and DeploymentClusteredTest.shouldDistributeDmnResourceOnRetry.
This fixes critical a critical bug, where a decision and/or drg can be inserted twice due to retry logic for inter-partition distribution of deployments. To counter act, this allows an overwrite of existing values. See #9877 and DeploymentClusteredTest.shouldDistributeDmnResourceOnRetry.
9883: [stable/8.0] Allow retried DMN resource distribution r=korthout a=korthout ## Description <!-- Please explain the changes you made here. --> This patches a critical bug related to the retry mechanism of deployment distribution. If the retried deployment distribution contains a DMN resource, then this would trigger the consistency checks. This patch is only available for `stable/8.0`, as we want to provide a different solution on `main`. See #9877 (comment). ## Related issues <!-- Which issues are closed by this PR or are related --> closes #9877 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
This fixes critical a critical bug, where a decision and/or drg can be inserted twice due to retry logic for inter-partition distribution of deployments. To counter act, this allows an overwrite of existing values. See #9877 and DeploymentClusteredTest.shouldDistributeDmnResourceOnRetry. (cherry picked from commit 0e60752)
9885: Support logging partition id in compact record logger r=remcowesterhoud a=korthout ## Description <!-- Please explain the changes you made here. --> The compact record logger could already log the encoded partition id of keys when multiple partitions were used, but sometimes we write records on other partitions without a key (-1), or we write records on other partitions with a key that encodes a different partition. This PR adds support for printing the partition id at the start of each log line. The partition id is omitted if the records don't contain any reference to other partitions (neither `partitionId` nor encoded in the `key`). I ran into this while working on a test case for #9877. ## Related issues <!-- Which issues are closed by this PR or are related --> relates #9877 9887: [Backport main] [stable/8.0] Allow retried DMN resource distribution r=korthout a=github-actions[bot] # Description Backport of #9883 to `main`. relates to #9877 Co-authored-by: Nico Korthout <nico.korthout@camunda.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Describe the bug
A DMN resource was deployed which was distributed to another partition, and then this distribution was retried (for some reason). The retry results in the same command on the log. The second command is processed in the same way as the first and results in the same DMN_DECISION_REQUIREMENTS to be inserted into the state. The engine expects that this is an insert operation and thus triggers the consistency check. The consistency check then stops the second partition from making progress. The partition is marked as unhealthy.
In the meantime, the deployment partition keeps retrying the distribution, each time appending a new command to the other partition's log. This increases the disk usage continuously, as an additional symptom.
To Reproduce
06e517a contains a test in
DeploymentClusteredTest.shouldRedeployDmn()
that reproduces the problem. Note that this test cannot be used directly in the fix, as it required some additional hard-coded changes to test the scenario.Expected behavior
Retried sending of the Distribute command should be idempotent and not trigger the consistency checks.
Log/Stacktrace
Full Stacktrace
Environment:
The text was updated successfully, but these errors were encountered: