-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to cancel process instance at receive task with incident #7544
Comments
This may or may not be a regression introduced with the event sourcing model. Since it bans a process instance, I've given it |
We can improve the error handling of event applying by logging metadata about the applied event that caused the error. Taht way we wouldn't have to dig too deep into the data in some cases like this one. |
In the meantime, I've continued to investigate the issue in more detail. The process is rather simple (no boundary events, or event sub processes). The receive task is a direct child of the process. I have also confirmed the following current state (with the banned instance):
I think what we're seeing here is the event trigger (message correlation) still exists in the state for the receive task when it is terminated (by processing the TERMINATE_ELEMENT command that was written from the termination of the process instance). However, the receive task considers this event trigger at termination to mean that it was interrupted by an event that needs to be activated. SolutionsI see 2 potential solutions:
|
Here's a failing test case for it. It creates the Error record and bans the instance as described: package io.camunda.zeebe.engine.processing.bpmn.activity;
import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.tuple;
import io.camunda.zeebe.engine.util.EngineRule;
import io.camunda.zeebe.model.bpmn.Bpmn;
import io.camunda.zeebe.model.bpmn.BpmnModelInstance;
import io.camunda.zeebe.model.bpmn.builder.ReceiveTaskBuilder;
import io.camunda.zeebe.protocol.record.Assertions;
import io.camunda.zeebe.protocol.record.Record;
import io.camunda.zeebe.protocol.record.RecordType;
import io.camunda.zeebe.protocol.record.intent.IncidentIntent;
import io.camunda.zeebe.protocol.record.intent.MessageSubscriptionIntent;
import io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent;
import io.camunda.zeebe.protocol.record.value.BpmnElementType;
import io.camunda.zeebe.test.util.record.RecordingExporter;
import io.camunda.zeebe.test.util.record.RecordingExporterTestWatcher;
import java.util.function.Consumer;
import org.junit.ClassRule;
import org.junit.Rule;
import org.junit.Test;
public final class ReceiveTaskTest {
@ClassRule public static final EngineRule ENGINE = EngineRule.singlePartition();
@Rule
public final RecordingExporterTestWatcher recordingExporterTestWatcher =
new RecordingExporterTestWatcher();
@Test
public void shouldTerminateReceiveTaskAfterProcessCancelled() {
// given
ENGINE
.deployment()
.withXmlResource(
Bpmn.createExecutableProcess("process")
.startEvent()
.receiveTask(
"task", t -> t.message(m -> m.name("foo").zeebeCorrelationKeyExpression("bar")))
.zeebeOutputExpression("output_item", "output")
.endEvent()
.done())
.deploy();
final long processInstanceKey =
ENGINE.processInstance().ofBpmnProcessId("process").withVariable("bar", "baz").create();
RecordingExporter.messageSubscriptionRecords(MessageSubscriptionIntent.CREATED)
.withProcessInstanceKey(processInstanceKey)
.await();
ENGINE.message().withName("foo").withCorrelationKey("baz").publish();
RecordingExporter.incidentRecords(IncidentIntent.CREATED)
.withProcessInstanceKey(processInstanceKey)
.await();
// when
ENGINE.processInstance().withInstanceKey(processInstanceKey).cancel();
// then
RecordingExporter.processInstanceRecords(ProcessInstanceIntent.ELEMENT_COMPLETED)
.withProcessInstanceKey(processInstanceKey)
.limitToProcessInstanceCompleted()
.await();
}
} |
Is there currently a workaround? What is the exact impact right now, just that the instance is banned? From the user's/client's point of view, what is happening? |
@npepinpe For the user, that wanted to cancel the process instance it means that the process instance is still active, but banned. Both the process instance and the incident are still visible in Operate. If the cancel operation is send from Operate, then Operate shows the operation in progress spinner indefinitely. There is no workaround when the situation has happened, because the instance is banned and no progress on it can be made. There is no workaround to avoid the situation, except for resolving the incident. There are output mapping expressions possible, that lead to incidents that cannot be resolved. For example, |
I will attach this to the "Banning replacement" epic for now. But we should verify that this issue is still relevant as the mentioned method has been changed to only decrement if there are more than 0 active sequence flows: public void decrementActiveSequenceFlows() {
if (getActiveSequenceFlows() > 0) {
activeSequenceFlowsProp.decrement();
// This should never happen, but we should fix this in a better way
// https://github.com/camunda/zeebe/issues/9528
// if (decrement < 0) {
// throw new IllegalStateException(
// "Not expected to have an active sequence flow count lower then zero!");
// }
}
} |
Describe the bug
Engine tried to decrement the active sequence flows of a flow scope instance, but failed because it would drop below zero. This resulted in an Error event and the process instance being banned.
This happened on Camunda Cloud and was reported as an error.
Looking at the stacktrace it looks like a receive task is interrupted by an event and is being terminated. As part of the Terminate_Element processing of the receive task, the triggered event is activated by writing the Element_Activating event for it. This event is then applied to the state, which leads to it trying to decrement the number of active sequence flows. However, that shouldn't have happened. The number of active sequence flows of the flow scope should only be lowered when a sequence flow was actually taken before activating an event element. Normally the engine is able to determine whether or not the active sequence flows should be decremented, but this code could use some love anyways IMO.
Might be related to #6778
To Reproduce
Not yet sure, depends on what type of event element is triggered that interrupted the receive task. I've asked for a data snapshot to inspect the state of this process instance.
Expected behavior
Don't decrementActiveSequenceFlows when activating an event element that by triggering interrupted another element.
Log/Stacktrace
Full Stacktrace
Environment:
The text was updated successfully, but these errors were encountered: