Unable to cancel process instance at receive task with incident #7544

korthout · 2021-07-22T10:43:16Z

Describe the bug

Engine tried to decrement the active sequence flows of a flow scope instance, but failed because it would drop below zero. This resulted in an Error event and the process instance being banned.

This happened on Camunda Cloud and was reported as an error.

Looking at the stacktrace it looks like a receive task is interrupted by an event and is being terminated. As part of the Terminate_Element processing of the receive task, the triggered event is activated by writing the Element_Activating event for it. This event is then applied to the state, which leads to it trying to decrement the number of active sequence flows. However, that shouldn't have happened. The number of active sequence flows of the flow scope should only be lowered when a sequence flow was actually taken before activating an event element. Normally the engine is able to determine whether or not the active sequence flows should be decremented, but this code could use some love anyways IMO.

Might be related to #6778

To Reproduce

Not yet sure, depends on what type of event element is triggered that interrupted the receive task. I've asked for a data snapshot to inspect the state of this process instance.

Expected behavior

Don't decrementActiveSequenceFlows when activating an event element that by triggering interrupted another element.

Log/Stacktrace

Full Stacktrace

java.lang.IllegalStateException: Not expected to have an active sequence flow count lower then zero!
	at io.camunda.zeebe.engine.state.instance.ElementInstance.decrementActiveSequenceFlows(ElementInstance.java:170) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementActivatingApplier.decrementActiveSequenceFlow(ProcessInstanceElementActivatingApplier.java:159) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementActivatingApplier.applyState(ProcessInstanceElementActivatingApplier.java:86) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementActivatingApplier.applyState(ProcessInstanceElementActivatingApplier.java:27) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.state.appliers.EventAppliers.applyState(EventAppliers.java:248) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.writers.EventApplyingStateWriter.appendFollowUpEvent(EventApplyingStateWriter.java:36) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.common.EventTriggerBehavior.activateTriggeredEvent(EventTriggerBehavior.java:319) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnEventSubscriptionBehavior.activateTriggeredEvent(BpmnEventSubscriptionBehavior.java:140) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.task.ReceiveTaskProcessor.lambda$onTerminate$6(ReceiveTaskProcessor.java:75) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at java.util.Optional.ifPresentOrElse(Unknown Source) ~[?:?]
	at io.camunda.zeebe.engine.processing.bpmn.task.ReceiveTaskProcessor.onTerminate(ReceiveTaskProcessor.java:72) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.task.ReceiveTaskProcessor.onTerminate(ReceiveTaskProcessor.java:19) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processEvent(BpmnStreamProcessor.java:211) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.lambda$processRecord$1(BpmnStreamProcessor.java:133) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.Either$Right.ifRightOrLeft(Either.java:238) ~[zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processRecord(BpmnStreamProcessor.java:130) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.TypedRecordProcessor.processRecord(TypedRecordProcessor.java:54) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.lambda$processInTransaction$3(ProcessingStateMachine.java:295) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:84) ~[zeebe-db-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processInTransaction(ProcessingStateMachine.java:285) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processEvent(ProcessingStateMachine.java:253) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.tryToReadNextEvent(ProcessingStateMachine.java:219) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.readNextEvent(ProcessingStateMachine.java:210) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94) [zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:191) [zeebe-util-1.1.0.jar:1.1.0]

Environment:

!what 83f60830-d2f8-4c44-84a1-6e8a6723c89b
found cluster <redacted>
Channel: Alpha
Generation: Zeebe 1.1 with Optimize Beta
ClusterPlan: Production - S v10 with Worker and Optimize
DataCenter: Europe West
Status: Healthy
- Zeebe: Healthy
- Elasticsearch: Healthy
- Operate: Healthy
- Tasklist: Healthy
- Optimize: Healthy
owned by <redacted>’s Organization
SalesPlan: Trial
Expired: false (Wed, 18 Aug 2021 12:44:38 GMT)

The text was updated successfully, but these errors were encountered:

korthout · 2021-07-22T10:44:21Z

This may or may not be a regression introduced with the event sourcing model. Since it bans a process instance, I've given it Severity: Mid for now

korthout · 2021-07-22T11:10:27Z

We can improve the error handling of event applying by logging metadata about the applied event that caused the error. Taht way we wouldn't have to dig too deep into the data in some cases like this one.

korthout · 2021-07-22T16:51:06Z

In the meantime, I've continued to investigate the issue in more detail. The process is rather simple (no boundary events, or event sub processes). The receive task is a direct child of the process.

I have also confirmed the following current state (with the banned instance):

the process instance has 0 active sequence flows, again confirms that we should not decrement here
the process instance is TERMINATING, so I believe the process instance was cancelled
the receive task is in state COMPLETING, so I believe a message was correlated to it, but the output mapping failed, leading to an incident
I don't see any incidents in the state, but the process instance termination would already have removed that incident

I think what we're seeing here is the event trigger (message correlation) still exists in the state for the receive task when it is terminated (by processing the TERMINATE_ELEMENT command that was written from the termination of the process instance). However, the receive task considers this event trigger at termination to mean that it was interrupted by an event that needs to be activated.

Solutions

I see 2 potential solutions:

termination of an element should also remove any event triggers from children before trying to terminate them
receive task can check whether it's flowscope is terminating before assuming that an event trigger means that it should activate the triggered event.

korthout · 2021-07-22T17:10:32Z

Here's a failing test case for it. It creates the Error record and bans the instance as described:

package io.camunda.zeebe.engine.processing.bpmn.activity;

import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.tuple;

import io.camunda.zeebe.engine.util.EngineRule;
import io.camunda.zeebe.model.bpmn.Bpmn;
import io.camunda.zeebe.model.bpmn.BpmnModelInstance;
import io.camunda.zeebe.model.bpmn.builder.ReceiveTaskBuilder;
import io.camunda.zeebe.protocol.record.Assertions;
import io.camunda.zeebe.protocol.record.Record;
import io.camunda.zeebe.protocol.record.RecordType;
import io.camunda.zeebe.protocol.record.intent.IncidentIntent;
import io.camunda.zeebe.protocol.record.intent.MessageSubscriptionIntent;
import io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent;
import io.camunda.zeebe.protocol.record.value.BpmnElementType;
import io.camunda.zeebe.test.util.record.RecordingExporter;
import io.camunda.zeebe.test.util.record.RecordingExporterTestWatcher;
import java.util.function.Consumer;
import org.junit.ClassRule;
import org.junit.Rule;
import org.junit.Test;

public final class ReceiveTaskTest {

  @ClassRule public static final EngineRule ENGINE = EngineRule.singlePartition();

  @Rule
  public final RecordingExporterTestWatcher recordingExporterTestWatcher =
      new RecordingExporterTestWatcher();

  @Test
  public void shouldTerminateReceiveTaskAfterProcessCancelled() {
    // given
    ENGINE
        .deployment()
        .withXmlResource(
            Bpmn.createExecutableProcess("process")
                .startEvent()
                .receiveTask(
                    "task", t -> t.message(m -> m.name("foo").zeebeCorrelationKeyExpression("bar")))
                .zeebeOutputExpression("output_item", "output")
                .endEvent()
                .done())
        .deploy();

    final long processInstanceKey =
        ENGINE.processInstance().ofBpmnProcessId("process").withVariable("bar", "baz").create();
    RecordingExporter.messageSubscriptionRecords(MessageSubscriptionIntent.CREATED)
        .withProcessInstanceKey(processInstanceKey)
        .await();

    ENGINE.message().withName("foo").withCorrelationKey("baz").publish();
    RecordingExporter.incidentRecords(IncidentIntent.CREATED)
        .withProcessInstanceKey(processInstanceKey)
        .await();

    // when
    ENGINE.processInstance().withInstanceKey(processInstanceKey).cancel();
    
    // then
    RecordingExporter.processInstanceRecords(ProcessInstanceIntent.ELEMENT_COMPLETED)
        .withProcessInstanceKey(processInstanceKey)
        .limitToProcessInstanceCompleted()
        .await();
  }
}

npepinpe · 2021-07-26T07:31:33Z

Is there currently a workaround? What is the exact impact right now, just that the instance is banned? From the user's/client's point of view, what is happening?

korthout · 2021-07-26T08:45:01Z

@npepinpe For the user, that wanted to cancel the process instance it means that the process instance is still active, but banned. Both the process instance and the incident are still visible in Operate. If the cancel operation is send from Operate, then Operate shows the operation in progress spinner indefinitely.

There is no workaround when the situation has happened, because the instance is banned and no progress on it can be made.

There is no workaround to avoid the situation, except for resolving the incident. There are output mapping expressions possible, that lead to incidents that cannot be resolved. For example, = ? is unresolvable.

remcowesterhoud · 2023-04-05T09:08:28Z

I will attach this to the "Banning replacement" epic for now. But we should verify that this issue is still relevant as the mentioned method has been changed to only decrement if there are more than 0 active sequence flows:

  public void decrementActiveSequenceFlows() {
    if (getActiveSequenceFlows() > 0) {
      activeSequenceFlowsProp.decrement();
      // This should never happen, but we should fix this in a better way
      // https://github.com/camunda/zeebe/issues/9528
      //    if (decrement < 0) {
      //      throw new IllegalStateException(
      //          "Not expected to have an active sequence flow count lower then zero!");
      //    }
    }
  }

korthout added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Usability severity/mid Marks a bug as having a noticeable impact but with a known workaround labels Jul 22, 2021

korthout self-assigned this Jul 22, 2021

korthout removed their assignment Jul 22, 2021

korthout changed the title ~~Active sequence flow count lower then zero~~ Unable to cancel process instance at receive task with incident Jul 22, 2021

npepinpe added this to Planned in Zeebe Jul 26, 2021

npepinpe removed this from Planned in Zeebe Oct 14, 2021

npepinpe added area/ux Marks an issue as related to improving the user experience and removed Impact: Usability labels Apr 11, 2022

Zelldon added the team/process-automation label Jun 3, 2022

menski removed the team/process-automation label Jul 11, 2022

Zelldon added the component/engine label Dec 29, 2022

romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to cancel process instance at receive task with incident #7544

Unable to cancel process instance at receive task with incident #7544

korthout commented Jul 22, 2021 •

edited by Zelldon

korthout commented Jul 22, 2021 •

edited by Zelldon

korthout commented Jul 22, 2021

korthout commented Jul 22, 2021 •

edited by Zelldon

korthout commented Jul 22, 2021 •

edited by Zelldon

npepinpe commented Jul 26, 2021 •

edited by Zelldon

korthout commented Jul 26, 2021 •

edited by Zelldon

remcowesterhoud commented Apr 5, 2023 •

edited by Zelldon

Unable to cancel process instance at receive task with incident #7544

Unable to cancel process instance at receive task with incident #7544

Comments

korthout commented Jul 22, 2021 • edited by Zelldon

korthout commented Jul 22, 2021 • edited by Zelldon

korthout commented Jul 22, 2021

korthout commented Jul 22, 2021 • edited by Zelldon

Solutions

korthout commented Jul 22, 2021 • edited by Zelldon

npepinpe commented Jul 26, 2021 • edited by Zelldon

korthout commented Jul 26, 2021 • edited by Zelldon

remcowesterhoud commented Apr 5, 2023 • edited by Zelldon

korthout commented Jul 22, 2021 •

edited by Zelldon

korthout commented Jul 22, 2021 •

edited by Zelldon

korthout commented Jul 22, 2021 •

edited by Zelldon

korthout commented Jul 22, 2021 •

edited by Zelldon

npepinpe commented Jul 26, 2021 •

edited by Zelldon

korthout commented Jul 26, 2021 •

edited by Zelldon

remcowesterhoud commented Apr 5, 2023 •

edited by Zelldon