Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to cancel process instance at receive task with incident #7544

Open
korthout opened this issue Jul 22, 2021 · 7 comments
Open

Unable to cancel process instance at receive task with incident #7544

korthout opened this issue Jul 22, 2021 · 7 comments
Labels
area/ux Marks an issue as related to improving the user experience component/engine component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround

Comments

@korthout
Copy link
Member

korthout commented Jul 22, 2021

Describe the bug

Engine tried to decrement the active sequence flows of a flow scope instance, but failed because it would drop below zero. This resulted in an Error event and the process instance being banned.

This happened on Camunda Cloud and was reported as an error.

Looking at the stacktrace it looks like a receive task is interrupted by an event and is being terminated. As part of the Terminate_Element processing of the receive task, the triggered event is activated by writing the Element_Activating event for it. This event is then applied to the state, which leads to it trying to decrement the number of active sequence flows. However, that shouldn't have happened. The number of active sequence flows of the flow scope should only be lowered when a sequence flow was actually taken before activating an event element. Normally the engine is able to determine whether or not the active sequence flows should be decremented, but this code could use some love anyways IMO.

Might be related to #6778

To Reproduce

Not yet sure, depends on what type of event element is triggered that interrupted the receive task. I've asked for a data snapshot to inspect the state of this process instance.

Expected behavior

Don't decrementActiveSequenceFlows when activating an event element that by triggering interrupted another element.

Log/Stacktrace

Full Stacktrace

java.lang.IllegalStateException: Not expected to have an active sequence flow count lower then zero!
	at io.camunda.zeebe.engine.state.instance.ElementInstance.decrementActiveSequenceFlows(ElementInstance.java:170) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementActivatingApplier.decrementActiveSequenceFlow(ProcessInstanceElementActivatingApplier.java:159) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementActivatingApplier.applyState(ProcessInstanceElementActivatingApplier.java:86) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementActivatingApplier.applyState(ProcessInstanceElementActivatingApplier.java:27) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.state.appliers.EventAppliers.applyState(EventAppliers.java:248) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.writers.EventApplyingStateWriter.appendFollowUpEvent(EventApplyingStateWriter.java:36) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.common.EventTriggerBehavior.activateTriggeredEvent(EventTriggerBehavior.java:319) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnEventSubscriptionBehavior.activateTriggeredEvent(BpmnEventSubscriptionBehavior.java:140) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.task.ReceiveTaskProcessor.lambda$onTerminate$6(ReceiveTaskProcessor.java:75) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at java.util.Optional.ifPresentOrElse(Unknown Source) ~[?:?]
	at io.camunda.zeebe.engine.processing.bpmn.task.ReceiveTaskProcessor.onTerminate(ReceiveTaskProcessor.java:72) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.task.ReceiveTaskProcessor.onTerminate(ReceiveTaskProcessor.java:19) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processEvent(BpmnStreamProcessor.java:211) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.lambda$processRecord$1(BpmnStreamProcessor.java:133) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.Either$Right.ifRightOrLeft(Either.java:238) ~[zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processRecord(BpmnStreamProcessor.java:130) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.TypedRecordProcessor.processRecord(TypedRecordProcessor.java:54) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.lambda$processInTransaction$3(ProcessingStateMachine.java:295) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:84) ~[zeebe-db-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processInTransaction(ProcessingStateMachine.java:285) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processEvent(ProcessingStateMachine.java:253) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.tryToReadNextEvent(ProcessingStateMachine.java:219) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.readNextEvent(ProcessingStateMachine.java:210) ~[zeebe-workflow-engine-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94) [zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-1.1.0.jar:1.1.0]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:191) [zeebe-util-1.1.0.jar:1.1.0]

Environment:

!what 83f60830-d2f8-4c44-84a1-6e8a6723c89b
found cluster <redacted>
Channel: Alpha
Generation: Zeebe 1.1 with Optimize Beta
ClusterPlan: Production - S v10 with Worker and Optimize
DataCenter: Europe West
Status: Healthy
- Zeebe: Healthy
- Elasticsearch: Healthy
- Operate: Healthy
- Tasklist: Healthy
- Optimize: Healthy
owned by <redacted>’s Organization
SalesPlan: Trial
Expired: false (Wed, 18 Aug 2021 12:44:38 GMT)
@korthout korthout added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Usability severity/mid Marks a bug as having a noticeable impact but with a known workaround labels Jul 22, 2021
@korthout
Copy link
Member Author

korthout commented Jul 22, 2021

This may or may not be a regression introduced with the event sourcing model. Since it bans a process instance, I've given it Severity: Mid for now

@korthout
Copy link
Member Author

We can improve the error handling of event applying by logging metadata about the applied event that caused the error. Taht way we wouldn't have to dig too deep into the data in some cases like this one.

@korthout
Copy link
Member Author

korthout commented Jul 22, 2021

In the meantime, I've continued to investigate the issue in more detail. The process is rather simple (no boundary events, or event sub processes). The receive task is a direct child of the process.

I have also confirmed the following current state (with the banned instance):

  • the process instance has 0 active sequence flows, again confirms that we should not decrement here
  • the process instance is TERMINATING, so I believe the process instance was cancelled
  • the receive task is in state COMPLETING, so I believe a message was correlated to it, but the output mapping failed, leading to an incident
  • I don't see any incidents in the state, but the process instance termination would already have removed that incident

I think what we're seeing here is the event trigger (message correlation) still exists in the state for the receive task when it is terminated (by processing the TERMINATE_ELEMENT command that was written from the termination of the process instance). However, the receive task considers this event trigger at termination to mean that it was interrupted by an event that needs to be activated.

Solutions

I see 2 potential solutions:

  • termination of an element should also remove any event triggers from children before trying to terminate them
  • receive task can check whether it's flowscope is terminating before assuming that an event trigger means that it should activate the triggered event.

@korthout korthout self-assigned this Jul 22, 2021
@korthout
Copy link
Member Author

korthout commented Jul 22, 2021

Here's a failing test case for it. It creates the Error record and bans the instance as described:

package io.camunda.zeebe.engine.processing.bpmn.activity;

import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.tuple;

import io.camunda.zeebe.engine.util.EngineRule;
import io.camunda.zeebe.model.bpmn.Bpmn;
import io.camunda.zeebe.model.bpmn.BpmnModelInstance;
import io.camunda.zeebe.model.bpmn.builder.ReceiveTaskBuilder;
import io.camunda.zeebe.protocol.record.Assertions;
import io.camunda.zeebe.protocol.record.Record;
import io.camunda.zeebe.protocol.record.RecordType;
import io.camunda.zeebe.protocol.record.intent.IncidentIntent;
import io.camunda.zeebe.protocol.record.intent.MessageSubscriptionIntent;
import io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent;
import io.camunda.zeebe.protocol.record.value.BpmnElementType;
import io.camunda.zeebe.test.util.record.RecordingExporter;
import io.camunda.zeebe.test.util.record.RecordingExporterTestWatcher;
import java.util.function.Consumer;
import org.junit.ClassRule;
import org.junit.Rule;
import org.junit.Test;

public final class ReceiveTaskTest {

  @ClassRule public static final EngineRule ENGINE = EngineRule.singlePartition();

  @Rule
  public final RecordingExporterTestWatcher recordingExporterTestWatcher =
      new RecordingExporterTestWatcher();

  @Test
  public void shouldTerminateReceiveTaskAfterProcessCancelled() {
    // given
    ENGINE
        .deployment()
        .withXmlResource(
            Bpmn.createExecutableProcess("process")
                .startEvent()
                .receiveTask(
                    "task", t -> t.message(m -> m.name("foo").zeebeCorrelationKeyExpression("bar")))
                .zeebeOutputExpression("output_item", "output")
                .endEvent()
                .done())
        .deploy();

    final long processInstanceKey =
        ENGINE.processInstance().ofBpmnProcessId("process").withVariable("bar", "baz").create();
    RecordingExporter.messageSubscriptionRecords(MessageSubscriptionIntent.CREATED)
        .withProcessInstanceKey(processInstanceKey)
        .await();

    ENGINE.message().withName("foo").withCorrelationKey("baz").publish();
    RecordingExporter.incidentRecords(IncidentIntent.CREATED)
        .withProcessInstanceKey(processInstanceKey)
        .await();

    // when
    ENGINE.processInstance().withInstanceKey(processInstanceKey).cancel();
    
    // then
    RecordingExporter.processInstanceRecords(ProcessInstanceIntent.ELEMENT_COMPLETED)
        .withProcessInstanceKey(processInstanceKey)
        .limitToProcessInstanceCompleted()
        .await();
  }
}

@korthout korthout removed their assignment Jul 22, 2021
@korthout korthout changed the title Active sequence flow count lower then zero Unable to cancel process instance at receive task with incident Jul 22, 2021
@npepinpe
Copy link
Member

npepinpe commented Jul 26, 2021

Is there currently a workaround? What is the exact impact right now, just that the instance is banned? From the user's/client's point of view, what is happening?

@npepinpe npepinpe added this to Planned in Zeebe Jul 26, 2021
@korthout
Copy link
Member Author

korthout commented Jul 26, 2021

@npepinpe For the user, that wanted to cancel the process instance it means that the process instance is still active, but banned. Both the process instance and the incident are still visible in Operate. If the cancel operation is send from Operate, then Operate shows the operation in progress spinner indefinitely.

There is no workaround when the situation has happened, because the instance is banned and no progress on it can be made.

There is no workaround to avoid the situation, except for resolving the incident. There are output mapping expressions possible, that lead to incidents that cannot be resolved. For example, = ? is unresolvable.

@npepinpe npepinpe removed this from Planned in Zeebe Oct 14, 2021
@npepinpe npepinpe added area/ux Marks an issue as related to improving the user experience and removed Impact: Usability labels Apr 11, 2022
@remcowesterhoud
Copy link
Contributor

remcowesterhoud commented Apr 5, 2023

I will attach this to the "Banning replacement" epic for now. But we should verify that this issue is still relevant as the mentioned method has been changed to only decrement if there are more than 0 active sequence flows:

  public void decrementActiveSequenceFlows() {
    if (getActiveSequenceFlows() > 0) {
      activeSequenceFlowsProp.decrement();
      // This should never happen, but we should fix this in a better way
      // https://github.com/camunda/zeebe/issues/9528
      //    if (decrement < 0) {
      //      throw new IllegalStateException(
      //          "Not expected to have an active sequence flow count lower then zero!");
      //    }
    }
  }

@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ux Marks an issue as related to improving the user experience component/engine component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround
Projects
None yet
Development

No branches or pull requests

6 participants