Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IllegalStateException: Expected to find a flow element with id 'migrated-variable-2251799826929599' in process with key '2251799825939186' but not found. #8626

Closed
pihme opened this issue Jan 21, 2022 · 20 comments
Labels
area/ux Marks an issue as related to improving the user experience kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround

Comments

@pihme
Copy link
Contributor

pihme commented Jan 21, 2022

Describe the bug
Observed in production log.
A) https://console.cloud.google.com/errors/detail/COfpgtnP4c-E0QE;service=zeebe;time=P7D?project=camunda-cloud-240911
Logs: https://drive.google.com/file/d/1DF-Kl0TXV5LKT6e2Nnk7Gk8YsdE926Cn/view?usp=sharing
and
B) https://console.cloud.google.com/errors/detail/CJmzuouvuYOk5wE;service=zeebe;time=P7D?project=camunda-cloud-240911
Logs: https://drive.google.com/file/d/1bgGiULD6bNzvuY0LapW5BQmbuDZYW2Lr/view?usp=sharing

Log/Stacktrace

Full Stacktrace

 java.lang.IllegalStateException: Expected to find a flow element with id 'migrated-variable-2251799826929599' in process with key '2251799825939186' but not found.
	at io.camunda.zeebe.engine.state.deployment.DbProcessState.getFlowElement(DbProcessState.java:294) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnEventSubscriptionBehavior.activateTriggeredEvent(BpmnEventSubscriptionBehavior.java:110) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.lambda$transitionToTerminated$7(CallActivityProcessor.java:144) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at java.util.Optional.ifPresentOrElse(Unknown Source) ~[?:?]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.transitionToTerminated(CallActivityProcessor.java:141) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.onChildTerminated(CallActivityProcessor.java:133) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.onChildTerminated(CallActivityProcessor.java:29) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnStateTransitionBehavior.lambda$terminateChildProcessInstance$12(BpmnStateTransitionBehavior.java:494) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at java.util.Optional.ifPresentOrElse(Unknown Source) ~[?:?]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnStateTransitionBehavior.terminateChildProcessInstance(BpmnStateTransitionBehavior.java:491) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.onTerminate(CallActivityProcessor.java:98) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.onTerminate(CallActivityProcessor.java:29) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processEvent(BpmnStreamProcessor.java:137) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.lambda$processRecord$0(BpmnStreamProcessor.java:110) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.Either$Right.ifRightOrLeft(Either.java:344) ~[zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processRecord(BpmnStreamProcessor.java:107) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.TypedRecordProcessor.processRecord(TypedRecordProcessor.java:54) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.lambda$processInTransaction$2(ProcessingStateMachine.java:263) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:84) ~[zeebe-db-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processInTransaction(ProcessingStateMachine.java:253) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.processEvent(ProcessingStateMachine.java:221) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.tryToReadNextEvent(ProcessingStateMachine.java:192) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.readNextEvent(ProcessingStateMachine.java:184) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:95) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:192) [zeebe-util-1.3.0.jar:1.3.0] 

Environment:

  • OS:
  • Zeebe Version: 1.3.0
  • Configuration:
@pihme pihme added the kind/bug Categorizes an issue or PR as a bug label Jan 21, 2022
@pihme
Copy link
Contributor Author

pihme commented Jan 21, 2022

Summary of affected entities so far:

Affected variables:

  • migrated-variable-2251799827268084
  • migrated-variable-2251799826931975
  • migrated-variable-2251799826931023
  • migrated-variable-2251799826929599
  • migrated-variable-2251799826930372
  • migrated-variable-2251799826928045

Affected processes:

  • 2251799824489065
  • 2251799826929998
  • 2251799825939186

Affected process Instances:

  • 2251799826928418
  • 2251799826927723
  • 2251799826929160
  • 2251799826930041
  • 2251799826930719
  • 2251799826931632
  • 2251799827268073

Stack traces:
A)

	at io.camunda.zeebe.engine.state.deployment.DbProcessState.getFlowElement(DbProcessState.java:294) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnEventSubscriptionBehavior.activateTriggeredEvent(BpmnEventSubscriptionBehavior.java:110) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.task.JobWorkerTaskProcessor.lambda$onTerminate$7(JobWorkerTaskProcessor.java:82) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.task.JobWorkerTaskProcessor.onTerminate(JobWorkerTaskProcessor.java:79) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.task.JobWorkerTaskProcessor.onTerminate(JobWorkerTaskProcessor.java:24) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processEvent(BpmnStreamProcessor.java:137) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.lambda$processRecord$0(BpmnStreamProcessor.java:110) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]

B)

	at io.camunda.zeebe.engine.state.deployment.DbProcessState.getFlowElement(DbProcessState.java:294) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnEventSubscriptionBehavior.activateTriggeredEvent(BpmnEventSubscriptionBehavior.java:110) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.lambda$transitionToTerminated$7(CallActivityProcessor.java:144) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.transitionToTerminated(CallActivityProcessor.java:141) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.onChildTerminated(CallActivityProcessor.java:133) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.onChildTerminated(CallActivityProcessor.java:29) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnStateTransitionBehavior.lambda$terminateChildProcessInstance$12(BpmnStateTransitionBehavior.java:494) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.BpmnStateTransitionBehavior.terminateChildProcessInstance(BpmnStateTransitionBehavior.java:491) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.onTerminate(CallActivityProcessor.java:98) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.container.CallActivityProcessor.onTerminate(CallActivityProcessor.java:29) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processEvent(BpmnStreamProcessor.java:137) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]
	at io.camunda.zeebe.engine.processing.bpmn.BpmnStreamProcessor.lambda$processRecord$0(BpmnStreamProcessor.java:110) ~[zeebe-workflow-engine-1.3.0.jar:1.3.0]

@pihme
Copy link
Contributor Author

pihme commented Jan 21, 2022

Grafana looks normal. It's a cluster with a single node.

@pihme pihme added Impact: Data severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Jan 21, 2022
@pihme
Copy link
Contributor Author

pihme commented Jan 21, 2022

Affected instances are blacklisted.

Other than that, Zeebe keeps running normally.

@pihme
Copy link
Contributor Author

pihme commented Jan 21, 2022

One thing that is peculiar about the affected process instances:

  • all of them have raised an incident in the past. The incident is not resolved
  • All of them were started mid December 2021

What I don't understand is why processes with an active incident are showing any activity at all.

@korthout
Copy link
Member

korthout commented Jan 21, 2022

What I don't understand is why processes with an active incident are showing any activity at all.

Perhaps the process instance is being cancelled?

@pihme
Copy link
Contributor Author

pihme commented Jan 21, 2022

Summary of findings:

Impact for currently known cases: minimal. It is a dev cluster, the processes have been started last year in December

Impact in general: unknown at this point

Timeline of events:

  • Mid December last year: process instances get started and raise incidents due to wrong variable binding
  • Recently: Update to 1.3.0
  • Yesterday: Bug occurred and process instances were subsequently blacklisted

@pihme
Copy link
Contributor Author

pihme commented Jan 21, 2022

Perhaps the process instance is being cancelled?

How can I see that in the logs? Or anywhere else?

@korthout
Copy link
Member

korthout commented Jan 21, 2022

How can I see that in the logs? Or anywhere else?

Not something we log, but should be available on the log stream. 🤔 I'm not sure if operate would also already show that the instance is terminating entirely, but perhaps you could see it there as well.

EDIT: You can also see it in using zdb as the root process should be in the state ELEMENT_TERMINATING

@pihme
Copy link
Contributor Author

pihme commented Jan 21, 2022

Confirmed that there were cancel requests (we were able to see the requests in Grafana)

@pihme
Copy link
Contributor Author

pihme commented Jan 21, 2022

Timeline of events:

  • Mid December last year: process instances get started and raise incidents due to wrong variable binding
  • Recently: Update to 1.3.0
  • Yesterday: Customer tried to cancel the process instances, this led to this error and subsequently the instances were blacklisted

@pihme
Copy link
Contributor Author

pihme commented Jan 21, 2022

After discussing with @Zelldon we decided to lower the severity.

  • So far this was only observed for one customer in one dev cluster
  • There is no indication that it is a general problem of data corruption
  • All observed instances have in common, that they occurred when trying to cancel a process instance that has an incident
  • It is more likely that there is a bug in this specific scenario
  • However, in this specific scenario - cancelling a process instance which has an incident - the outcome of a blacklisted process instance is tolerable from a customers point of view.

@pihme pihme added severity/mid Marks a bug as having a noticeable impact but with a known workaround and removed severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Jan 21, 2022
@korthout
Copy link
Member

Curious what the incident type is. We have an existing issue with cancelling instances with incidents of UNHANDLED_ERROR #8588

@pihme
Copy link
Contributor Author

pihme commented Jan 21, 2022

The problem were variables that couldn't be resolved. Looked a bit like a typo in the FEEL expression. But tbh, I didn't pay too much attention to this detail.

@npepinpe
Copy link
Member

npepinpe commented Jan 25, 2022

Leaving in the backlog for now - from my understanding, the impact is mostly on visibility as the instance is blacklisted, which is not really visible for users. However, the net effect is what they wanted - users cannot interact with this instance anymore. It's still confusing, but I think this stems mostly from the blacklisting concept and its lack of visibility.

Two things which might cause us to reprioritize:

  1. Blacklisting still leaves some data behind, and if this bug occurs often, then this can be an issue as the state size increases.
  2. Can jobs of blacklisted processes be activated anyway? They cannot.

@rusterholz
Copy link

rusterholz commented Jan 28, 2022

the impact is mostly on visibility as the instance is blacklisted, which is not really visible for users

As someone who uses Zeebe on its own (without Camunda Cloud), this was confusing to me. I searched on docs.camunda.io but couldn't find any references to this. Could you explain more about blacklisting an instance? Is it a feature of Zeebe, or of Camunda Cloud / Operate?

@Zelldon
Copy link
Member

Zelldon commented Jan 28, 2022

There is a open issue to document this camunda/camunda-docs#145 some details can be found here #1988

@rusterholz
Copy link

Thanks @Zelldon -- the details in #1988 are sparse but I think I get the idea. Hopefully the more formal documentation will be written soon.

In the meantime, is there any risk that an instance running on one of our self-hosted zeebe clusters might get blacklisted? It seems like we would have no way to know about that if it happened.

@Zelldon
Copy link
Member

Zelldon commented Jan 28, 2022

Normally you should see an error in the log. Furthermore there are metrics regarding blacklisted instances.

I agree it would be good if we have soon some documentation, but also be aware that there are plans to replace the blacklisting with a different concept.

@rusterholz
Copy link

Thanks!

@npepinpe npepinpe added area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) area/observability Marks an issue as observability related and removed Impact: Data area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) labels Apr 11, 2022
@npepinpe npepinpe added area/ux Marks an issue as related to improving the user experience and removed area/observability Marks an issue as observability related labels Apr 11, 2022
@Zelldon
Copy link
Member

Zelldon commented Dec 30, 2022

It doesn't to occur again, I will close it for now.

@Zelldon Zelldon closed this as completed Dec 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ux Marks an issue as related to improving the user experience kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround
Projects
None yet
Development

No branches or pull requests

6 participants