Exception loop in Engine during handling of unexpected error #10199

remcowesterhoud · 2022-08-26T12:13:35Z

Describe the bug

In #9946 we encountered the problem of an exception loop when we deployed a process that was too large. The cause for this that the processor did not handle this exception explicitly using the tryHandleError method. As a result the engine considered this an UNEXPECTED_ERROR. When this happens the engine will try to handle this unexpected error:

  private void handleUnexpectedError(
      final Throwable processingException, final TypedRecord record) {
    final String errorMessage =
        String.format(
            ERROR_MESSAGE_PROCESSING_EXCEPTION_OCCURRED, record, processingException.getMessage());
    LOG.error(errorMessage, processingException);

    writers.rejection().appendRejection(record, RejectionType.PROCESSING_ERROR, errorMessage);
    writers
        .response()
        .writeRejectionOnCommand(record, RejectionType.PROCESSING_ERROR, errorMessage);
    errorRecord.initErrorRecord(processingException, record.getPosition());

    if (DbBlackListState.shouldBeBlacklisted(record.getIntent())) {
      if (record.getValue() instanceof ProcessInstanceRelated) {
        final long processInstanceKey =
            ((ProcessInstanceRelated) record.getValue()).getProcessInstanceKey();
        errorRecord.setProcessInstanceKey(processInstanceKey);
      }

      writers.state().appendFollowUpEvent(record.getKey(), ErrorIntent.CREATED, errorRecord);
    }
  }

At the beginning of this method we will create an error message and write the rejection. During the appending of this rejection we check the size of this record (https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/api/records/RecordBatch.java#L58). If it's also too large we will throw an exception.
Since this exception is also considered to be an UNEXPECTED_ERROR the Engine tries to handle this unexpected error again. This will try to append the record again and in turn thrown another exception, starting the loop again.

This loop will cause the partition to reach an UNHEALTHY state.

To Reproduce

As long as #10193 is not merged:

Deploy a very large process
The exception loop should occur

If #10193 is merged:

Change the DeploymentCreateProcessor tryHandleError method to always return an UNEXPECTED_ERROR
Deploy a very large process
The exception loop should occur

Expected behavior

The exception loop should not occur when the Engine fails to append a rejection.

Log/Stacktrace

Full Stacktrace

...with size: 57419999 this would exceed the maximum batch size. [ currentBatchEntryCount: 0, currentBatchSize: 0]
	at io.camunda.zeebe.engine.api.records.RecordBatch.appendRecord(RecordBatch.java:67) ~[classes/:?]
	at io.camunda.zeebe.streamprocessor.DirectProcessingResultBuilder.appendRecordReturnEither(DirectProcessingResultBuilder.java:76) ~[classes/:?]
	at io.camunda.zeebe.engine.api.ProcessingResultBuilder.appendRecord(ProcessingResultBuilder.java:36) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.writers.ResultBuilderBackedRejectionWriter.appendRejection(ResultBuilderBackedRejectionWriter.java:31) ~[classes/:?]
	at io.camunda.zeebe.engine.Engine.handleUnexpectedError(Engine.java:174) ~[classes/:?]
	at io.camunda.zeebe.engine.Engine.onProcessingError(Engine.java:161) ~[classes/:?]
	at io.camunda.zeebe.streamprocessor.ProcessingStateMachine.lambda$errorHandlingInTransaction$7(ProcessingStateMachine.java:336) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:84) ~[classes/:?]
	at io.camunda.zeebe.streamprocessor.ProcessingStateMachine.errorHandlingInTransaction(ProcessingStateMachine.java:324) ~[classes/:?]
	at io.camunda.zeebe.streamprocessor.ProcessingStateMachine.lambda$onError$6(ProcessingStateMachine.java:313) ~[classes/:?]
	at io.camunda.zeebe.scheduler.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:28) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorJob.invoke(ActorJob.java:94) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorJob.execute(ActorJob.java:45) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorTask.execute(ActorTask.java:119) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask(ActorThread.java:106) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorThread.doWork(ActorThread.java:87) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorThread.run(ActorThread.java:198) ~[classes/:?]

Environment:

OS: Cloud
Zeebe Version: Starting from 8.1.0-alpha4

The text was updated successfully, but these errors were encountered:

remcowesterhoud · 2022-08-26T12:17:53Z

Possible solution
In the appendRejection method we could make use of the appendRecordReturnEither method. By doing this we can see whether or not the append of the record was successful or not. If it wasn't we can handle this here. For example by appending a different slimmed down record with some generic reason.

When we require a more detailed message we can do that by handling the exception in the processor like we did in #10193

remcowesterhoud · 2022-08-26T12:21:01Z

@Zelldon I'd be interested on your thoughts on how we could handle this.

Zelldon · 2022-08-26T12:36:55Z

Hey @remcowesterhoud yes so my recommendation would also be to verify whether the appending worked via the either and if not, handle that like for example add an empty record or a slimmed one. 👍

Maybe you could also check this already on the appending of the DeploymentRecord during processing and handle that case if it is too large 🤔 But I guess with the exception is right now a bit cleaner, since it will reset the transaction, result builder etc.

korthout · 2022-08-26T15:11:10Z

For completeness, this is the appendRecordReturnEither method.

10402: Fix error handling loop r=remcowesterhoud a=remcowesterhoud ## Description  When we exceed the record batch an exception is thrown. When this is done during the handling of an error this would result in the writing of a rejection to fail, in turn resulting in an exception-loop. By removing the rejection reason from the rejection in the event of an `ExceededBatchRecordSizeException` the rejection should always be able to be written. Unfortunately removing this reason makes it unclear what went wrong. To still be able to identify this a new rejection type has been added. ## Related issues  closes #10199 Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>

remcowesterhoud added the kind/bug Categorizes an issue or PR as a bug label Aug 26, 2022

remcowesterhoud mentioned this issue Aug 26, 2022

Reject too large deployment #10193

Merged

10 tasks

remcowesterhoud self-assigned this Sep 19, 2022

remcowesterhoud mentioned this issue Sep 19, 2022

Fix error handling loop #10402

Merged

10 tasks

zeebe-bors-camunda bot closed this as completed in 15b2a9d Sep 20, 2022

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception loop in Engine during handling of unexpected error #10199

Exception loop in Engine during handling of unexpected error #10199

remcowesterhoud commented Aug 26, 2022

remcowesterhoud commented Aug 26, 2022 •

edited

remcowesterhoud commented Aug 26, 2022 •

edited

Zelldon commented Aug 26, 2022 •

edited

korthout commented Aug 26, 2022

Exception loop in Engine during handling of unexpected error #10199

Exception loop in Engine during handling of unexpected error #10199

Comments

remcowesterhoud commented Aug 26, 2022

remcowesterhoud commented Aug 26, 2022 • edited

remcowesterhoud commented Aug 26, 2022 • edited

Zelldon commented Aug 26, 2022 • edited

korthout commented Aug 26, 2022

remcowesterhoud commented Aug 26, 2022 •

edited

remcowesterhoud commented Aug 26, 2022 •

edited

Zelldon commented Aug 26, 2022 •

edited