Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception loop in Engine during handling of unexpected error #10199

Closed
remcowesterhoud opened this issue Aug 26, 2022 · 4 comments · Fixed by #10402
Closed

Exception loop in Engine during handling of unexpected error #10199

remcowesterhoud opened this issue Aug 26, 2022 · 4 comments · Fixed by #10402
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0

Comments

@remcowesterhoud
Copy link
Contributor

Describe the bug

In #9946 we encountered the problem of an exception loop when we deployed a process that was too large. The cause for this that the processor did not handle this exception explicitly using the tryHandleError method. As a result the engine considered this an UNEXPECTED_ERROR. When this happens the engine will try to handle this unexpected error:

  private void handleUnexpectedError(
      final Throwable processingException, final TypedRecord record) {
    final String errorMessage =
        String.format(
            ERROR_MESSAGE_PROCESSING_EXCEPTION_OCCURRED, record, processingException.getMessage());
    LOG.error(errorMessage, processingException);

    writers.rejection().appendRejection(record, RejectionType.PROCESSING_ERROR, errorMessage);
    writers
        .response()
        .writeRejectionOnCommand(record, RejectionType.PROCESSING_ERROR, errorMessage);
    errorRecord.initErrorRecord(processingException, record.getPosition());

    if (DbBlackListState.shouldBeBlacklisted(record.getIntent())) {
      if (record.getValue() instanceof ProcessInstanceRelated) {
        final long processInstanceKey =
            ((ProcessInstanceRelated) record.getValue()).getProcessInstanceKey();
        errorRecord.setProcessInstanceKey(processInstanceKey);
      }

      writers.state().appendFollowUpEvent(record.getKey(), ErrorIntent.CREATED, errorRecord);
    }
  }

At the beginning of this method we will create an error message and write the rejection. During the appending of this rejection we check the size of this record (https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/api/records/RecordBatch.java#L58). If it's also too large we will throw an exception.
Since this exception is also considered to be an UNEXPECTED_ERROR the Engine tries to handle this unexpected error again. This will try to append the record again and in turn thrown another exception, starting the loop again.

This loop will cause the partition to reach an UNHEALTHY state.

To Reproduce

As long as #10193 is not merged:

  1. Deploy a very large process
  2. The exception loop should occur

If #10193 is merged:

  1. Change the DeploymentCreateProcessor tryHandleError method to always return an UNEXPECTED_ERROR
  2. Deploy a very large process
  3. The exception loop should occur

Expected behavior

The exception loop should not occur when the Engine fails to append a rejection.

Log/Stacktrace

Full Stacktrace

...with size: 57419999 this would exceed the maximum batch size. [ currentBatchEntryCount: 0, currentBatchSize: 0]
	at io.camunda.zeebe.engine.api.records.RecordBatch.appendRecord(RecordBatch.java:67) ~[classes/:?]
	at io.camunda.zeebe.streamprocessor.DirectProcessingResultBuilder.appendRecordReturnEither(DirectProcessingResultBuilder.java:76) ~[classes/:?]
	at io.camunda.zeebe.engine.api.ProcessingResultBuilder.appendRecord(ProcessingResultBuilder.java:36) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.writers.ResultBuilderBackedRejectionWriter.appendRejection(ResultBuilderBackedRejectionWriter.java:31) ~[classes/:?]
	at io.camunda.zeebe.engine.Engine.handleUnexpectedError(Engine.java:174) ~[classes/:?]
	at io.camunda.zeebe.engine.Engine.onProcessingError(Engine.java:161) ~[classes/:?]
	at io.camunda.zeebe.streamprocessor.ProcessingStateMachine.lambda$errorHandlingInTransaction$7(ProcessingStateMachine.java:336) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:84) ~[classes/:?]
	at io.camunda.zeebe.streamprocessor.ProcessingStateMachine.errorHandlingInTransaction(ProcessingStateMachine.java:324) ~[classes/:?]
	at io.camunda.zeebe.streamprocessor.ProcessingStateMachine.lambda$onError$6(ProcessingStateMachine.java:313) ~[classes/:?]
	at io.camunda.zeebe.scheduler.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:28) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorJob.invoke(ActorJob.java:94) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorJob.execute(ActorJob.java:45) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorTask.execute(ActorTask.java:119) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask(ActorThread.java:106) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorThread.doWork(ActorThread.java:87) ~[classes/:?]
	at io.camunda.zeebe.scheduler.ActorThread.run(ActorThread.java:198) ~[classes/:?]

Environment:

  • OS: Cloud
  • Zeebe Version: Starting from 8.1.0-alpha4
@remcowesterhoud remcowesterhoud added the kind/bug Categorizes an issue or PR as a bug label Aug 26, 2022
@remcowesterhoud
Copy link
Contributor Author

remcowesterhoud commented Aug 26, 2022

Possible solution
In the appendRejection method we could make use of the appendRecordReturnEither method. By doing this we can see whether or not the append of the record was successful or not. If it wasn't we can handle this here. For example by appending a different slimmed down record with some generic reason.

When we require a more detailed message we can do that by handling the exception in the processor like we did in #10193

@remcowesterhoud
Copy link
Contributor Author

remcowesterhoud commented Aug 26, 2022

@Zelldon I'd be interested on your thoughts on how we could handle this.

@Zelldon
Copy link
Member

Zelldon commented Aug 26, 2022

Hey @remcowesterhoud yes so my recommendation would also be to verify whether the appending worked via the either and if not, handle that like for example add an empty record or a slimmed one. 👍

Maybe you could also check this already on the appending of the DeploymentRecord during processing and handle that case if it is too large 🤔 But I guess with the exception is right now a bit cleaner, since it will reset the transaction, result builder etc.

@korthout
Copy link
Member

For completeness, this is the appendRecordReturnEither method.

@remcowesterhoud remcowesterhoud self-assigned this Sep 19, 2022
zeebe-bors-camunda bot added a commit that referenced this issue Sep 20, 2022
10402: Fix error handling loop r=remcowesterhoud a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
When we exceed the record batch an exception is thrown. When this is done during the handling of an error this would result in the writing of a rejection to fail, in turn resulting in an exception-loop.

By removing the rejection reason from the rejection in the event of an `ExceededBatchRecordSizeException` the rejection should always be able to be written. Unfortunately removing this reason makes it unclear what went wrong. To still be able to identify this a new rejection type has been added.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #10199 



Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants