Reject too large deployment #10193

remcowesterhoud · 2022-08-25T14:25:17Z

Description

When a deployment was too large an unexpected exception would be thrown. This would cause a rejection to be written with an error message containing the very large resource. The writing of this rejection failed because it was too large as well, causing another rejection to be written. This kept happening in a loop.

By manually writing the rejection in the processor we can give a custom message that isn't too large, resolving the loop.

The bug first occurred in 8.1.0-alpha4, so a backport is not necessary.

Related issues

closes #9946

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
New content is added to the release announcement
If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

remcowesterhoud · 2022-08-25T14:27:54Z

@korthout Don't be alarmed about the large number of lines added, it's actually a relatively small fix. That's just a large process used in the test 😄

github-actions · 2022-08-25T14:41:34Z

Test Results

  854 files +    1   854 suites +1 1h 40m 49s ⏱️ +55s
6 609 tests +245 6 599 ✔️ +245 10 💤 ±0 0 ❌ ±0
6 793 runs +245 6 783 ✔️ +245 10 💤 ±0 0 ❌ ±0

Results for commit 6fe4a77. ± Comparison against base commit 1480f7e.

♻️ This comment has been updated with latest results.

.../io/camunda/zeebe/engine/processing/deployment/model/validation/DeploymentRejectionTest.java

engine/src/test/java/io/camunda/zeebe/engine/processing/deployment/DeploymentRejectionTest.java

+  @Test
+  public void shouldRejectDeploymentIfNotValidDesignTimeAspect() throws Exception {
+    // given
+    final Path path = Paths.get(getClass().getResource("/processes/invalid_process.bpmn").toURI());


engine/src/test/java/io/camunda/zeebe/engine/processing/deployment/DeploymentRejectionTest.java

+  public void shouldRejectDeploymentIfNotValidRuntimeAspect() throws Exception {
+    // given
+    final Path path =
+        Paths.get(getClass().getResource("/processes/invalid_process_condition.bpmn").toURI());


engine/src/test/java/io/camunda/zeebe/engine/processing/deployment/DeploymentRejectionTest.java

+  @Test
+  public void shouldRejectDeploymentIfOneResourceIsNotValid() throws Exception {
+    // given
+    final Path path1 = Paths.get(getClass().getResource("/processes/invalid_process.bpmn").toURI());


engine/src/test/java/io/camunda/zeebe/engine/processing/deployment/DeploymentRejectionTest.java

+  public void shouldRejectDeploymentIfOneResourceIsNotValid() throws Exception {
+    // given
+    final Path path1 = Paths.get(getClass().getResource("/processes/invalid_process.bpmn").toURI());
+    final Path path2 = Paths.get(getClass().getResource("/processes/collaboration.bpmn").toURI());


korthout

Well done @remcowesterhoud 👍

👍 Thanks for the integration test! That's a good idea!

👍 I also really like the split of the commits, it is easy to review

I'm not immediately approving this but I also don't necessarily request changes. Instead, I'd like to discuss the main solution approach first. (see my 🤔 comment below).

Perhaps this is a very specific problem to deployments, but I'd like to understand how this works for other records that exceed the max msg size. If it's solved there, why does it not work for deployment? If it isn't solved there, then we should probably look for a more general solution. What do you think @remcowesterhoud Do you agree?

korthout · 2022-08-26T09:52:12Z

...e/src/main/java/io/camunda/zeebe/engine/processing/deployment/DeploymentCreateProcessor.java

+  @Override
+  public ProcessingError tryHandleError(
+      final TypedRecord<DeploymentRecord> command, final Throwable error) {
+    if (error instanceof IllegalStateException
+        && error.getMessage().contains("this would exceed the maximum batch size")) {


🤔 Should we handle this as an expected error specifically for deployments? It can happen to any record that becomes too large. How do we deal with this for job batch activate with variables? or for set variables command? We may need to consider a more general solution.

Just want to mention that there is a new feature right out of the ofen #10191 Where you can determine whether the append to the RecordBatch is successful or not. We can discuss in more detail when I'm back, but in general you could think of whether you would like to handle this either as Remco did via exception or directly on the processing and use the either.

Important are these new methods for you https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/api/ProcessingResultBuilder.java#L22-L62

We should be aware of records that could be too large. I think for now this would only be a risk for deployments and batch activations as you said.
The job batch activate does checks this explicitly here: https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/processing/job/JobBatchCollector.java#L95.

I think the important thing is that when we write a command rejection, we make sure don't get into an error loop again. This will happens because we receive the "too large" error when processing the rejection as well. Since this is an unexpected error we will handle this and write a rejection because of it, causing the loop here:
https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/Engine.java#L167-L174

I still have to create a follow-up issue for this, I'll do that later today.

Follow-up issue: #10199

So, if I understand correctly @remcowesterhoud you suggest picking this up later and first patch the bug. I think that makes sense.

Personally, I'm not so happy that the DeploymentCreateProcessor is handling an error as expected that it didn't throw. But, that's all good if we make this is a temporary solution and aim to resolve #10199 at a later point.

Did I understand your intention correctly, @remcowesterhoud?

@korthout Almost right. The only difference is that I don't see this as a temporary solutions. I intend to keep this code here, next to the changes that will be made in #10199, which I will probably make next week.

I think the misunderstanding here is that the processor will be handling an error that it hasn't thrown. The very first error that triggers the loop is thrown in this processor. The first error happens when the processor tries to append the deployment CREATED event here https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/processing/deployment/DeploymentCreateProcessor.java#L116.
After this the loop begins and all the consecutive errors are not thrown in this processor, but in the Engine.

In my opinion this is definitely an expected error from a deployment. We know users can upload large processes and we should expect it to happen. Handling it here allows us to give more significant feedback to the user than the generic message of #10199 will do.

The problem that I have with this is that the IllegalStateException is thrown by the RecordBatch, which we don't own. The processor now depends on this specific exception being thrown, but it may be changed by the other team. To make sure they don't change it, we need to make sure that this exception is part of the contract when we append records.

After discussing F2F, @saig0 @remcowesterhoud and I decided we want to:

change the exception thrown by RecordBatch into a more meaningful one

document that throwing this exception is part of the contract with the engine (so DPT doesn't change this);

document that this exception can be thrown as part of the contract of the appendX methods of the record writers (for visibility of the exception in the engine processors);

handle this new exception in the tryHandleError method of the processor (we can later expand this to a separate abstract handleExceededBatchRecordSize (or similar) method for further improved visibility in the processors).

All this means that the processor still doesn't throw the exception directly, but it's now part of the contract of appending records. Therefore, the processor is aware that this can happen and it should deal with it accordingly.

Note that we should also update our internal developer handbook on how to write record processors.

Please take a look at #10191

You can use either the ResultBuilder method which returns an either or which throws an exceptions. As far as I remember correctly I added it as Javadoc, can't check it right now since I'm underway.

We can also check together tomorrow if you need

@Zelldon The problem with using the Either is that we would have to explicitly check the append result for every append we do, since all of the appends could result in this problem. It's easier to have it throw the exception and check it once.

Yes I get that just wanted to show the option. Btw I think in some cases it is more likely to fail, multi instance, deployments etc maybe it makes sense to handle here the either just as an idea 🤷‍♂️

...e/src/main/java/io/camunda/zeebe/engine/processing/deployment/DeploymentCreateProcessor.java

engine/src/test/java/io/camunda/zeebe/engine/processing/deployment/CreateDeploymentTest.java

When a deployment was too large an unexpected exception would be thrown. This would cause a rejection to be written with an error message containing the very large resource. The writing of this rejection failed because it was too large as well, causing another rejection to be written. This kept happening in a loop. By manually writing the rejection in the processor we can give a custom message that isn't too large, resolving the loop.

Verify that a deployment that is too large gets rejected properly. A rejection must be written and the client must receive a rejection response.

The CreateDeploymentTest became a bit large. I have moved the rejections to their own test class.

Instead of a generic IllegalArgumentException the RecordBatch will now throw a more explicit ExceedBatchRecordSizeException. This way it's easier to catch the exception in the event something goes wrong.

remcowesterhoud · 2022-08-30T12:04:25Z

@korthout your review is still suggested so I'll ping you like this. Please have another look at the changed I made (last 2 commits) 🙂

korthout

Thanks @remcowesterhoud 👍 I like it!

Please have a look at my suggestions before merging

engine/src/main/java/io/camunda/zeebe/engine/api/records/RecordBatch.java

engine/src/main/java/io/camunda/zeebe/engine/api/ProcessingResultBuilder.java

korthout · 2022-08-30T12:46:33Z

engine/src/main/java/io/camunda/zeebe/engine/api/records/RecordBatch.java

+  public static class ExceededBatchRecordSizeException extends RuntimeException {
+    public ExceededBatchRecordSizeException(final String message) {
+      super(message);
+    }


🔧 The message is unused but it is filled with useful data. Let's move this data into the exception and use it to enhance the rejection message.

I figured we could add this when we needed it. I don't see the point in adding it in case we might need it in the future.

Explicitly mentions that this exception can be thrown when appending a record. Also document that this exception is part of the contract of the engine.

remcowesterhoud · 2022-08-30T13:16:39Z

bors merge

zeebe-bors-camunda · 2022-08-30T13:31:48Z

Build succeeded:

Test summary
Java checks
Go linting

remcowesterhoud requested a review from korthout August 25, 2022 14:27

github-advanced-security bot found potential problems Aug 25, 2022

View reviewed changes

remcowesterhoud force-pushed the 9946_reject_too_large_deployment branch from 1998f3f to e6a00ec Compare August 26, 2022 07:19

github-advanced-security bot found potential problems Aug 26, 2022

View reviewed changes

korthout reviewed Aug 26, 2022

View reviewed changes

remcowesterhoud mentioned this pull request Aug 26, 2022

Exception loop in Engine during handling of unexpected error #10199

Closed

remcowesterhoud requested a review from korthout August 26, 2022 12:55

remcowesterhoud added 4 commits August 30, 2022 13:37

test: verify too large deployment is rejected

6661933

Verify that a deployment that is too large gets rejected properly. A rejection must be written and the client must receive a rejection response.

refactor: move deployment rejection tests to separate class

f297754

The CreateDeploymentTest became a bit large. I have moved the rejections to their own test class.

refactor: throw ExceededBatchRecordSizeException

8526948

Instead of a generic IllegalArgumentException the RecordBatch will now throw a more explicit ExceedBatchRecordSizeException. This way it's easier to catch the exception in the event something goes wrong.

remcowesterhoud force-pushed the 9946_reject_too_large_deployment branch from e6a00ec to 6fe4a77 Compare August 30, 2022 12:03

korthout approved these changes Aug 30, 2022

View reviewed changes

docs: add javadoc about ExceededBatchRecordSizeException

bed92c8

Explicitly mentions that this exception can be thrown when appending a record. Also document that this exception is part of the contract of the engine.

remcowesterhoud force-pushed the 9946_reject_too_large_deployment branch from 6fe4a77 to bed92c8 Compare August 30, 2022 13:16

zeebe-bors-camunda bot merged commit 58e6bf0 into main Aug 30, 2022

zeebe-bors-camunda bot deleted the 9946_reject_too_large_deployment branch August 30, 2022 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject too large deployment #10193

Reject too large deployment #10193

remcowesterhoud commented Aug 25, 2022 •

edited

remcowesterhoud commented Aug 25, 2022

github-actions bot commented Aug 25, 2022 •

edited

korthout left a comment

korthout Aug 26, 2022

Zelldon Aug 26, 2022 •

edited

remcowesterhoud Aug 26, 2022

remcowesterhoud Aug 26, 2022

korthout Aug 26, 2022

remcowesterhoud Aug 26, 2022

korthout Aug 29, 2022

Zelldon Aug 29, 2022

remcowesterhoud Aug 29, 2022

Zelldon Aug 29, 2022

remcowesterhoud commented Aug 30, 2022

korthout left a comment

korthout Aug 30, 2022

remcowesterhoud Aug 30, 2022

remcowesterhoud commented Aug 30, 2022

zeebe-bors-camunda bot commented Aug 30, 2022

Reject too large deployment #10193

Reject too large deployment #10193

Conversation

remcowesterhoud commented Aug 25, 2022 • edited

Description

Related issues

Definition of Done

remcowesterhoud commented Aug 25, 2022

github-actions bot commented Aug 25, 2022 • edited

Test Results

korthout left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zelldon Aug 26, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

remcowesterhoud commented Aug 30, 2022

korthout left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

remcowesterhoud commented Aug 30, 2022

zeebe-bors-camunda bot commented Aug 30, 2022

remcowesterhoud commented Aug 25, 2022 •

edited

github-actions bot commented Aug 25, 2022 •

edited

Zelldon Aug 26, 2022 •

edited