Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject too large deployment #10193

Merged
merged 5 commits into from
Aug 30, 2022
Merged

Conversation

remcowesterhoud
Copy link
Contributor

@remcowesterhoud remcowesterhoud commented Aug 25, 2022

Description

When a deployment was too large an unexpected exception would be thrown. This would cause a rejection to be written with an error message containing the very large resource. The writing of this rejection failed because it was too large as well, causing another rejection to be written. This kept happening in a loop.

By manually writing the rejection in the processor we can give a custom message that isn't too large, resolving the loop.

The bug first occurred in 8.1.0-alpha4, so a backport is not necessary.

Related issues

closes #9946

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

@remcowesterhoud
Copy link
Contributor Author

@korthout Don't be alarmed about the large number of lines added, it's actually a relatively small fix. That's just a large process used in the test 😄

@github-actions
Copy link
Contributor

github-actions bot commented Aug 25, 2022

Test Results

   854 files  +    1     854 suites  +1   1h 40m 49s ⏱️ +55s
6 609 tests +245  6 599 ✔️ +245  10 💤 ±0  0 ±0 
6 793 runs  +245  6 783 ✔️ +245  10 💤 ±0  0 ±0 

Results for commit 6fe4a77. ± Comparison against base commit 1480f7e.

♻️ This comment has been updated with latest results.

@Test
public void shouldRejectDeploymentIfNotValidDesignTimeAspect() throws Exception {
// given
final Path path = Paths.get(getClass().getResource("/processes/invalid_process.bpmn").toURI());

Check warning

Code scanning / CodeQL

Unsafe use of getResource

The idiom getClass().getResource() is unsafe for classes that may be extended.
public void shouldRejectDeploymentIfNotValidRuntimeAspect() throws Exception {
// given
final Path path =
Paths.get(getClass().getResource("/processes/invalid_process_condition.bpmn").toURI());

Check warning

Code scanning / CodeQL

Unsafe use of getResource

The idiom getClass().getResource() is unsafe for classes that may be extended.
@Test
public void shouldRejectDeploymentIfOneResourceIsNotValid() throws Exception {
// given
final Path path1 = Paths.get(getClass().getResource("/processes/invalid_process.bpmn").toURI());

Check warning

Code scanning / CodeQL

Unsafe use of getResource

The idiom getClass().getResource() is unsafe for classes that may be extended.
public void shouldRejectDeploymentIfOneResourceIsNotValid() throws Exception {
// given
final Path path1 = Paths.get(getClass().getResource("/processes/invalid_process.bpmn").toURI());
final Path path2 = Paths.get(getClass().getResource("/processes/collaboration.bpmn").toURI());

Check warning

Code scanning / CodeQL

Unsafe use of getResource

The idiom getClass().getResource() is unsafe for classes that may be extended.
Copy link
Member

@korthout korthout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done @remcowesterhoud 👍

👍 Thanks for the integration test! That's a good idea!

👍 I also really like the split of the commits, it is easy to review

I'm not immediately approving this but I also don't necessarily request changes. Instead, I'd like to discuss the main solution approach first. (see my 🤔 comment below).

Perhaps this is a very specific problem to deployments, but I'd like to understand how this works for other records that exceed the max msg size. If it's solved there, why does it not work for deployment? If it isn't solved there, then we should probably look for a more general solution. What do you think @remcowesterhoud Do you agree?

Comment on lines 137 to 141
@Override
public ProcessingError tryHandleError(
final TypedRecord<DeploymentRecord> command, final Throwable error) {
if (error instanceof IllegalStateException
&& error.getMessage().contains("this would exceed the maximum batch size")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Should we handle this as an expected error specifically for deployments? It can happen to any record that becomes too large. How do we deal with this for job batch activate with variables? or for set variables command? We may need to consider a more general solution.

Copy link
Member

@Zelldon Zelldon Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to mention that there is a new feature right out of the ofen #10191 Where you can determine whether the append to the RecordBatch is successful or not. We can discuss in more detail when I'm back, but in general you could think of whether you would like to handle this either as Remco did via exception or directly on the processing and use the either.

Important are these new methods for you https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/api/ProcessingResultBuilder.java#L22-L62

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be aware of records that could be too large. I think for now this would only be a risk for deployments and batch activations as you said.
The job batch activate does checks this explicitly here: https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/processing/job/JobBatchCollector.java#L95.

I think the important thing is that when we write a command rejection, we make sure don't get into an error loop again. This will happens because we receive the "too large" error when processing the rejection as well. Since this is an unexpected error we will handle this and write a rejection because of it, causing the loop here:
https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/Engine.java#L167-L174

I still have to create a follow-up issue for this, I'll do that later today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up issue: #10199

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I understand correctly @remcowesterhoud you suggest picking this up later and first patch the bug. I think that makes sense.

Personally, I'm not so happy that the DeploymentCreateProcessor is handling an error as expected that it didn't throw. But, that's all good if we make this is a temporary solution and aim to resolve #10199 at a later point.

Did I understand your intention correctly, @remcowesterhoud?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@korthout Almost right. The only difference is that I don't see this as a temporary solutions. I intend to keep this code here, next to the changes that will be made in #10199, which I will probably make next week.

I think the misunderstanding here is that the processor will be handling an error that it hasn't thrown. The very first error that triggers the loop is thrown in this processor. The first error happens when the processor tries to append the deployment CREATED event here https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/processing/deployment/DeploymentCreateProcessor.java#L116.
After this the loop begins and all the consecutive errors are not thrown in this processor, but in the Engine.

In my opinion this is definitely an expected error from a deployment. We know users can upload large processes and we should expect it to happen. Handling it here allows us to give more significant feedback to the user than the generic message of #10199 will do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem that I have with this is that the IllegalStateException is thrown by the RecordBatch, which we don't own. The processor now depends on this specific exception being thrown, but it may be changed by the other team. To make sure they don't change it, we need to make sure that this exception is part of the contract when we append records.

After discussing F2F, @saig0 @remcowesterhoud and I decided we want to:

  • change the exception thrown by RecordBatch into a more meaningful one
  • document that throwing this exception is part of the contract with the engine (so DPT doesn't change this);
  • document that this exception can be thrown as part of the contract of the appendX methods of the record writers (for visibility of the exception in the engine processors);
  • handle this new exception in the tryHandleError method of the processor (we can later expand this to a separate abstract handleExceededBatchRecordSize (or similar) method for further improved visibility in the processors).

All this means that the processor still doesn't throw the exception directly, but it's now part of the contract of appending records. Therefore, the processor is aware that this can happen and it should deal with it accordingly.

Note that we should also update our internal developer handbook on how to write record processors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at #10191

You can use either the ResultBuilder method which returns an either or which throws an exceptions. As far as I remember correctly I added it as Javadoc, can't check it right now since I'm underway.

We can also check together tomorrow if you need

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zelldon The problem with using the Either is that we would have to explicitly check the append result for every append we do, since all of the appends could result in this problem. It's easier to have it throw the exception and check it once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I get that just wanted to show the option. Btw I think in some cases it is more likely to fail, multi instance, deployments etc maybe it makes sense to handle here the either just as an idea 🤷‍♂️

When a deployment was too large an unexpected exception would be thrown. This would cause a rejection to be written with an error message containing the very large resource. The writing of this rejection failed because it was too large as well, causing another rejection to be written. This kept happening in a loop.

By manually writing the rejection in the processor we can give a custom message that isn't too large, resolving the loop.
Verify that a deployment that is too large gets rejected properly. A rejection must be written and the client must receive a rejection response.
The CreateDeploymentTest became a bit large. I have moved the rejections to their own test class.
Instead of a generic IllegalArgumentException the RecordBatch will now throw a more explicit ExceedBatchRecordSizeException. This way it's easier to catch the exception in the event something goes wrong.
@remcowesterhoud
Copy link
Contributor Author

@korthout your review is still suggested so I'll ping you like this. Please have another look at the changed I made (last 2 commits) 🙂

Copy link
Member

@korthout korthout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @remcowesterhoud 👍 I like it!

Please have a look at my suggestions before merging

Comment on lines +102 to +105
public static class ExceededBatchRecordSizeException extends RuntimeException {
public ExceededBatchRecordSizeException(final String message) {
super(message);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 The message is unused but it is filled with useful data. Let's move this data into the exception and use it to enhance the rejection message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured we could add this when we needed it. I don't see the point in adding it in case we might need it in the future.

Explicitly mentions that this exception can be thrown when appending a record.
Also document that this exception is part of the contract of the engine.
@remcowesterhoud
Copy link
Contributor Author

bors merge

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit 58e6bf0 into main Aug 30, 2022
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the 9946_reject_too_large_deployment branch August 30, 2022 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too big Deployment is no longer rejected
3 participants