Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute deployment in post commit tasks #10689

Merged
merged 1 commit into from
Oct 13, 2022

Conversation

remcowesterhoud
Copy link
Contributor

@remcowesterhoud remcowesterhoud commented Oct 12, 2022

Description

This change makes it so that we do not distribute a deployment during the processing of the CREATE command, but as post commit task afters this command has been processed.

The reasoning behind this change is that the writing of the deployment distribution events/commands could happen in an unexpected order. This is because of the event buffering. Once we write an event this gets buffered. When we notify a different partition about the deployment it will write (and possibly process) the command immediately. This results in a situation where the different partition could write it's commands/events before we commands/events of the CREATE command have been written, making the ordering seem backwards.

E.g.:

  1. We receive a Deployment.CREATE command
  2. During processing we will write the DeploymentDistribution.DISTRIBUTING event. During processing we also send a message to the other partitions in order to distribute the deployment.
  3. The DeploymentDistribution.DISTRIBUTING event is written to the buffer. Nothing is written to the log stream yet. The other partition receives the message and writes a Deployment.DISTRIBUTE command. At this point this partition is idle so it will immediately start processing this command.
  4. Here we run into a race condition. If the second partition sends the response back to the first partition before the first partition has finished processing the Deployment.CREATE command it will write the DeploymentDistribution.COMPLETE before it writes the buffered events.

With this change the order of commands/events should always be the same. As a result this fixes the flaky test that is reference as a related issue. The flow will be as follows:

On partition 1

Command     Deployment.CREATE                      
Event       Process.CREATED                        
Event       Deployment.CREATED                     
Event       DeplyomentDistribution.DISTRIBUTING     (store distribution to partition N in state)
<Post commit tasks trigger at this point causing the other partitions to starts processing the deployment>
Command     DeploymentDistribution.COMPLETE         
Event       DeploymentDistribution.COMPLETED        
Event       Deployment.FULLY_DISTRIBUTEED           (when all partitions have triggered a complete command)

On other partitions

Command     Deployment.DISTRIBUTE       (causes the DeploymentDistribution.COMPLETE command to be written on partition 1)
Event       Deployment.DISTRIBUTED      

Related issues

closes #9964

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

@remcowesterhoud
Copy link
Contributor Author

I ran the test a couple thousand times without failure 🤞

This change makes it so that we do not distribute a deployment during the processing of the CREATE command, but as post commit task afters this command has been processed.

The reasoning behind this change is that the writing of the deployment distribution events/commands could happen in an unexpected order. This is because of the event buffering. Once we write an event this gets buffered. When we notify a different partition about the deployment it will write (and possibly process) the command immediately. This results in a situation where the different partition could write it's commands/events before we commands/events of the CREATE command have been written, making the ordering seem backwards.

E.g.:
1. We receive a Deployment.CREATE command

2. During processing we will write the DeploymentDistribution.DISTRIBUTING event. During processing we also send a message to the other partitions in order to distribute the deployment.

3. The DeploymentDistribution.DISTRIBUTING event is written to the buffer. Nothing is written to the log stream yet. The other partition receives the message and writes a Deployment.DISTRIBUTE command. At this point this partition is idle so it will immediately start processing this command.

4. Here we run into a race condition. If the second partition sends the response back to the first partition before the first partition has finished processing the Deployment.CREATE command it will write the DeploymentDistribution.COMPLETE before it writes the buffered events.
@github-actions
Copy link
Contributor

github-actions bot commented Oct 12, 2022

Test Results

   939 files  ±    0     939 suites  ±0   1h 49m 7s ⏱️ -31s
7 323 tests  - 465  7 317 ✔️  - 465  6 💤 ±0  0 ±0 
7 513 runs   - 465  7 505 ✔️  - 465  8 💤 ±0  0 ±0 

Results for commit 33070a4. ± Comparison against base commit fb62d65.

♻️ This comment has been updated with latest results.

@korthout
Copy link
Member

With this change the order of commands/events should always be the same. As a result this fixes the flaky test that is reference as a related issue.

@remcowesterhoud Please document this new order in the PR description.

@remcowesterhoud
Copy link
Contributor Author

@korthout I have updated the description

Copy link
Member

@korthout korthout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @remcowesterhoud 👏 Well researched!

❓ Would it be possible to write a test for this?

🤔 I was thinking about a test where we assert the exact ordering using the recording log exporter

@remcowesterhoud
Copy link
Contributor Author

@korthout You mean something like the flaky test that caused me to fix this? 😄

@korthout
Copy link
Member

@remcowesterhoud Good point 😅

Copy link
Member

@korthout korthout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🥇

bors merge

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit aff630e into main Oct 13, 2022
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the 9964_flaky_deployment_test branch October 13, 2022 13:52
@backport-action
Copy link
Collaborator

Backport failed for stable/8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-10689-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-10689-to-stable/8.0
git checkout -b backport-10689-to-stable/8.0
ancref=$(git merge-base fb62d658b963d4f948d858e77bc9cbc0513d8d83 33070a437e5beb2d69b02db1e3a1c50e072739ba)
git cherry-pick -x $ancref..33070a437e5beb2d69b02db1e3a1c50e072739ba

@backport-action
Copy link
Collaborator

Successfully created backport PR #10711 for stable/8.1.

zeebe-bors-camunda bot added a commit that referenced this pull request Oct 13, 2022
10711: [Backport stable/8.1] Distribute deployment in post commit tasks r=remcowesterhoud a=backport-action

# Description
Backport of #10689 to `stable/8.1`.

relates to #9964

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
zeebe-bors-camunda bot added a commit that referenced this pull request Oct 13, 2022
10715: Backport 10689 to 8.0 r=korthout a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
Backport #10689 to stable 8.0

## Related issues

<!-- Which issues are closed by this PR or are related -->




Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@korthout korthout added the version:8.1.2 Marks an issue as being completely or in parts released in 8.1.2 label Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.1.2 Marks an issue as being completely or in parts released in 8.1.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky MultiPartitionDeploymentLifecycleTest.shouldTestLifecycle
4 participants