-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distribute deployment in post commit tasks #10689
Conversation
I ran the test a couple thousand times without failure 🤞 |
...io/camunda/zeebe/engine/processing/deployment/distribute/DeploymentDistributionBehavior.java
Outdated
Show resolved
Hide resolved
This change makes it so that we do not distribute a deployment during the processing of the CREATE command, but as post commit task afters this command has been processed. The reasoning behind this change is that the writing of the deployment distribution events/commands could happen in an unexpected order. This is because of the event buffering. Once we write an event this gets buffered. When we notify a different partition about the deployment it will write (and possibly process) the command immediately. This results in a situation where the different partition could write it's commands/events before we commands/events of the CREATE command have been written, making the ordering seem backwards. E.g.: 1. We receive a Deployment.CREATE command 2. During processing we will write the DeploymentDistribution.DISTRIBUTING event. During processing we also send a message to the other partitions in order to distribute the deployment. 3. The DeploymentDistribution.DISTRIBUTING event is written to the buffer. Nothing is written to the log stream yet. The other partition receives the message and writes a Deployment.DISTRIBUTE command. At this point this partition is idle so it will immediately start processing this command. 4. Here we run into a race condition. If the second partition sends the response back to the first partition before the first partition has finished processing the Deployment.CREATE command it will write the DeploymentDistribution.COMPLETE before it writes the buffered events.
715aafa
to
33070a4
Compare
@remcowesterhoud Please document this new order in the PR description. |
@korthout I have updated the description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @remcowesterhoud 👏 Well researched!
❓ Would it be possible to write a test for this?
🤔 I was thinking about a test where we assert the exact ordering using the recording log exporter
@korthout You mean something like the flaky test that caused me to fix this? 😄 |
@remcowesterhoud Good point 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🥇
bors merge
Build succeeded: |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.0
git worktree add -d .worktree/backport-10689-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-10689-to-stable/8.0
git checkout -b backport-10689-to-stable/8.0
ancref=$(git merge-base fb62d658b963d4f948d858e77bc9cbc0513d8d83 33070a437e5beb2d69b02db1e3a1c50e072739ba)
git cherry-pick -x $ancref..33070a437e5beb2d69b02db1e3a1c50e072739ba |
Successfully created backport PR #10711 for |
10715: Backport 10689 to 8.0 r=korthout a=remcowesterhoud ## Description <!-- Please explain the changes you made here. --> Backport #10689 to stable 8.0 ## Related issues <!-- Which issues are closed by this PR or are related --> Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
Description
This change makes it so that we do not distribute a deployment during the processing of the
CREATE
command, but as post commit task afters this command has been processed.The reasoning behind this change is that the writing of the deployment distribution events/commands could happen in an unexpected order. This is because of the event buffering. Once we write an event this gets buffered. When we notify a different partition about the deployment it will write (and possibly process) the command immediately. This results in a situation where the different partition could write it's commands/events before we commands/events of the
CREATE
command have been written, making the ordering seem backwards.E.g.:
Deployment.CREATE
commandDeploymentDistribution.DISTRIBUTING
event. During processing we also send a message to the other partitions in order to distribute the deployment.DeploymentDistribution.DISTRIBUTING
event is written to the buffer. Nothing is written to the log stream yet. The other partition receives the message and writes aDeployment.DISTRIBUTE
command. At this point this partition is idle so it will immediately start processing this command.Deployment.CREATE
command it will write theDeploymentDistribution.COMPLETE
before it writes the buffered events.With this change the order of commands/events should always be the same. As a result this fixes the flaky test that is reference as a related issue. The flow will be as follows:
On partition 1
On other partitions
Related issues
closes #9964
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Please refer to our review guidelines.