-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(engine): prevent instant rescheduling #9237
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pihme good idea 👍
The changes look good but I've one concern. Please have a look at my comment.
engine/src/main/java/io/camunda/zeebe/engine/processing/scheduled/DueDateChecker.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Before this change, the delay calculated to reschedule a task could be negative or close to 0. This lead to the checker being immediately rescheduled. This is bad, because it does not leave room for other tasks to run. With this change, a lower floor is applied when the task is rescheduled.
5f0896f
to
0f4a6ab
Compare
bors merge |
Successfully created backport PR #9255 for |
Successfully created backport PR #9256 for |
9249: Yield control if too many timers due r=pihme a=pihme ## Description Adds a mechanism for the `DueDateTimeChecker` to yield control after some time. This is to stop it from iterating over an unknown number of due timer events and blocking execution while doing so. Overall, this change should work well in cases where there is a huge backlog of timers. This backlog would then be reduced bit by bit. The change is potentially bad for cases in which there is a constant and high load with many timers being created all the time. In this case, the change of this PR can lead to due timers continuously growing and the timers triggered will fall more and more behind real time. Overall, this tradeoff was deemed advantageous. At least it removes that dangers that the iteration blocks the execution for so long that the node is marked as unhealthy. When this situation is reached there is currently no practical recovery possible. Even before this point is reached, execution will be blocked for long stretches of time, and no progress can be made on that partition. So one faulty process can block all others from executing. Both issues are addressed by this PR. With this PR it should be always possible to make some progress, albeit small. This would allow users to cancel or change any faulty process, or to reduce the load if needed. Further work will be needed to figure out a way how to trigger timers without potentially falling further and further behind real time. ## Review Hints This PR has duplicate commits from #9237 ## Related issues <!-- Which issues are closed by this PR or are related --> closes #9238 Co-authored-by: pihme <pihme@users.noreply.github.com>
Description
Before this change, the delay calculated to reschedule a task could be negative or close to 0.
This lead to the checker being immediately rescheduled. This is bad, because it does not leave room
for other tasks to run.
With this change, a lower floor is applied when the task is rescheduled.
Related issues
closes #9236
preparation for #9238
Definition of Done
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Please refer to our review guidelines.