Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(engine): prevent instant rescheduling #9237

Merged
merged 1 commit into from
Apr 29, 2022

Conversation

pihme
Copy link
Contributor

@pihme pihme commented Apr 27, 2022

Description

Before this change, the delay calculated to reschedule a task could be negative or close to 0.
This lead to the checker being immediately rescheduled. This is bad, because it does not leave room
for other tasks to run.

With this change, a lower floor is applied when the task is rescheduled.

Related issues

closes #9236
preparation for #9238

Definition of Done

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

Copy link
Member

@saig0 saig0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pihme good idea 👍

The changes look good but I've one concern. Please have a look at my comment.

@pihme pihme requested a review from saig0 April 29, 2022 10:55
Copy link
Member

@saig0 saig0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Before this change, the delay calculated to reschedule a task could be negative or close to 0.
This lead to the checker being immediately rescheduled. This is bad, because it does not leave room
for other tasks to run.

With this change, a lower floor is applied when the task is rescheduled.
@pihme pihme force-pushed the 9236-time-checker-positive-delay branch from 5f0896f to 0f4a6ab Compare April 29, 2022 11:53
@pihme
Copy link
Contributor Author

pihme commented Apr 29, 2022

bors merge

@zeebe-bors-camunda
Copy link
Contributor

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit 890a33e into main Apr 29, 2022
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the 9236-time-checker-positive-delay branch April 29, 2022 12:24
@github-actions
Copy link
Contributor

Successfully created backport PR #9255 for stable/1.3.

@github-actions
Copy link
Contributor

Successfully created backport PR #9256 for stable/8.0.

zeebe-bors-camunda bot added a commit that referenced this pull request Apr 29, 2022
9255: [Backport stable/1.3] refactor(engine): prevent instant rescheduling r=pihme a=github-actions[bot]

# Description
Backport of #9237 to `stable/1.3`.

relates to #9236 #9238

Co-authored-by: pihme <pihme@users.noreply.github.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Apr 29, 2022
9256: [Backport stable/8.0] refactor(engine): prevent instant rescheduling r=pihme a=github-actions[bot]

# Description
Backport of #9237 to `stable/8.0`.

relates to #9236 #9238

Co-authored-by: pihme <pihme@users.noreply.github.com>
zeebe-bors-camunda bot added a commit that referenced this pull request May 2, 2022
9249: Yield control if too many timers due r=pihme a=pihme

## Description

Adds a mechanism for the `DueDateTimeChecker` to yield control after some time. This is to stop it from iterating over an unknown number of due timer events and blocking execution while doing so.

Overall, this change should work well in cases where there is a huge backlog of timers. This backlog would then be reduced bit by bit.

The change is potentially bad for cases in which there is a constant and high load with many timers being created all the time. In this case, the change of this PR can lead to due timers continuously growing and the timers triggered will fall more and more behind real time.

Overall, this tradeoff was deemed advantageous. At least it removes that dangers that the iteration blocks the execution for so long that the node is marked as unhealthy. When this situation is reached there is currently no practical recovery possible.

Even before this point is reached, execution will be blocked for long stretches of time, and no progress can be made on that partition. So one faulty process can block all others from executing.

Both issues are addressed by this PR. With this PR it should be always possible to make some progress, albeit small. This would allow users to cancel or change any faulty process, or to reduce the load if needed. 

Further work will be needed to figure out a way how to trigger timers without potentially falling further and further behind real time.

## Review Hints
This PR has duplicate commits from #9237 

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #9238



Co-authored-by: pihme <pihme@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DueDateTimeChecker may be scheduled with a negative delay
3 participants