Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DueDateTimeChecker will block progress if many timers are due #9238

Closed
pihme opened this issue Apr 27, 2022 · 0 comments · Fixed by #9249
Closed

DueDateTimeChecker will block progress if many timers are due #9238

pihme opened this issue Apr 27, 2022 · 0 comments · Fixed by #9249
Assignees
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:1.3.8 version:8.1.0-alpha2 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0

Comments

@pihme
Copy link
Contributor

pihme commented Apr 27, 2022

Describe the bug

If there are many due timers to be triggered, 'DueDateTimeChecker` will iterate over them. During this time, all progress is blocked for this partition.

@pihme pihme added the kind/bug Categorizes an issue or PR as a bug label Apr 27, 2022
@pihme pihme assigned pihme and saig0 Apr 27, 2022
@pihme pihme changed the title DueDateTimeChecker will block progress if many due timers DueDateTimeChecker will block progress if many timers are due Apr 27, 2022
@npepinpe npepinpe added area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) severity/high Marks a bug as having a noticeable impact on the user with no known workaround scope/broker Marks an issue or PR to appear in the broker section of the changelog team/process-automation labels Apr 27, 2022
zeebe-bors-camunda bot added a commit that referenced this issue Apr 29, 2022
9237: refactor(engine): prevent instant rescheduling r=pihme a=pihme

## Description

Before this change, the delay calculated to reschedule a task could be negative or close to 0.
This lead to the checker being immediately rescheduled. This is bad, because it does not leave room
for other tasks to run.

With this change, a lower floor is applied when the task is rescheduled.

## Related issues

closes #9236 
preparation for #9238  

<!---
## Definition of Ready

* [X] I've reviewed my own code
* [X] I've written a clear changelist description
* [X] I've narrowly scoped my changes
* [X] I've separated structural from behavioural changes
-->

## Definition of Done
Code changes:
* [X] The changes are backwards compatibility with previous versions
* [ ] If it fixes a bug then PRs are created to [backport](https://github.com/camunda/zeebe/compare/stable/0.24...main?expand=1&template=backport_template.md&title=[Backport%200.24]) the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. `backport stable/1.3`) to the PR, in case that fails you need to create backports manually.

Testing:
* [ ] There are unit/integration tests that verify all acceptance criterias of the issue
* [ ] New tests are written to ensure backwards compatibility with further versions
* [ ] The behavior is tested manually
* [ ] The change has been verified by a QA run
* [ ] The impact of the changes is verified by a benchmark

Documentation:
* [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
* [ ] New content is added to the [release announcement](https://drive.google.com/drive/u/0/folders/1DTIeswnEEq-NggJ25rm2BsDjcCQpDape)
* [ ] If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our [review guidelines](https://github.com/camunda/zeebe/wiki/Pull-Requests-and-Code-Reviews#code-review-guidelines).


Co-authored-by: pihme <pihme@users.noreply.github.com>
zeebe-bors-camunda bot added a commit that referenced this issue Apr 29, 2022
9255: [Backport stable/1.3] refactor(engine): prevent instant rescheduling r=pihme a=github-actions[bot]

# Description
Backport of #9237 to `stable/1.3`.

relates to #9236 #9238

Co-authored-by: pihme <pihme@users.noreply.github.com>
zeebe-bors-camunda bot added a commit that referenced this issue Apr 29, 2022
9256: [Backport stable/8.0] refactor(engine): prevent instant rescheduling r=pihme a=github-actions[bot]

# Description
Backport of #9237 to `stable/8.0`.

relates to #9236 #9238

Co-authored-by: pihme <pihme@users.noreply.github.com>
zeebe-bors-camunda bot added a commit that referenced this issue May 2, 2022
9249: Yield control if too many timers due r=pihme a=pihme

## Description

Adds a mechanism for the `DueDateTimeChecker` to yield control after some time. This is to stop it from iterating over an unknown number of due timer events and blocking execution while doing so.

Overall, this change should work well in cases where there is a huge backlog of timers. This backlog would then be reduced bit by bit.

The change is potentially bad for cases in which there is a constant and high load with many timers being created all the time. In this case, the change of this PR can lead to due timers continuously growing and the timers triggered will fall more and more behind real time.

Overall, this tradeoff was deemed advantageous. At least it removes that dangers that the iteration blocks the execution for so long that the node is marked as unhealthy. When this situation is reached there is currently no practical recovery possible.

Even before this point is reached, execution will be blocked for long stretches of time, and no progress can be made on that partition. So one faulty process can block all others from executing.

Both issues are addressed by this PR. With this PR it should be always possible to make some progress, albeit small. This would allow users to cancel or change any faulty process, or to reduce the load if needed. 

Further work will be needed to figure out a way how to trigger timers without potentially falling further and further behind real time.

## Review Hints
This PR has duplicate commits from #9237 

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #9238



Co-authored-by: pihme <pihme@users.noreply.github.com>
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:1.3.8 version:8.1.0-alpha2 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants