Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitions gets unhealthy if many timers are scheduled #8991

Open
Zelldon opened this issue Mar 25, 2022 · 22 comments
Open

Partitions gets unhealthy if many timers are scheduled #8991

Zelldon opened this issue Mar 25, 2022 · 22 comments
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) chaos-day Marks an issue or PR as related to an experiment performed during a chaos day component/engine component/zeebe Related to the Zeebe component/team gameday Marks an issue or PR as related to an experiment performed during a gameday kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Milestone

Comments

@Zelldon
Copy link
Member

Zelldon commented Mar 25, 2022

Describe the bug

We deployed the following process

gamedayprocess

There were two bugs in the process:

  • an parallel sequence flow to the call activity, causing endless recursion
  • the timer expression was not optimal =date and time(today(), time("16:00:00@Europe/Berlin")) - now() it caused an negative duration after reaching the time.

When we deploy the model, we create endless new process instances and fill the timer column. If the specific time is reached we have to trigger a lot of timers, might be millions depending how long the recursion was running and how many instances have initially be started. The recursion continues and adds more and more due timers to the timer column.

The problem here is that the processing is completely blocked, since the timer checker is running in the same actor.. Furthermore, it causes to produces an endless backlog, which the processor is not able to catch up. The processor will become unhealthy and the backpressure wents up to 100% since the requests are not answered.

general
processing

To Reproduce

Deploy this model
gamedayprocess.bpmn.txt and start new instances watch out, you can throw your cluster away afterwards.

Expected behavior
As a user I'm not able to completely f** up my cluster, or at least there is some way to recover from this.

@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug Impact: Availability severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround chaos-day Marks an issue or PR as related to an experiment performed during a chaos day gameday Marks an issue or PR as related to an experiment performed during a gameday labels Mar 25, 2022
@npepinpe npepinpe added area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) and removed Impact: Availability labels Apr 11, 2022
@npepinpe
Copy link
Member

Seems like we did not triage this, and this was marked as critical. Let's talk about it tomorrow - I guess the board switch caused me to miss this one.

@npepinpe npepinpe added this to the 8.1 milestone Apr 12, 2022
@pihme pihme self-assigned this Apr 26, 2022
@pihme
Copy link
Contributor

pihme commented Apr 26, 2022

This also occurred in a simpler process. E.g. in Roman's benchmark he wants to have a certain load on the system. He starts a number of process instance e.g. 10/s. Each process waits at two points for 20 minutes. Because of the continuous creation of process instances, at any point in time, there are a couple of timers ready to be triggered.

Depending on the load the performance is heavily degraded.

@pihme pihme mentioned this issue Apr 26, 2022
10 tasks
zeebe-bors-camunda bot added a commit that referenced this issue Apr 26, 2022
9217: deps(maven): bump protobuf-bom from 3.20.0 to 3.20.1 r=npepinpe a=dependabot[bot]

Bumps [protobuf-bom](https://github.com/protocolbuffers/protobuf) from 3.20.0 to 3.20.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/protocolbuffers/protobuf/releases">protobuf-bom's releases</a>.</em></p>
<blockquote>
<h2>Protocol Buffers v3.20.1</h2>
<h1>PHP</h1>
<ul>
<li>Fix building packaged PHP extension (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9727">#9727</a>)</li>
<li>Fixed composer.json to only advertise compatibility with PHP 7.0+.  (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9819">#9819</a>)</li>
</ul>
<h1>Ruby</h1>
<ul>
<li>Disable the aarch64 build on macOS until it can be fixed. (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9816">#9816</a>)</li>
</ul>
<h1>Other</h1>
<ul>
<li>Fix versioning issues in 3.20.0</li>
</ul>
<h2>Protocol Buffers v3.20.1-rc1</h2>
<p>#PHP</p>
<ul>
<li>Fix building packaged PHP extension (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9727">#9727</a>)</li>
</ul>
<p>#Other</p>
<ul>
<li>Fix versioning issues in 3.20.0</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/21027a27c4c2ec1000859ccbcfff46d83b16e1ed"><code>21027a2</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9832">#9832</a> from haberman/php-dist</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/634a5681b8f249d4cce876e36d179d9c8a8e12d2"><code>634a568</code></a> Fixed PHP release script.</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/616a7ef54d3b157b4a68054f705ba70929fb094a"><code>616a7ef</code></a> Updated CHANGES.txt for 3.20.1 (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9826">#9826</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/296c1fdaff6fd16b319633fe215d890b90342304"><code>296c1fd</code></a> Update protobuf version (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9820">#9820</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/6387f95aa974bc24907896d4099deb98241602e8"><code>6387f95</code></a> fix: move composer.json out of root to avoid confusion (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9819">#9819</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/653de45a2174e0a59ae8ec550b6383399af9357c"><code>653de45</code></a> Disable the aarch64 build on macOS because it is broken. (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9816">#9816</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/acd24bcd8dbd3834050f0c70a45c51d4fb619db8"><code>acd24bc</code></a> Add disallowment of setting numpy singleton arrays and multi-dimensio… (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9788">#9788</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/26d11fdba2fb6435ee4bef8a20fa011574397225"><code>26d11fd</code></a> -Dsurefire.failIfNoSpecifiedTests=false (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9761">#9761</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/044cb7fd45cdf112b7682381c3be2f23aedd1f98"><code>044cb7f</code></a> -DfailIfNoSpecifiedTests=false (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9753">#9753</a>)</li>
<li><a href="https://github.com/protocolbuffers/protobuf/commit/7db4eca77f2b03f93632edca5825f33ab65590e7"><code>7db4eca</code></a> Update changelogs for 3.20.1-rc1</li>
<li>Additional commits viewable in <a href="https://github.com/protocolbuffers/protobuf/compare/v3.20.0...v3.20.1">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.google.protobuf:protobuf-bom&package-manager=maven&previous-version=3.20.0&new-version=3.20.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- ``@dependabot` rebase` will rebase this PR
- ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it
- ``@dependabot` merge` will merge this PR after your CI passes on it
- ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it
- ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging
- ``@dependabot` reopen` will reopen this PR if it is closed
- ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

9218: deps(maven): bump maven-antrun-plugin from 3.0.0 to 3.1.0 r=npepinpe a=dependabot[bot]

Bumps [maven-antrun-plugin](https://github.com/apache/maven-antrun-plugin) from 3.0.0 to 3.1.0.
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/apache/maven-antrun-plugin/commit/feceee80f82e1f52a8233a19015c1ed0e4cef2ef"><code>feceee8</code></a> [maven-release-plugin] prepare release maven-antrun-plugin-3.1.0</li>
<li><a href="https://github.com/apache/maven-antrun-plugin/commit/4e899c4314aa9e1d8ee0e6b17844c849359c42ce"><code>4e899c4</code></a> [MANTRUN-236] Upgrade Maven Plugin Plugin to 3.6.4</li>
<li><a href="https://github.com/apache/maven-antrun-plugin/commit/61c21bd0e95cb2e4433492f00eb473d5b39fb27d"><code>61c21bd</code></a> Update CI URL</li>
<li><a href="https://github.com/apache/maven-antrun-plugin/commit/b44673765e5cb4990e779d9115226510403273a7"><code>b446737</code></a> Bump checkstyle from 9.2.1 to 9.3</li>
<li><a href="https://github.com/apache/maven-antrun-plugin/commit/43cb90edb00dfd305574820b32d368a4ae25008d"><code>43cb90e</code></a> [MANTRUN-232] Require Maven 3.2.5 - set proper maven scopes</li>
<li><a href="https://github.com/apache/maven-antrun-plugin/commit/2274253cdaed396d1de6964e1c6f93ebc7502141"><code>2274253</code></a> Shared GitHub Actions v2</li>
<li><a href="https://github.com/apache/maven-antrun-plugin/commit/7911362e0ac86b5a1120bc8f6ef4512222c342ec"><code>7911362</code></a> Bump checkstyle from 9.0.1 to 9.2.1</li>
<li><a href="https://github.com/apache/maven-antrun-plugin/commit/5bb03c35533255f04d98267c5662d3b2272835c2"><code>5bb03c3</code></a> (doc) Update plugins before release</li>
<li><a href="https://github.com/apache/maven-antrun-plugin/commit/e773180d4e7d1b661efe6fa7882fcd24c11e84bc"><code>e773180</code></a> Bump maven-site-plugin from 3.9.1 to 3.10.0</li>
<li><a href="https://github.com/apache/maven-antrun-plugin/commit/68fc8330e278b92829168e66065b5fa32086df93"><code>68fc833</code></a> Bump xmlunit-matchers from 2.8.3 to 2.8.4</li>
<li>Additional commits viewable in <a href="https://github.com/apache/maven-antrun-plugin/compare/maven-antrun-plugin-3.0.0...maven-antrun-plugin-3.1.0">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.maven.plugins:maven-antrun-plugin&package-manager=maven&previous-version=3.0.0&new-version=3.1.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- ``@dependabot` rebase` will rebase this PR
- ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it
- ``@dependabot` merge` will merge this PR after your CI passes on it
- ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it
- ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging
- ``@dependabot` reopen` will reopen this PR if it is closed
- ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

9226: deps(maven): bump spotless-maven-plugin from 2.22.1 to 2.22.2 r=npepinpe a=dependabot[bot]

Bumps [spotless-maven-plugin](https://github.com/diffplug/spotless) from 2.22.1 to 2.22.2.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/diffplug/spotless/blob/main/CHANGES.md">spotless-maven-plugin's changelog</a>.</em></p>
<blockquote>
<h2>[2.22.2] - 2022-02-09</h2>
<h3>Changed</h3>
<ul>
<li>Bump default ktfmt <code>0.30</code> -&gt; <code>0.31</code> (<a href="https://github-redirect.dependabot.com/diffplug/spotless/pull/1118">#1118</a>).</li>
</ul>
<h3>Fixed</h3>
<ul>
<li>Add full support for git worktrees (<a href="https://github-redirect.dependabot.com/diffplug/spotless/pull/1119">#1119</a>).</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/diffplug/spotless/commit/19d271c641aab4b16af67287ed835d91e5e076e0"><code>19d271c</code></a> Published lib/2.22.2</li>
<li><a href="https://github.com/diffplug/spotless/commit/693803ca42be848e69be3d87243d0d66ba4f011a"><code>693803c</code></a> Add support for worktrees (<a href="https://github-redirect.dependabot.com/diffplug/spotless/issues/1119">#1119</a>)</li>
<li><a href="https://github.com/diffplug/spotless/commit/6a03212a6fdb8417b41b54c5e1436558e798cb9b"><code>6a03212</code></a> Double the Gradle metaspace to help with flaky CI.</li>
<li><a href="https://github.com/diffplug/spotless/commit/27d16456d0fb32ea7f842ca08e0cddfe0698118f"><code>27d1645</code></a> Oops, fix the changelog entries.</li>
<li><a href="https://github.com/diffplug/spotless/commit/e4d025b4cf1e829048260ee0f75a1a15ea418764"><code>e4d025b</code></a> Fix changelog merge conflicts.</li>
<li><a href="https://github.com/diffplug/spotless/commit/0a44147c9a1072251346d14562238c9e157047fd"><code>0a44147</code></a> Bump default ktfmt 0.30 -&gt; 0.31. (<a href="https://github-redirect.dependabot.com/diffplug/spotless/issues/1118">#1118</a>)</li>
<li><a href="https://github.com/diffplug/spotless/commit/68893214eee8569219e695a26771b15e41305ccc"><code>6889321</code></a> Fix spotbugs warning.</li>
<li><a href="https://github.com/diffplug/spotless/commit/9d3a4bdf7d9ad1aa391496acace9e95eaab33995"><code>9d3a4bd</code></a> Update changelogs.</li>
<li><a href="https://github.com/diffplug/spotless/commit/4b47e1c6d7e3d9d6d9968631c47a843bd18f782f"><code>4b47e1c</code></a> Update changelog</li>
<li><a href="https://github.com/diffplug/spotless/commit/c3a32ee26afa83ff04425b58736a0e4f8812563f"><code>c3a32ee</code></a> Add support for worktrees</li>
<li>Additional commits viewable in <a href="https://github.com/diffplug/spotless/compare/lib/2.22.1...lib/2.22.2">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.diffplug.spotless:spotless-maven-plugin&package-manager=maven&previous-version=2.22.1&new-version=2.22.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- ``@dependabot` rebase` will rebase this PR
- ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it
- ``@dependabot` merge` will merge this PR after your CI passes on it
- ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it
- ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging
- ``@dependabot` reopen` will reopen this PR if it is closed
- ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

9230: Renaming and javadocs r=remcowesterhoud a=pihme

## Description

- rename dueDateKey -> dueDate - because it is a due date, not a key
- rename findTimersWithDueDateBefore(...) -> processTimersWithDueDateBefore(...) - it's a more precise description of the method
- add JavaDoc comments to TimerInstanceState interface

## Related issues
related to #8991



Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: pihme <pihme@users.noreply.github.com>
@pihme
Copy link
Contributor

pihme commented Apr 26, 2022

Roman's workflow looked like this:
image

I guess it can be simplified to:
image

But maybe the other elements are necessary to add latency between the time a command is written to the log and the time a record is processed.

This is just to say it can also happen without specially crafted process from Game days.

Looking at the code I can see the following:

  • Timer Records are added to the database when the timer element is activated
  • Timer records are removed after the TimerIntent.TRIGGER has been processed.

This also means that the DueDateTimeChecker might visit the same timer multiple times, and create multiple trigger commands for it. This is because the DueDateTimeChecker does not alter the database, and only writes to the log.

So my working hypothesis is that the there is an escalating feedback loop.

  • DueDateTimeChecker visits a couple of timers once, writes commands for these timers
  • DueDateTimeChecker is triggered again before these commands have been processed, thus revisit the same events and also new events which have come due since the last run. Then it writes duplicate commands for the old timers and new commands for new timers it found
  • Thus, if we cannot process all commands than the DueDateTimeChecker wrote in the last run, before DueDateTimeChecker is running again, we will write more and more duplicate records to the commit log and revisit certain timers again and again.
  • Sine the db table is chronological, the commands are also chronological. So in the second round we encounter the duplicate commands first and waste time on those, which means we make even less progress in terms of removing the new events from the database
  • This can then escalate until we are totally blocked, because thousands of timers have not been removed yet, and thousands of timers lead to thousands of follow up records

Could also be that this happens more prominently after a broker resumes, because much time has passed since the last run. This could also explain Romans observation that one some days he can get stable performance and on others not.

/cc @Zelldon @saig0 wdyt?

@saig0
Copy link
Member

saig0 commented Apr 27, 2022

@pihme yes. That sounds correct.

But there is one more aspect. If the DueDateTimeChecker visits many timers on a check then it blocks the actor thread and other actor jobs can't be executed (e.g. sending heartbeats).

A solution could be to limit the number of timers for one check (e.g. only 100 timers at once).

@pihme
Copy link
Contributor

pihme commented Apr 27, 2022

Not sure this would be a valid solution. My fear is that it will just always be falling further and further behind.

Right now the highest possible frequency of DueDateTimeChecker runs seems to be 10 Hertz. If we only allow to trigger 100 timers per invocation, but we create 200 timers each 100ms, then we will just fall further and further behind. And this build up would not be reflected in the back pressure mechanism.

My hope is that if we always trigger all due timers, this will cause delay, which will influence back pressure. The way the system is set up now, I see no other way than to let back pressure dial in to sustained throughput to the speed of the slowest link in the processing chain.

@Zelldon
Copy link
Member Author

Zelldon commented Apr 27, 2022

Sorry for the late response. Interesting observation you made @pihme and sounds right. In my case it was more like @saig0 described that the actor didn't stop to write due timers, I had many many write records per second but no processing at all.

I agree with the argument from @pihme. Not sure whether it would really help to just limit it, sounds like a band-aid 🩹 I guess optimal would be to have a separate actor, which writes to the log. This would at least help here with not blocking processing. Still it can happen that we write many many timers at once. We could solve that then maybe via writing over the command API or something such that back pressure is taken into account like @pihme said. 🤷

@pihme
Copy link
Contributor

pihme commented Apr 27, 2022

@Zelldon Could you comment on the prioritization of actor tasks with respect to time based tasks vs. ordinary tasks?

More precisely: Supposing the DueDateTimeChecker is scheduled every 100ms, but runs more than 100ms each time. What will happen? Will the actor schedule always and only schedule the DueDateTimeChecker or will some progress be made by other tasks before DueDateTimeChecker is scheduled again?

@pihme
Copy link
Contributor

pihme commented Apr 27, 2022

And the second question: What happens when you schedule a task with a negative delay? Because I think this can happen

@Zelldon
Copy link
Member Author

Zelldon commented Apr 27, 2022

Hey @pihme

as always it depends. In general, the timers and "ordinary" tasks are at the end scheduled the same. We have the fastlane and the queue, for "less" priority. So we have kind of an implicit priority. What I understand from the code is that if there are jobs in the fastlane already they are executed first and due timers are added to the fastlane as well. I guess the issue is even more problematic if we use runAtFixedRate and the execution takes longer than the rate.

To get to your example: If the DueDateTimeChecker is scheduled via runAtFixedRate with 100ms as the rate and takes longer than 100ms then a new job will already be added. Normally we use runDelayed, which means we schedule the next job only after we are done. AFAIK this is how it's done in the DueDateTimeChecker.

So in the above-described example and scenario, which I used in the gameday was that the DueTimerchecker had thousand or millions of Timers, which are due at the same time. It blocked the actor thread completely because it iterated over the whole column family to write a timer trigger command. I guess due to the subscription logic and "priority" that timers are scheduled in fast-lane the processing was not happening for a long time. This is my current assumption.

Regarding the negative duration, this happened to me accidentally in the gameday. This means the timer is immediately due.

Does this help?

@pihme
Copy link
Contributor

pihme commented Apr 27, 2022

Summary of Preliminary Findings

There are two problems we need to solve:

  • Huge backlog of accumulated timers
  • Steady state with high throughput of process instances

The first problem was the focus of the original description of the issue. This could for example occur when a company wants to send their customers a thank you email on Christmas. Throughout the year timers are collected, which all fire on Christmas day simultaneously.

Here we see the following issues:

The second problem was observed by Roman in benchmarks. Here it is just a steady load which leads to indeterministic blocking of the processing thread.

Here we see the following issues:

  • Due timers may be visited several times by the DueDateTimeChecker. Each time it will write a new TriggerTimer command. Only when the command is processed will the timer be removed. So then we run into a difference in the speed in which timers are created, and the speed in which timers are removed. And writing duplicate commands makes this worse for reasons outline above.
  • DueDateTimeChecker may be scheduled with a negative delay #9236 The DueDateTimeChecker may be scheduled with a negative delay to run next, which means it runs pretty much immediately.

@romansmirnov
Copy link
Member

Just want to share my thoughts 😄

@Zelldon,

Maybe a comment on your #8991 (comment) 😄

I guess due to the subscription logic and "priority" that timers are scheduled in fast-lane the processing was not happening for a long time. This is my current assumption.

In the described scenario, with millions of timers being due simultaneously, I think the problem is not with how the actor schedules the jobs. In my opinion, this whileTrue-loop is the problem:

https://github.com/camunda/zeebe/blob/dd3e9fb653e04223e4fe64b88b3273231efa6c6e/engine/src/main/java/io/camunda/zeebe/engine/state/instance/DbTimerInstanceState.java#L90-L105

This means it will remain in this loop as long as there is a next timer event being due. With millions of timers being due simultaneously, it will execute that loop until it visits all the millions of timer events...and this may take some time and block processing on that partition in the meantime.

@pihme,

Here it is just a steady load which leads to indeterministic blocking of the processing thread.

I don't think my scenario is much different from Chris's scenario in the gameday. The only difference is there are more than 20k timer events on one partition, whereby the timer's due date is distributed over the next 20 minutes. So, not all of them are due at the same time.

What can happen (and happens) is that it will remain in the whileTrue-loop for the next 20 minutes because there is always a next timer event being due. Also, it may happen that in between, it reaches a moment when the next timer event is not due at that moment (but the timer event would get due at the very next moment). Then it will schedule the next run accordingly, but because of how the actor schedules the jobs (as described by @Zelldon), it will immediately go into the whileTrue-loop and continue visiting the timers.

@Zelldon
Copy link
Member Author

Zelldon commented Apr 27, 2022

Hey @romansmirnov yes this is what I wrote the sentence before 😅 It blocked the actor thread completely because it iterated over the whole column family to write a timer trigger command. and the subscription is the addition to it which makes the problem even worse.

@romansmirnov
Copy link
Member

I guess optimal would be to have a separate actor, which writes to the log.

I agree with that point. I even tried that but it failed with the exception in #9187 - but I haven't investigated it further.

Besides that, the only concern I may have, if there is only one CPU thread configured 😉

@Zelldon
Copy link
Member Author

Zelldon commented Apr 27, 2022

Besides that, the only concern I may have, if there is only one CPU thread configured wink

@romansmirnov you mean because it would block the thread again and the other actor can't run? Might be then the yield or backpressure would help here? 🤔

@romansmirnov
Copy link
Member

@Zelldon, yes, exactly.

Might be then the yield or backpressure would help here?

I think yield would help but haven't thought it through yet.

@pihme
Copy link
Contributor

pihme commented May 4, 2022

Update before handover:

There were three problems found in DueDateChecker

With the two fixes, I was unable to repeatedly recover a cluster which was unresponsive. I was able to recover it once, by also implementing a quick fix for the third issue, but I was unable to reproduce the success.

There were a couple of problems found unrelated to DueDateChecker

Another issue not needed for the fix, but to make it better to diagnose issue:
#9282

From what I could see from the logs of the server it seems to be stuck at some other place as well. It makes good progress, intermitted by long breaks. I don't know what is causing the breaks. The data set is linked in one of the comments above.
I did add timer watch guards around the call to process a command:

        final long now = System.currentTimeMillis();
        processCommand(currentRecord);
        final long elapsedTime = System.currentTimeMillis() - now;
        if (elapsedTime > 1000) {
          System.out.println("Elapsed Time:" + Duration.ofMillis(elapsedTime).toString());
        }

However, I have not observed any commands that exceeded 1 s.

Another thing to consider is whether the same issues of DueDateTimerChecker would also need to be implemented for job timeouts. (in the logs I saw a lot of jobs timing out)

@saig0 saig0 changed the title Cluster is unrecoverable after faulty process is deployed Partitions goes unhealthy if many timers are scheduled May 16, 2022
@saig0 saig0 changed the title Partitions goes unhealthy if many timers are scheduled Partitions gets unhealthy if many timers are scheduled May 16, 2022
@saig0
Copy link
Member

saig0 commented May 16, 2022

Summary

A partition gets unhealthy if many timers are scheduled for the same time (or in a short time frame).
The number of timers depends on the available resources.

@romansmirnov ran into this issue with a process that has two timers after each other (including some service tasks in between). The timers are scheduled for 20 minutes.

With cluster plan S: if more than 17 instances are created per second
With cluster plan M: if more than 27 instances are created per second

The partition got unhealthy but it recovered after a while (20 minutes - 1 hour).


We will not work on this issue immediately because

  • it is not likely to have so many timers that are scheduled for the same time
  • it is a general issue about processing large data from the database (also for job time outs and message TTL) that requires a bigger effort to fix it

@saig0 saig0 unassigned saig0 and pihme May 16, 2022
@korthout korthout added severity/high Marks a bug as having a noticeable impact on the user with no known workaround and removed severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround labels Jul 22, 2022
@korthout
Copy link
Member

Updated severity to reflect current team opinion described here. Issue is moved into backlog.

@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
github-merge-queue bot pushed a commit that referenced this issue Mar 27, 2024
## Description

<!-- Please explain the changes you made here. -->

This pull request aims to avoid duplicate timer `TRIGGER` commands and
avoids the pathological case where this would deterministically result
in #17128.

An integration test reproducing that scenario is provided in
`TimerTriggerSchedulingTest`.

This pull request achieves this by introducing two main changes:
- It cancels existing future execution runs of the `DueDateChecker` when
scheduling another to avoid many scheduled executions of the checker at
any time
- It provides a threadsafe way to schedule, execute, and cancel the
executions of the `DueDateChecker`
- this is to ensure that the bug is also fixed when enabling the async
due date checker

> [!NOTE]
> Cancellation in concurrent (or async) scheduling/execution does not
guarantee to prevent execution.

This pull request also:
- Implements the `onClose` and `onFailure` lifecycle events for the
`DueDateChecker`
- It refactors the `DueDateChecker` for understandability and adds a lot
of documentation
- Introduces an `AtomicUtil` class to allow threadsafe lock-free
replacing of AtomicReference
- The initial execution of `DueDateChecker` is scheduled after 100ms
instead of 0ms

## Related issues

<!-- Which issues are closed by this PR or are related -->

This PR:
resolves #17128
resolves #10541

Not sure exactly, but could:
resolves #16884
resolves #8991
korthout added a commit that referenced this issue Mar 28, 2024
# Description
Backport of #17136 to `release-8.5.0`.

relates to #17128
relates to #10541
relates to #16884
relates to #8991

original author: @korthout
github-merge-queue bot pushed a commit that referenced this issue Mar 28, 2024
github-merge-queue bot pushed a commit that referenced this issue Mar 28, 2024
# Description
Backport of #17136 to `stable/8.1`.

relates to #17128
relates to #10541
relates to #16884
relates to #8991

original author: @korthout

There were a few conflicts that I had to resolve by hand, mostly related
to imports. A description is added to the specific commits.
github-merge-queue bot pushed a commit that referenced this issue Mar 28, 2024
# Description
Backport of #17171 to `stable/8.4`.

relates to #17128 #10541 #16884
#8991 #17136
original author: @oleschoenburg

I've resolved a few small conflicts by hand, mostly due to the
introduction of the `zeebe/` directory. A description is added to the
specific commits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) chaos-day Marks an issue or PR as related to an experiment performed during a chaos day component/engine component/zeebe Related to the Zeebe component/team gameday Marks an issue or PR as related to an experiment performed during a gameday kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants