Partitions gets unhealthy if many timers are scheduled #8991

Zelldon · 2022-03-25T12:50:48Z

Describe the bug

We deployed the following process

There were two bugs in the process:

an parallel sequence flow to the call activity, causing endless recursion
the timer expression was not optimal =date and time(today(), time("16:00:00@Europe/Berlin")) - now() it caused an negative duration after reaching the time.

When we deploy the model, we create endless new process instances and fill the timer column. If the specific time is reached we have to trigger a lot of timers, might be millions depending how long the recursion was running and how many instances have initially be started. The recursion continues and adds more and more due timers to the timer column.

The problem here is that the processing is completely blocked, since the timer checker is running in the same actor.. Furthermore, it causes to produces an endless backlog, which the processor is not able to catch up. The processor will become unhealthy and the backpressure wents up to 100% since the requests are not answered.

To Reproduce

Deploy this model
gamedayprocess.bpmn.txt and start new instances watch out, you can throw your cluster away afterwards.

Expected behavior
As a user I'm not able to completely f** up my cluster, or at least there is some way to recover from this.

The text was updated successfully, but these errors were encountered:

npepinpe · 2022-04-11T21:00:47Z

Seems like we did not triage this, and this was marked as critical. Let's talk about it tomorrow - I guess the board switch caused me to miss this one.

pihme · 2022-04-26T11:16:04Z

This also occurred in a simpler process. E.g. in Roman's benchmark he wants to have a certain load on the system. He starts a number of process instance e.g. 10/s. Each process waits at two points for 20 minutes. Because of the continuous creation of process instances, at any point in time, there are a couple of timers ready to be triggered.

Depending on the load the performance is heavily degraded.

9217: deps(maven): bump protobuf-bom from 3.20.0 to 3.20.1 r=npepinpe a=dependabot[bot] Bumps [protobuf-bom](https://github.com/protocolbuffers/protobuf) from 3.20.0 to 3.20.1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/protocolbuffers/protobuf/releases">protobuf-bom's releases</a>.</em></p> <blockquote> <h2>Protocol Buffers v3.20.1</h2> <h1>PHP</h1> <ul> <li>Fix building packaged PHP extension (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9727">#9727</a>)</li> <li>Fixed composer.json to only advertise compatibility with PHP 7.0+. (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9819">#9819</a>)</li> </ul> <h1>Ruby</h1> <ul> <li>Disable the aarch64 build on macOS until it can be fixed. (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9816">#9816</a>)</li> </ul> <h1>Other</h1> <ul> <li>Fix versioning issues in 3.20.0</li> </ul> <h2>Protocol Buffers v3.20.1-rc1</h2> <p>#PHP</p> <ul> <li>Fix building packaged PHP extension (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9727">#9727</a>)</li> </ul> <p>#Other</p> <ul> <li>Fix versioning issues in 3.20.0</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/protocolbuffers/protobuf/commit/21027a27c4c2ec1000859ccbcfff46d83b16e1ed"><code>21027a2</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9832">#9832</a> from haberman/php-dist</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/634a5681b8f249d4cce876e36d179d9c8a8e12d2"><code>634a568</code></a> Fixed PHP release script.</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/616a7ef54d3b157b4a68054f705ba70929fb094a"><code>616a7ef</code></a> Updated CHANGES.txt for 3.20.1 (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9826">#9826</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/296c1fdaff6fd16b319633fe215d890b90342304"><code>296c1fd</code></a> Update protobuf version (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9820">#9820</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/6387f95aa974bc24907896d4099deb98241602e8"><code>6387f95</code></a> fix: move composer.json out of root to avoid confusion (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9819">#9819</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/653de45a2174e0a59ae8ec550b6383399af9357c"><code>653de45</code></a> Disable the aarch64 build on macOS because it is broken. (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9816">#9816</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/acd24bcd8dbd3834050f0c70a45c51d4fb619db8"><code>acd24bc</code></a> Add disallowment of setting numpy singleton arrays and multi-dimensio… (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9788">#9788</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/26d11fdba2fb6435ee4bef8a20fa011574397225"><code>26d11fd</code></a> -Dsurefire.failIfNoSpecifiedTests=false (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9761">#9761</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/044cb7fd45cdf112b7682381c3be2f23aedd1f98"><code>044cb7f</code></a> -DfailIfNoSpecifiedTests=false (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9753">#9753</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/7db4eca77f2b03f93632edca5825f33ab65590e7"><code>7db4eca</code></a> Update changelogs for 3.20.1-rc1</li> <li>Additional commits viewable in <a href="https://github.com/protocolbuffers/protobuf/compare/v3.20.0...v3.20.1">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.google.protobuf:protobuf-bom&package-manager=maven&previous-version=3.20.0&new-version=3.20.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 9218: deps(maven): bump maven-antrun-plugin from 3.0.0 to 3.1.0 r=npepinpe a=dependabot[bot] Bumps [maven-antrun-plugin](https://github.com/apache/maven-antrun-plugin) from 3.0.0 to 3.1.0. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/feceee80f82e1f52a8233a19015c1ed0e4cef2ef"><code>feceee8</code></a> [maven-release-plugin] prepare release maven-antrun-plugin-3.1.0</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/4e899c4314aa9e1d8ee0e6b17844c849359c42ce"><code>4e899c4</code></a> [MANTRUN-236] Upgrade Maven Plugin Plugin to 3.6.4</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/61c21bd0e95cb2e4433492f00eb473d5b39fb27d"><code>61c21bd</code></a> Update CI URL</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/b44673765e5cb4990e779d9115226510403273a7"><code>b446737</code></a> Bump checkstyle from 9.2.1 to 9.3</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/43cb90edb00dfd305574820b32d368a4ae25008d"><code>43cb90e</code></a> [MANTRUN-232] Require Maven 3.2.5 - set proper maven scopes</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/2274253cdaed396d1de6964e1c6f93ebc7502141"><code>2274253</code></a> Shared GitHub Actions v2</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/7911362e0ac86b5a1120bc8f6ef4512222c342ec"><code>7911362</code></a> Bump checkstyle from 9.0.1 to 9.2.1</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/5bb03c35533255f04d98267c5662d3b2272835c2"><code>5bb03c3</code></a> (doc) Update plugins before release</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/e773180d4e7d1b661efe6fa7882fcd24c11e84bc"><code>e773180</code></a> Bump maven-site-plugin from 3.9.1 to 3.10.0</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/68fc8330e278b92829168e66065b5fa32086df93"><code>68fc833</code></a> Bump xmlunit-matchers from 2.8.3 to 2.8.4</li> <li>Additional commits viewable in <a href="https://github.com/apache/maven-antrun-plugin/compare/maven-antrun-plugin-3.0.0...maven-antrun-plugin-3.1.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.maven.plugins:maven-antrun-plugin&package-manager=maven&previous-version=3.0.0&new-version=3.1.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 9226: deps(maven): bump spotless-maven-plugin from 2.22.1 to 2.22.2 r=npepinpe a=dependabot[bot] Bumps [spotless-maven-plugin](https://github.com/diffplug/spotless) from 2.22.1 to 2.22.2. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/diffplug/spotless/blob/main/CHANGES.md">spotless-maven-plugin's changelog</a>.</em></p> <blockquote> <h2>[2.22.2] - 2022-02-09</h2> <h3>Changed</h3> <ul> <li>Bump default ktfmt <code>0.30</code> -> <code>0.31</code> (<a href="https://github-redirect.dependabot.com/diffplug/spotless/pull/1118">#1118</a>).</li> </ul> <h3>Fixed</h3> <ul> <li>Add full support for git worktrees (<a href="https://github-redirect.dependabot.com/diffplug/spotless/pull/1119">#1119</a>).</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/diffplug/spotless/commit/19d271c641aab4b16af67287ed835d91e5e076e0"><code>19d271c</code></a> Published lib/2.22.2</li> <li><a href="https://github.com/diffplug/spotless/commit/693803ca42be848e69be3d87243d0d66ba4f011a"><code>693803c</code></a> Add support for worktrees (<a href="https://github-redirect.dependabot.com/diffplug/spotless/issues/1119">#1119</a>)</li> <li><a href="https://github.com/diffplug/spotless/commit/6a03212a6fdb8417b41b54c5e1436558e798cb9b"><code>6a03212</code></a> Double the Gradle metaspace to help with flaky CI.</li> <li><a href="https://github.com/diffplug/spotless/commit/27d16456d0fb32ea7f842ca08e0cddfe0698118f"><code>27d1645</code></a> Oops, fix the changelog entries.</li> <li><a href="https://github.com/diffplug/spotless/commit/e4d025b4cf1e829048260ee0f75a1a15ea418764"><code>e4d025b</code></a> Fix changelog merge conflicts.</li> <li><a href="https://github.com/diffplug/spotless/commit/0a44147c9a1072251346d14562238c9e157047fd"><code>0a44147</code></a> Bump default ktfmt 0.30 -> 0.31. (<a href="https://github-redirect.dependabot.com/diffplug/spotless/issues/1118">#1118</a>)</li> <li><a href="https://github.com/diffplug/spotless/commit/68893214eee8569219e695a26771b15e41305ccc"><code>6889321</code></a> Fix spotbugs warning.</li> <li><a href="https://github.com/diffplug/spotless/commit/9d3a4bdf7d9ad1aa391496acace9e95eaab33995"><code>9d3a4bd</code></a> Update changelogs.</li> <li><a href="https://github.com/diffplug/spotless/commit/4b47e1c6d7e3d9d6d9968631c47a843bd18f782f"><code>4b47e1c</code></a> Update changelog</li> <li><a href="https://github.com/diffplug/spotless/commit/c3a32ee26afa83ff04425b58736a0e4f8812563f"><code>c3a32ee</code></a> Add support for worktrees</li> <li>Additional commits viewable in <a href="https://github.com/diffplug/spotless/compare/lib/2.22.1...lib/2.22.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.diffplug.spotless:spotless-maven-plugin&package-manager=maven&previous-version=2.22.1&new-version=2.22.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 9230: Renaming and javadocs r=remcowesterhoud a=pihme ## Description - rename dueDateKey -> dueDate - because it is a due date, not a key - rename findTimersWithDueDateBefore(...) -> processTimersWithDueDateBefore(...) - it's a more precise description of the method - add JavaDoc comments to TimerInstanceState interface ## Related issues related to #8991 Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: pihme <pihme@users.noreply.github.com>

pihme · 2022-04-26T13:54:53Z

Roman's workflow looked like this:

I guess it can be simplified to:

But maybe the other elements are necessary to add latency between the time a command is written to the log and the time a record is processed.

This is just to say it can also happen without specially crafted process from Game days.

Looking at the code I can see the following:

Timer Records are added to the database when the timer element is activated
Timer records are removed after the TimerIntent.TRIGGER has been processed.

This also means that the DueDateTimeChecker might visit the same timer multiple times, and create multiple trigger commands for it. This is because the DueDateTimeChecker does not alter the database, and only writes to the log.

So my working hypothesis is that the there is an escalating feedback loop.

DueDateTimeChecker visits a couple of timers once, writes commands for these timers
DueDateTimeChecker is triggered again before these commands have been processed, thus revisit the same events and also new events which have come due since the last run. Then it writes duplicate commands for the old timers and new commands for new timers it found
Thus, if we cannot process all commands than the DueDateTimeChecker wrote in the last run, before DueDateTimeChecker is running again, we will write more and more duplicate records to the commit log and revisit certain timers again and again.
Sine the db table is chronological, the commands are also chronological. So in the second round we encounter the duplicate commands first and waste time on those, which means we make even less progress in terms of removing the new events from the database
This can then escalate until we are totally blocked, because thousands of timers have not been removed yet, and thousands of timers lead to thousands of follow up records

Could also be that this happens more prominently after a broker resumes, because much time has passed since the last run. This could also explain Romans observation that one some days he can get stable performance and on others not.

/cc @Zelldon @saig0 wdyt?

saig0 · 2022-04-27T04:09:17Z

@pihme yes. That sounds correct.

But there is one more aspect. If the DueDateTimeChecker visits many timers on a check then it blocks the actor thread and other actor jobs can't be executed (e.g. sending heartbeats).

A solution could be to limit the number of timers for one check (e.g. only 100 timers at once).

pihme · 2022-04-27T05:26:37Z

Not sure this would be a valid solution. My fear is that it will just always be falling further and further behind.

Right now the highest possible frequency of DueDateTimeChecker runs seems to be 10 Hertz. If we only allow to trigger 100 timers per invocation, but we create 200 timers each 100ms, then we will just fall further and further behind. And this build up would not be reflected in the back pressure mechanism.

My hope is that if we always trigger all due timers, this will cause delay, which will influence back pressure. The way the system is set up now, I see no other way than to let back pressure dial in to sustained throughput to the speed of the slowest link in the processing chain.

Zelldon · 2022-04-27T06:08:16Z

Sorry for the late response. Interesting observation you made @pihme and sounds right. In my case it was more like @saig0 described that the actor didn't stop to write due timers, I had many many write records per second but no processing at all.

I agree with the argument from @pihme. Not sure whether it would really help to just limit it, sounds like a band-aid 🩹 I guess optimal would be to have a separate actor, which writes to the log. This would at least help here with not blocking processing. Still it can happen that we write many many timers at once. We could solve that then maybe via writing over the command API or something such that back pressure is taken into account like @pihme said. 🤷

pihme · 2022-04-27T06:17:43Z

@Zelldon Could you comment on the prioritization of actor tasks with respect to time based tasks vs. ordinary tasks?

More precisely: Supposing the DueDateTimeChecker is scheduled every 100ms, but runs more than 100ms each time. What will happen? Will the actor schedule always and only schedule the DueDateTimeChecker or will some progress be made by other tasks before DueDateTimeChecker is scheduled again?

pihme · 2022-04-27T06:33:00Z

And the second question: What happens when you schedule a task with a negative delay? Because I think this can happen

Zelldon · 2022-04-27T06:40:46Z

Hey @pihme

as always it depends. In general, the timers and "ordinary" tasks are at the end scheduled the same. We have the fastlane and the queue, for "less" priority. So we have kind of an implicit priority. What I understand from the code is that if there are jobs in the fastlane already they are executed first and due timers are added to the fastlane as well. I guess the issue is even more problematic if we use runAtFixedRate and the execution takes longer than the rate.

To get to your example: If the DueDateTimeChecker is scheduled via runAtFixedRate with 100ms as the rate and takes longer than 100ms then a new job will already be added. Normally we use runDelayed, which means we schedule the next job only after we are done. AFAIK this is how it's done in the DueDateTimeChecker.

So in the above-described example and scenario, which I used in the gameday was that the DueTimerchecker had thousand or millions of Timers, which are due at the same time. It blocked the actor thread completely because it iterated over the whole column family to write a timer trigger command. I guess due to the subscription logic and "priority" that timers are scheduled in fast-lane the processing was not happening for a long time. This is my current assumption.

Regarding the negative duration, this happened to me accidentally in the gameday. This means the timer is immediately due.

Does this help?

pihme · 2022-04-27T06:53:06Z

Summary of Preliminary Findings

There are two problems we need to solve:

Huge backlog of accumulated timers
Steady state with high throughput of process instances

The first problem was the focus of the original description of the issue. This could for example occur when a company wants to send their customers a thank you email on Christmas. Throughout the year timers are collected, which all fire on Christmas day simultaneously.

Here we see the following issues:

DueDateTimeChecker will block progress if many timers are due #9238 Timers are iterated over in full without ever giving yield. During this time, no other processing can take place

The second problem was observed by Roman in benchmarks. Here it is just a steady load which leads to indeterministic blocking of the processing thread.

Here we see the following issues:

Due timers may be visited several times by the DueDateTimeChecker. Each time it will write a new TriggerTimer command. Only when the command is processed will the timer be removed. So then we run into a difference in the speed in which timers are created, and the speed in which timers are removed. And writing duplicate commands makes this worse for reasons outline above.
DueDateTimeChecker may be scheduled with a negative delay #9236 The DueDateTimeChecker may be scheduled with a negative delay to run next, which means it runs pretty much immediately.

romansmirnov · 2022-04-27T07:47:47Z

Just want to share my thoughts 😄

@Zelldon,

Maybe a comment on your #8991 (comment) 😄

I guess due to the subscription logic and "priority" that timers are scheduled in fast-lane the processing was not happening for a long time. This is my current assumption.

In the described scenario, with millions of timers being due simultaneously, I think the problem is not with how the actor schedules the jobs. In my opinion, this whileTrue-loop is the problem:

https://github.com/camunda/zeebe/blob/dd3e9fb653e04223e4fe64b88b3273231efa6c6e/engine/src/main/java/io/camunda/zeebe/engine/state/instance/DbTimerInstanceState.java#L90-L105

This means it will remain in this loop as long as there is a next timer event being due. With millions of timers being due simultaneously, it will execute that loop until it visits all the millions of timer events...and this may take some time and block processing on that partition in the meantime.

@pihme,

Here it is just a steady load which leads to indeterministic blocking of the processing thread.

I don't think my scenario is much different from Chris's scenario in the gameday. The only difference is there are more than 20k timer events on one partition, whereby the timer's due date is distributed over the next 20 minutes. So, not all of them are due at the same time.

What can happen (and happens) is that it will remain in the whileTrue-loop for the next 20 minutes because there is always a next timer event being due. Also, it may happen that in between, it reaches a moment when the next timer event is not due at that moment (but the timer event would get due at the very next moment). Then it will schedule the next run accordingly, but because of how the actor schedules the jobs (as described by @Zelldon), it will immediately go into the whileTrue-loop and continue visiting the timers.

Zelldon · 2022-04-27T07:52:01Z

Hey @romansmirnov yes this is what I wrote the sentence before 😅 It blocked the actor thread completely because it iterated over the whole column family to write a timer trigger command. and the subscription is the addition to it which makes the problem even worse.

romansmirnov · 2022-04-27T07:58:48Z

I guess optimal would be to have a separate actor, which writes to the log.

I agree with that point. I even tried that but it failed with the exception in #9187 - but I haven't investigated it further.

Besides that, the only concern I may have, if there is only one CPU thread configured 😉

Zelldon · 2022-04-27T08:00:31Z

Besides that, the only concern I may have, if there is only one CPU thread configured wink

@romansmirnov you mean because it would block the thread again and the other actor can't run? Might be then the yield or backpressure would help here? 🤔

romansmirnov · 2022-04-27T08:05:19Z

@Zelldon, yes, exactly.

Might be then the yield or backpressure would help here?

I think yield would help but haven't thought it through yet.

pihme · 2022-05-04T07:46:49Z

Update before handover:

There were three problems found in DueDateChecker

Tasks to check for due timers was sometimes rescheduled immediately - fixed DueDateTimeChecker may be scheduled with a negative delay #9236
DueDateTimeChecker did not yield - fixed DueDateTimeChecker will block progress if many timers are due #9238
DueDateTimeChecker will visit the same timer more than once and write duplicate trigger event records - open (no issue created yet)

With the two fixes, I was unable to repeatedly recover a cluster which was unresponsive. I was able to recover it once, by also implementing a quick fix for the third issue, but I was unable to reproduce the success.

There were a couple of problems found unrelated to DueDateChecker

Request time out in JobActivation request (and others) is ignored #9276 In discussion. Behavior seems to be expected, but is not really useful
Consider adding deploy workflow / cancel process instance to whitelisted commands (which will be accepted when backpressure is high) #9283 open. Could help in this case and would be an easy fix

Another issue not needed for the fix, but to make it better to diagnose issue:
#9282

From what I could see from the logs of the server it seems to be stuck at some other place as well. It makes good progress, intermitted by long breaks. I don't know what is causing the breaks. The data set is linked in one of the comments above.
I did add timer watch guards around the call to process a command:

        final long now = System.currentTimeMillis();
        processCommand(currentRecord);
        final long elapsedTime = System.currentTimeMillis() - now;
        if (elapsedTime > 1000) {
          System.out.println("Elapsed Time:" + Duration.ofMillis(elapsedTime).toString());
        }

However, I have not observed any commands that exceeded 1 s.

Another thing to consider is whether the same issues of DueDateTimerChecker would also need to be implemented for job timeouts. (in the logs I saw a lot of jobs timing out)

saig0 · 2022-05-16T05:30:36Z

Summary

A partition gets unhealthy if many timers are scheduled for the same time (or in a short time frame).
The number of timers depends on the available resources.

@romansmirnov ran into this issue with a process that has two timers after each other (including some service tasks in between). The timers are scheduled for 20 minutes.

With cluster plan S: if more than 17 instances are created per second
With cluster plan M: if more than 27 instances are created per second

The partition got unhealthy but it recovered after a while (20 minutes - 1 hour).

We will not work on this issue immediately because

it is not likely to have so many timers that are scheduled for the same time
it is a general issue about processing large data from the database (also for job time outs and message TTL) that requires a bigger effort to fix it

korthout · 2022-07-22T08:15:58Z

Updated severity to reflect current team opinion described here. Issue is moved into backlog.

## Description  This pull request aims to avoid duplicate timer `TRIGGER` commands and avoids the pathological case where this would deterministically result in #17128. An integration test reproducing that scenario is provided in `TimerTriggerSchedulingTest`. This pull request achieves this by introducing two main changes: - It cancels existing future execution runs of the `DueDateChecker` when scheduling another to avoid many scheduled executions of the checker at any time - It provides a threadsafe way to schedule, execute, and cancel the executions of the `DueDateChecker` - this is to ensure that the bug is also fixed when enabling the async due date checker > [!NOTE] > Cancellation in concurrent (or async) scheduling/execution does not guarantee to prevent execution. This pull request also: - Implements the `onClose` and `onFailure` lifecycle events for the `DueDateChecker` - It refactors the `DueDateChecker` for understandability and adds a lot of documentation - Introduces an `AtomicUtil` class to allow threadsafe lock-free replacing of AtomicReference - The initial execution of `DueDateChecker` is scheduled after 100ms instead of 0ms ## Related issues  This PR: resolves #17128 resolves #10541 Not sure exactly, but could: resolves #16884 resolves #8991

@korthout

# Description Backport of #17136 to `release-8.5.0`. relates to #17128 relates to #10541 relates to #16884 relates to #8991 original author: @korthout

@oleschoenburg

#17176) # Description Backport of #17171 to `main`. relates to #17128 #10541 #16884 #8991 #17136 original author: @oleschoenburg

@korthout

# Description Backport of #17136 to `stable/8.1`. relates to #17128 relates to #10541 relates to #16884 relates to #8991 original author: @korthout There were a few conflicts that I had to resolve by hand, mostly related to imports. A description is added to the specific commits.

@oleschoenburg

# Description Backport of #17171 to `stable/8.4`. relates to #17128 #10541 #16884 #8991 #17136 original author: @oleschoenburg I've resolved a few small conflicts by hand, mostly due to the introduction of the `zeebe/` directory. A description is added to the specific commits.

npepinpe added area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) and removed Impact: Availability labels Apr 11, 2022

npepinpe added the team/process-automation label Apr 12, 2022

npepinpe added this to the 8.1 milestone Apr 12, 2022

npepinpe assigned saig0 Apr 12, 2022

pihme self-assigned this Apr 26, 2022

pihme mentioned this issue Apr 26, 2022

Renaming and javadocs #9230

Merged

10 tasks

npepinpe mentioned this issue Apr 26, 2022

Bloated DEADLINE_JOBS column family; no way to recover #8949

Closed

pihme mentioned this issue Apr 27, 2022

DueDateTimeChecker may be scheduled with a negative delay #9236

Closed

saig0 changed the title ~~Cluster is unrecoverable after faulty process is deployed~~ Partitions goes unhealthy if many timers are scheduled May 16, 2022

saig0 changed the title ~~Partitions goes unhealthy if many timers are scheduled~~ Partitions gets unhealthy if many timers are scheduled May 16, 2022

saig0 unassigned saig0 and pihme May 16, 2022

deepthidevaki mentioned this issue May 31, 2022

Handle internal load from timer events gracefully #4783

Open

menski removed the team/process-automation label Jul 11, 2022

korthout added severity/high Marks a bug as having a noticeable impact on the user with no known workaround and removed severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround labels Jul 22, 2022

Zelldon mentioned this issue Sep 20, 2022

Migrate ProcessingScheduleService to own actor #10414

Closed

10 tasks

Zelldon mentioned this issue Sep 27, 2022

Writing TaskResult in ProcessScheduleService can block complete processing #10526

Closed

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

Zelldon added the component/engine label Dec 30, 2022

Zelldon mentioned this issue Feb 10, 2023

Triggering due timer events causes periodic latency spikes #11594

Closed

romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024

korthout mentioned this issue Mar 27, 2024

Fix duplicate timer schedules #17136

Merged

oleschoenburg mentioned this issue Mar 28, 2024

[Backport release-8.5.0] Fix duplicate timer schedules #17171

Merged

korthout mentioned this issue Mar 28, 2024

[Backport stable/8.1] Fix duplicate timer schedules #17175

Merged

korthout added a commit that referenced this issue Mar 28, 2024

[Backport release-8.5.0] Fix duplicate timer schedules (#17171)

0c3f53e

# Description Backport of #17136 to `release-8.5.0`. relates to #17128 relates to #10541 relates to #16884 relates to #8991 original author: @korthout

backport-action mentioned this issue Mar 28, 2024

[Backport main] [Backport release-8.5.0] Fix duplicate timer schedules #17176

Merged

korthout mentioned this issue Mar 28, 2024

[Backport stable/8.4] Fix duplicate timer schedules #17178

Merged

github-merge-queue bot pushed a commit that referenced this issue Mar 28, 2024

[Backport main] [Backport release-8.5.0] Fix duplicate timer schedules (

1be87c5

#17176) # Description Backport of #17171 to `main`. relates to #17128 #10541 #16884 #8991 #17136 original author: @oleschoenburg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitions gets unhealthy if many timers are scheduled #8991

Partitions gets unhealthy if many timers are scheduled #8991

Zelldon commented Mar 25, 2022

npepinpe commented Apr 11, 2022

pihme commented Apr 26, 2022

pihme commented Apr 26, 2022

saig0 commented Apr 27, 2022

pihme commented Apr 27, 2022

Zelldon commented Apr 27, 2022

pihme commented Apr 27, 2022 •

edited

pihme commented Apr 27, 2022

Zelldon commented Apr 27, 2022 •

edited

pihme commented Apr 27, 2022 •

edited

romansmirnov commented Apr 27, 2022

Zelldon commented Apr 27, 2022

romansmirnov commented Apr 27, 2022

Zelldon commented Apr 27, 2022

romansmirnov commented Apr 27, 2022

pihme commented May 4, 2022

saig0 commented May 16, 2022

korthout commented Jul 22, 2022

Partitions gets unhealthy if many timers are scheduled #8991

Partitions gets unhealthy if many timers are scheduled #8991

Comments

Zelldon commented Mar 25, 2022

npepinpe commented Apr 11, 2022

pihme commented Apr 26, 2022

pihme commented Apr 26, 2022

saig0 commented Apr 27, 2022

pihme commented Apr 27, 2022

Zelldon commented Apr 27, 2022

pihme commented Apr 27, 2022 • edited

pihme commented Apr 27, 2022

Zelldon commented Apr 27, 2022 • edited

pihme commented Apr 27, 2022 • edited

Summary of Preliminary Findings

romansmirnov commented Apr 27, 2022

Zelldon commented Apr 27, 2022

romansmirnov commented Apr 27, 2022

Zelldon commented Apr 27, 2022

romansmirnov commented Apr 27, 2022

pihme commented May 4, 2022

saig0 commented May 16, 2022

Summary

korthout commented Jul 22, 2022

pihme commented Apr 27, 2022 •

edited

Zelldon commented Apr 27, 2022 •

edited

pihme commented Apr 27, 2022 •

edited