Run scheduled Task only in Processing Phase #10112

Zelldon · 2022-08-18T09:02:39Z

Description

This PR makes sure that we no longer run Tasks in other phases than in Processing.

We want to execute the scheduled tasks only if the StreamProcessor is in the PROCESSING
PHASE, but no processing is currently happening.

To make sure that:

 * we are not running during replay/init phase (the state might not be up-to-date yet)
 * we are not running during suspending
 * we are not interfering with the current ongoing processing,
   such that all transaction changes are available during our task execution

This means when the StreamProcessor is suspended scheduled Tasks are not executed they are put back to the queue this was the simplest solution for now, since we have no Actor suspend feature for now #10006. Furthermore, we no longer run scheduled tasks on replay since this might also lead to issues (because the state is not yet fully rebuild etc.)

Added tests for certain aspects of this feature.

Related issues

closes #9962
closes #9999
closes #8642 (this will no longer happen since the abort condition is also verified in the SchedulingService)

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
New content is added to the release announcement
If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

We need to move the StreamProcessor.Phase into the context, so it is accessible by other classes like the ProcessingScheduleService

We want to execute the scheduled tasks only if the StreamProcessor is in the PROCESSING PHASE, but no processing is currently happening. To make sure that: * we are not running during replay/init phase (the state might not be up-to-date yet) * we are not running during suspending * we are not interfering with the current ongoing processing, such that all transaction changes are available during our task execution

engine/src/test/java/io/camunda/zeebe/engine/util/StreamPlatform.java

@@ -242,11 +248,22 @@
    return streamProcessor;
  }

+  public void pauseProcessing() {
+    pauseProcessing(getLogName(DEFAULT_PARTITION));


engine/src/test/java/io/camunda/zeebe/engine/util/StreamPlatform.java

  public void pauseProcessing(final String streamName) {
    streamContextMap.get(streamName).streamProcessor.pauseProcessing().join();
    LOG.info("Paused processing for stream {}", streamName);
  }

+  public void resumeProcessing() {
+    resumeProcessing(getLogName(DEFAULT_PARTITION));


github-actions · 2022-08-18T09:16:07Z

Test Results

  856 files +  9   856 suites +9 1h 54m 48s ⏱️ + 14m 17s
6 519 tests +67 6 497 ✔️ +56 11 💤 ±0 11 ❌ +11
6 638 runs +  2 6 613 ✔️ - 12 11 💤 ±0 14 ❌ +14

For more details on these failures, see this check.

Results for commit 18e7f58. ± Comparison against base commit c108b34.

♻️ This comment has been updated with latest results.

npepinpe

Looks good in general.

I have one question about the requirement:

we are not interfering with the current ongoing processing, such that all transaction changes are available during our task execution

I thought if everything worked within its own transaction then it shouldn't interfere with each other anyway, no? Maybe I misunderstood how transactions work in RocksDB. Can you explain that to me?

Anyway, so right now it means if I want to schedule something, I will schedule it anyway but might not do anything, and we rely on the runAtFixedRate implementation to reschedule it, correct? So in a paused service, this means it will keep enqueuing forever until it's unpaused, correct? Could that cause starvation to the things that run at fixed rate?

e.g. I want to do something at fixed rate, but every time I try to schedule, we're currently in processing, so I never do anything.

How likely is this? Unlikely? Cause in a processing system, I expect we're very often "in processing", right?

npepinpe · 2022-08-19T08:56:32Z

engine/src/main/java/io/camunda/zeebe/streamprocessor/StreamProcessorContext.java

@@ -61,6 +62,9 @@ public void onSkipped(final LoggedEvent skippedRecord) {}
  // safe structure here
  private boolean inProcessing;

+  // this is accessed outside, which is why we need to make sure that it is thread-safe
+  private volatile StreamProcessor.Phase phase = Phase.INITIAL;


❓ What accesses it from the outside?

Right now the health monitir and I don't want to fix this in this PR so I kept as it is

Zelldon · 2022-08-19T14:36:22Z

The stream processor doesn't use any runAtFixed rate anymore. This is just a wrapper around runDelayed and rescheduling.

Regarding never executed this should also no issue since we do not use #run in the stream processor anymore, which means everything will appended to the same queue and no skip the line is happening.

Hope that helps.

P.S. regarding the transaction the requirement came up because the engine wants to schedule jobs (replacement for sideeffects) but they might be not committed yet.

Zelldon · 2022-08-18T09:05:46Z

engine/src/test/java/io/camunda/zeebe/engine/util/StreamPlatform.java

+    pauseProcessing(getLogName(DEFAULT_PARTITION));
+  }
+
+  // todo remove multi partition support - is not necessary for the StreamProcessor tests


I added a bulletpoint for this here #9727

Zelldon · 2022-08-18T09:07:22Z

engine/src/main/java/io/camunda/zeebe/streamprocessor/ProcessingScheduleServiceImpl.java

@@ -26,6 +27,7 @@ public class ProcessingScheduleServiceImpl implements ProcessingScheduleService
  private final StreamProcessorContext streamProcessorContext;
  private final AbortableRetryStrategy writeRetryStrategy;

+  // todo remove context from CTOR; will be cleaned up after engine abstraction is done


I added a bullet point for it here #9727 (comment)

Zelldon · 2022-08-19T14:34:20Z

engine/src/main/java/io/camunda/zeebe/streamprocessor/StreamProcessorContext.java

@@ -61,6 +62,9 @@ public void onSkipped(final LoggedEvent skippedRecord) {}
  // safe structure here
  private boolean inProcessing;

+  // this is accessed outside, which is why we need to make sure that it is thread-safe
+  private volatile StreamProcessor.Phase phase = Phase.INITIAL;


Right now the health monitir and I don't want to fix this in this PR so I kept as it is

npepinpe · 2022-08-19T14:53:21Z

Regarding never executed this should also no issue since we do not use #run in the stream processor anymore, which means everything will appended to the same queue and no skip the line is happening.

Maybe I misunderstood, but even if runAtFixedRate is just a wrapper around runDelayed, there's still the possible case of:

Process is inProcessing
Schedule runnable X
Processor update states (still inProcessing)
Scheduled runnable X does nothing because inProcessing; gets rescheduled for the next delay
Processor does more stuff, etc.
Processor gets back to inProcessing
Delay passes, so now the runnable X executes; does nothing because inProcessing, gets rescheduled for next delay

Couldn't this happen in a loop? My question was, do we consider this so unlikely (that it would loop like this)? Cause I would expect we're most of the time "in processing", right? We're only not in processing between executeSideEffects and processCommand (so only when reading the next record). Since scheduled jobs aren't sequenced with normal jobs (they get sequenced only when the timer expires), there's no guarantee of ordering? Am I missing something?

npepinpe · 2022-08-19T14:54:47Z

P.S. regarding the transaction the requirement came up because the engine wants to schedule jobs (replacement for sideeffects) but they might be not committed yet.

I really thought RocksDB transactions were isolated from each other. Is that not true? Or is it because side effects (or the replacement) use the same transaction context as the processing?

Zelldon · 2022-08-19T18:11:29Z

I really thought RocksDB transactions were isolated from each other. Is that not true? Or is it because side effects (or the replacement) use the same transaction context as the processing?

I think we talk about two things here. It is not about whether transactions are isolated or not. It is about data should be written to the state before the state is access again on a scheduled task. For example and to be specific https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/processing/common/CatchEventBehavior.java#L302-L308 we want that this to be executed AFTER the transaction is committed. Right now this is done via side-effects, which we wanted to replace with the SchjedulingService as you can see here #9723 (comment)

But since we talk about it: @pihme created during I was on vacation the post-commit tasks https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/api/PostCommitTask.java which can be added to the ProcessingResult. We could remove that with the ProcessingScheduleService after this PR, so we would need to maintain only one approach. OR we decide that we keep them and use it for that use case and remove from the ProcessingScheduleService the property that we guarantee the case that it is not executed during inProcessing. Worth to discuss I guess.

Zelldon · 2022-08-19T18:13:25Z

Regarding never executed this should also no issue since we do not use #run in the stream processor anymore, which means everything will appended to the same queue and no skip the line is happening.

Maybe I misunderstood, but even if runAtFixedRate is just a wrapper around runDelayed, there's still the possible case of:

Process is inProcessing

Schedule runnable X

Processor update states (still inProcessing)

Scheduled runnable X does nothing because inProcessing; gets rescheduled for the next delay

Processor does more stuff, etc.

Processor gets back to inProcessing

Delay passes, so now the runnable X executes; does nothing because inProcessing, gets rescheduled for next delay

Couldn't this happen in a loop? My question was, do we consider this so unlikely (that it would loop like this)? Cause I would expect we're most of the time "in processing", right? We're only not in processing between executeSideEffects and processCommand (so only when reading the next record). Since scheduled jobs aren't sequenced with normal jobs (they get sequenced only when the timer expires), there's no guarantee of ordering? Am I missing something?

Not really. The runnable is scheduled with the delay and after the delay is done the inProcessing is checked. IF inProcessing is true we add it to the queue, and eventually, we will execute it since all jobs are added to the same queue (no delay is here again). For me it looks similar to our log stream and the processing

Right now I don't see an issue, but please tell me if I'm wrong.

Update: Sorry I checked the code again and missed the part that the runAtFixedRate is defined here https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/api/ProcessingScheduleService.java#L35-L42 and schedules outside the check.

TBH most of the scheduled task right now doesn't need it, so maybe we go with the post-commit tasks in the ProcessingResult OR We support other methods in the PRocessingScheduleService with different guarantees for different use cases 🤷

Zelldon · 2022-08-19T18:30:54Z

OR we decide that we keep them and use it for that use case and remove from the ProcessingScheduleService the property that we guarantee the case that it is not executed during inProcessing.

I guess this would be also interesting, since we have not the issue like here #10112 (comment) that runDelayed and runAtFixedRate are postPoned for too long, because the StateMachine is in processing.

npepinpe · 2022-08-20T16:02:42Z

My bad, I must've been tired when I did the review because it seems pretty obvious now. So we schedule, and if we're "busy", then we enqueue the task for later consumption. Yeah it would be nice if we don't re-schedule immediately but instead only when the task has actually run, cause right now it might cause two tasks which should be scheduled with a delay in between to be run immediately one after the other 🤷

Regarding the transaction, if I understand correctly, it's just about making sure that the scheduled task isn't triggered before the current record processing is finished (since it needs access to whatever will be committed). Sorry, I'm a little slow right now 😅

I'm not familiar with the decisions to remove the side effects/post commit tasks, to be honest, so I'll trust that you all took the best decision and that it's the right thing to do.

I think it's not hard to modify the runAtFixedRate implementation to make sure we only re-schedule when the task is actually executed, e.g.

  @Override
  public void runAtFixedRate(final Duration delay, final Task task) {
    /* TODO preliminary implementation; with the direct access
     * this only works because this class is scheduled on the same actor as the
     * stream processor.
     */
    runDelayed(
        delay,
        toRunnable(
            builder -> {
              try {
                return task.execute(builder);
              } finally {
                runAtFixedRate(delay, task);
              }
            }));
  }

This can probably be "optimized" to avoid rebuilding the runnable/rescheduling task I guess, but you get the idea.

npepinpe

👍

I'm a little tired and slower these days, sorry for the back and forth, it makes sense to me now.

I think we can improve by only rescheduling the task after it's been actually executed (and not immediately after the converted Runnable runs). That's simple enough to do I think, so no need for another review.

Previously the ProcessingSchedulingService rescheduled the task, even when the task was not executed (due to no in processing). This can cause a lot of scheduled tasks on the queue.

Zelldon · 2022-08-23T13:27:32Z

thanks @npepinpe I did that

bors r+

zeebe-bors-camunda · 2022-08-23T14:03:45Z

Build succeeded:

10214: Unflake TimerStartEventTest.shouldTriggerOnlyTwice r=korthout a=korthout ## Description  This test became flaky after a [change in the task scheduler](#10112). Since then, tasks that have been scheduled are only executed when the engine is in the PROCESSING phase, but isn't currently processing a command. If that is the case, then it is rescheduled. That means that sometimes things take longer, especially when the processor is busy. In this specific test case, new process instances are created using 2 timer start events on different processes. Each time they trigger, they produce a bunch of records because they start new process instances. The processing of the relevant commands for these instances can interfere with the DueDateTimerChecker which is a scheduled task. Because of the new task scheduler change, it can happen that the DueDateTimerChecker is postponed. The problem is solved by periodically checking that the assertion holds, and increasing the clock's time every time we check this. This makes sure that the DueDateTimerChecker's task can be executed, even when it's rescheduled. ## Related issues  closes #10169 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>

Zelldon added 6 commits August 18, 2022 10:17

refactor: make phase part of context

905906e

We need to move the StreamProcessor.Phase into the context, so it is accessible by other classes like the ProcessingScheduleService

refactor: rename setter/getter

12b84fc

feat: implement EmptyTaskResult

7f58633

test: verify that tasks are not executed in certain phases

2f31d49

style: apply code style

7ca8500

Zelldon requested a review from npepinpe August 18, 2022 09:02

Zelldon marked this pull request as ready for review August 18, 2022 09:13

github-advanced-security bot found potential problems Aug 18, 2022

View reviewed changes

npepinpe reviewed Aug 19, 2022

View reviewed changes

Zelldon commented Aug 19, 2022

View reviewed changes

npepinpe approved these changes Aug 20, 2022

View reviewed changes

fix: reschedule only after actual task was run

18e7f58

Previously the ProcessingSchedulingService rescheduled the task, even when the task was not executed (due to no in processing). This can cause a lot of scheduled tasks on the queue.

zeebe-bors-camunda bot merged commit cf7cd1f into main Aug 23, 2022

zeebe-bors-camunda bot deleted the zell-suspend-in-ProcessingScheduleService branch August 23, 2022 14:03

Zelldon mentioned this pull request Aug 25, 2022

Flaky test TimerStartEventTest.shouldTriggerOnlyTwice #10169

Closed

korthout mentioned this pull request Aug 29, 2022

Unflake TimerStartEventTest.shouldTriggerOnlyTwice #10214

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run scheduled Task only in Processing Phase #10112

Run scheduled Task only in Processing Phase #10112

Zelldon commented Aug 18, 2022 •

edited

github-actions bot commented Aug 18, 2022 •

edited

npepinpe left a comment

npepinpe Aug 19, 2022

Zelldon Aug 19, 2022

Zelldon commented Aug 19, 2022 •

edited

Zelldon Aug 18, 2022

Zelldon Aug 18, 2022

Zelldon Aug 19, 2022

npepinpe commented Aug 19, 2022

npepinpe commented Aug 19, 2022

Zelldon commented Aug 19, 2022 •

edited

Zelldon commented Aug 19, 2022 •

edited

Zelldon commented Aug 19, 2022

npepinpe commented Aug 20, 2022

npepinpe left a comment

Zelldon commented Aug 23, 2022

zeebe-bors-camunda bot commented Aug 23, 2022

Run scheduled Task only in Processing Phase #10112

Run scheduled Task only in Processing Phase #10112

Conversation

Zelldon commented Aug 18, 2022 • edited

Description

Related issues

Definition of Done

github-actions bot commented Aug 18, 2022 • edited

Test Results

npepinpe left a comment

Choose a reason for hiding this comment

npepinpe Aug 19, 2022

Choose a reason for hiding this comment

Zelldon Aug 19, 2022

Choose a reason for hiding this comment

Zelldon commented Aug 19, 2022 • edited

Zelldon Aug 18, 2022

Choose a reason for hiding this comment

Zelldon Aug 18, 2022

Choose a reason for hiding this comment

Zelldon Aug 19, 2022

Choose a reason for hiding this comment

npepinpe commented Aug 19, 2022

npepinpe commented Aug 19, 2022

Zelldon commented Aug 19, 2022 • edited

Zelldon commented Aug 19, 2022 • edited

Zelldon commented Aug 19, 2022

npepinpe commented Aug 20, 2022

npepinpe left a comment

Choose a reason for hiding this comment

Zelldon commented Aug 23, 2022

zeebe-bors-camunda bot commented Aug 23, 2022

Zelldon commented Aug 18, 2022 •

edited

github-actions bot commented Aug 18, 2022 •

edited

Zelldon commented Aug 19, 2022 •

edited

Zelldon commented Aug 19, 2022 •

edited

Zelldon commented Aug 19, 2022 •

edited