-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky CompleteProcessInstanceAfterLeaderChangeTest.shouldCompleteProcessInstanceAfterSeveralLeaderChanges #9813
Comments
9845: chore(ci): adjust direct Vault usage in Jenkins to new path r=npepinpe a=Langleu ## Description Infra is restructuring Vault paths to ease the self-service. See announcement on [slack](https://camunda.slack.com/archives/C03AEFWGJ9K/p1657524606321389). GitHub paths will also be adjusted later this week in a different PR. Goal is to combine everything under a product space, which makes reusing of secrets also easier as you don't have to duplicate them between GitHub Actions and Jenkins. ## Related issues On the Infra side --> [INFRA-3326](https://jira.camunda.com/browse/INFRA-3326) 9884: Simplify ControlledActorClockEndpointIT r=npepinpe a=npepinpe ## Description This PR simplifies the `ControlledActorClockEndpointIT`, dropping the dependency on Elasticsearch and using `BpmnAssert` to verify the records produced. I also opted to not use `ZeebeClock` directly, or even the `ZeebeTestEngine#increaseTime`, as we want to explicitly test the endpoint, and not the interfaces provided by other libraries. 9886: test(qa): wait until message is published before restarting the broker r=deepthidevaki a=deepthidevaki ## Description The test was flaky - it sometime failed in 'correlate message after restart' while waiting for the instance to be completed. It is not clear why exactly the test fails, but a probable reason is that the message was not published before the broker was restarted. ## Related issues closes #9813 Co-authored-by: Langleu <lars.lange@camunda.com> Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Still fails #9886 (comment) 😭 |
9894: [Backport stable/1.3] test(qa): wait until message is published before restarting the broker r=deepthidevaki a=backport-action # Description Backport of #9886 to `stable/1.3`. relates to #9813 9896: [Backport stable/1.3] fix(engine): add grace period to detect end of processing r=remcowesterhoud a=backport-action # Description Backport of #9082 to `stable/1.3`. relates to #8738 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: pihme <pihme@users.noreply.github.com>
9895: [Backport stable/8.0] test(qa): wait until message is published before restarting the broker r=deepthidevaki a=backport-action # Description Backport of #9886 to `stable/8.0`. relates to #9813 9897: [Backport stable/8.0] fix(engine): add grace period to detect end of processing r=remcowesterhoud a=backport-action # Description Backport of #9082 to `stable/8.0`. relates to #8738 closes #9641 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: pihme <pihme@users.noreply.github.com>
Found the following logs from the failed test
Not sure if this explains the failure. Related to #9860 |
Found this log
where the command is sent to member 1. But at this time, the leader is 0 and member 1 (which was the previous leader) is already shutdown. It seems the topology is not updated when this command is send. I think that that topology will be eventually updated. But the retry delay is set to 10 seconds. The tests fails before the retry is attempted. |
Also the NPE doesn't seems to affect this test because when this error happens, there are no subscription to correlate. |
Happened again, see #9935 (comment)
|
😭 |
Thanks @deepthidevaki Sorry to have to open this one again 🙇 But thanks for looking into it. |
Here is another example pipeline: https://github.com/camunda/zeebe/runs/7667370072?check_suite_focus=true |
Was quite difficult to reproduce it. But reproduced it after several runs https://github.com/camunda/zeebe/actions/runs/2783045847 From the logs it seems the topology is not updated correctly.
Broker 0 became the leader. The topology listener updates the leader to 0, but then it updates again to broker 1. Broker 1 was leader in the previous term. So the current leader is overwritten. No idea how it can happen. |
When the partition leader is updated, we now schedule a timer: Before, we directly Quoting from the documentation of
|
Great point @oleschoenburg I suspected some thing like it, but the following test passed
But as the documentation says, if it is not guaranteed, then that would explain why it fails non-deterministically. |
IMO |
|
So |
Would reslove this #9756 (comment) as well |
Then I will do that. Not yet sure if using its own actor or as @oleschoenburg said get rid of the actor and use a |
imo having the transport be its own actor kind of makes sense. you can pass the job of sending a command to another actor and keep processing 👍 edit: i know the sender is not the transport, but you get the idea 😄 |
10020: Merge topology listener with InterPartitionCommandSender r=deepthidevaki a=deepthidevaki ## Description Flaky test #9813 was due to the changes in `ProcessSchedulingService`, which instead of submitting a new task scheduled a task with 0 delay. There is not guarantee in which order the tasks are executed when multiple tasks times out at the same time. This breaks the assumption in the topology listener, and it led to listeners executing in different order than they were infoked. As a result, the the new leader was overwritten by a previous invocation of the listener with the old leader. To fix it, in this PR * Merged topology listener with InterPartitionCommandSender * InterPartitionCommandSender has its own actor, instead of sharing the StreamProcessorActor. This avoids the need to use `ProcessSchedulingService` and thus the non-deterministic execution of the listener. * Refactoring to allow installing InterPartitionCommandSender during PartitionTransition instead of engine factory. ## Related issues closes #9813 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Summary
Failures
Example assertion failure
Logs
logs
The text was updated successfully, but these errors were encountered: