Description
Describe the bug
We see an error on prod which says: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
on broker 6.
Based on metrics we can see:
that we have a leader (Broker 5), Broker 6 becomes leader, Broker 5 steps down, Broker 0 becomes Leader and Broker 6 steps down. Based on that I would say everything worked as expected ?
Error group: https://console.cloud.google.com/errors/detail/CPf-xtb-3czwbw;service=zeebe;time=P7D?project=camunda-cloud-240911
Interesting is that this error happend in the cluster multiple times on the same time:
Occurrences
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20. 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
To Reproduce
IDk
Expected behavior
I think this is expected that this can happen, than I would expect we log a warning instead of an error.
Since we worked on this here #8717 I also expected a different exception ?
Log/Stacktrace
https://drive.google.com/drive/folders/18XzGehQ0z2ut4inT-wXiBbBGM-DD3Zpf
Full Stacktrace
io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.checkAndCreateAtomixLogStorage(LogStoragePartitionTransitionStep.java:123) ~[zeebe-broker-1.3.6.jar:1.3.6]
at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.lambda$createWritableLogStorage$0(LogStoragePartitionTransitionStep.java:106) ~[zeebe-broker-1.3.6.jar:1.3.6]
at java.util.Optional.map(Unknown Source) ~[?:?]
at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.createWritableLogStorage(LogStoragePartitionTransitionStep.java:105) ~[zeebe-broker-1.3.6.jar:1.3.6]
at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.buildAtomixLogStorage(LogStoragePartitionTransitionStep.java:83) ~[zeebe-broker-1.3.6.jar:1.3.6]
at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.transitionTo(LogStoragePartitionTransitionStep.java:50) ~[zeebe-broker-1.3.6.jar:1.3.6]
at io.camunda.zeebe.broker.system.partitions.impl.PartitionTransitionProcess.lambda$proceedWithTransition$1(PartitionTransitionProcess.java:80) ~[zeebe-broker-1.3.6.jar:1.3.6]
at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) [zeebe-util-1.3.6.jar:1.3.6]
at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) [zeebe-util-1.3.6.jar:1.3.6]
at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.3.6.jar:1.3.6]
at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) [zeebe-util-1.3.6.jar:1.3.6]
at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) [zeebe-util-1.3.6.jar:1.3.6]
at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) [zeebe-util-1.3.6.jar:1.3.6]
Environment:
Camunda saas
- Zeebe Version: 1.3.6
- Configuration: https://console.cloud.camunda.io/admin/cluster-plan/plans/648cda52-8a31-4595-bc93-24abaa294608
Activity
npepinpe commentedon Apr 7, 2022
@deepthidevaki can you double check that the system did not react and that it's just a logging issue? 👍
deepthidevaki commentedon Apr 7, 2022
It behaved as expected.
ZeebePartition
ignored the install failure and continued with the next transition.However it was logged as an error in PartitionTransitionProcess ` https://github.com/camunda/zeebe/blob/acd6aff1a3d29959235871a1b9c0e4a9216b2b9c/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/PartitionTransitionProcess.java#L87.
and also here https://github.com/camunda/zeebe/blob/660f790e932870cf2c325a8622fea5ca5a4e3a5b/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L253
We can fix this by conditionally logging the errors. But of course a better solution will be to revisit the transitions and as part of it remove the term-check.
npepinpe commentedon Apr 7, 2022
Let's reduce the log level for these recoverable errors to at least WARN, warning being setup for issues which may recover by themselves but give hints in case the operator notices something is wrong, or if the warning consistently repeats.
merge: #9122
merge: #9124
merge: #9125 #9133
11 remaining items