Skip to content

Error with: Failing installation of 'LogStoragePartitionStep' #9040

Closed
@ChrisKujawa

Description

@ChrisKujawa

Describe the bug

We see an error on prod which says: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20. on broker 6.

Based on metrics we can see:

roles

that we have a leader (Broker 5), Broker 6 becomes leader, Broker 5 steps down, Broker 0 becomes Leader and Broker 6 steps down. Based on that I would say everything worked as expected ?

Error group: https://console.cloud.google.com/errors/detail/CPf-xtb-3czwbw;service=zeebe;time=P7D?project=camunda-cloud-240911

Interesting is that this error happend in the cluster multiple times on the same time:

Occurrences
 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48. 
 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34.
 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.

To Reproduce
IDk

Expected behavior

I think this is expected that this can happen, than I would expect we log a warning instead of an error.

Since we worked on this here #8717 I also expected a different exception ?

Log/Stacktrace

https://drive.google.com/drive/folders/18XzGehQ0z2ut4inT-wXiBbBGM-DD3Zpf

Full Stacktrace

io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.checkAndCreateAtomixLogStorage(LogStoragePartitionTransitionStep.java:123) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.lambda$createWritableLogStorage$0(LogStoragePartitionTransitionStep.java:106) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at java.util.Optional.map(Unknown Source) ~[?:?]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.createWritableLogStorage(LogStoragePartitionTransitionStep.java:105) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.buildAtomixLogStorage(LogStoragePartitionTransitionStep.java:83) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.transitionTo(LogStoragePartitionTransitionStep.java:50) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.broker.system.partitions.impl.PartitionTransitionProcess.lambda$proceedWithTransition$1(PartitionTransitionProcess.java:80) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) [zeebe-util-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) [zeebe-util-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) [zeebe-util-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) [zeebe-util-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) [zeebe-util-1.3.6.jar:1.3.6]

Environment:
Camunda saas

Activity

added
kind/bugCategorizes an issue or PR as a bug
severity/lowMarks a bug as having little to no noticeable impact for the user
on Apr 1, 2022
npepinpe

npepinpe commented on Apr 7, 2022

@npepinpe
Member

@deepthidevaki can you double check that the system did not react and that it's just a logging issue? 👍

deepthidevaki

deepthidevaki commented on Apr 7, 2022

@deepthidevaki
Contributor

It behaved as expected. ZeebePartition ignored the install failure and continued with the next transition.

2022-04-06 11:49:53.482 CEST
zeebe
"Expected that current term '159' is same as raft term '160', but was not. Failing installation of 'LogStoragePartitionStep' on partition 30."
Error
2022-04-06 11:49:53.483 CEST
zeebe
"Failed to install leader partition 30"
Info
2022-04-06 11:49:53.483 CEST
zeebe
"Aborted installation of partition 30, cause: Expected that current term '159' is same as raft term '160', but was not. Failing installation of 'LogStoragePartitionStep' on partition 30."
Info

However it was logged as an error in PartitionTransitionProcess ` https://github.com/camunda/zeebe/blob/acd6aff1a3d29959235871a1b9c0e4a9216b2b9c/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/PartitionTransitionProcess.java#L87.

and also here https://github.com/camunda/zeebe/blob/660f790e932870cf2c325a8622fea5ca5a4e3a5b/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L253

We can fix this by conditionally logging the errors. But of course a better solution will be to revisit the transitions and as part of it remove the term-check.

npepinpe

npepinpe commented on Apr 7, 2022

@npepinpe
Member

Let's reduce the log level for these recoverable errors to at least WARN, warning being setup for issues which may recover by themselves but give hints in case the operator notices something is wrong, or if the warning consistently repeats.

ghost added a commit that references this issue on Apr 13, 2022
d786104
ghost closed this as completedin #9122on Apr 13, 2022
ghost added a commit that references this issue on Apr 13, 2022
ghost added a commit that references this issue on Apr 13, 2022

11 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes an issue or PR as a bugseverity/lowMarks a bug as having little to no noticeable impact for the userversion:1.3.8version:8.1.0Marks an issue as being completely or in parts released in 8.1.0version:8.1.0-alpha1Marks an issue as being completely or in parts released in 8.1.0-alpha1

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @npepinpe@deepthidevaki@ChrisKujawa

      Issue actions

        Error with: Failing installation of 'LogStoragePartitionStep' · Issue #9040 · camunda/camunda