Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE during replay #8830

Closed
oleschoenburg opened this issue Feb 23, 2022 · 3 comments · Fixed by #8834
Closed

NPE during replay #8830

oleschoenburg opened this issue Feb 23, 2022 · 3 comments · Fixed by #8834
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5 version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0

Comments

@oleschoenburg
Copy link
Member

oleschoenburg commented Feb 23, 2022

Describe the bug

java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "io.camunda.zeebe.engine.state.instance.ElementInstance.setState(io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent)" because "instance" is null
java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "io.camunda.zeebe.engine.state.instance.ElementInstance.setState(io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent)" because "instance" is null
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.lambda$replayNextEvent$4(ReplayStateMachine.java:164) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:33) ~[zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:82) ~[zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) [zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) [zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) [zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) [zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
Caused by: java.lang.NullPointerException: Cannot invoke "io.camunda.zeebe.engine.state.instance.ElementInstance.setState(io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent)" because "instance" is null
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementCompletingApplier.lambda$applyState$0(ProcessInstanceElementCompletingApplier.java:30) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.state.instance.DbElementInstanceState.updateInstance(DbElementInstanceState.java:166) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementCompletingApplier.applyState(ProcessInstanceElementCompletingApplier.java:28) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementCompletingApplier.applyState(ProcessInstanceElementCompletingApplier.java:16) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.state.appliers.EventAppliers.applyState(EventAppliers.java:239) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.applyCurrentEvent(ReplayStateMachine.java:304) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.replayEvent(ReplayStateMachine.java:224) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at java.util.Iterator.forEachRemaining(Iterator.java:133) ~[?:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.lambda$tryToReplayBatch$5(ReplayStateMachine.java:205) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:84) ~[zeebe-db-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.tryToReplayBatch(ReplayStateMachine.java:203) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.lambda$replayNextEvent$3(ReplayStateMachine.java:160) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:36) ~[zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.retry.RecoverableRetryStrategy.run(RecoverableRetryStrategy.java:47) ~[zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) ~[zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	... 5 more

I collected the data folders of all three brokers here: https://drive.google.com/drive/folders/1KZcyeFzxY3wy12ODMSRb-ovqptRuGzvQ

Unfortunately, I was a bit too late and was only able to collect the data after broker 0 partially recovered when it received a new snapshot.

@oleschoenburg
Copy link
Member Author

oleschoenburg commented Feb 23, 2022

A rough timeline of broker 0, partition 2 where the exception occurred:

At 05:18:42
broker-0 starts, probably following a pod eviction.
It receives metadata changes from brokers 1 and 2, 1 is leader for all three partitions.
broker-0 finds log segments 6-13 for partition 2 and starts the transition to follower

At 05:18:54
broker-0 receives a snapshot for partition 2: 1448060811-68-4352806561-4352805746
The snapshot is committed and the existing log is being deleted (lastIndex 1448043632)
Setting firstCommitIndex to 1448176559. RaftServer is ready only after it has committed events upto this index"

At 05:21:00
Successfully started server for partition PartitionId{id=2, group=raft-partition} in 131135ms

At 05:21:18
broker-0 receives a snapshot for partition 2: 1448288435-68-4353483090-4353482796-2.
Transition to follower requested.
Partition role transitioning from null to FOLLOWER in term 68
Recovering state from available snapshot 1448060811-68-4352806561-4352805746
Committed new snapshot 1448288435-68-4353483090-4353482796
Deleting previous snapshot 1448060811-68-4352806561-4352805746
Scheduling log compaction up to index 1448288435
Transition to FOLLOWER on term 68 requested.
Received cancel signal for transition to FOLLOWER on term 68
Recovering state of partition 2 from snapshot
Recovered state of partition 2 from snapshot at position 4352806640
Processor starts replay of events. snapshot-position: 4352806640, replay-mode: REPLAY
Detected 'UNHEALTHY' components. The current health status of components: ZeebePartition-2{status=UNHEALTHY, issue='Initial state'}, raft-partition-partition-2{status=HEALTHY}, Broker-0-LogStream-2{status=HEALTHY}, Broker-0-StreamProcessor-2{status=HEALTHY}
Paused replay for partition 2
Cancelling transition to FOLLOWER on term 68
Resumed replay for partition 2
Detected 'UNHEALTHY' components. The current health status of components: [ZeebePartition-1{status=UNHEALTHY, issue='Initial state'}, raft-partition-partition-1{status=HEALTHY}, Broker-0-StreamProcessor-1{status=HEALTHY}, Broker-0-LogStream-1{status=HEALTHY}]
ZeebePartition-2 recovered, marking it as healthy
Prepare transition from FOLLOWER on term 68 to FOLLOWER
Discard job io.camunda.zeebe.engine.processing.streamprocessor.StreamProcessor$$Lambda$1829/0x000000080147fc40 QUEUED from fastLane of Actor Broker-0-StreamProcessor-2.
Closed stream processor controller Broker-0-StreamProcessor-2.
On closing logstream logstream-raft-partition-partition-2 close 1 readers
Transition to FOLLOWER on term 68 starting
Transition to FOLLOWER on term 68 - transitioning StreamProcessor
Recovering state of partition 2 from snapshot
Recovered state of partition 2 from snapshot at position 4352806640
Recovered exporter 'Broker-0-Exporter-2' from snapshot at lastExportedPosition 4352805746
Actor Broker-0-StreamProcessor-2 failed in phase STARTED.

Detected 'UNHEALTHY' components. The current health status of components: [Partition-2{status=UNHEALTHY, issue=Broker-0-StreamProcessor-2{status=UNHEALTHY, issue=java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "io.camunda.zeebe.engine.state.instance.ElementInstance.setState(io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent)" because "instance" is null}}, Broker-0-ZeebePartition-1{status=HEALTHY}, Partition-3{status=UNHEALTHY, issue='Components are not yet initialized'}]

At 09:56:48
Started receiving new snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/pending/1461329989-68-4392047538-4392046939-3, snapshotStore=io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore@388c3821, metadata=FileBasedSnapshotMetadata{index=1461329989, term=68, processedPosition=4392047538, exporterPosition=4392046939}} from 1"

@oleschoenburg
Copy link
Member Author

@romansmirnov and I investigated this further and we think that we have found the cause.

  1. Why did we see this Exception?
    The StreamProcessor tried to complete a process instance that didn't exist in the current state. This is because the ZeebeDb is based on an old, now deleted, snapshot.
  2. Why was ZeebeDb based on an old Snapshot?
    ZeebeDbPartitionTransitionStep recovers the ZeebeDb when transitioning between certain roles:
    https://github.com/camunda-cloud/zeebe/blob/e5787dfa03e5b6a7dc7483c7c10844936d872027/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/steps/ZeebeDbPartitionTransitionStep.java#L40-L58
    Usually we should have transitioned from FOLLOWER to INACTIVE and then from INACTIVE back to FOLLOWER when a snapshot is committed. Based on the logs, this is not what has happened. After receiving a snapshot, a transition from FOLLOWER to FOLLOWER was started which does not match the conditions for recovering the ZeebeDb from a snapshot. Instead, we keep using the old ZeebeDb without recovering from the new snapshot.
  3. Why did we transition from FOLLOWER to FOLLOWER?
    There was a time window where the ZeebePartition actor was already started, but it's SnapshotReplicationListener isn't registered yet.
    https://github.com/camunda-cloud/zeebe/blob/5c3090f9eafd328cef16c899cdf7cdfa4debda8f/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L107-L128
    During that time period, a snapshot was received and the SnapshotReplicationListener that is responsible for transitioning to INACTIVE is not triggered. We can also see this timing issue in the logs, where the snapshot is received 2ms before the Startup process is completely finished.

@oleschoenburg oleschoenburg added Impact: Data scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround kind/bug Categorizes an issue or PR as a bug labels Feb 23, 2022
@Zelldon
Copy link
Member

Zelldon commented Feb 23, 2022

Thanks for the awesome report @oleschoenburg 🙇

@npepinpe npepinpe added severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround and removed severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Feb 24, 2022
@npepinpe npepinpe added this to In progress in Zeebe Feb 24, 2022
@npepinpe npepinpe self-assigned this Feb 24, 2022
@ghost ghost closed this as completed in 7c20b26 Feb 28, 2022
Zeebe automation moved this from In progress to Done Feb 28, 2022
ghost pushed a commit that referenced this issue Feb 28, 2022
8855: [Backport release-1.4.0-alpha2] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8834 to `release-1.4.0-alpha2`.

relates to #8830

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Feb 28, 2022
8854: [Backport stable/1.3] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8834 to `stable/1.3`.

relates to #8830

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Feb 28, 2022
8853: [Backport stable/1.2] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8834 to `stable/1.2`.

relates to #8830

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Feb 28, 2022
8854: [Backport stable/1.3] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8834 to `stable/1.3`.

relates to #8830

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Feb 28, 2022
8853: [Backport stable/1.2] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8834 to `stable/1.2`.

relates to #8830

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Feb 28, 2022
8854: [Backport stable/1.3] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8834 to `stable/1.3`.

relates to #8830

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@Zelldon Zelldon added Release: 1.2.11 version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5 labels Mar 1, 2022
@KerstinHebel KerstinHebel removed this from Done in Zeebe Mar 23, 2022
zeebe-bors-camunda bot added a commit that referenced this issue Mar 28, 2022
8994: Fix notification of snapshot replication listeners about missed events r=oleschoenburg a=oleschoenburg

## Description
Whenever a raft partition transitions to a new role, we must reset `missedSnapshotReplicationEvents` so that registering new snapshot replication listeners does not trigger the listener for snapshot replication events that occurred for a different role.

This is to guard against a known case where raft received a snapshot and transitioned to leader before role change and snapshot replication listeners were registered. Once the registration of the snapshot replication listener registration went through, the listener was informed about the missed replication and caused the zeebe partition to transition to follower, while the raft partition was leader.

Additionally, because we keep track of missed snapshot replication events to notify listeners on registering, the partition is not necessarily in follower role at that point. This breaks the assumption of the listener, so we add a condition here to only trigger listeners on register if the raft partition is still follower.

## Related issues

relates to #8830
closes #8978


Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@deepthidevaki deepthidevaki added the version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 label May 3, 2022
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5 version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants