NPE during replay #8830

oleschoenburg · 2022-02-23T09:09:52Z

Describe the bug

java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "io.camunda.zeebe.engine.state.instance.ElementInstance.setState(io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent)" because "instance" is null

java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "io.camunda.zeebe.engine.state.instance.ElementInstance.setState(io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent)" because "instance" is null
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.lambda$replayNextEvent$4(ReplayStateMachine.java:164) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:33) ~[zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:82) ~[zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) [zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) [zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) [zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) [zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
Caused by: java.lang.NullPointerException: Cannot invoke "io.camunda.zeebe.engine.state.instance.ElementInstance.setState(io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent)" because "instance" is null
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementCompletingApplier.lambda$applyState$0(ProcessInstanceElementCompletingApplier.java:30) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.state.instance.DbElementInstanceState.updateInstance(DbElementInstanceState.java:166) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementCompletingApplier.applyState(ProcessInstanceElementCompletingApplier.java:28) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.state.appliers.ProcessInstanceElementCompletingApplier.applyState(ProcessInstanceElementCompletingApplier.java:16) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.state.appliers.EventAppliers.applyState(EventAppliers.java:239) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.applyCurrentEvent(ReplayStateMachine.java:304) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.replayEvent(ReplayStateMachine.java:224) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at java.util.Iterator.forEachRemaining(Iterator.java:133) ~[?:?]
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.lambda$tryToReplayBatch$5(ReplayStateMachine.java:205) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:84) ~[zeebe-db-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.tryToReplayBatch(ReplayStateMachine.java:203) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.ReplayStateMachine.lambda$replayNextEvent$3(ReplayStateMachine.java:160) ~[zeebe-workflow-engine-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:36) ~[zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.retry.RecoverableRetryStrategy.run(RecoverableRetryStrategy.java:47) ~[zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) ~[zeebe-util-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	... 5 more

I collected the data folders of all three brokers here: https://drive.google.com/drive/folders/1KZcyeFzxY3wy12ODMSRb-ovqptRuGzvQ

Unfortunately, I was a bit too late and was only able to collect the data after broker 0 partially recovered when it received a new snapshot.

The text was updated successfully, but these errors were encountered:

oleschoenburg · 2022-02-23T10:31:56Z

A rough timeline of broker 0, partition 2 where the exception occurred:

At 05:18:42
broker-0 starts, probably following a pod eviction.
It receives metadata changes from brokers 1 and 2, 1 is leader for all three partitions.
broker-0 finds log segments 6-13 for partition 2 and starts the transition to follower

At 05:18:54
broker-0 receives a snapshot for partition 2: 1448060811-68-4352806561-4352805746
The snapshot is committed and the existing log is being deleted (lastIndex 1448043632)
Setting firstCommitIndex to 1448176559. RaftServer is ready only after it has committed events upto this index"

At 05:21:00
Successfully started server for partition PartitionId{id=2, group=raft-partition} in 131135ms

At 05:21:18
broker-0 receives a snapshot for partition 2: 1448288435-68-4353483090-4353482796-2.
Transition to follower requested.
Partition role transitioning from null to FOLLOWER in term 68
Recovering state from available snapshot 1448060811-68-4352806561-4352805746
Committed new snapshot 1448288435-68-4353483090-4353482796
Deleting previous snapshot 1448060811-68-4352806561-4352805746
Scheduling log compaction up to index 1448288435
Transition to FOLLOWER on term 68 requested.
Received cancel signal for transition to FOLLOWER on term 68
Recovering state of partition 2 from snapshot
Recovered state of partition 2 from snapshot at position 4352806640
Processor starts replay of events. snapshot-position: 4352806640, replay-mode: REPLAY
Detected 'UNHEALTHY' components. The current health status of components: ZeebePartition-2{status=UNHEALTHY, issue='Initial state'}, raft-partition-partition-2{status=HEALTHY}, Broker-0-LogStream-2{status=HEALTHY}, Broker-0-StreamProcessor-2{status=HEALTHY}
Paused replay for partition 2
Cancelling transition to FOLLOWER on term 68
Resumed replay for partition 2
Detected 'UNHEALTHY' components. The current health status of components: [ZeebePartition-1{status=UNHEALTHY, issue='Initial state'}, raft-partition-partition-1{status=HEALTHY}, Broker-0-StreamProcessor-1{status=HEALTHY}, Broker-0-LogStream-1{status=HEALTHY}]
ZeebePartition-2 recovered, marking it as healthy
Prepare transition from FOLLOWER on term 68 to FOLLOWER
Discard job io.camunda.zeebe.engine.processing.streamprocessor.StreamProcessor$$Lambda$1829/0x000000080147fc40 QUEUED from fastLane of Actor Broker-0-StreamProcessor-2.
Closed stream processor controller Broker-0-StreamProcessor-2.
On closing logstream logstream-raft-partition-partition-2 close 1 readers
Transition to FOLLOWER on term 68 starting
Transition to FOLLOWER on term 68 - transitioning StreamProcessor
Recovering state of partition 2 from snapshot
Recovered state of partition 2 from snapshot at position 4352806640
Recovered exporter 'Broker-0-Exporter-2' from snapshot at lastExportedPosition 4352805746
Actor Broker-0-StreamProcessor-2 failed in phase STARTED.

Detected 'UNHEALTHY' components. The current health status of components: [Partition-2{status=UNHEALTHY, issue=Broker-0-StreamProcessor-2{status=UNHEALTHY, issue=java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "io.camunda.zeebe.engine.state.instance.ElementInstance.setState(io.camunda.zeebe.protocol.record.intent.ProcessInstanceIntent)" because "instance" is null}}, Broker-0-ZeebePartition-1{status=HEALTHY}, Partition-3{status=UNHEALTHY, issue='Components are not yet initialized'}]

At 09:56:48
Started receiving new snapshot FileBasedReceivedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/pending/1461329989-68-4392047538-4392046939-3, snapshotStore=io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore@388c3821, metadata=FileBasedSnapshotMetadata{index=1461329989, term=68, processedPosition=4392047538, exporterPosition=4392046939}} from 1"

oleschoenburg · 2022-02-23T15:10:33Z

@romansmirnov and I investigated this further and we think that we have found the cause.

Why did we see this Exception?
The StreamProcessor tried to complete a process instance that didn't exist in the current state. This is because the ZeebeDb is based on an old, now deleted, snapshot.
Why was ZeebeDb based on an old Snapshot?
ZeebeDbPartitionTransitionStep recovers the ZeebeDb when transitioning between certain roles:
https://github.com/camunda-cloud/zeebe/blob/e5787dfa03e5b6a7dc7483c7c10844936d872027/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/steps/ZeebeDbPartitionTransitionStep.java#L40-L58
Usually we should have transitioned from FOLLOWER to INACTIVE and then from INACTIVE back to FOLLOWER when a snapshot is committed. Based on the logs, this is not what has happened. After receiving a snapshot, a transition from FOLLOWER to FOLLOWER was started which does not match the conditions for recovering the ZeebeDb from a snapshot. Instead, we keep using the old ZeebeDb without recovering from the new snapshot.
Why did we transition from FOLLOWER to FOLLOWER?
There was a time window where the ZeebePartition actor was already started, but it's SnapshotReplicationListener isn't registered yet.
https://github.com/camunda-cloud/zeebe/blob/5c3090f9eafd328cef16c899cdf7cdfa4debda8f/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L107-L128
During that time period, a snapshot was received and the SnapshotReplicationListener that is responsible for transitioning to INACTIVE is not triggered. We can also see this timing issue in the logs, where the snapshot is received 2ms before the Startup process is completely finished.

Zelldon · 2022-02-23T15:26:16Z

Thanks for the awesome report @oleschoenburg 🙇

8855: [Backport release-1.4.0-alpha2] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot] # Description Backport of #8834 to `release-1.4.0-alpha2`. relates to #8830 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

8854: [Backport stable/1.3] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot] # Description Backport of #8834 to `stable/1.3`. relates to #8830 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

8853: [Backport stable/1.2] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot] # Description Backport of #8834 to `stable/1.2`. relates to #8830 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

8854: [Backport stable/1.3] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot] # Description Backport of #8834 to `stable/1.3`. relates to #8830 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

8853: [Backport stable/1.2] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot] # Description Backport of #8834 to `stable/1.2`. relates to #8830 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

8854: [Backport stable/1.3] fix: notify new `SnapshotReplicationListener`s about missed replications r=oleschoenburg a=github-actions[bot] # Description Backport of #8834 to `stable/1.3`. relates to #8830 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

8994: Fix notification of snapshot replication listeners about missed events r=oleschoenburg a=oleschoenburg ## Description Whenever a raft partition transitions to a new role, we must reset `missedSnapshotReplicationEvents` so that registering new snapshot replication listeners does not trigger the listener for snapshot replication events that occurred for a different role. This is to guard against a known case where raft received a snapshot and transitioned to leader before role change and snapshot replication listeners were registered. Once the registration of the snapshot replication listener registration went through, the listener was informed about the missed replication and caused the zeebe partition to transition to follower, while the raft partition was leader. Additionally, because we keep track of missed snapshot replication events to notify listeners on registering, the partition is not necessarily in follower role at that point. This breaks the assumption of the listener, so we add a condition here to only trigger listeners on register if the raft partition is still follower. ## Related issues relates to #8830 closes #8978 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

oleschoenburg added Impact: Data scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround kind/bug Categorizes an issue or PR as a bug labels Feb 23, 2022

npepinpe added severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround and removed severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Feb 24, 2022

npepinpe added this to In progress in Zeebe Feb 24, 2022

npepinpe self-assigned this Feb 24, 2022

oleschoenburg assigned oleschoenburg and unassigned npepinpe Feb 24, 2022

oleschoenburg mentioned this issue Feb 24, 2022

fix: notify new SnapshotReplicationListeners about missed replications #8834

Merged

romansmirnov mentioned this issue Feb 28, 2022

Improve coordination of Raft and Zeebe role transitions #8852

Open

1 task

ghost closed this as completed in 7c20b26 Feb 28, 2022

Zeebe automation moved this from In progress to Done Feb 28, 2022

Zelldon added Release: 1.2.11 version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5 labels Mar 1, 2022

oleschoenburg added the Release: 1.4.0-alpha2 label Mar 1, 2022

KerstinHebel removed this from Done in Zeebe Mar 23, 2022

romansmirnov mentioned this issue Mar 24, 2022

Partition without leader where all brokers are listed as followers #8978

Closed

oleschoenburg mentioned this issue Mar 25, 2022

Fix notification of snapshot replication listeners about missed events #8994

Merged

npepinpe added the Release: 8.0.0-rc1 label Mar 28, 2022

pihme added the Release: 8.0.0 label Apr 5, 2022

deepthidevaki added the version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 label May 3, 2022

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPE during replay #8830

NPE during replay #8830

oleschoenburg commented Feb 23, 2022 •

edited

oleschoenburg commented Feb 23, 2022 •

edited

oleschoenburg commented Feb 23, 2022

Zelldon commented Feb 23, 2022

NPE during replay #8830

NPE during replay #8830

Comments

oleschoenburg commented Feb 23, 2022 • edited

oleschoenburg commented Feb 23, 2022 • edited

oleschoenburg commented Feb 23, 2022

Zelldon commented Feb 23, 2022

oleschoenburg commented Feb 23, 2022 •

edited

oleschoenburg commented Feb 23, 2022 •

edited