Failed to take snapshot in leader because index entry is not found #9761

deepthidevaki · 2022-07-12T12:48:15Z

Describe the bug

2022-07-12 01:28:37.329 CEST
Failed to take a snapshot for Broker-1-StreamProcessor-2
"io.camunda.zeebe.broker.system.partitions.NoEntryAtSnapshotPosition: Failed to take snapshot. Expected to find an indexed entry for determined snapshot position 1494387075 (processedPosition = 1494736551, exportedPosition=1494387075), but found no matching indexed entry which contains this position.

This occurred in the leader.

Related to #7911 . But in that issue, it was followers who could not take snapshot and that is a valid scenario. However, leaders should always be able to take snapshot.

The impact of this bug is low (See following section). There is no data inconsistency or loss. The snapshot is taken later when new events are exported. However if exporting does not resume, it will not take new snapshot. So reprocessing time will be high.

To Reproduce

Following events happened in the broker leading to the above error.

Broker receives a new snapshot and replace existing log.

 2022-07-12 01:26:19.338 CEST
RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/494998838-19-1494391974-1494387075, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/494998838-19-1494391974-1494387075.checksum, checksum=1722253646, metadata=FileBasedSnapshotMetadata{index=494998838, term=19, processedPosition=1494391974, exporterPosition=1494387075}}
2022-07-12 01:26:19.339 CEST
RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Delete existing log (lastIndex '494887084') and replace with received snapshot (index '494998838')
2022-07-12 01:26:19.773 CEST
Transition to FOLLOWER on term 19 requested.
...
2022-07-12 01:26:19.922 CEST
Transition to FOLLOWER on term 19 completed

Broker later becomes the leader

2022-07-12 01:26:37.184 CEST
Transition to LEADER on term 20 starting

2022-07-12 01:26:37.217 CEST
Recovered exporter 'Broker-1-Exporter-2' from snapshot at lastExportedPosition 1494387075

No new records are exported

2022-07-12 01:27:38.251 CEST
Unexpected exception occurred on periodically flushing bulk, will retry later.
2022-07-12 01:28:08.589 CEST
Error on exporting record with key 4503600188691487
2022-07-12 01:28:37.329 CEST

Tries to take snapshot at the same exportedPosition but with different processedPosition, but fails.

2022-07-12 01:28:37.329 CEST
Failed to take a snapshot for Broker-1-StreamProcessor-2
"io.camunda.zeebe.broker.system.partitions.NoEntryAtSnapshotPosition: Failed to take snapshot. Expected to find an indexed entry for determined snapshot position 1494387075 (processedPosition = 1494736551, exportedPosition=1494387075), but found no matching indexed entry which contains this position.

Hypothesis

It fails to get the index because, it is trying to get the "previous" entry -> entry at index 494998838. Since there is already a snapshot at that index, raft leader would have only replicated event from index 494998839. The record with the position 1494387075 is at index 494998839. So there is no inconsistency in the logs but it cannot take snapshot because the previous index is already compacted.

Expected behavior

Leader should take a snapshot if exporter position has not changed, but it has processed new events.

Log/Stacktrace

Logs

Full Stacktrace

"io.camunda.zeebe.broker.system.partitions.NoEntryAtSnapshotPosition: Failed to take snapshot. Expected to find an indexed entry for determined snapshot position 1494387075 (processedPosition = 1494736551, exportedPosition=1494387075), but found no matching indexed entry which contains this position.
	at io.camunda.zeebe.broker.system.partitions.impl.StateControllerImpl.takeTransientSnapshotInternal(StateControllerImpl.java:160) ~[zeebe-broker-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.broker.system.partitions.impl.StateControllerImpl.lambda$takeTransientSnapshot$0(StateControllerImpl.java:69) ~[zeebe-broker-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorJob.invoke(ActorJob.java:79) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorJob.execute(ActorJob.java:44) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorTask.execute(ActorTask.java:116) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask(ActorThread.java:97) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorThread.doWork(ActorThread.java:81) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorThread.run(ActorThread.java:189) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]

The text was updated successfully, but these errors were encountered:

oleschoenburg · 2022-10-05T08:55:34Z

Happened again: https://console.cloud.google.com/errors/detail/CNS4is-G4-2f2QE?project=camunda-cloud-240911

10611: fix: take snapshot if nothing was exported since last snapshot r=oleschoenburg a=oleschoenburg When figuring out where to take the next snapshot, we determine the snapshot position as the minimum of processing and exporter positions. There was an edge case where a leader could process but no export. In that case it'd use the exporter position as snapshot position and tryand find a log entry at that position. If the log starts with the exporter position, for example because the same broker previously received a snapshot and compacted the log, no entry could be found which led to a failed snapshot. We now try and use the latest snapshot's term and index if the log entry could not be found. This ensures that new snapshots can be taken even if nothing was exported since the last snapshot. closes #9761 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

10624: [Backport stable/8.0] fix: take snapshot if nothing was exported since last snapshot r=oleschoenburg a=backport-action # Description Backport of #10611 to `stable/8.0`. relates to #9761 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

10625: [Backport stable/8.1] fix: take snapshot if nothing was exported since last snapshot r=oleschoenburg a=backport-action # Description Backport of #10611 to `stable/8.1`. relates to #9761 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

10493: [Backport stable/8.0] fix(raft): handle exceptions on partition server init r=oleschoenburg a=megglos # Description Backport of #10450 to `stable/8.0`. 10566: [Backport stable/8.0] fix(helm): rename podSecurityContext to containerSecurityContext r=oleschoenburg a=backport-action # Description Backport of #10556 to `stable/8.0`. relates to camunda/camunda-platform-helm#374 10624: [Backport stable/8.0] fix: take snapshot if nothing was exported since last snapshot r=oleschoenburg a=backport-action # Description Backport of #10611 to `stable/8.0`. relates to #9761 10638: [Backport stable/8.0] test: fix unfinished stubbing of command response writer r=oleschoenburg a=backport-action # Description Backport of #10605 to `stable/8.0`. relates to #10604 Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

deepthidevaki added kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround labels Jul 12, 2022

oleschoenburg mentioned this issue Oct 5, 2022

fix: take snapshot if nothing was exported since last snapshot #10611

Merged

oleschoenburg self-assigned this Oct 5, 2022

zeebe-bors-camunda bot closed this as completed in 64902ec Oct 6, 2022

This was referenced Oct 6, 2022

[Backport stable/8.0] fix: take snapshot if nothing was exported since last snapshot #10624

Merged

[Backport stable/8.1] fix: take snapshot if nothing was exported since last snapshot #10625

Merged

korthout added the version:8.1.1 Marks an issue as being completely or in parts released in 8.1.1 label Oct 13, 2022

korthout added version:8.2.0-alpha1 Marks an issue as being completely or in parts released in 8.2.0-alpha1 release/8.0.8 labels Nov 1, 2022

npepinpe added the version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0 label Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to take snapshot in leader because index entry is not found #9761

Failed to take snapshot in leader because index entry is not found #9761

deepthidevaki commented Jul 12, 2022 •

edited

oleschoenburg commented Oct 5, 2022

Failed to take snapshot in leader because index entry is not found #9761

Failed to take snapshot in leader because index entry is not found #9761

Comments

deepthidevaki commented Jul 12, 2022 • edited

oleschoenburg commented Oct 5, 2022

deepthidevaki commented Jul 12, 2022 •

edited