Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to take snapshot in leader because index entry is not found #9761

Closed
deepthidevaki opened this issue Jul 12, 2022 · 1 comment · Fixed by #10611
Closed

Failed to take snapshot in leader because index entry is not found #9761

deepthidevaki opened this issue Jul 12, 2022 · 1 comment · Fixed by #10611
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround version:8.1.1 Marks an issue as being completely or in parts released in 8.1.1 version:8.2.0-alpha1 Marks an issue as being completely or in parts released in 8.2.0-alpha1 version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0

Comments

@deepthidevaki
Copy link
Contributor

deepthidevaki commented Jul 12, 2022

Describe the bug

2022-07-12 01:28:37.329 CEST
Failed to take a snapshot for Broker-1-StreamProcessor-2
"io.camunda.zeebe.broker.system.partitions.NoEntryAtSnapshotPosition: Failed to take snapshot. Expected to find an indexed entry for determined snapshot position 1494387075 (processedPosition = 1494736551, exportedPosition=1494387075), but found no matching indexed entry which contains this position.

This occurred in the leader.

Related to #7911 . But in that issue, it was followers who could not take snapshot and that is a valid scenario. However, leaders should always be able to take snapshot.

The impact of this bug is low (See following section). There is no data inconsistency or loss. The snapshot is taken later when new events are exported. However if exporting does not resume, it will not take new snapshot. So reprocessing time will be high.

To Reproduce

Following events happened in the broker leading to the above error.

  1. Broker receives a new snapshot and replace existing log.
 2022-07-12 01:26:19.338 CEST
RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/494998838-19-1494391974-1494387075, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/494998838-19-1494391974-1494387075.checksum, checksum=1722253646, metadata=FileBasedSnapshotMetadata{index=494998838, term=19, processedPosition=1494391974, exporterPosition=1494387075}}
2022-07-12 01:26:19.339 CEST
RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Delete existing log (lastIndex '494887084') and replace with received snapshot (index '494998838')
2022-07-12 01:26:19.773 CEST
Transition to FOLLOWER on term 19 requested.
...
2022-07-12 01:26:19.922 CEST
Transition to FOLLOWER on term 19 completed
  1. Broker later becomes the leader
2022-07-12 01:26:37.184 CEST
Transition to LEADER on term 20 starting

2022-07-12 01:26:37.217 CEST
Recovered exporter 'Broker-1-Exporter-2' from snapshot at lastExportedPosition 1494387075
  1. No new records are exported
2022-07-12 01:27:38.251 CEST
Unexpected exception occurred on periodically flushing bulk, will retry later.
2022-07-12 01:28:08.589 CEST
Error on exporting record with key 4503600188691487
2022-07-12 01:28:37.329 CEST
  1. Tries to take snapshot at the same exportedPosition but with different processedPosition, but fails.
2022-07-12 01:28:37.329 CEST
Failed to take a snapshot for Broker-1-StreamProcessor-2
"io.camunda.zeebe.broker.system.partitions.NoEntryAtSnapshotPosition: Failed to take snapshot. Expected to find an indexed entry for determined snapshot position 1494387075 (processedPosition = 1494736551, exportedPosition=1494387075), but found no matching indexed entry which contains this position.

Hypothesis

It fails to get the index because, it is trying to get the "previous" entry -> entry at index 494998838. Since there is already a snapshot at that index, raft leader would have only replicated event from index 494998839. The record with the position 1494387075 is at index 494998839. So there is no inconsistency in the logs but it cannot take snapshot because the previous index is already compacted.

Expected behavior

Leader should take a snapshot if exporter position has not changed, but it has processed new events.

Log/Stacktrace

Logs

Full Stacktrace

"io.camunda.zeebe.broker.system.partitions.NoEntryAtSnapshotPosition: Failed to take snapshot. Expected to find an indexed entry for determined snapshot position 1494387075 (processedPosition = 1494736551, exportedPosition=1494387075), but found no matching indexed entry which contains this position.
	at io.camunda.zeebe.broker.system.partitions.impl.StateControllerImpl.takeTransientSnapshotInternal(StateControllerImpl.java:160) ~[zeebe-broker-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.broker.system.partitions.impl.StateControllerImpl.lambda$takeTransientSnapshot$0(StateControllerImpl.java:69) ~[zeebe-broker-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorJob.invoke(ActorJob.java:79) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorJob.execute(ActorJob.java:44) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorTask.execute(ActorTask.java:116) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask(ActorThread.java:97) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorThread.doWork(ActorThread.java:81) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.scheduler.ActorThread.run(ActorThread.java:189) ~[zeebe-scheduler-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]

@deepthidevaki deepthidevaki added kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround labels Jul 12, 2022
@oleschoenburg
Copy link
Member

@oleschoenburg oleschoenburg self-assigned this Oct 5, 2022
zeebe-bors-camunda bot added a commit that referenced this issue Oct 6, 2022
10611: fix: take snapshot if nothing was exported since last snapshot r=oleschoenburg a=oleschoenburg

When figuring out where to take the next snapshot, we determine the snapshot position as the minimum of processing and exporter positions.

There was an edge case where a leader could process but no export.
In that case it'd use the exporter position as snapshot position and tryand find a log entry at that position.
If the log starts with the exporter position, for example because the same broker previously received a snapshot and compacted the log, no entry could be found which led to a failed snapshot.

We now try and use the latest snapshot's term and index if the log entry could not be found. This ensures that new snapshots can be taken even if nothing was exported since the last snapshot.

closes #9761 

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Oct 6, 2022
10624: [Backport stable/8.0] fix: take snapshot if nothing was exported since last snapshot r=oleschoenburg a=backport-action

# Description
Backport of #10611 to `stable/8.0`.

relates to #9761

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Oct 7, 2022
10625: [Backport stable/8.1] fix: take snapshot if nothing was exported since last snapshot r=oleschoenburg a=backport-action

# Description
Backport of #10611 to `stable/8.1`.

relates to #9761

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Oct 7, 2022
10493: [Backport stable/8.0] fix(raft): handle exceptions on partition server init r=oleschoenburg a=megglos

# Description
Backport of #10450 to `stable/8.0`.

10566: [Backport stable/8.0] fix(helm): rename podSecurityContext to containerSecurityContext r=oleschoenburg a=backport-action

# Description
Backport of #10556 to `stable/8.0`.

relates to camunda/camunda-platform-helm#374

10624: [Backport stable/8.0] fix: take snapshot if nothing was exported since last snapshot r=oleschoenburg a=backport-action

# Description
Backport of #10611 to `stable/8.0`.

relates to #9761

10638: [Backport stable/8.0] test: fix unfinished stubbing of command response writer r=oleschoenburg a=backport-action

# Description
Backport of #10605 to `stable/8.0`.

relates to #10604

Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Oct 7, 2022
10493: [Backport stable/8.0] fix(raft): handle exceptions on partition server init r=oleschoenburg a=megglos

# Description
Backport of #10450 to `stable/8.0`.

10566: [Backport stable/8.0] fix(helm): rename podSecurityContext to containerSecurityContext r=oleschoenburg a=backport-action

# Description
Backport of #10556 to `stable/8.0`.

relates to camunda/camunda-platform-helm#374

10624: [Backport stable/8.0] fix: take snapshot if nothing was exported since last snapshot r=oleschoenburg a=backport-action

# Description
Backport of #10611 to `stable/8.0`.

relates to #9761

10638: [Backport stable/8.0] test: fix unfinished stubbing of command response writer r=oleschoenburg a=backport-action

# Description
Backport of #10605 to `stable/8.0`.

relates to #10604

Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@korthout korthout added the version:8.1.1 Marks an issue as being completely or in parts released in 8.1.1 label Oct 13, 2022
@korthout korthout added version:8.2.0-alpha1 Marks an issue as being completely or in parts released in 8.2.0-alpha1 release/8.0.8 labels Nov 1, 2022
@npepinpe npepinpe added the version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0 label Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround version:8.1.1 Marks an issue as being completely or in parts released in 8.1.1 version:8.2.0-alpha1 Marks an issue as being completely or in parts released in 8.2.0-alpha1 version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants