Skip to content

Do not fail snapshotting when exporter position is -1 #7978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
deepthidevaki opened this issue Oct 13, 2021 · 5 comments · Fixed by #8176
Closed

Do not fail snapshotting when exporter position is -1 #7978

deepthidevaki opened this issue Oct 13, 2021 · 5 comments · Fixed by #8176
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0

Comments

@deepthidevaki
Copy link
Contributor

Describe the bug

When exporter position is -1 takeTransientSansphot fails with an error.

java.lang.IllegalStateException: Failed to take snapshot. Expected to find an indexed entry for determined snapshot position -1 (processedPosition = 111, exportedPosition=-1), but found no matching indexed entry which contains this position.

        at io.camunda.zeebe.broker.system.partitions.impl.StateControllerImpl.takeTransientSnapshotInternal (StateControllerImpl.java:151)
        at io.camunda.zeebe.broker.system.partitions.impl.StateControllerImpl.lambda$takeTransientSnapshot$0 (StateControllerImpl.java:66)
        at io.camunda.zeebe.util.sched.ActorJob.invoke (ActorJob.java:73)
        at io.camunda.zeebe.util.sched.ActorJob.execute (ActorJob.java:39)
        at io.camunda.zeebe.util.sched.ActorTask.execute (ActorTask.java:122)
        at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask (ActorThread.java:95)
        at io.camunda.zeebe.util.sched.ActorThread.doWork (ActorThread.java:78)
        at io.camunda.zeebe.util.sched.ActorThread.run (ActorThread.java:192)


Expected behavior

Either

  1. Do not throw IllegalStateException, instead use a meaningful exception that can be logged at debug level by AsyncSnapshotDirector. This is not an error, but an expected state.
    OR
  2. Take snapshot even if exporter position is -1. This will be useful if exporter has any issues, but processing is progressing. No snapshots means it will take for ever to reprocess. In this case, the question is what would be the snapshot index.

Environment:

  • Camunda Cloud
  • Zeebe Version: 1.2.0
@deepthidevaki deepthidevaki added the kind/bug Categorizes an issue or PR as a bug label Oct 13, 2021
@npepinpe npepinpe added 1.2.1 scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user labels Oct 13, 2021
@romansmirnov romansmirnov self-assigned this Nov 3, 2021
@romansmirnov
Copy link
Member

@deepthidevaki,

Just to get a better understanding, can you please confirm my assumption/understanding :)

Current Behavior:
Whenever the exporter position is -1, then Zeebe cannot determine a snapshot position, and the snapshotting fails with a IllegalStateException. As a consequence, processing on that partition(?) does not progress anymore, so that nothing happens anymore on this partition.

Expected Behavior:
The exporter position with a value -1 does not affect the partition's processing. Meaning, whenever the exporter position is -1, the processing should continue progressing.

Possible Solutions:

  1. Interrupt snapshotting:
  • Do not create a snapshot
  • Instead, throw a specific exception that is handled properly (for example, by logging the message to the console).
  • Cons:
    • As long as the exporter position remains at -1, the log grows which results in long-running reprocessing (on fail over or whatever).
    • When the exporter position changes, eventually a snapshot will be created. However, it might result in a big snapshot.
  1. Do snapshotting:
  • Create a snapshot
  • When the exporter position is set to -1, the exporter position is not considered when determining the snapshot position.
  • Cons: Historical records might not be exported because of log compaction after a successful snapshotting so that Operate will never import those records.

Can you confirm my understanding?

@deepthidevaki
Copy link
Contributor Author

Current Behavior:
Whenever the exporter position is -1, then Zeebe cannot determine a snapshot position, and the snapshotting fails with a IllegalStateException.

This is correct.

As a consequence, processing on that partition(?) does not progress anymore, so that nothing happens anymore on this partition.

No. It does not affect processing. The only impact is that - since it is logged as an error, a user may think that there is something wrong. But actually everything is fine. We will retry taking snapshot after sometime, and when the exporter position is updated, we will eventually be able to take a snapshot.

@deepthidevaki
Copy link
Contributor Author

Do not create a snapshot
Instead, throw a specific exception that is handled properly (for example, by logging the message to the console).

This is a possible solution. We can just a log a message at appropriate level that we cannot take a snapshot because we have not exported anything. This is the minimum we should do.

Do snapshotting:

This is a better solution, but also complicated.

When the exporter position is set to -1, the exporter position is not considered when determining the snapshot position.

This is not ideal. As you have already described, the log get compacted before they are exported.
I would propose to take snapshot, while considering the exporter position is -1. So we should calculate the index of the snapshot in such a way that no events can be compacted. I was thinking we can use index 0, but I have to think a little bit more if that is a correct value. If you would like we can discuss this next week and see if this solution works or not.

@romansmirnov
Copy link
Member

@deepthidevaki, thanks for your clarification! It would be great, if we could discuss this issue next week, additionally, a walkthrough of the snapshotting procedure would be great. I will schedule a call for next week.

@romansmirnov
Copy link
Member

romansmirnov commented Nov 8, 2021

Outcome of discussion with @deepthidevaki:

  • We will consider the second option, i.e., creating a snapshot when the exporter position is -1
  • Implementation approach (given exporter version is -1):
    • If it is the first time a snapshot is done, then the snapshotting index is set to 0
    • If at least a second snapshot has been created, then the snapshotting index is taken from the last snapshot (this typically would happen, if the broker is started for the first time without exporters, then restarted but with exporter configuration)
    • In terms of snapshot id, 0 is used as the exporter position
  • Things to consider:
    • The ordering of snapshots should not break with that change, i.e., an old snapshot should not be considered as the latest snapshot
    • Compaction should not delete any (journal) segments when the snapshot index is set to 0.
    • According to the current implementation of snapshot replication, in that given scenario, the snapshot won't be replicated but the Raft events should be replicated as always.

@ghost ghost closed this as completed in 76bf970 Nov 18, 2021
ghost pushed a commit that referenced this issue Nov 26, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
8239: [Backport stable/1.2] fix(snapshot): create snapshot exporter position is -1 r=romansmirnov a=romansmirnov

## Description

backports #8176 

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #7978 



Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Nov 26, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
8239: [Backport stable/1.2] fix(snapshot): create snapshot exporter position is -1 r=romansmirnov a=romansmirnov

## Description

backports #8176 

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #7978 



Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Nov 29, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
8240: [Backport stable/1.1] fix(snapshot): create snapshot exporter position is -1 r=romansmirnov a=romansmirnov

## Description

backports #8176

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #7978



Co-authored-by: Roman <roman.smirnov@camunda.com>
@korthout korthout added the version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 label Jan 4, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants