Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SnapshotWriteException: Expected snapshot chunk with equal snapshot total count 51, but got chunk with total count 23 #8381

Closed
lenaschoenburg opened this issue Dec 14, 2021 · 6 comments
Labels
kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user

Comments

@lenaschoenburg
Copy link
Member

lenaschoenburg commented Dec 14, 2021

Describe the bug
A follower rejects an install request because SnapshotWriteException: Expected snapshot chunk with equal snapshot total count 51, but got chunk with total count 23 happened. Before and after this exception, the leader complains that it is not able to send install requests because chunks are out of order:

Failed to send InstallRequest to member 1, with RaftError{type=ILLEGAL_MEMBER_STATE, message=Request chunk is was received out of order}. Restart sending snapshot.

Impact
The follower immediately recovers: Leader retries sending a snapshot, follower successfully receives it and is able to restore state from it.

Details

io.camunda.zeebe.snapshots.impl.SnapshotWriteException: Expected snapshot chunk with equal snapshot total count 51, but got chunk with total count 23.
	at io.camunda.zeebe.snapshots.impl.FileBasedReceivedSnapshot.checkTotalCountIsValid(FileBasedReceivedSnapshot.java:153) ~[zeebe-snapshots-1.2.4.jar:1.2.4]
	at io.camunda.zeebe.snapshots.impl.FileBasedReceivedSnapshot.applyInternal(FileBasedReceivedSnapshot.java:84) ~[zeebe-snapshots-1.2.4.jar:1.2.4]
	at io.camunda.zeebe.snapshots.impl.FileBasedReceivedSnapshot.lambda$apply$0(FileBasedReceivedSnapshot.java:64) ~[zeebe-snapshots-1.2.4.jar:1.2.4]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:62) ~[zeebe-util-1.2.4.jar:1.2.4]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-1.2.4.jar:1.2.4]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.2.4.jar:1.2.4]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:95) [zeebe-util-1.2.4.jar:1.2.4]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-1.2.4.jar:1.2.4]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:192) [zeebe-util-1.2.4.jar:1.2.4]

@lenaschoenburg lenaschoenburg added the kind/bug Categorizes an issue or PR as a bug label Dec 14, 2021
@Zelldon Zelldon added the severity/low Marks a bug as having little to no noticeable impact for the user label Dec 14, 2021
@npepinpe
Copy link
Member

It's possible that this is an expected case - we sometimes take consecutive snapshots with the same index. For example, if processing but not exporting, then the index will not change (even if the snapshot ID changes). The current replication logic expects snapshot indexes to be unique, which is not the case anymore, and there is no additional check on the other snapshot metadata properties after until we apply the chunk. So in this case, it could be the leader took a new snapshot for the same index, but that snapshot now has a different chunk count, checksum, etc.

If this is the case, then we should handle the case better by really checking if a new snapshot is sent to the follower, but the impact is very low. If that's not the case, we should try to figure out what went wrong since it's a bigger problem if we're sometimes sending the wrong snapshot metadata.

@lenaschoenburg
Copy link
Member Author

lenaschoenburg commented Dec 15, 2021

we sometimes take consecutive snapshots with the same index

AFAICT this is not the case here. I think the problem starts before we see the SnapshotWriteException.

Here is a very condensed timeline:

  • 2021-12-13 17:04:12.441 CET broker-2 receives a snapshot for partition-2, 418877-4-1301730-1289811-1
  • 2021-12-13 17:04:19.844 CET broker-2 commits snapshot 418877-4-1301730-1289811

Roughly 5 minutes later:

  • 2021-12-13 17:09:20.675 CET broker-2 receives a snapshot for partition-2, 636220-4-1904255-1870401-2
  • broker-2 rejects all install requests: RaftError{type=ILLEGAL_MEMBER_STATE, message=Request chunk is was received out of order}
  • 2021-12-13 17:14:14.168 CET broker-2 aborts the receive process for snapshot 636220-4-1904255-1870401-2

Another 5 minutes later:

  • 2021-12-13 17:19:19.586 CET broker-2 receives a snapshot for partition-2, 955401-4-2759594-2723483-4
  • broker-2 rejects all install requests: RaftError{type=ILLEGAL_MEMBER_STATE, message=Request chunk is was received out of order}
  • 2021-12-13 17:24:12.544 CET broker-2 throws SnapshotWriteException and aborts the receive process for snapshot 955401-4-2759594-2723483-4

@lenaschoenburg
Copy link
Member Author

During the ~5 minutes where the follower can't receive the snapshot, the leader continuously retries install requests. Maybe we can find a solution where either the follower always rolls back when it receives chunks out-of-order or we could even introduce a backoff when retrying install requests?

@npepinpe npepinpe added this to Ready in Zeebe Jan 10, 2022
@npepinpe npepinpe moved this from Ready to Planned in Zeebe Jan 10, 2022
@KerstinHebel KerstinHebel removed this from Planned in Zeebe Mar 23, 2022
@korthout
Copy link
Member

korthout commented Sep 6, 2022

Also occurred on benchmark CW-34-mixed

@deepthidevaki
Copy link
Contributor

Is this same as #10180 ?

@Zelldon
Copy link
Member

Zelldon commented Dec 29, 2022

Potentially fixed by #10183

@Zelldon Zelldon closed this as completed Dec 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user
Projects
None yet
Development

No branches or pull requests

6 participants