-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SnapshotWriteException: Expected snapshot chunk with equal snapshot total count 51, but got chunk with total count 23 #8381
Comments
It's possible that this is an expected case - we sometimes take consecutive snapshots with the same index. For example, if processing but not exporting, then the index will not change (even if the snapshot ID changes). The current replication logic expects snapshot indexes to be unique, which is not the case anymore, and there is no additional check on the other snapshot metadata properties after until we apply the chunk. So in this case, it could be the leader took a new snapshot for the same index, but that snapshot now has a different chunk count, checksum, etc. If this is the case, then we should handle the case better by really checking if a new snapshot is sent to the follower, but the impact is very low. If that's not the case, we should try to figure out what went wrong since it's a bigger problem if we're sometimes sending the wrong snapshot metadata. |
AFAICT this is not the case here. I think the problem starts before we see the Here is a very condensed timeline:
Roughly 5 minutes later:
Another 5 minutes later:
|
During the ~5 minutes where the follower can't receive the snapshot, the leader continuously retries install requests. Maybe we can find a solution where either the follower always rolls back when it receives chunks out-of-order or we could even introduce a backoff when retrying install requests? |
Also occurred on benchmark CW-34-mixed |
Is this same as #10180 ? |
Potentially fixed by #10183 |
Describe the bug
A follower rejects an install request because
SnapshotWriteException: Expected snapshot chunk with equal snapshot total count 51, but got chunk with total count 23
happened. Before and after this exception, the leader complains that it is not able to send install requests because chunks are out of order:Impact
The follower immediately recovers: Leader retries sending a snapshot, follower successfully receives it and is able to restore state from it.
Details
The text was updated successfully, but these errors were encountered: