Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curator Snapshot/delete action is failing which points to snapshot_missing_exception. #1697

Open
pratik-devops opened this issue Jan 19, 2024 · 10 comments

Comments

@pratik-devops
Copy link

pratik-devops commented Jan 19, 2024

We have a single node ES cluster hosting around 15 aliases and which is working perfectly.
We have developed a backup and restore solution using Curator action which was working perfectly fine.

The solution works as follows:

  1. Take snapshot of Elasticsearch Indices and then use the delete action to maintain the latest 2 snapshots.

actions:
1:
action: snapshot
description: >-
Snapshot selected aliases to 'aaf_backup' with the snapshot name
pattern in 'name'.
options:
repository: aaf_backup
name: curator_snapshot-%Y%m%d%H%M%S
wait_for_completion: True
ignore_unavailable: True
include_global_state: True
max_wait: 3600
wait_interval: 10
filters:
- filtertype: alias
aliases:
- test1
- test2
- testn
exclude: False
2:
action: delete_snapshots
description: "Delete snapshots from 'aaf_backup' repository (exclude latest 2 )."
options:
repository: aaf_backup
ignore_empty_list: True
retry_interval: 120
retry_count: 3
filters:
- filtertype: pattern
kind: prefix
value: curator_
- filtertype: count
count: 2

  1. In case of disaster, we will use a Restore action to restore from the latest snapshot.

Suddenly - We started seeing failures in Snapshot operation as below:

2024-01-19 08:35:05,547 INFO Snapshot curator_snapshot-20240119083505 still in progress.
2024-01-19 08:35:15,562 WARNING Snapshot curator_snapshot-20240119083505 completed with state PARTIAL.
2024-01-19 08:35:15,565 ERROR Snapshot PARTIAL completed with state: PARTIAL
2024-01-19 08:35:15,565 ERROR Failed to complete action: snapshot. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Snapshot PARTIAL completed with state: PARTIAL

On checking further, In ES logs found below errors:

"stacktrace": ["java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/mount/backups/indices/87gNXnEESjCTaTFJqekEuw/0/index-2Q_3S0aYSCubJ39gsp7L9w",
"at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
"at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:380) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:432) ~[?:?]",
"at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]",
"at java.nio.file.Files.newInputStream(Files.java:160) ~[?:?]",
"at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:195) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3416) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2666) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:370) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:267) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.3.jar:7.16.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]"] }
{"type": "server", "timestamp": "2024-01-19T08:35:06,443Z", "level": "WARN", "component": "o.e.s.SnapshotShardsService", "cluster.name": "test_elasticsearch", "node.name": "test_node", "message": "[[test_index-2023.04.24-000002][0]][aaf_backup:curator_snapshot-20240119083505/Lj_q_BZTQqy9pxiEnbu2Mg] failed to snapshot shard", "cluster.uuid": "lp-MLx_wRN-UeC7oEFTxtw", "node.id": "luXNaXPhSJS2cuNANwi_HQ" ,

on checking the snapshot repo with below curl:

curl -k https://localhost:30920/_snapshot/aaf_backup/_all?pretty
{
"error" : {
"root_cause" : [
{
"type" : "snapshot_missing_exception",
"reason" : "[aaf_backup:curator_snapshot-20230821110005/F4kYVIy9Re6ea90ikPVNrw] is missing"
}
],
"type" : "snapshot_missing_exception",
"reason" : "[aaf_backup:curator_snapshot-20230821110005/F4kYVIy9Re6ea90ikPVNrw] is missing",
"caused_by" : {
"type" : "no_such_file_exception",
"reason" : "/usr/share/elasticsearch/data/mount/backups/snap-F4kYVIy9Re6ea90ikPVNrw.dat"
}
},
"status" : 404
}

$ curl -k https://localhost:30920/_cluster/health?pretty
{
"cluster_name" : "test_elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 119,
"active_shards" : 119,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

There is a WA to make snapshot operation working again, which is to Deregister the repo, delete the data from backup mount and recreate the backup repo and register it. After this it start with BAU work. But after certain interval - It keeps ending with same behavior where snapshot operation starts failing,

Expected Behavior

The above mentioned process where snapshot and delete actions are used.
We run the snapshot creation and old snapshot deletion operation in the interval of 6 hours every day.
The expectation here is it should keep taking the latest snapshot and also maintain the latest two snapshots in the repository.

Actual Behavior

Snapshot process mentioned above works till certain time and one day it starts failing with partial snapshot error:
Failed to complete action: snapshot. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Snapshot PARTIAL completed with state: PARTIAL**
Error in ES logs:

"stacktrace": ["java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/mount/backups/indices/87gNXnEESjCTaTFJqekEuw/0/index-2Q_3S0aYSCubJ39gsp7L9w",
"at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
"at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:380) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:432) ~[?:?]",
"at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]",
"at java.nio.file.Files.newInputStream(Files.java:160) ~[?:?]",
"at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:195) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3416) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2666) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:370) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:267) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.3.jar:7.16.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]"] }
{"type": "server", "timestamp": "2024-01-19T08:35:06,443Z", "level": "WARN", "component": "o.e.s.SnapshotShardsService", "cluster.name": "test_elasticsearch", "node.name": "test_node", "message": "[[test_index-2023.04.24-000002][0]][aaf_backup:curator_snapshot-20240119083505/Lj_q_BZTQqy9pxiEnbu2Mg] failed to snapshot shard", "cluster.uuid": "lp-MLx_wRN-UeC7oEFTxtw", "node.id": "luXNaXPhSJS2cuNANwi_HQ" ,

As per the above logs it seems it was not able to find the index related file. But as this is the issue in production and the files were never removed manually. So I am suspecting something is going wrong here.

Specifications

ES version: 7.16.3
Curator version: 5.8.4

Context (Environment)

This is causing issues in production, where we end up having corrupted repository of the backups and only solution we have is to recreate it by losing all the backups. and in worst case we will end up having no use of backup as no way to restore the data because of backup repo corruption.

@untergeek
Copy link
Member

I know it's easiest to blame Curator, but Curator only makes Elasticsearch API calls. What that means is that you would have the exact same result if you made the same API calls from a terminal or the Dev Tools Console in Kibana. The error you're seeing is at the Elasticsearch level. Curator is only logging that the Elasticsearch API call it made generated an error on the Elasticsearch side. The error suggests that a file copy did not pass a checksum test: (ChecksumBlobStoreFormat.read), which suggests the problem exists between Elasticsearch and your blob storage provider somehow. Curator does not perform these checksum tests. This is an operation Elasticsearch performs to ensure that what has been written to the blob store matches the local checksum. If there is a checksum mismatch, Elasticsearch will generate the above error.

Additionally, you might want to verify your node's Elasticsearch versions. You report that you're running on 7.17.12, but the stack trace you shared indicates that the node is running elasticsearch-7.16.3.jar:7.16.3, rather than 7.17.12.

@pratik-devops
Copy link
Author

Thanks for the clarification @untergeek.
And you are right Elasticsearch version is 7.16.3, modifying the same above.
The storage I am using for storing the backup is on the Linux FS.
What I am more worried of is, This issue is not happening on a single machine. Almost all the machines where I have this solution implemented, I am seeing this issue happening frequently. I will try to raise the issue with Elasticsearch as well.

But do you have any other suggestions to tackle this problem ?

@untergeek
Copy link
Member

It is an actual, local filesystem? Or is it mounted from something over the network?

@pratik-devops
Copy link
Author

We have an elasticsearch pod running on Kubernetes and backup repository is mounted directly from local FS.

@untergeek
Copy link
Member

Persistent Volume? If so, what is backing it?

@pratik-devops
Copy link
Author

Yes at the moment its a persistent volume having infra level backups at regular interval ( Which I know is not a good practice ), We also planning to move to NFS but right now we are really concerned about the problem which is happening here.
What I am thinking is: When in ES logs, It complains for NoSuchFileException related to a particular index data, I am more worried that in the actions provided, I am deleting older snapshot - is it corrupting the repo by chance ?

@untergeek
Copy link
Member

I'm concerned that the local file mount is partially to blame, but I'd need to know more about what kind of storage class your PV is. Elasticsearch works fine with NFS mounts, S3, Google Storage, Azure Blob, etc. But a PV can be made of so many different storage classes that it could have something to do with how Kubernetes handles it.

@pratik-devops
Copy link
Author

Adding further details: ( missed to realize earlier, sorry about that )
Its not really PV instead a hostPath volume where I have mounted the Backup filesystem as well as data FS.

       volumeMounts:
      - mountPath: /usr/share/elasticsearch/data
        name: elasticsearch-data
      - mountPath: /usr/share/elasticsearch/data/mount/backups
        name: elasticsearch-backup
      volumes:
      - name: elasticsearch-certs
        secret:
          secretName: elasticsearch-certs
          defaultMode: 0777
      - hostPath:
          path: {{ .Values.mountPath.data }}
        name: elasticsearch-data
      - name: tz-config
        hostPath:
           path: /etc/localtime
      - name: elasticsearch-backup
        hostPath:
           path: {{ .Values.mountPath.backup }} 

@untergeek
Copy link
Member

What kind of drive is the hostPath? Local drive? NVMe? SATA? Network mount?

@pratik-devops
Copy link
Author

Local Drive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants