-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: indexnode is unavailable and always restarts, how to fix it #32283
Comments
I think the problem is minio but not for milvus ": more drives are expected to heal than parity" |
I'm doubting your local disk is failed and cause minio not work. |
hello xiaofan Through detection, the minio service is ready, minio is deployed uniformly based on milvus helm ^C[root@hf-10.103.240.71.iflysearch.cn ~]$ kubectl get svc |grep -i minio [root@hf-10.103.240.71.iflysearch.cn ~]$ mc ping myminio |
No i mean ping milvus port |
The current problem, I think, is because the indexnode wants to get the minio data of the dropped collection, but the data in the minio of the drop colleciton was manually deleted by me, so the indexnode throws [error="[UnexpectedError] Error:GetObjectSize[errcode: 404, exception:, errmessage:No response body.]"] |
The milvus port test is also ok. [root@hf-10.103.240.71.iflysearch.cn ~]$ kubectl get svc |grep -i milvus |
we shall not remove data manually in most cases, could you please share what happened before that and why you need to manually delete the data in minio? /assign @congqixia |
|
hi yanliang567,I think there is no need to fix the index data because the corresponding collection data has been deleted. Now we only need to restore the indexnode service to be available. |
hi yanliang567 Any updates now? |
@TonyAnn did you tried to drop the related index? any modification on the collection yet? |
@TonyAnn If the collection has been dropped during the case, we might not able to modify the legacy task with normal methods. |
we doubt there remains some meta in milvus which always triggers index tasks for the dropped collection, it the index tasks cannot succeed or fail for the data in minio was deleted. We believed there is a fix in latest 2.3.13 release, but we you are running 2.2.11, this could not recover itself. |
Backups are slow and may be interrupted Milvus(by-dev) > backup |
hi yanliang567,There is a problem. Can upgrading to 2.2.16 solve the problem? |
While backing up, an error was encountered. Offline > connect --etcd 10.96.3.170:2379 Incomplete backup, please refer to the attachment |
so you want to wipe out all data? |
no, I only want to clean up the dropped collection data,Not all data |
@TonyAnn looks like the upload did not complete, the link above is the url back to this issue |
yes, I am using the backup --ignoreRevision command to back up again. The amount of data is large and the backup is very slow. It is only 15% complete. |
Please help confirm this issue. |
@TonyAnn |
you will need to clean up everything on etcd, that is a lot of work to do. |
The complete etcd backup set is too large and cannot be uploaded. It has been sent to yanliang privately. |
@congqixia After the problem occurred, I tried to execute remove segment-orphan on birdwather, but it had no effect. |
@congqixia After the problem occurred, I only saw the following error reported by the indexnode, and at the same time, the minio-related pod reported an error that the file could not be found. [2024/04/15 11:51:40.367 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"] |
These log does not indicate why the indexnode crashed. Can't be sure what the problem was |
@TonyAnn from the buildID
|
@TonyAnn from the backup the index task still exists. A quick guess was that the segments are too many and the garbage collector was too busy to keep up and recycle this index task. you could check:
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
resolved offline |
Is there an existing issue for this?
Environment
Current Behavior
The indexnode keeps restarting and the following error is reported. How to fix it?
[2024/04/15 11:51:40.367 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:40.367 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703841] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]
[2024/04/15 11:51:50.423 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"]
Check minio and find that minio is throwing an object that does not exist:
DeploymentID: 75e401f4-5e1a-49ec-a55d-565de0876aa4
Error: Reading erasure shards at (http://my-release-minio-10.my-release-minio-svc.default.svc.cluster.local:9000/export: milvus-bucket/file/index_files/446155862386992438/1/446155862382381128/446155862386992191/HNSW_8/8687f322-f0ff-442e-8da3-7718e05d2e1d/part.1) returned 'file not found', will attempt to reconstruct if we have quorum (*fmt.wrapError)
API: SYSTEM(bucket=milvus-bucket, object=file/index_files/446155862387198095/1/446155862382381128/446155862386992176/HNSW_6)
Time: 06:56:42 UTC 04/10/2024
DeploymentID: 75e401f4-5e1a-49ec-a55d-565de0876aa4
Error: more drives are expected to heal than parity, returned errors: [file version not found file version not found ] (dataErrs [file version not found file version not found file not found file not found]) -> milvus-bucket/file/index_files/446155862387198095/1/446155862382381128/446155862386992176/HNSW_6(null) (*errors.errorString)
5: internal/logger/logger.go:258:logger.LogIf()
4: cmd/erasure-healing.go:487:cmd.(*erasureObjects).healObject()
3: cmd/erasure-healing.go:1067:cmd.erasureObjects.HealObject()
2: cmd/erasure-sets.go:1209:cmd.(*erasureSets).HealObject()
1: cmd/erasure-server-pool.go:2030:cmd.(*erasureServerPools).HealObject.func1()
Check from birdwatch, the collection has been deleted:
Milvus(by-dev) > show collections --id 446155862382381128
collection 446155862382381128 not found in etcd collection not found
Milvus(by-dev) >
Problem summary: It should be that the previous collection data was deleted, but minio did not automatically delete the corresponding data.
Then use mc rm cmd to manually delete insert_log and index_log where minio is located,
How to fix this situation? The indexnode is now restarting, causing data writing exceptions.
Expected Behavior
milvus-log.tar (2).gz
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: