Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: indexnode is unavailable and always restarts, how to fix it #32283

Closed
1 task done
TonyAnn opened this issue Apr 15, 2024 · 31 comments
Closed
1 task done

[Bug]: indexnode is unavailable and always restarts, how to fix it #32283

TonyAnn opened this issue Apr 15, 2024 · 31 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@TonyAnn
Copy link

TonyAnn commented Apr 15, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.2.11
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The indexnode keeps restarting and the following error is reported. How to fix it?

[2024/04/15 11:51:40.367 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:40.367 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703841] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]
[2024/04/15 11:51:50.423 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"]

Check minio and find that minio is throwing an object that does not exist:
DeploymentID: 75e401f4-5e1a-49ec-a55d-565de0876aa4
Error: Reading erasure shards at (http://my-release-minio-10.my-release-minio-svc.default.svc.cluster.local:9000/export: milvus-bucket/file/index_files/446155862386992438/1/446155862382381128/446155862386992191/HNSW_8/8687f322-f0ff-442e-8da3-7718e05d2e1d/part.1) returned 'file not found', will attempt to reconstruct if we have quorum (*fmt.wrapError)

API: SYSTEM(bucket=milvus-bucket, object=file/index_files/446155862387198095/1/446155862382381128/446155862386992176/HNSW_6)
Time: 06:56:42 UTC 04/10/2024
DeploymentID: 75e401f4-5e1a-49ec-a55d-565de0876aa4
Error: more drives are expected to heal than parity, returned errors: [file version not found file version not found ] (dataErrs [file version not found file version not found file not found file not found]) -> milvus-bucket/file/index_files/446155862387198095/1/446155862382381128/446155862386992176/HNSW_6(null) (*errors.errorString)
5: internal/logger/logger.go:258:logger.LogIf()
4: cmd/erasure-healing.go:487:cmd.(*erasureObjects).healObject()
3: cmd/erasure-healing.go:1067:cmd.erasureObjects.HealObject()
2: cmd/erasure-sets.go:1209:cmd.(*erasureSets).HealObject()
1: cmd/erasure-server-pool.go:2030:cmd.(*erasureServerPools).HealObject.func1()

Check from birdwatch, the collection has been deleted:
Milvus(by-dev) > show collections --id 446155862382381128
collection 446155862382381128 not found in etcd collection not found
Milvus(by-dev) >

Problem summary: It should be that the previous collection data was deleted, but minio did not automatically delete the corresponding data.

Then use mc rm cmd to manually delete insert_log and index_log where minio is located,

How to fix this situation? The indexnode is now restarting, causing data writing exceptions.

Expected Behavior

milvus-log.tar (2).gz

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@TonyAnn TonyAnn added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 15, 2024
@xiaofan-luan
Copy link
Contributor

: more drives are expected to heal than parity

I think the problem is minio but not for milvus ": more drives are expected to heal than parity"
Did you try anything sepcial to fix? you should first check if minio is working as expect

@xiaofan-luan
Copy link
Contributor

I'm doubting your local disk is failed and cause minio not work.
Did you deploy a standalone minio?

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

I'm doubting your local disk is failed and cause minio not work. Did you deploy a standalone minio?

hello xiaofan Through detection, the minio service is ready, minio is deployed uniformly based on milvus helm

^C[root@hf-10.103.240.71.iflysearch.cn ~]$ kubectl get svc |grep -i minio
my-release-minio NodePort 10.96.1.80 9000:31090/TCP 263d
my-release-minio-svc ClusterIP None 9000/TCP 263d

[root@hf-10.103.240.71.iflysearch.cn ~]$ mc ping myminio
1: http://10.103.240.71:31090:31090 min=1.04ms max=1.04ms average=1.04ms errors=0 roundtrip=1.04ms
2: http://10.103.240.71:31090:31090 min=0.71ms max=1.04ms average=0.87ms errors=0 roundtrip=0.71ms
3: http://10.103.240.71:31090:31090 min=0.35ms max=1.04ms average=0.70ms errors=0 roundtrip=0.35ms
4: http://10.103.240.71:31090:31090 min=0.31ms max=1.04ms average=0.60ms errors=0 roundtrip=0.31ms
5: http://10.103.240.71:31090:31090 min=0.31ms max=1.04ms average=0.55ms errors=0 roundtrip=0.32ms
6: http://10.103.240.71:31090:31090 min=0.31ms max=1.04ms average=0.51ms errors=0 roundtrip=0.35ms
7: http://10.103.240.71:31090:31090 min=0.31ms max=1.04ms average=0.50ms errors=0 roundtrip=0.45ms
8: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.48ms errors=0 roundtrip=0.29ms
9: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.46ms errors=0 roundtrip=0.33ms
10: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.44ms errors=0 roundtrip=0.30ms
11: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.43ms errors=0 roundtrip=0.30ms
12: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.42ms errors=0 roundtrip=0.33ms
13: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.42ms errors=0 roundtrip=0.32ms

@xiaofan-luan
Copy link
Contributor

No i mean ping milvus port

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

The current problem, I think, is because the indexnode wants to get the minio data of the dropped collection, but the data in the minio of the drop colleciton was manually deleted by me,

so the indexnode throws [error="[UnexpectedError] Error:GetObjectSize[errcode: 404, exception:, errmessage:No response body.]"]

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

No i mean ping milvus port

The milvus port test is also ok.

[root@hf-10.103.240.71.iflysearch.cn ~]$ kubectl get svc |grep -i milvus
my-release-milvus NodePort 10.96.2.89 19530:32300/TCP,9091:32582/TCP 263d
my-release-milvus-attu NodePort 10.96.2.114 3000:32401/TCP 14d
my-release-milvus-datacoord ClusterIP 10.96.0.155 13333/TCP,9091/TCP 263d
my-release-milvus-datanode ClusterIP None 9091/TCP 263d
my-release-milvus-indexcoord ClusterIP 10.96.0.123 31000/TCP,9091/TCP 263d
my-release-milvus-indexnode ClusterIP None 9091/TCP 263d
my-release-milvus-querycoord ClusterIP 10.96.3.201 19531/TCP,9091/TCP 263d
my-release-milvus-querynode ClusterIP None 9091/TCP 263d
my-release-milvus-rootcoord ClusterIP 10.96.2.199 53100/TCP,9091/TCP 263d
[root@hf-10.103.240.71.iflysearch.cn ~]$ ping 10.96.2.89
PING 10.96.2.89 (10.96.2.89) 56(84) bytes of data.
64 bytes from 10.96.2.89: icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from 10.96.2.89: icmp_seq=2 ttl=64 time=0.033 ms

@yanliang567
Copy link
Contributor

The current problem, I think, is because the indexnode wants to get the minio data of the dropped collection, but the data in the minio of the drop colleciton was manually deleted by me,

so the indexnode throws [error="[UnexpectedError] Error:GetObjectSize[errcode: 404, exception:, errmessage:No response body.]"]

we shall not remove data manually in most cases, could you please share what happened before that and why you need to manually delete the data in minio?

/assign @congqixia
@congqixia I know it is difficult, but in this case, do we have any ideas to fix the index nodes?

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 16, 2024
@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

The current problem, I think, is because the indexnode wants to get the minio data of the dropped collection, but the data in the minio of the drop colleciton was manually deleted by me,
so the indexnode throws [error="[UnexpectedError] Error:GetObjectSize[errcode: 404, exception:, errmessage:No response body.]"]

we shall not remove data manually in most cases, could you please share what happened before that and why you need to manually delete the data in minio?

/assign @congqixia @congqixia I know it is difficult, but in this case, do we have any ideas to fix the index nodes?
hi yanliang567
Because the disk space is almost full, but milvus cannot automatically clean up the deleted collection data in version 2.2.11, so I have to manually clean up the junk data on the minio.

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

The current problem, I think, is because the indexnode wants to get the minio data of the dropped collection, but the data in the minio of the drop colleciton was manually deleted by me,
so the indexnode throws [error="[UnexpectedError] Error:GetObjectSize[errcode: 404, exception:, errmessage:No response body.]"]

we shall not remove data manually in most cases, could you please share what happened before that and why you need to manually delete the data in minio?

/assign @congqixia @congqixia I know it is difficult, but in this case, do we have any ideas to fix the index nodes?

hi yanliang567,I think there is no need to fix the index data because the corresponding collection data has been deleted. Now we only need to restore the indexnode service to be available.

@yanliang567 yanliang567 removed their assignment Apr 16, 2024
@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

hi yanliang567 Any updates now?

@congqixia
Copy link
Contributor

@TonyAnn did you tried to drop the related index? any modification on the collection yet?

@congqixia
Copy link
Contributor

@TonyAnn If the collection has been dropped during the case, we might not able to modify the legacy task with normal methods.
Since its an abnormal operation, we might need to manually drop the legacy tasks for the indexnodes. Could you please provided the backup of etcd with this tool?
https://github.com/milvus-io/birdwatcher/releases/tag/v1.0.3

@yanliang567
Copy link
Contributor

hi yanliang567 Any updates now?

we doubt there remains some meta in milvus which always triggers index tasks for the dropped collection, it the index tasks cannot succeed or fail for the data in minio was deleted. We believed there is a fix in latest 2.3.13 release, but we you are running 2.2.11, this could not recover itself.
We can try to recover the index nodes by manually cleaning yp the dirty meta, but it is risky and not 100% assurance. If you agree please help to collect the backup of etcd, we will analyze the existing meta first and offer you some actions. Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

hi yanliang567 Any updates now?

we doubt there remains some meta in milvus which always triggers index tasks for the dropped collection, it the index tasks cannot succeed or fail for the data in minio was deleted. We believed there is a fix in latest 2.3.13 release, but we you are running 2.2.11, this could not recover itself. We can try to recover the index nodes by manually cleaning yp the dirty meta, but it is risky and not 100% assurance. If you agree please help to collect the backup of etcd, we will analyze the existing meta first and offer you some actions. Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher

Backups are slow and may be interrupted

Milvus(by-dev) > backup
Backing up... 0%(10601/1503047)

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

hi yanliang567,There is a problem. Can upgrading to 2.2.16 solve the problem?

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

While backing up, an error was encountered.

Offline > connect --etcd 10.96.3.170:2379
Using meta path: by-dev/meta/
Milvus(by-dev) > backup
Backing up ... 7%(119001/1503047)
backup etcd failed, error: etcdserver: mvcc: required revision has been compacted
http://100.93.184.91:9091/metrics
http://100.116.193.30:9091/metrics
http://100.85.158.103:9091/metrics
http://100.93.184.127:9091/metrics
http://100.72.181.136:9091/metrics
http://100.93.184.96:9091/metrics
http://100.79.32.102:9091/metrics
failed to fetch metrics for indexnode(9963), Get "http://100.79.32.102:9091/metrics": read tcp 100.103.75.0:2157->100.79.32.102:9091: read: connection reset by peer
http://100.103.75.3:9091/metrics
http://100.85.158.112:9091/metrics
failed to fetch metrics for indexnode(9965), Get "http://100.85.158.112:9091/metrics": dial tcp 100.85.158.112:9091: connect: connection refused

Incomplete backup, please refer to the attachment
Uploading bw_etcd_ALL.240416-102441.bak.gz…

@xiaofan-luan
Copy link
Contributor

so you want to wipe out all data?
maybe you can simply remove etcd disk, and delete everything on minio?

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

so you want to wipe out all data? maybe you can simply remove etcd disk, and delete everything on minio?

no, I only want to clean up the dropped collection data,Not all data

@congqixia
Copy link
Contributor

congqixia commented Apr 16, 2024

Incomplete backup, please refer to the attachment
Uploading bw_etcd_ALL.240416-102441.bak.gz…

@TonyAnn looks like the upload did not complete, the link above is the url back to this issue

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

Incomplete backup, please refer to the attachment
Uploading bw_etcd_ALL.240416-102441.bak.gz…

@TonyAnn looks like the upload did not complete, the link above is the url back to this issue

yes, I am using the backup --ignoreRevision command to back up again. The amount of data is large and the backup is very slow. It is only 15% complete.

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

hi yanliang567,There is a problem. Can upgrading to 2.2.16 solve the problem?

Please help confirm this issue.

@congqixia
Copy link
Contributor

hi yanliang567,There is a problem. Can upgrading to 2.2.16 solve the problem?

Please help confirm this issue.

@TonyAnn
From the log attached, the indexnodes restart reason was not found. Seems like the tool did not catch the panic or stderr log in time. Have you ever witness the panic stack trace by any change?

@xiaofan-luan
Copy link
Contributor

you will need to clean up everything on etcd, that is a lot of work to do.
Birdwatcher can help you on inspect and clean up wrong collection metas but that might be a lot of work to do

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

Incomplete backup, please refer to the attachment
Uploading bw_etcd_ALL.240416-102441.bak.gz…

@TonyAnn looks like the upload did not complete, the link above is the url back to this issue

The complete etcd backup set is too large and cannot be uploaded. It has been sent to yanliang privately.

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

@TonyAnn did you tried to drop the related index? any modification on the collection yet?

@congqixia After the problem occurred, I tried to execute remove segment-orphan on birdwather, but it had no effect.

@TonyAnn
Copy link
Author

TonyAnn commented Apr 16, 2024

hi yanliang567,There is a problem. Can upgrading to 2.2.16 solve the problem?

Please help confirm this issue.

@TonyAnn From the log attached, the indexnodes restart reason was not found. Seems like the tool did not catch the panic or stderr log in time. Have you ever witness the panic stack trace by any change?

@congqixia After the problem occurred, I only saw the following error reported by the indexnode, and at the same time, the minio-related pod reported an error that the file could not be found.

[2024/04/15 11:51:40.367 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:40.367 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703841] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]
[2024/04/15 11:51:50.423 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"]
[2024/04/15 11:51:50.423 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:50.423 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668159918] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]
[2024/04/15 11:51:51.301 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"]
[2024/04/15 11:51:51.301 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:51.301 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619667725864] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]
[2024/04/15 11:51:52.238 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"]
[2024/04/15 11:51:52.238 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:52.238 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703750] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]

@congqixia
Copy link
Contributor

hi yanliang567,There is a problem. Can upgrading to 2.2.16 solve the problem?

Please help confirm this issue.

@TonyAnn From the log attached, the indexnodes restart reason was not found. Seems like the tool did not catch the panic or stderr log in time. Have you ever witness the panic stack trace by any change?

@congqixia After the problem occurred, I only saw the following error reported by the indexnode, and at the same time, the minio-related pod reported an error that the file could not be found.

[2024/04/15 11:51:40.367 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"] [2024/04/15 11:51:40.367 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703841] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [2024/04/15 11:51:50.423 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"] [2024/04/15 11:51:50.423 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"] [2024/04/15 11:51:50.423 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668159918] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [2024/04/15 11:51:51.301 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"] [2024/04/15 11:51:51.301 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"] [2024/04/15 11:51:51.301 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619667725864] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [2024/04/15 11:51:52.238 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"] [2024/04/15 11:51:52.238 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"] [2024/04/15 11:51:52.238 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703750] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]

These log does not indicate why the indexnode crashed. Can't be sure what the problem was

@congqixia
Copy link
Contributor

@TonyAnn from the buildID 449033619668703750, I found this index build task is belong to segment 449033619667725811, which is already in dropped state and its files might got GCed already

SegmentID: 449033619667725811 State: Flushed, Level: Legacy, Row Count:376181
--- Growing: 0, Sealed: 0, Flushed: 1, Dropped: 0

@congqixia
Copy link
Contributor

@TonyAnn from the backup the index task still exists. A quick guess was that the segments are too many and the garbage collector was too busy to keep up and recycle this index task. you could check:

Backup(by-dev) > show segment-index --segment 449033619667725811
SegmentID: 449033619667725811    State: Flushed
        IndexV2 build ID: 449033619668703750, states InProgress  Index Type:HNSW on Field ID: 101       Serialized Size: 0
        Current Index Version: 0
[InProgress]: 1 

Copy link

stale bot commented May 18, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label May 18, 2024
@yanliang567
Copy link
Contributor

resolved offline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants