[Bug]: indexnode is unavailable and always restarts, how to fix it #32283

TonyAnn · 2024-04-15T13:00:14Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:2.2.11
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The indexnode keeps restarting and the following error is reported. How to fix it?

[2024/04/15 11:51:40.367 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:40.367 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703841] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]
[2024/04/15 11:51:50.423 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"]

Check minio and find that minio is throwing an object that does not exist：
DeploymentID: 75e401f4-5e1a-49ec-a55d-565de0876aa4
Error: Reading erasure shards at (http://my-release-minio-10.my-release-minio-svc.default.svc.cluster.local:9000/export: milvus-bucket/file/index_files/446155862386992438/1/446155862382381128/446155862386992191/HNSW_8/8687f322-f0ff-442e-8da3-7718e05d2e1d/part.1) returned 'file not found', will attempt to reconstruct if we have quorum (*fmt.wrapError)

API: SYSTEM(bucket=milvus-bucket, object=file/index_files/446155862387198095/1/446155862382381128/446155862386992176/HNSW_6)
Time: 06:56:42 UTC 04/10/2024
DeploymentID: 75e401f4-5e1a-49ec-a55d-565de0876aa4
Error: more drives are expected to heal than parity, returned errors: [file version not found file version not found ] (dataErrs [file version not found file version not found file not found file not found]) -> milvus-bucket/file/index_files/446155862387198095/1/446155862382381128/446155862386992176/HNSW_6(null) (*errors.errorString)
5: internal/logger/logger.go:258:logger.LogIf()
4: cmd/erasure-healing.go:487:cmd.(*erasureObjects).healObject()
3: cmd/erasure-healing.go:1067:cmd.erasureObjects.HealObject()
2: cmd/erasure-sets.go:1209:cmd.(*erasureSets).HealObject()
1: cmd/erasure-server-pool.go:2030:cmd.(*erasureServerPools).HealObject.func1()

Check from birdwatch, the collection has been deleted：
Milvus(by-dev) > show collections --id 446155862382381128
collection 446155862382381128 not found in etcd collection not found
Milvus(by-dev) >

Problem summary: It should be that the previous collection data was deleted, but minio did not automatically delete the corresponding data.

Then use mc rm cmd to manually delete insert_log and index_log where minio is located,

How to fix this situation? The indexnode is now restarting, causing data writing exceptions.

Expected Behavior

milvus-log.tar (2).gz

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

xiaofan-luan · 2024-04-15T20:39:58Z

: more drives are expected to heal than parity

I think the problem is minio but not for milvus ": more drives are expected to heal than parity"
Did you try anything sepcial to fix? you should first check if minio is working as expect

xiaofan-luan · 2024-04-15T21:32:38Z

I'm doubting your local disk is failed and cause minio not work.
Did you deploy a standalone minio?

TonyAnn · 2024-04-16T00:53:25Z

I'm doubting your local disk is failed and cause minio not work. Did you deploy a standalone minio?

hello xiaofan Through detection, the minio service is ready, minio is deployed uniformly based on milvus helm

^C[root@hf-10.103.240.71.iflysearch.cn ~]$ kubectl get svc |grep -i minio
my-release-minio NodePort 10.96.1.80 9000:31090/TCP 263d
my-release-minio-svc ClusterIP None 9000/TCP 263d

[root@hf-10.103.240.71.iflysearch.cn ~]$ mc ping myminio
1: http://10.103.240.71:31090:31090 min=1.04ms max=1.04ms average=1.04ms errors=0 roundtrip=1.04ms
2: http://10.103.240.71:31090:31090 min=0.71ms max=1.04ms average=0.87ms errors=0 roundtrip=0.71ms
3: http://10.103.240.71:31090:31090 min=0.35ms max=1.04ms average=0.70ms errors=0 roundtrip=0.35ms
4: http://10.103.240.71:31090:31090 min=0.31ms max=1.04ms average=0.60ms errors=0 roundtrip=0.31ms
5: http://10.103.240.71:31090:31090 min=0.31ms max=1.04ms average=0.55ms errors=0 roundtrip=0.32ms
6: http://10.103.240.71:31090:31090 min=0.31ms max=1.04ms average=0.51ms errors=0 roundtrip=0.35ms
7: http://10.103.240.71:31090:31090 min=0.31ms max=1.04ms average=0.50ms errors=0 roundtrip=0.45ms
8: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.48ms errors=0 roundtrip=0.29ms
9: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.46ms errors=0 roundtrip=0.33ms
10: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.44ms errors=0 roundtrip=0.30ms
11: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.43ms errors=0 roundtrip=0.30ms
12: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.42ms errors=0 roundtrip=0.33ms
13: http://10.103.240.71:31090:31090 min=0.29ms max=1.04ms average=0.42ms errors=0 roundtrip=0.32ms

xiaofan-luan · 2024-04-16T01:05:35Z

No i mean ping milvus port

TonyAnn · 2024-04-16T01:12:36Z

The current problem, I think, is because the indexnode wants to get the minio data of the dropped collection, but the data in the minio of the drop colleciton was manually deleted by me,

so the indexnode throws [error="[UnexpectedError] Error:GetObjectSize[errcode: 404, exception:, errmessage:No response body.]"]

TonyAnn · 2024-04-16T01:14:26Z

No i mean ping milvus port

The milvus port test is also ok.

[root@hf-10.103.240.71.iflysearch.cn ~]$ kubectl get svc |grep -i milvus
my-release-milvus NodePort 10.96.2.89 19530:32300/TCP,9091:32582/TCP 263d
my-release-milvus-attu NodePort 10.96.2.114 3000:32401/TCP 14d
my-release-milvus-datacoord ClusterIP 10.96.0.155 13333/TCP,9091/TCP 263d
my-release-milvus-datanode ClusterIP None 9091/TCP 263d
my-release-milvus-indexcoord ClusterIP 10.96.0.123 31000/TCP,9091/TCP 263d
my-release-milvus-indexnode ClusterIP None 9091/TCP 263d
my-release-milvus-querycoord ClusterIP 10.96.3.201 19531/TCP,9091/TCP 263d
my-release-milvus-querynode ClusterIP None 9091/TCP 263d
my-release-milvus-rootcoord ClusterIP 10.96.2.199 53100/TCP,9091/TCP 263d
[root@hf-10.103.240.71.iflysearch.cn ~]$ ping 10.96.2.89
PING 10.96.2.89 (10.96.2.89) 56(84) bytes of data.
64 bytes from 10.96.2.89: icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from 10.96.2.89: icmp_seq=2 ttl=64 time=0.033 ms

yanliang567 · 2024-04-16T01:20:44Z

The current problem, I think, is because the indexnode wants to get the minio data of the dropped collection, but the data in the minio of the drop colleciton was manually deleted by me,

so the indexnode throws [error="[UnexpectedError] Error:GetObjectSize[errcode: 404, exception:, errmessage:No response body.]"]

we shall not remove data manually in most cases, could you please share what happened before that and why you need to manually delete the data in minio?

/assign @congqixia
@congqixia I know it is difficult, but in this case, do we have any ideas to fix the index nodes?

TonyAnn · 2024-04-16T01:26:16Z

The current problem, I think, is because the indexnode wants to get the minio data of the dropped collection, but the data in the minio of the drop colleciton was manually deleted by me,
so the indexnode throws [error="[UnexpectedError] Error:GetObjectSize[errcode: 404, exception:, errmessage:No response body.]"]

we shall not remove data manually in most cases, could you please share what happened before that and why you need to manually delete the data in minio?

/assign @congqixia @congqixia I know it is difficult, but in this case, do we have any ideas to fix the index nodes?
hi yanliang567
Because the disk space is almost full, but milvus cannot automatically clean up the deleted collection data in version 2.2.11, so I have to manually clean up the junk data on the minio.

TonyAnn · 2024-04-16T01:28:54Z

The current problem, I think, is because the indexnode wants to get the minio data of the dropped collection, but the data in the minio of the drop colleciton was manually deleted by me,
so the indexnode throws [error="[UnexpectedError] Error:GetObjectSize[errcode: 404, exception:, errmessage:No response body.]"]

we shall not remove data manually in most cases, could you please share what happened before that and why you need to manually delete the data in minio?

/assign @congqixia @congqixia I know it is difficult, but in this case, do we have any ideas to fix the index nodes?

hi yanliang567，I think there is no need to fix the index data because the corresponding collection data has been deleted. Now we only need to restore the indexnode service to be available.

TonyAnn · 2024-04-16T01:47:19Z

hi yanliang567 Any updates now?

congqixia · 2024-04-16T02:04:25Z

@TonyAnn did you tried to drop the related index? any modification on the collection yet?

congqixia · 2024-04-16T02:12:33Z

@TonyAnn If the collection has been dropped during the case, we might not able to modify the legacy task with normal methods.
Since its an abnormal operation, we might need to manually drop the legacy tasks for the indexnodes. Could you please provided the backup of etcd with this tool?
https://github.com/milvus-io/birdwatcher/releases/tag/v1.0.3

yanliang567 · 2024-04-16T02:15:30Z

hi yanliang567 Any updates now?

we doubt there remains some meta in milvus which always triggers index tasks for the dropped collection, it the index tasks cannot succeed or fail for the data in minio was deleted. We believed there is a fix in latest 2.3.13 release, but we you are running 2.2.11, this could not recover itself.
We can try to recover the index nodes by manually cleaning yp the dirty meta, but it is risky and not 100% assurance. If you agree please help to collect the backup of etcd, we will analyze the existing meta first and offer you some actions. Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher

TonyAnn · 2024-04-16T02:25:56Z

hi yanliang567 Any updates now?

we doubt there remains some meta in milvus which always triggers index tasks for the dropped collection, it the index tasks cannot succeed or fail for the data in minio was deleted. We believed there is a fix in latest 2.3.13 release, but we you are running 2.2.11, this could not recover itself. We can try to recover the index nodes by manually cleaning yp the dirty meta, but it is risky and not 100% assurance. If you agree please help to collect the backup of etcd, we will analyze the existing meta first and offer you some actions. Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher

Backups are slow and may be interrupted

Milvus(by-dev) > backup
Backing up... 0%(10601/1503047)

TonyAnn · 2024-04-16T02:30:13Z

hi yanliang567，There is a problem. Can upgrading to 2.2.16 solve the problem?

TonyAnn · 2024-04-16T02:37:45Z

While backing up, an error was encountered.

Offline > connect --etcd 10.96.3.170:2379
Using meta path: by-dev/meta/
Milvus(by-dev) > backup
Backing up ... 7%(119001/1503047)
backup etcd failed, error: etcdserver: mvcc: required revision has been compacted
http://100.93.184.91:9091/metrics
http://100.116.193.30:9091/metrics
http://100.85.158.103:9091/metrics
http://100.93.184.127:9091/metrics
http://100.72.181.136:9091/metrics
http://100.93.184.96:9091/metrics
http://100.79.32.102:9091/metrics
failed to fetch metrics for indexnode(9963), Get "http://100.79.32.102:9091/metrics": read tcp 100.103.75.0:2157->100.79.32.102:9091: read: connection reset by peer
http://100.103.75.3:9091/metrics
http://100.85.158.112:9091/metrics
failed to fetch metrics for indexnode(9965), Get "http://100.85.158.112:9091/metrics": dial tcp 100.85.158.112:9091: connect: connection refused

Incomplete backup, please refer to the attachment
Uploading bw_etcd_ALL.240416-102441.bak.gz…

xiaofan-luan · 2024-04-16T02:46:42Z

so you want to wipe out all data?
maybe you can simply remove etcd disk, and delete everything on minio?

TonyAnn · 2024-04-16T02:59:17Z

so you want to wipe out all data? maybe you can simply remove etcd disk, and delete everything on minio?

no, I only want to clean up the dropped collection data，Not all data

congqixia · 2024-04-16T03:02:42Z

Incomplete backup, please refer to the attachment
Uploading bw_etcd_ALL.240416-102441.bak.gz…

@TonyAnn looks like the upload did not complete, the link above is the url back to this issue

TonyAnn · 2024-04-16T03:07:09Z

Incomplete backup, please refer to the attachment
Uploading bw_etcd_ALL.240416-102441.bak.gz…

@TonyAnn looks like the upload did not complete, the link above is the url back to this issue

yes, I am using the backup --ignoreRevision command to back up again. The amount of data is large and the backup is very slow. It is only 15% complete.

TonyAnn · 2024-04-16T03:08:32Z

hi yanliang567，There is a problem. Can upgrading to 2.2.16 solve the problem?

Please help confirm this issue.

congqixia · 2024-04-16T03:54:53Z

hi yanliang567，There is a problem. Can upgrading to 2.2.16 solve the problem?

Please help confirm this issue.

@TonyAnn
From the log attached, the indexnodes restart reason was not found. Seems like the tool did not catch the panic or stderr log in time. Have you ever witness the panic stack trace by any change?

xiaofan-luan · 2024-04-16T04:04:39Z

you will need to clean up everything on etcd, that is a lot of work to do.
Birdwatcher can help you on inspect and clean up wrong collection metas but that might be a lot of work to do

TonyAnn · 2024-04-16T04:56:02Z

Incomplete backup, please refer to the attachment
Uploading bw_etcd_ALL.240416-102441.bak.gz…

@TonyAnn looks like the upload did not complete, the link above is the url back to this issue

The complete etcd backup set is too large and cannot be uploaded. It has been sent to yanliang privately.

TonyAnn · 2024-04-16T05:11:40Z

@TonyAnn did you tried to drop the related index? any modification on the collection yet?

@congqixia After the problem occurred, I tried to execute remove segment-orphan on birdwather, but it had no effect.

TonyAnn · 2024-04-16T05:14:48Z

hi yanliang567，There is a problem. Can upgrading to 2.2.16 solve the problem?

Please help confirm this issue.

@TonyAnn From the log attached, the indexnodes restart reason was not found. Seems like the tool did not catch the panic or stderr log in time. Have you ever witness the panic stack trace by any change?

@congqixia After the problem occurred, I only saw the following error reported by the indexnode, and at the same time, the minio-related pod reported an error that the file could not be found.

[2024/04/15 11:51:40.367 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:40.367 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703841] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]
[2024/04/15 11:51:50.423 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"]
[2024/04/15 11:51:50.423 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:50.423 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668159918] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]
[2024/04/15 11:51:51.301 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"]
[2024/04/15 11:51:51.301 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:51.301 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619667725864] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]
[2024/04/15 11:51:52.238 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"]
[2024/04/15 11:51:52.238 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"]
[2024/04/15 11:51:52.238 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703750] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]

congqixia · 2024-04-16T06:32:12Z

hi yanliang567，There is a problem. Can upgrading to 2.2.16 solve the problem?

Please help confirm this issue.

@TonyAnn From the log attached, the indexnodes restart reason was not found. Seems like the tool did not catch the panic or stderr log in time. Have you ever witness the panic stack trace by any change?

@congqixia After the problem occurred, I only saw the following error reported by the indexnode, and at the same time, the minio-related pod reported an error that the file could not be found.

[2024/04/15 11:51:40.367 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"] [2024/04/15 11:51:40.367 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703841] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [2024/04/15 11:51:50.423 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"] [2024/04/15 11:51:50.423 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"] [2024/04/15 11:51:50.423 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668159918] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [2024/04/15 11:51:51.301 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"] [2024/04/15 11:51:51.301 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"] [2024/04/15 11:51:51.301 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619667725864] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [2024/04/15 11:51:52.238 +00:00] [WARN] [indexcgowrapper/helper.go:76] ["failed to create index, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]\n"] [2024/04/15 11:51:52.238 +00:00] [ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).BuildIndex\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task.go:340\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:207\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:220\ngithub.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/task_scheduler.go:253"] [2024/04/15 11:51:52.238 +00:00] [INFO] [indexnode/taskinfo_ops.go:42] ["IndexNode store task state"] [clusterID=by-dev] [buildID=449033619668703750] [state=Retry] ["fail reason"="[UnexpectedError] Error:GetObjectSize[errcode:404, exception:, errmessage:No response body.]"]

These log does not indicate why the indexnode crashed. Can't be sure what the problem was

congqixia · 2024-04-16T06:51:56Z

@TonyAnn from the buildID 449033619668703750, I found this index build task is belong to segment 449033619667725811, which is already in dropped state and its files might got GCed already

SegmentID: 449033619667725811 State: Flushed, Level: Legacy, Row Count:376181
--- Growing: 0, Sealed: 0, Flushed: 1, Dropped: 0

congqixia · 2024-04-16T07:02:56Z

@TonyAnn from the backup the index task still exists. A quick guess was that the segments are too many and the garbage collector was too busy to keep up and recycle this index task. you could check:

Backup(by-dev) > show segment-index --segment 449033619667725811
SegmentID: 449033619667725811    State: Flushed
        IndexV2 build ID: 449033619668703750, states InProgress  Index Type:HNSW on Field ID: 101       Serialized Size: 0
        Current Index Version: 0
[InProgress]: 1

stale · 2024-05-18T19:14:10Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

yanliang567 · 2024-05-20T01:15:40Z

resolved offline

TonyAnn added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 15, 2024

TonyAnn assigned yanliang567 Apr 15, 2024

sre-ci-robot assigned congqixia Apr 16, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 16, 2024

yanliang567 assigned TonyAnn Apr 16, 2024

yanliang567 removed their assignment Apr 16, 2024

stale bot added the stale indicates no udpates for 30 days label May 18, 2024

yanliang567 closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: indexnode is unavailable and always restarts, how to fix it #32283

[Bug]: indexnode is unavailable and always restarts, how to fix it #32283

TonyAnn commented Apr 15, 2024

xiaofan-luan commented Apr 15, 2024

xiaofan-luan commented Apr 15, 2024

TonyAnn commented Apr 16, 2024

xiaofan-luan commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

yanliang567 commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

congqixia commented Apr 16, 2024

congqixia commented Apr 16, 2024

yanliang567 commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

xiaofan-luan commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

congqixia commented Apr 16, 2024 •

edited

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

congqixia commented Apr 16, 2024

xiaofan-luan commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

congqixia commented Apr 16, 2024

congqixia commented Apr 16, 2024

congqixia commented Apr 16, 2024

stale bot commented May 18, 2024

yanliang567 commented May 20, 2024

[Bug]: indexnode is unavailable and always restarts, how to fix it #32283

[Bug]: indexnode is unavailable and always restarts, how to fix it #32283

Comments

TonyAnn commented Apr 15, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

xiaofan-luan commented Apr 15, 2024

xiaofan-luan commented Apr 15, 2024

TonyAnn commented Apr 16, 2024

xiaofan-luan commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

yanliang567 commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

congqixia commented Apr 16, 2024

congqixia commented Apr 16, 2024

yanliang567 commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

xiaofan-luan commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

congqixia commented Apr 16, 2024 • edited

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

congqixia commented Apr 16, 2024

xiaofan-luan commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

TonyAnn commented Apr 16, 2024

congqixia commented Apr 16, 2024

congqixia commented Apr 16, 2024

congqixia commented Apr 16, 2024

stale bot commented May 18, 2024

yanliang567 commented May 20, 2024

congqixia commented Apr 16, 2024 •

edited