Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster] Milvus sets up two replicas, and the concurrent test raise error: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded #17289

Closed
1 task done
wangting0128 opened this issue May 31, 2022 · 10 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20220527-55f1e03c
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.1.0.dev63
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: benchmark-backup-ds5f4

test yaml:
client-configmap:client-random-locust-search-filter-100m-ddl-r8-w2-replica2
server-configmap:server-cluster-8c64m-querynode2

server:

NAME                                                          READY   STATUS      RESTARTS   AGE     IP             NODE                      NOMINATED NODE   READINESS GATES
benchmark-backup-ds5f4-1-etcd-0                               1/1     Running     0          3d15h   10.97.17.217   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-etcd-1                               1/1     Running     0          3d15h   10.97.16.130   qa-node013.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-etcd-2                               1/1     Running     0          3d15h   10.97.17.219   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-milvus-datacoord-599d49f46f-hlzdr    1/1     Running     0          3d15h   10.97.4.200    qa-node002.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-milvus-datanode-684764f9d-p7xnh      1/1     Running     0          3d15h   10.97.19.137   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-milvus-indexcoord-69b8cd78f6-9ff49   1/1     Running     0          3d15h   10.97.4.199    qa-node002.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-milvus-indexnode-64c95b5bb7-q4z22    1/1     Running     0          3d15h   10.97.9.157    qa-node007.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-milvus-proxy-54f4bd5b9b-wbr2t        1/1     Running     0          3d15h   10.97.4.198    qa-node002.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-milvus-querycoord-77d9ff7559-ftp6r   1/1     Running     0          3d15h   10.97.4.196    qa-node002.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-milvus-querynode-7c59d95c45-k4xvg    1/1     Running     0          3d15h   10.97.16.120   qa-node013.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-milvus-querynode-7c59d95c45-wz8hc    1/1     Running     0          3d15h   10.97.12.104   qa-node015.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-milvus-rootcoord-7c9bd7bb5c-9cjb8    1/1     Running     0          3d15h   10.97.4.197    qa-node002.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-minio-0                              1/1     Running     0          3d15h   10.97.19.139   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-minio-1                              1/1     Running     0          3d15h   10.97.12.106   qa-node015.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-minio-2                              1/1     Running     0          3d15h   10.97.20.2     qa-node018.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-minio-3                              1/1     Running     0          3d15h   10.97.20.254   qa-node018.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-bookie-0                      1/1     Running     0          32h     10.97.5.66     qa-node003.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-bookie-1                      1/1     Running     0          3d15h   10.97.18.123   qa-node017.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-bookie-2                      1/1     Running     0          32h     10.97.5.65     qa-node003.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-bookie-init-pvwq4             0/1     Completed   0          3d15h   10.97.9.156    qa-node007.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-broker-0                      1/1     Running     0          3d15h   10.97.18.118   qa-node017.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-proxy-0                       1/1     Running     0          39h     10.97.17.137   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-pulsar-init-4thrk             0/1     Completed   0          3d15h   10.97.5.17     qa-node003.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-recovery-0                    1/1     Running     1          3d15h   10.97.3.42     qa-node001.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-zookeeper-0                   1/1     Running     0          32h     10.97.5.67     qa-node003.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-zookeeper-1                   1/1     Running     0          3d15h   10.97.18.125   qa-node017.zilliz.local   <none>           <none>
benchmark-backup-ds5f4-1-pulsar-zookeeper-2                   1/1     Running     0          3d15h   10.97.18.127   qa-node017.zilliz.local   <none>           <none>

client pod: benchmark-backup-ds5f4-3449886322

client log:

[2022-05-29 11:14:48,947] [   ERROR] - grpc RpcError: [has_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 10:58:35.214896', 'gRPC error': '2022-05-29 11:14:48.946252'}> (pymilvus.decorators:84)
[2022-05-29 12:35:47,247] [   ERROR] - grpc RpcError: [describe_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 12:30:10.922998', 'gRPC error': '2022-05-29 12:35:47.247697'}> (pymilvus.decorators:84)
[2022-05-29 12:51:08,127] [   ERROR] - grpc RpcError: [has_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 12:45:43.863194', 'gRPC error': '2022-05-29 12:51:08.127439'}> (pymilvus.decorators:84)
[2022-05-29 12:51:11,117] [   ERROR] - grpc RpcError: [has_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 12:45:43.863463', 'gRPC error': '2022-05-29 12:51:11.116962'}> (pymilvus.decorators:84)
[2022-05-29 12:51:11,118] [   ERROR] - grpc RpcError: [describe_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 12:45:43.863712', 'gRPC error': '2022-05-29 12:51:11.118219'}> (pymilvus.decorators:84)
[2022-05-29 13:25:04,006] [   ERROR] - grpc RpcError: [has_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:19:46.678401', 'gRPC error': '2022-05-29 13:25:04.006016'}> (pymilvus.decorators:84)
[2022-05-29 13:25:04,007] [   ERROR] - grpc RpcError: [describe_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:19:46.826210', 'gRPC error': '2022-05-29 13:25:04.007726'}> (pymilvus.decorators:84)
[2022-05-29 13:25:04,008] [   ERROR] - grpc RpcError: [describe_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:19:46.826520', 'gRPC error': '2022-05-29 13:25:04.008879'}> (pymilvus.decorators:84)
[2022-05-29 13:25:04,010] [   ERROR] - grpc RpcError: [describe_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:19:46.826820', 'gRPC error': '2022-05-29 13:25:04.010741'}> (pymilvus.decorators:84)
[2022-05-29 13:25:04,013] [   ERROR] - grpc RpcError: [describe_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:19:46.827082', 'gRPC error': '2022-05-29 13:25:04.013182'}> (pymilvus.decorators:84)
[2022-05-29 13:38:29,170] [   ERROR] - grpc RpcError: [describe_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:31:33.334079', 'gRPC error': '2022-05-29 13:38:29.170154'}> (pymilvus.decorators:84)
[2022-05-29 13:38:30,290] [   ERROR] - grpc RpcError: [describe_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:31:33.334282', 'gRPC error': '2022-05-29 13:38:30.290268'}> (pymilvus.decorators:84)
[2022-05-29 13:38:30,291] [   ERROR] - grpc RpcError: [has_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:31:33.334628', 'gRPC error': '2022-05-29 13:38:30.291776'}> (pymilvus.decorators:84)
[2022-05-29 13:38:30,293] [   ERROR] - grpc RpcError: [has_collection], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:31:33.334922', 'gRPC error': '2022-05-29 13:38:30.293021'}> (pymilvus.decorators:84)
[2022-05-29 13:39:30,305] [   ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:38:30.297317', 'gRPC error': '2022-05-29 13:39:30.305875'}> (pymilvus.decorators:84)
[2022-05-29 13:39:30,306] [   ERROR] - grpc RpcError: [search], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:31:32.792426', 'gRPC error': '2022-05-29 13:39:30.306790'}> (pymilvus.decorators:84)
[2022-05-29 13:39:30,309] [   ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:38:30.298344', 'gRPC error': '2022-05-29 13:39:30.309877'}> (pymilvus.decorators:84)
[2022-05-29 13:39:30,310] [   ERROR] - grpc RpcError: [search], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:31:32.791999', 'gRPC error': '2022-05-29 13:39:30.310549'}> (pymilvus.decorators:84)
[2022-05-29 13:44:30,371] [   ERROR] - grpc RpcError: [_execute_search_requests], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-05-29 13:39:30.304071', 'gRPC error': '2022-05-29 13:44:30.371412'}> (pymilvus.decorators:84)

Expected Behavior

No response

Steps To Reproduce

1、create collection
2、create index of ivf_sq8
3、insert 100m million vectors
4、flush collection
5、build index with the same params
6、load collection with 2 replicas
7、locust concurrent: query<-search, load, get<-query, scene_test 《- raise error

Milvus Log

No response

Anything else?

client-random-locust-search-filter-100m-ddl-r8-w2-replica2:

locust_random_performance:
      collections:
        -
          collection_name: sift_100m_128_l2
          other_fields: float1
          ni_per: 50000
          build_index: true
          index_type: ivf_sq8
          index_param:
            nlist: 2048
          load_param:
            replica_number: 2
          task:
            types:
              -
                type: query
                weight: 20
                params:
                  top_k: 10
                  nq: 10
                  search_param:
                    nprobe: 16
                  filters:
                    -
                      range: "{'range': {'float1': {'GT': -1.0, 'LT': collection_size * 0.5}}}"
              -
                type: load
                weight: 1
                params:
                  replica_number: 2
              -
                type: get
                weight: 10
                params:
                  ids_length: 10
              -
                type: scene_test
                weight: 2
            connection_num: 1
            clients_num: 20
            spawn_rate: 2
            during_time: 84h
    @time_wrapper
    def scene_test(self, collection_name=None, vectors=None, ids=None):
        logger.debug("[scene_test] Start scene test : %s" % collection_name)
        self.create_collection(dimension=128, collection_name=collection_name)
        time.sleep(1)

        collection_info = self.get_info(collection_name)

        entities = utils.generate_entities(collection_info, vectors, ids)
        logger.debug("[scene_test] Start insert : %s" % collection_name)
        self.insert(entities, collection_name=collection_name)
        logger.debug("[scene_test] Start flush : %s" % collection_name)
        self.flush(collection_name=collection_name)

        self.count(collection_name=collection_name)

        logger.debug("[scene_test] Start create index : %s" % collection_name)
        self.create_index(field_name='float_vector', index_type="ivf_sq8", metric_type='l2',
                          collection_name=collection_name, index_param={'nlist': 2048})
        time.sleep(59)

        logger.debug("[scene_test] Start drop : %s" % collection_name)
        self.drop(collection_name=collection_name)
        logger.debug("[scene_test]Scene test close : %s" % collection_name)
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels May 31, 2022
@xiaofan-luan xiaofan-luan added this to the 2.1-RC1 milestone May 31, 2022
@yanliang567 yanliang567 removed their assignment May 31, 2022
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 31, 2022
@XuanYang-cn
Copy link
Contributor

time="2022-05-29T10:59:50Z" level=error msg="[Failed to create producer]" error="server error: ServiceNotReady: 
org.apache.pulsar.broker.service.BrokerServiceException$TopicFencedException: Topic is temporarily unavailable" 
producerID=163466 producer_name=benchmark-backup-ds5f4-1-pulsar-0-327825 topic="persistent://public/default/by-
dev-rootcoord-dml_0"

@XuanYang-cn
Copy link
Contributor

time="2022-05-29T10:59:50Z" level=error msg="[Failed to create producer]" error="server error: ServiceNotReady:
org.apache.pulsar.broker.service.BrokerServiceException$TopicFencedException: Topic is temporarily unavailable"
producerID=163466 producer_name=benchmark-backup-ds5f4-1-pulsar-0-327825 topic="persistent://public/default/by-
dev-rootcoord-dml_0"

This seems to be an old issue of pulsar: apache/pulsar#5284

@XuanYang-cn
Copy link
Contributor

/unassign
/assign @letian-jiang

@yanliang567
Copy link
Contributor

if it is not 100% reproduce and recoverable, degrade priority

@letian-jiang
Copy link
Contributor

letian-jiang commented Jun 8, 2022

time="2022-05-29T10:59:50Z" level=error msg="[Failed to create producer]" error="server error: ServiceNotReady:
org.apache.pulsar.broker.service.BrokerServiceException$TopicFencedException: Topic is temporarily unavailable"
producerID=163466 producer_name=benchmark-backup-ds5f4-1-pulsar-0-327825 topic="persistent://public/default/by-
dev-rootcoord-dml_0"

This seems to be an old issue of pulsar: apache/pulsar#5284

This fenced topic issue is more likely to be an issue of pulsar, and should rarely occur unless producers and consumers are created and closed frequently in a short period of time. Possible workarounds could bring substantial complexities into the existing channel management logic of Milvus.

An FAQ for this rare case will be added to the documentation. Users are suggested to restart the pulsar cluster when encountering this fence issue.

One more thing to mention is that we could increase the number of topics in configuration. A larger pool of physical channels is expected to lower the probability of this issue.

dmlChannelNum: 256 # The number of dml channels created at system startup

@xiaofan-luan
Copy link
Contributor

We shouldn't increase the dmlChannel number to very large because topic/partition is a limited resources to Kafka/Pulsar.
Usually the topic upperbound of a kafka cluster is 3000, so 256 sounds like a reasonable value.

Currently, we don't create any physical channel once the cluster is up.
The key problem is where failure recovery happened, we first try to unregister one consumer and register at another querynode. I guess the fence is to just ensure two consumer shouldn't consumer same subname at the the same time.

If we know that's not the case, change the consumer to shared might be a solution but it's a little bit tricky and dangerous

@xiaofan-luan
Copy link
Contributor

@wangting0128 what pulsar version are we using? 2.8.2 should fix exactly the same issue.

@wangting0128
Copy link
Contributor Author

@xiaofan-luan

截屏2022-06-16 14 05 35

@yanliang567
Copy link
Contributor

not reproduce recently. degrade the priority

@yanliang567 yanliang567 removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 17, 2022
@stale
Copy link

stale bot commented Jul 17, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Jul 17, 2022
@stale stale bot closed this as completed Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants