Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when etcd node have many http long polling connections, it may cause etcd to respond slowly to http requests. #14185

Closed
hansedong opened this issue Jun 30, 2022 · 4 comments
Labels

Comments

@hansedong
Copy link

hansedong commented Jun 30, 2022

What happened?

I have 3 etcd nodes, and clients operate etcd through http.

The operations of the client are only http requests, and the request types include GET, PUT, and long polling for Watch scenarios. Actually, the client is APISIX, and APISIX internally uses lua to operate etcd through HTTP protocol.

However, when there are many connections on the etcd node, operating the etcd node through http will be particularly slow.

Show the number of connections on 3 nodes respectively:

[root@knode10-132-15-158 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
263
[root@knode10-132-15-159 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
7
[root@knode10-132-15-160 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
98

Then, on each etcd node, query the same key by curl:

1、Request the etcd server of 10-132-15-158 node, which has the most 263 connections, the result is that the http request takes the longest (7 seconds).

[root@knode10-132-15-158 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.158:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem   --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM -k

{"header":{"cluster_id":"5357725075044320405","member_id":"16420066870069275712","revision":"2695209","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}

real	0m7.132s
user	0m0.061s
sys	0m0.063s

2、Request the etcd server of 10-132-15-159 node, which has the 9 connections, the result is that the http request takes the longest (0.097 seconds).

[root@knode10-132-15-159 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.159:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem   --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"1021713008895288175","revision":"2695484","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}

real	0m0.097s
user	0m0.053s
sys	0m0.037s

3、Request the etcd server of 10-132-15-160 node, which has the 90 connections, the result is that the http request takes the longest (0.092 seconds).

[root@knode10-132-15-160 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.160:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem   --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"2577531103840814189","revision":"2695532","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}
real	0m0.092s
user	0m0.053s
sys	0m0.034s

It can be explained above that the etcd node with the largest number of connections also takes longer to respond to http requests.

Now I'll move the connections from knode10-132-15-158 to knode10-132-15-160 and try again(actually, just let APISIX connect to node knode10-132-15-160 ).

[root@knode10-132-15-158 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
97
[root@knode10-132-15-160 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
260

Request the etcd server of node 10-132-15-158, the response cost 0.159s

[root@knode10-132-15-158 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.158:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem   --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"16420066870069275712","revision":"2697372","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}

real	0m0.159s
user	0m0.067s
sys	0m0.085s

Request the etcd server of node 10-132-15-160, the response cost 10.335s

[root@knode10-132-15-158 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.160:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem   --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"2577531103840814189","revision":"2697372","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}

real	0m10.335s
user	0m0.067s
sys	0m0.076s

I would like to ask, what is the source of this problem? what should be done to fix this problem?

What did you expect to happen?

none

How can we reproduce it (as minimally and precisely as possible)?

I'm not sure about simulating hundreds of http long polling connections with some tool.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version

etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version

etcdctl version: 3.5.4
API version: 3.5

Etcd configuration (command line flags or environment variables)

ETCD_ARGS="--cert-file=/etc/etcd/etcdSSL/servers/10.132.15.158/etcdserver-10.132.15.158.pem
--key-file=/etc/etcd/etcdSSL/servers/10.132.15.158/etcdserver-10.132.15.158-key.pem
--peer-cert-file=/etc/etcd/etcdSSL/peers/10.132.15.158/etcdpeer-10.132.15.158.pem
--peer-key-file=/etc/etcd/etcdSSL/peers/10.132.15.158/etcdpeer-10.132.15.158-key.pem
--trusted-ca-file=/etc/etcd/etcdSSL/ca.pem
--peer-trusted-ca-file=/etc/etcd/etcdSSL/ca.pem
--name etcd1
--enable-grpc-gateway
--data-dir /DATA1/etcd
--client-cert-auth
--peer-client-cert-auth
--experimental-initial-corrupt-check
--experimental-corrupt-check-time 6h
--initial-advertise-peer-urls https://10.132.15.158:2380
--listen-peer-urls https://10.132.15.158:2380
--listen-client-urls https://10.132.15.158:2379,https://127.0.0.1:2379,https://10.132.15.158:4001
--advertise-client-urls https://10.132.15.158:2379,https://10.132.15.158:4001
--initial-cluster-token etcd-apisix-product
--initial-cluster etcd1=https://10.132.15.158:2380,etcd2=https://10.132.15.159:2380,etcd3=https://10.132.15.160:2380
--initial-cluster-state new
--heartbeat-interval 2000
--election-timeout 10000
--snapshot-count 10000
--auto-compaction-mode periodic
--auto-compaction-retention 30m
--max-request-bytes 33554432
--quota-backend-bytes 34359738368
--logger zap
--log-level debug
--backend-bbolt-freelist-type=map"

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table

+------------------+---------+-------+----------------------------+-------------------------------------------------------+------------+
|        ID        | STATUS  | NAME  |         PEER ADDRS         |                     CLIENT ADDRS                      | IS LEARNER |
+------------------+---------+-------+----------------------------+-------------------------------------------------------+------------+
|  e2dda917d99636f | started | etcd2 | https://10.132.15.159:2380 | https://10.132.15.159:2379,https://10.132.15.159:4001 |      false |
| 23c53ae1f1f3c46d | started | etcd3 | https://10.132.15.160:2380 | https://10.132.15.160:2379,https://10.132.15.160:4001 |      false |
| e3dfcbd232a3ac40 | started | etcd1 | https://10.132.15.158:2380 | https://10.132.15.158:2379,https://10.132.15.158:4001 |      false |
+------------------+---------+-------+----------------------------+-------------------------------------------------------+------------+

$ etcdctl --endpoints=<member list> endpoint status -w table

+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.132.15.158:2379 | e3dfcbd232a3ac40 |   3.5.4 |   18 MB |     false |      false |         7 |    2698464 |            2698464 |        |
| https://10.132.15.159:2379 |  e2dda917d99636f |   3.5.4 |   18 MB |     false |      false |         7 |    2698464 |            2698464 |        |
| https://10.132.15.160:2379 | 23c53ae1f1f3c46d |   3.5.4 |   18 MB |      true |      false |         7 |    2698464 |            2698464 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

No response

@ahrtr
Copy link
Member

ahrtr commented Jun 30, 2022

This issue should have already been resolved by 14169, which isn't merged yet.

Can you try to reproduce this issue using above PR?

@hansedong
Copy link
Author

@ahrtr

In my usage scenario, I recompile and deploy your repo, and after about 1 hour of testing, this problem is confirmed to be resolved. thank you very much for your help.

I would like to ask, will the 3.5.5 version include this change? When will version 3.5.5 be released?

@ahrtr
Copy link
Member

ahrtr commented Jul 1, 2022

Thanks for the feedback.

Yes, the PR will be cherry picked to 3.5.

We are still working on 3.5.5, but the exact date is not clear yet so far. FYI #14138

@ahrtr
Copy link
Member

ahrtr commented Jul 13, 2022

Resolved by #14169 and #14219

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants