Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stale read QPS is jitter #974

Open
crazycs520 opened this issue Sep 14, 2023 · 0 comments
Open

stale read QPS is jitter #974

crazycs520 opened this issue Sep 14, 2023 · 0 comments

Comments

@crazycs520
Copy link
Contributor

crazycs520 commented Sep 14, 2023

I found this issue in #970

Topology

Version: nightly

+------+--------------------+--------------------+--------------------------+------------------------------------------+---------------------------+---------------------+-----------+
| TYPE | INSTANCE           | STATUS_ADDRESS     | VERSION                  | GIT_HASH                                 | START_TIME                | UPTIME              | SERVER_ID |
+------+--------------------+--------------------+--------------------------+------------------------------------------+---------------------------+---------------------+-----------+
| tidb | 192.168.78.3:23400 | 192.168.78.3:23080 | 7.4.0-alpha-386-gf1067a1 | f1067a1b761d5bae74510fc715f28814a633352d | 2023-09-13T11:23:19+08:00 | 1h8m5.142420463s    | 634       |
| tidb | 192.168.78.4:23400 | 192.168.78.4:23080 | 7.4.0-alpha-386-gf1067a1 | f1067a1b761d5bae74510fc715f28814a633352d | 2023-09-13T11:23:42+08:00 | 1h7m42.142422497s   | 970       |
| tidb | 192.168.78.2:23400 | 192.168.78.2:23080 | 7.4.0-alpha-386-gf1067a1 | f1067a1b761d5bae74510fc715f28814a633352d | 2023-09-13T11:22:58+08:00 | 1h8m26.1424236s     | 559       |
| pd   | 192.168.78.3:23379 | 192.168.78.3:23379 | 7.4.0-alpha              | 58113f8d17ccec4690445dad2e47cadbacd42da3 | 2023-09-12T15:45:50+08:00 | 20h45m34.142424456s | 0         |
| pd   | 192.168.78.2:23379 | 192.168.78.2:23379 | 7.4.0-alpha              | 58113f8d17ccec4690445dad2e47cadbacd42da3 | 2023-09-12T15:45:50+08:00 | 20h45m34.142425424s | 0         |
| pd   | 192.168.78.4:23379 | 192.168.78.4:23379 | 7.4.0-alpha              | 58113f8d17ccec4690445dad2e47cadbacd42da3 | 2023-09-12T15:45:50+08:00 | 20h45m34.142426465s | 0         |
| tikv | 192.168.80.4:23360 | 192.168.80.4:23380 | 7.4.0-alpha              | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142427254s | 0         |
| tikv | 192.168.80.3:23360 | 192.168.80.3:23380 | 7.4.0-alpha              | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142427995s | 0         |
| tikv | 192.168.80.2:23361 | 192.168.80.2:23381 | 7.4.0-alpha              | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142428715s | 0         |
| tikv | 192.168.80.2:23360 | 192.168.80.2:23380 | 7.4.0-alpha              | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142429364s | 0         |
| tikv | 192.168.80.3:23361 | 192.168.80.3:23381 | 7.4.0-alpha              | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142431041s | 0         |
| tikv | 192.168.80.4:23361 | 192.168.80.4:23381 | 7.4.0-alpha              | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142431848s | 0         |
+------+--------------------+--------------------+--------------------------+------------------------------------------+---------------------------+---------------------+-----------+

Inject network latency

Inject 1000ms network delay in all TiDB nodes to 192.168.80.2, which deploy TiKV 192.168.80.2:23360 and 192.168.80.2:23361

[root@192.168.78.2]# ping 192.168.80.2
PING 192.168.80.2 (192.168.80.2) 56(84) bytes of data.
64 bytes from 192.168.80.2: icmp_seq=1 ttl=62 time=546 ms
64 bytes from 192.168.80.2: icmp_seq=2 ttl=62 time=217 ms
64 bytes from 192.168.80.2: icmp_seq=3 ttl=62 time=145 ms
64 bytes from 192.168.80.2: icmp_seq=4 ttl=62 time=500 ms
64 bytes from 192.168.80.2: icmp_seq=5 ttl=62 time=1091 ms

[root@192.168.78.2]# ping 192.168.80.3
PING 192.168.80.3 (192.168.80.3) 56(84) bytes of data.
64 bytes from 192.168.80.3: icmp_seq=1 ttl=62 time=0.086 ms
64 bytes from 192.168.80.3: icmp_seq=2 ttl=62 time=0.079 ms
64 bytes from 192.168.80.3: icmp_seq=3 ttl=62 time=0.082 ms


[root@192.168.78.2]# ping 192.168.80.4
PING 192.168.80.4 (192.168.80.4) 56(84) bytes of data.
64 bytes from 192.168.80.4: icmp_seq=1 ttl=62 time=0.088 ms
64 bytes from 192.168.80.4: icmp_seq=2 ttl=62 time=0.072 ms
64 bytes from 192.168.80.4: icmp_seq=3 ttl=62 time=0.058 ms

192.168.78.3 and 192.168.78.4 are same with upper too.

Test

Load: stale-read request
tidb_kv_read_timeout: 100
max_execution_time: 400

sysbench --config-file=sysbench.conf mussel_minimal.lua --threads=40 --read_staleness=-10 --kv_read_timeout=100 --max_execution_time=400 --mysql-ignore-errors=all run

And run more load in TiDB 192.168.78.2 :

sysbench --config-file=sysbench.conf2 mussel_minimal.lua --threads=100 --read_staleness=-10 --kv_read_timeout=100 --max_execution_time=400 --mysql-ignore-errors=all run
$ cat sysbench.conf
mysql-host=192.168.78.2,192.168.78.3,192.168.78.4
...
...

$ cat sysbench.conf2
mysql-host=192.168.78.2

The following are related metrics, some of metrics here are not as expected:

  • QPS is often jitter. This is caused by DataNotReady backoff.
image image image
  • DataNotReady is cause by lock. Following is TiKV log:
image
  • Chose a ts 444257405948395529 from upper log to search in TiDB slow query, which is a TiDB internal query, from statistic.
  • This Insert statement execution too slow, and the transaction prewrite(27.7s), commit(1.2min) are very slow too.
image image image image

the related slow log is following:

# Time: 2023-09-15T00:08:07.852305922+08:00
# Txn_start_ts: 444257405948395529
# Query_time: 176.126226988
# Parse_time: 0
# Compile_time: 0.000114942
# Rewrite_time: 0.000026448
# Optimize_time: 0.000279455
# Wait_TS: 0.000249822
# Prewrite_time: 27.688009498 Commit_time: 74.951987366 Get_commit_ts_time: 0.000345107 Commit_backoff_time: 0.002 Commit_Backoff_types: [regionMiss] Slowest_prewrite_rpc_detail: {total:27.688s, region_id: 10837, store: 192.168.80.2:23360, tikv_wall_time: 862.5µs, scan_detail: {get_snapshot_time: 14.5µs, rocksdb: {block: {cache_hit_count: 4}}}, write_detail: {store_batch_wait: 33.5µs, propose_send_wait: 0s, persist_log: {total: 106.3µs, write_leader_wait: 221ns, sync_log: 53.5µs, write_memtable: 2.766
µs}, commit_log: 400.7µs, apply_batch_wait: 30.8µs, apply: {total:129.7µs, mutex_lock: 0s, write_leader_wait: 0s, write_wal: 31.3µs, write_memtable: 19.2µs}}} Commit_primary_rpc_detail: {total:27.688s, region_id: 10837, store: 192.168.80.2:23360, tii
kv_wall_time: 862.5µs, scan_detail: {get_snapshot_time: 14.5µs, rocksdb: {block: {cache_hit_count: 4}}}, write_detail: {store_batch_wait: 33.5µs, propose_send_wait: 0s, persist_log: {total: 106.3µs, write_leader_wait: 221ns, sync_log: 53.5µs, write_mm
emtable: 2.76µs}, commit_log: 400.7µs, apply_batch_wait: 30.8µs, apply: {total:129.7µs, mutex_lock: 0s, write_leader_wait: 0s, write_wal: 31.3µs, write_memtable: 19.2µs}}} Write_keys: 1 Write_size: 66 Prewrite_region: 1
# Is_internal: true
# Digest: 372e2010ba3b46891b9c8ec8b2264af8cd2d88aa657b800cece6decc96e1d917
# Num_cop_tasks: 0
# Mem_max: 8447
# Prepared: false
# Plan_from_cache: false
# Plan_from_binding: false
# Has_more_results: false
# KV_total: 176.122251768
# PD_total: 0.000593377
# Backoff_total: 0.002
# Write_sql_response_total: 0
# Result_rows: 0
# Succ: true
# IsExplicitTxn: false
# IsSyncStatsFailed: false
# Plan: tidb_decode_plan('4AnwTDAJMjdfMQkwCTAJTi9BCTAJdGltZToxbTEzLjVzLCBsb29wczoxLCBwcmVwYXJlOiAxN8K1cywgY2hlY2tfaW5zZXJ0OiB7dG90YWxfBTwAIBU9CG1lbQ0iAF8NGgwwMS43BT8ccHJlZmV0Y2gdLkhycGM6e0JhdGNoR2V0OntudW1fARMIMiwgHV0NXCx9LCB0aWt2X3dhbGwNXEQyLjltcywgc2Nhbl9kZXRhaWwVlzRwcm9jZXNzX2tleXM6IBVRLhcAJF9zaXplOiAxNjcRbxUtKGdldF9zbmFwc2hvFcgIOS4yBcfwRnJvY2tzZGI6IHtibG9jazoge2NhY2hlX2hpdF9jb3VudDogN319fX19LCBjb21taXRfdHhuOiB7cHJld3JpdGU6MjcuN3MsBWQRIRhzOjM0NS4xBWAJFAH2IY0YYmFja29mZgHYFGltZTogMgHxCSN4IHR5cGU6IFtyZWdpb25NaXNzXX0sIHNsb3dlc3RfcA1vJVEEIHslUCg6IDI3LjY4OHMsIAk1OF9pZDogMTA4MzcsIHN0byXsQDkyLjE2OC44MC4yOjIzMzYwRngBEDg2Mi41Ba02ewFONwEANA0qijcBBDR9ITUlJTnPBZcEX2JBJCBfd2FpdDogMzMNUyxwcm9wb3NlX3NlbmQNHDgwcywgcGVyc2lzdF9sb2cZ+RAxMDYuMwWzCWAUbGVhZGVyDTYMMjIxbkFICHluYwk2ADURYwktGG1lbXRhYmwhfxAuNzbCtUGMKYQJLwg0MDBN5BBhcHBseTauAAgwLjgFeAUbFZEIMTI5DTMUbXV0ZXhfSUsBvwlzMqAAGRcYd2FsOiAzMTLLABmeTaQlRw2gGHByaW1hcnk2BgIQMTQuOTRBhPIGAhAxOTguNAXbhgYCADYts6IGAgB9VaMUbnVtOjEsLRdlohUOWGJ5dGU6NjZ9CTI1NSBCeXRlcwlOL0EK')
# Binary_plan: tidb_decode_binary_plan('6gnwRgrlCQoISW5zZXJ0XzE4AUABUgNOL0FaFXRpbWU6MW0xMy41cywgbG9vcHM6MWK/AnByZXBhcmU6IDE3wrVzLCBjaGVja19pBUEgOiB7dG90YWxfBT0AIBU+EG1lbV9pBSIRGgwwMS43BT8ccHJlZmV0Y2gdLkhycGM6e0JhdGNoR2V0OntudW1fARMIMiwgHV0NXCh9LCB0aWt2X3dhbBF2RDIuOW1zLCBzY2FuX2RldGFpbBWXNHByb2Nlc3Nfa2V5czogFVEuFwAkX3NpemU6IDE2NxFvFS0oZ2V0X3NuYXBzaG8VyAg5LjIFx/BPcm9ja3NkYjoge2Jsb2NrOiB7Y2FjaGVfaGl0X2NvdW50OiA3fX19fX1i6AZjb21taXRfdHhuOiB7cHJld3JpdGU6MjcuN3MsIGdldF9jb20FIRhzOjM0NS4xBWEJFAH3KDVzLCBiYWNrb2ZmAdkUaW1lOiAyAfIJI3QgdHlwZTogW3JlZ2lvbk1pc3NdfSwgc2xvd2VzdF8RbyVSLa4oOiAyNy42ODhzLCAJNThfaWQ6IDEwODM3LCBzdG8l7UA5Mi4xNjguODAuMjoyMzM2MEZ5ARA4NjIuNQWtNnwBTjgBADQNKoo4ARg0fX19LCB3ISU50AWXBF9iQSUgX3dhaXQ6IDMzDVMscHJvcG9zZV9zZW5kDRw4MHMsIHBlcnNpc3RfbG9nGfkQMTA2LjMFswlgFGxlYWRlcg02DDIyMW5BSQh5bmMJNgA1EWMJLRhtZW10YWJsIX8QLjc2wrVBjSmECS8INDAwTeUQYXBwbHk2rgAIMC44BXgFGxWRCDEyOQ0zFG11dGV4X0lMAb8JczKgABkXGHdhbDogMzEyywAZnk2lJUcNoBhwcmltYXJ5NgYCGDE0Ljk0N3P6BgIQMTk4LjQF24YGAgA2LbOiBgIAfRWdEG51bToxURdloxUOXGJ5dGU6NjZ9cP8BeP///////////wEYAQ==')
insert into mysql.tidb (variable_name, variable_value) values ('tidb_stats_gc_last_ts', 444257236350140416) on duplicate key update variable_value = 444257236350140416;

The slow query execution plan is following:

	id      	task	estRows	operator info	actRows	execution info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  	memory   	disk
	Insert_1	root	0      	N/A          	0      	time:1m13.5s, loops:1, prepare: 17µs, check_insert: {total_time: 1m13.5s, mem_insert_time: 101.7µs, prefetch: 1m13.5s, rpc:{BatchGet:{num_rpc:2, total_time:1m13.5s}, tikv_wall_time: 2.9ms, scan_detail: {total_process_keys: 2, total_process_keys_size: 167, total_keys: 2, get_snapshot_time: 19.2µs, rocksdb: {block: {cache_hit_count: 7}}}}}, commit_txn: {prewrite:27.7s, get_commit_ts:345.1µs, commit:1m15s, backoff: {time: 2ms, commit type: [regionMiss]}, slowest_prewrite_rpc: {total: 27.688s, region_id: 10837, store: 192.168.80.2:23360, tikv_wall_time: 862.5µs, scan_detail: {get_snapshot_time: 14.5µs, rocksdb: {block: {cache_hit_count: 4}}}, write_detail: {store_batch_wait: 33.5µs, propose_send_wait: 0s, persist_log: {total: 106.3µs, write_leader_wait: 221ns, sync_log: 53.5µs, write_memtable: 2.76µs}, commit_log: 400.7µs, apply_batch_wait: 30.8µs, apply: {total:129.7µs, mutex_lock: 0s, write_leader_wait: 0s, write_wal: 31.3µs, write_memtable: 19.2µs}}}, commit_primary_rpc: {total: 14.947s, region_id: 10837, store: 192.168.80.2:23360, tikv_wall_time: 198.4µs, scan_detail: {get_snapshot_time: 16.5µs, rocksdb: {block: {cache_hit_count: 4}}}, }, region_num:1, write_keys:1, write_byte:66}	255 Bytes	N/A

Then I add some metrics ( #973 ) in batch-client, this is the root cause to make Insert statement execute too slowly.

In the following metric, since I inject around 1 second network latency between 192.168.78.2 and 192.168.80.2, so the max batch conn send duration is 15s.

image

In the following metric, the max got-resp is 1.14min, and max wait-to-send is 16s, this is unexpected.

image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants