Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to improve tikv_client_read_timeout retry logic when 1 tikv server down #1318

Open
crazycs520 opened this issue Apr 22, 2024 · 0 comments

Comments

@crazycs520
Copy link
Contributor

Related log are following:

  • tikv_client_read_timeout = 100
  • max_execution_time = 400
[2024/04/22 17:32:40.752 +08:00] [INFO] [region_request.go:1804] ["throwing pseudo region error due to no replica available"] [conn=3787456712] [session_alias=] [req-ts=449256713093120000] [req-type=Cop] [region="{ region id: 290313, ver: 498, confVer: 209 }"] [replica-read-type=mixed] [stale-read=false] [request-sender="{rpcError:[192.168.80.3:23360](1) wait recvLoop timeout,timeout:100ms, wait_duration:4.499µs:: context deadline exceeded, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 5, cacheRegionIsValid: true, replicaStatus: [peer: 298850, store: 4, isEpochStale: false, attempts: 2, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 298992, store: 10, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 333879, store: 2, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [retry-times=4] [total-backoff-ms=2] [total-backoff-times=1] [max-exec-timeout-ms=100] [stats="Cop:{num_rpc:4, total_time:302.3ms}, rpc_errors:{data_is_not_ready:1, context deadline exceeded:3}"] [access-path="{stale_read, peer:298850, store:4, err:data_is_not_ready}, {peer:333879, store:2, err:context deadline exceeded}, {replica_read, peer:298850, store:4, err:context deadline exceeded}, {replica_read, peer:298992, store:10, err:context deadline exceeded}"] [cost=304.411625ms]
[2024/04/22 17:32:40.752 +08:00] [INFO] [coprocessor.go:1384] ["[TIME_COP_PROCESS] resp_time:304.488311ms txnStartTS:449256713093120000 region_id:290313 store_addr:192.168.80.3:23360 stats:Cop:{num_rpc:4, total_time:302.3ms}, rpc_errors:{data_is_not_ready:1, context deadline exceeded:3} region_err:epoch_not_match:<> "] [conn=3787456712] [session_alias=]
[2024/04/22 17:32:40.946 +08:00] [INFO] [region_request.go:1804] ["send request failed, err: context canceled"] [conn=3787456712] [session_alias=] [req-ts=449256713093120000] [req-type=Cop] [region="{ region id: 290313, ver: 498, confVer: 209 }"] [replica-read-type=mixed] [stale-read=false] [request-sender="{rpcError:[192.168.80.4:23360](1) context canceled, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 2, cacheRegionIsValid: true, replicaStatus: [peer: 298850, store: 4, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 298992, store: 10, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 333879, store: 2, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [retry-times=1] [total-backoff-ms=8] [total-backoff-times=3] [max-exec-timeout-ms=60000] [stats="Cop:{num_rpc:6, total_time:490.2ms}, rpc_errors:{data_is_not_ready:2, context deadline exceeded:3, context canceled:1}"] [access-path="{stale_read, peer:298850, store:4, err:data_is_not_ready}, {peer:333879, store:2, err:context canceled}"] [cost=192.074631ms]
[2024/04/22 17:32:40.947 +08:00] [INFO] [conn.go:1151] ["command dispatched failed"] [conn=3787456712] [session_alias=] [connInfo="id:3787456712, addr:192.168.80.1:48834 status:10, collation:latin1_swedish_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="select pk, sk, ts, v, n from (             select pk, sk, ts, v, row_number() over (partition by pk, sk order by ts desc) as n             from t1 as of timestamp now() - interval 5 second where pk = '000000000000000000000000399911' and sk like '0000-0001-0002-0004-0008%'         ) as tt where n <= 4 order by sk asc"] [txn_mode=PESSIMISTIC] [timestamp=449256713093120000] [err="[executor:3024]Query execution was interrupted, maximum statement execution time exceeded\ngithub.com/pingcap/errors.AddStack\n\t/data/cs/code/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/errors.go:178\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/data/cs/code/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/normalize.go:175\ngithub.com/pingcap/tidb/pkg/util/sqlkiller.(*SQLKiller).HandleSignal\n\t/data/cs/tidb/pkg/util/sqlkiller/sqlkiller.go:68\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.(*executorKillerHandler).HandleSQLKillerSignal\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:251\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:409\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).fetchChunksUnparallel\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:615\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).fetchChunks\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:528\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).Next\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:277\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:403\ngithub.com/pingcap/tidb/pkg/executor.(*PipelinedWindowExec).fetchChild\n\t/data/cs/tidb/pkg/executor/pipelined_window.go:210\ngithub.com/pingcap/tidb/pkg/executor.(*PipelinedWindowExec).getRowsInPartition\n\t/data/cs/tidb/pkg/executor/pipelined_window.go:181\ngithub.com/pingcap/tidb/pkg/executor.(*PipelinedWindowExec).Next\n\t/data/cs/tidb/pkg/executor/pipelined_window.go:126\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:403\ngithub.com/pingcap/tidb/pkg/executor.(*SelectionExec).Next\n\t/data/cs/tidb/pkg/executor/executor.go:1609\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:403\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).fetchChunksUnparallel\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:615\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).fetchChunks\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:528\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).Next\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:277\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:403\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).next\n\t/data/cs/tidb/pkg/executor/adapter.go:1213\ngithub.com/pingcap/tidb/pkg/executor.(*recordSet).Next\n\t/data/cs/tidb/pkg/executor/adapter.go:159\ngithub.com/pingcap/tidb/pkg/server/internal/resultset.(*tidbResultSet).Next\n\t/data/cs/tidb/pkg/server/internal/resultset/resultset.go:64\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).writeChunks\n\t/data/cs/tidb/pkg/server/conn.go:2316\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).writeResultSet\n\t/data/cs/tidb/pkg/server/conn.go:2259\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleStmt\n\t/data/cs/tidb/pkg/server/conn.go:2052\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleQuery\n\t/data/cs/tidb/pkg/server/conn.go:1774\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).dispatch\n\t/data/cs/tidb/pkg/server/conn.go:1348\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).Run\n\t/data/cs/tidb/pkg/server/conn.go:1114\ngithub.com/pingcap/tidb/pkg/server.(*Server).onConn\n\t/data/cs/tidb/pkg/server/server.go:739\nruntime.goexit\n\t/data/cs/go/src/runtime/asm_amd64.s:1650"]

From upper log, we can known the query is execution failed cause by maximum statement execution time exceeded, and most time is spent on RPC:

  • In the first send rpc request round, cost 304ms, and we can find stats="Cop:{num_rpc:4, total_time:302.3ms}, rpc_errors:{data_is_not_ready:1, context deadline exceeded:3} from the log, and access path is:
    • {stale_read, peer:298850, store:4, err:data_is_not_ready},
    • {peer:333879, store:2, err:context deadline exceeded},
    • {replica_read, peer:298850, store:4, err:context deadline exceeded},
    • {replica_read, peer:298992, store:10, err:context deadline exceeded}
  • In the second rpc request round, cost 192ms, and access path is:
    • {stale_read, peer:298850, store:4, err:data_is_not_ready}, // retry with stale-read again?
    • {peer:333879, store:2, err:context canceled}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant