Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High P99 Latency with tsoRequestDispatcher #1257

Open
sptuan opened this issue Apr 1, 2024 · 0 comments
Open

High P99 Latency with tsoRequestDispatcher #1257

sptuan opened this issue Apr 1, 2024 · 0 comments

Comments

@sptuan
Copy link

sptuan commented Apr 1, 2024

Hi everyone. Thanks for the contribution of tikv-client. We encounter some performance problem, but seems peculiar. Here are some details.

If our understanding is incorrect, we kindly request your generous guidance and correction.

Background

Our project utilizes tikv as a distributed KV engine to build the metadata service. We have observed a significant rise in P99 latency during peak usage. (tikv region num ~ 5,000,000)

We tried many optimizations on the tikv server and OS, including adjustments to grpc-related settings, raft thread control, and rocksdb configurations. However, the improvements were not satisfactory. We sought advice from the community, as mentioned in https://asktug.com/t/topic/1011036.

Then we accidentally discovered that scaling up the number of instances of our own service (i.e., tikv-client) significantly improved system throughput and latency.

However, we are puzzled: why does scaling horizontally prove effective despite seemingly low resource utilization? Is it possible that individual tikv-client instances have some sort of bottleneck (such as a lock), limiting their capacity.

We did use 10 instances on 10 64core bare-metal server. tikv-client version is a bit older, 2.0.0-rc. But we did not perceive any changes in this regard.

Source Code

image

Each batch of TSO (Timestamp Oracle) get requests has a maximum of 10,000 requests. The size of the tsoRequestCh channel is set to 20,000. There is only one goroutine in the handlerDispatcher that sequentially handles all requests with types 2, 3, 4, and 5.

When there is a large number of TSO get requests, it may become a performance bottleneck due to:

  • Merging thousands of requests for sequential processing.
  • Synchronously waiting for stream send and recv operations.
  • Sequentially invoking callback functions for req.done.

Discovery

image

We observe some metrics in tikv-client.
image

pd_client_request_handle_requests_duration_seconds_bucket{ type="tso"}

the duration of pure TSO (Timestamp Oracle) stream.send and stream.recv operations, which is the latency of a single RPC request to PD for TSO.
This latency remains around 1ms consistently, regardless of scaling. It can be used to assess any fluctuations in the network between the client and PD or high load on PD.
Corresponds to the yellow section in the graph.

handle_cmds_duration

pd_client_cmd_handle_cmds_duration_seconds_bucket{type="wait"}

It represents the time taken for a request to be received by the dispatcher, and then blocked until it receives the response. This latency fluctuates significantly and decreases after scaling.

Corresponds to the green section in the graph.

As graph, we scale our service (with tikv-client) instance to 20. This indicates that scaling has a significant improvement on the red (waiting for tso req) and purple (callback req.done) sections. We did not scale tikv-server/pd.

Here are some other metrics:
image

Questions

We would like to discuss:

  • Why using 1 goroutine to process whole tikv-client tso. Does it need to maintain the order when TSO requests are collected in a go channel? If so why is it necessary to preserve the order?
  • tso request collect and done callback seem not be a heavy progress. It there any idea about why it has high P99 latency? It seems max tso QPS is 5000 for each tikv-client instance.
  • It there any best practice about deploying scale of tikv/tidb. For example, an instance with 64-core bare-metal seem not good enough. 4 instances on 64-core server seems better.

Please let me know if you need any further information. Thanks for your kindly help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant