-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pageserver component perf tests #1380
Comments
Initial experiment: I wrote 10k updates to the same row that contains only an integer. Then I forced checkpoint and ran get_page for 0.1% of the relevant LSNs, where an update to that page was made. On the plot above, we have: This test was ran on my laptop with release build, on top of the takeaways:
non-takeaways:
|
Not needed for 0.3 milestone. After some thinking I realized that it's not clear how probable it is that we'll ever have two compute nodes share a pageserver. A read node in a different region will require a local pageserver, since the safekeeper-pageserver connection is the only one that can afford latency. We might only need random LSN reads in a few niche cases:
|
Resuming work on this issue. Random historical queries are maybe not important, so I will only focus on more immediately impactful tests:
|
I tested workload 1: pgbench smallpgbench -s5 -i resultsTotal pages: 9742 workload 2: pgbench bigpgbench -s100 -i resultsTotal pages: 193366 workload 3: pgbench big and longpgbench -s100 -i resultsTotal pages: 200609 workload 4: 100k updates to tiny tableTotal pages: 27 Not sure how the cache is configured, and how that might obscure results I'd get on a bigger database. Now working on getting this test merged into main so others can play with it Clarification: I'm not measuring the get_page latency that the compute node experienced. I'm measuring the get_page latency of sending direct pagestream api requests to the pageserver after pgbench is done. I'm sending one request at a time. |
This is stale, the perf testing story has moved on. |
Motivation:
We're currently only testing the pageserver as a part of a zenith cluster with 1 writer node and 0 reader nodes. But we want our work on the storage format to be able to support our future use cases without another major redesign, so we should test how it performs on workloads that will be required of it in the near future, like get_page_at_lsn for non-latest pages.
The scope of this issue is to create a minimal test suite for measuring get_page performance in isolation. That would help with perf work on the following issues:
I have some preliminary results, will share below.
The text was updated successfully, but these errors were encountered: