Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pageserver component perf tests #1380

Closed
Tracked by #4754 ...
bojanserafimov opened this issue Mar 18, 2022 · 5 comments
Closed
Tracked by #4754 ...

Add pageserver component perf tests #1380

bojanserafimov opened this issue Mar 18, 2022 · 5 comments
Assignees
Labels
a/benchmark Area: related to benchmarking a/performance Area: relates to performance of the system a/test Area: related to testing c/storage/pageserver Component: storage: pageserver

Comments

@bojanserafimov
Copy link
Contributor

Motivation:
We're currently only testing the pageserver as a part of a zenith cluster with 1 writer node and 0 reader nodes. But we want our work on the storage format to be able to support our future use cases without another major redesign, so we should test how it performs on workloads that will be required of it in the near future, like get_page_at_lsn for non-latest pages.

The scope of this issue is to create a minimal test suite for measuring get_page performance in isolation. That would help with perf work on the following issues:

I have some preliminary results, will share below.

@bojanserafimov bojanserafimov added c/storage/pageserver Component: storage: pageserver a/test Area: related to testing a/performance Area: relates to performance of the system a/benchmark Area: related to benchmarking labels Mar 18, 2022
@bojanserafimov bojanserafimov self-assigned this Mar 18, 2022
@bojanserafimov
Copy link
Contributor Author

newplot(2)

Initial experiment: I wrote 10k updates to the same row that contains only an integer. Then I forced checkpoint and ran get_page for 0.1% of the relevant LSNs, where an update to that page was made. On the plot above, we have:
x: LSN
y: get_page runtime in microseconds

This test was ran on my laptop with release build, on top of the heikki-kvstore branch. To reproduce, here's my wip branch.

takeaways:

  • We can easily get 140ms latency for a get_page request. Not good, but fixable
  • Latency grows linearly with number of updates, and there's room for more updates before pageserver takes an image
  • NOTE: This test doesn't stress disk lookup time, since all the relevant wal is in the latest layer. But the test suite is capable of testing this, and that's what I'm planning to try next.

non-takeaways:

  • Average redo time was 1.4 microseconds, but I'm only measuring one kind of wal entry, so don't take this number seriously at all.

@stepashka stepashka added this to the 0.3 Towards Tech Prev milestone Mar 21, 2022
@bojanserafimov
Copy link
Contributor Author

Not needed for 0.3 milestone. After some thinking I realized that it's not clear how probable it is that we'll ever have two compute nodes share a pageserver. A read node in a different region will require a local pageserver, since the safekeeper-pageserver connection is the only one that can afford latency. We might only need random LSN reads in a few niche cases:

  1. Read only repllica for stricter isolation of readers for security. Useful for sharing data across organizations
  2. Time travel queries
  3. ???

@bojanserafimov bojanserafimov removed this from the 0.3 Towards Tech Prev milestone Mar 23, 2022
@bojanserafimov bojanserafimov removed their assignment Mar 23, 2022
@bojanserafimov
Copy link
Contributor Author

Not needed for 0.3 milestone. After some thinking I realized that it's not clear how probable it is that we'll ever have two compute nodes share a pageserver. A read node in a different region will require a local pageserver, since the safekeeper-pageserver connection is the only one that can afford latency. We might only need random LSN reads in a few niche cases:

1. Read only repllica for stricter isolation of readers for security. Useful for sharing data across organizations

2. Time travel queries

3. ???

Resuming work on this issue. Random historical queries are maybe not important, so I will only focus on more immediately impactful tests:

  1. get_page performance for latest pages
  2. get_page performance from recent wal

@bojanserafimov bojanserafimov self-assigned this Apr 5, 2022
@bojanserafimov
Copy link
Contributor Author

bojanserafimov commented Apr 6, 2022

I tested get_page performance for the latest page for each key. To initialize some pageserver state, I used three different workloads:

workload 1: pgbench small

pgbench -s5 -i
pgbench -c1 -t5000

results

Total pages: 9742
Fastest: 19.925µs
Median: 44.113µs
99th percentile: 169.875µs
Slowest: 6.017257ms

workload 2: pgbench big

pgbench -s100 -i
pgbench -c1 -t100000

results

Total pages: 193366
Fastest: 17.465µs
Median: 33.979µs
99th percentile: 125.403µs
Slowest: 19.226091ms

workload 3: pgbench big and long

pgbench -s100 -i
pgbench -c1 -t1000000

results

Total pages: 200609
Fastest: 20.031µs
Median: 45.195µs
99th percentile: 103.583µs
Slowest: 38.300479ms

workload 4: 100k updates to tiny table

Total pages: 27
Fastest: 29.431µs
Median: 45.56µs
99th percentile: 865.437539ms
Slowest: 865.437539ms

Not sure how the cache is configured, and how that might obscure results I'd get on a bigger database. Now working on getting this test merged into main so others can play with it

Clarification: I'm not measuring the get_page latency that the compute node experienced. I'm measuring the get_page latency of sending direct pagestream api requests to the pageserver after pgbench is done. I'm sending one request at a time.

@jcsp
Copy link
Contributor

jcsp commented Feb 5, 2024

This is stale, the perf testing story has moved on.

@jcsp jcsp closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/benchmark Area: related to benchmarking a/performance Area: relates to performance of the system a/test Area: related to testing c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants