Add pageserver component perf tests #1380

bojanserafimov · 2022-03-18T20:51:55Z

Motivation:
We're currently only testing the pageserver as a part of a zenith cluster with 1 writer node and 0 reader nodes. But we want our work on the storage format to be able to support our future use cases without another major redesign, so we should test how it performs on workloads that will be required of it in the near future, like get_page_at_lsn for non-latest pages.

The scope of this issue is to create a minimal test suite for measuring get_page performance in isolation. That would help with perf work on the following issues:

Improve heuristics for compaction #1357
Take page images in pageserver #1142
maybe Increase WAL redo speed #1339 (maybe not but it's a relevant issue)

I have some preliminary results, will share below.

bojanserafimov · 2022-03-18T21:10:08Z

Initial experiment: I wrote 10k updates to the same row that contains only an integer. Then I forced checkpoint and ran get_page for 0.1% of the relevant LSNs, where an update to that page was made. On the plot above, we have:
x: LSN
y: get_page runtime in microseconds

This test was ran on my laptop with release build, on top of the heikki-kvstore branch. To reproduce, here's my wip branch.

takeaways:

We can easily get 140ms latency for a get_page request. Not good, but fixable
Latency grows linearly with number of updates, and there's room for more updates before pageserver takes an image
NOTE: This test doesn't stress disk lookup time, since all the relevant wal is in the latest layer. But the test suite is capable of testing this, and that's what I'm planning to try next.

non-takeaways:

Average redo time was 1.4 microseconds, but I'm only measuring one kind of wal entry, so don't take this number seriously at all.

bojanserafimov · 2022-03-23T14:19:07Z

Not needed for 0.3 milestone. After some thinking I realized that it's not clear how probable it is that we'll ever have two compute nodes share a pageserver. A read node in a different region will require a local pageserver, since the safekeeper-pageserver connection is the only one that can afford latency. We might only need random LSN reads in a few niche cases:

Read only repllica for stricter isolation of readers for security. Useful for sharing data across organizations
Time travel queries
???

bojanserafimov · 2022-04-05T16:44:25Z

Not needed for 0.3 milestone. After some thinking I realized that it's not clear how probable it is that we'll ever have two compute nodes share a pageserver. A read node in a different region will require a local pageserver, since the safekeeper-pageserver connection is the only one that can afford latency. We might only need random LSN reads in a few niche cases:
1. Read only repllica for stricter isolation of readers for security. Useful for sharing data across organizations

2. Time travel queries

3. ???

Resuming work on this issue. Random historical queries are maybe not important, so I will only focus on more immediately impactful tests:

get_page performance for latest pages
get_page performance from recent wal

bojanserafimov · 2022-04-06T00:32:34Z

I tested get_page performance for the latest page for each key. To initialize some pageserver state, I used three different workloads:

workload 1: pgbench small

pgbench -s5 -i
pgbench -c1 -t5000

results

Total pages: 9742
Fastest: 19.925µs
Median: 44.113µs
99th percentile: 169.875µs
Slowest: 6.017257ms

workload 2: pgbench big

pgbench -s100 -i
pgbench -c1 -t100000

results

Total pages: 193366
Fastest: 17.465µs
Median: 33.979µs
99th percentile: 125.403µs
Slowest: 19.226091ms

workload 3: pgbench big and long

pgbench -s100 -i
pgbench -c1 -t1000000

results

Total pages: 200609
Fastest: 20.031µs
Median: 45.195µs
99th percentile: 103.583µs
Slowest: 38.300479ms

workload 4: 100k updates to tiny table

Total pages: 27
Fastest: 29.431µs
Median: 45.56µs
99th percentile: 865.437539ms
Slowest: 865.437539ms

Not sure how the cache is configured, and how that might obscure results I'd get on a bigger database. Now working on getting this test merged into main so others can play with it

Clarification: I'm not measuring the get_page latency that the compute node experienced. I'm measuring the get_page latency of sending direct pagestream api requests to the pageserver after pgbench is done. I'm sending one request at a time.

jcsp · 2024-02-05T10:28:22Z

This is stale, the perf testing story has moved on.

bojanserafimov added c/storage/pageserver Component: storage: pageserver a/test Area: related to testing a/performance Area: relates to performance of the system a/benchmark Area: related to benchmarking labels Mar 18, 2022

bojanserafimov self-assigned this Mar 18, 2022

bojanserafimov mentioned this issue Mar 18, 2022

Take page images in pageserver #1142

Closed

2 tasks

stepashka added this to the 0.3 Towards Tech Prev milestone Mar 21, 2022

bojanserafimov removed this from the 0.3 Towards Tech Prev milestone Mar 23, 2022

bojanserafimov removed their assignment Mar 23, 2022

bojanserafimov self-assigned this Apr 5, 2022

stepashka mentioned this issue Apr 7, 2022

Epic: Read Latency — define metrics, define and achieve launch objectives #1466

Closed

16 tasks

neondatabase-bot bot added this to the 0.7 Towards Tech Prev milestone Apr 7, 2022

This was referenced Apr 10, 2022

Epic: page reconstruction (materialization) criteria #1481

Open

Add dedicated get_page perf test #1495

Closed

stepashka modified the milestones: 0.7 Towards Tech Prev, 1.0 Technical preview May 6, 2022

neondatabase-bot bot removed this from the 1.0 Technical preview milestone May 17, 2022

shanyp mentioned this issue Jul 19, 2023

Epic: reduce space amplification #4754

Closed

jcsp closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pageserver component perf tests #1380

Add pageserver component perf tests #1380

bojanserafimov commented Mar 18, 2022

bojanserafimov commented Mar 18, 2022

bojanserafimov commented Mar 23, 2022

bojanserafimov commented Apr 5, 2022

bojanserafimov commented Apr 6, 2022 •

edited

jcsp commented Feb 5, 2024

Add pageserver component perf tests #1380

Add pageserver component perf tests #1380

Comments

bojanserafimov commented Mar 18, 2022

bojanserafimov commented Mar 18, 2022

bojanserafimov commented Mar 23, 2022

bojanserafimov commented Apr 5, 2022

bojanserafimov commented Apr 6, 2022 • edited

workload 1: pgbench small

results

workload 2: pgbench big

results

workload 3: pgbench big and long

results

workload 4: 100k updates to tiny table

jcsp commented Feb 5, 2024

bojanserafimov commented Apr 6, 2022 •

edited