Increase WAL redo speed #1339

knizhnik · 2022-03-04T10:04:48Z

We are currently applying WAL records in separate process (wal-redo postgres), sending wal records through the pipe and receiving reconstructed pages as response.
It adds quite significant overhead.

Also received WAL records can be malformed, so we should place wal-redo process in sand-box to prevent hack of the system. Right now we are using seccomp to prohibit all "dangerous" system calls. But I am not sure that it is enough.

Also there is just one instance of wal-redo process per-tenant, so it can be a bottleneck. There were attempts to spawn pool of wal-redo postgreses. In some cases it have positive effect on performance, in some cases- negative. And if we are going to server larger number of tenants by one pageserver, then large number of wal-redo processes can be a problem.

So I want investigate how difficult it can be to reimplement postgres redo handlers in Rust and can it improve performance.

knizhnik · 2022-03-04T10:21:33Z

Right now I have implemented redo handler for only one WAL record: HEAP INSERT. But it allows to compare apply wal records speed.

Configuration:

autovacuum = off
shared_buffers = 1Mb
wal_log_hints = off
---
checkpoint_interval = 10000 sec

So we disable checkpointing and use small shared buffers to force page reconstruction. Vacuum is disabled to exclude background activity.

Query:

create table t2 as select generate_series(1,10000000);
select count(*) from t2;

Results:

Zenith branch	Insert time	count(*) sequential	count(*) parallel	repo size	wal size
main	18658	29666	12886	873M	625M
rust_redo_handlers	20826	20254	5722	568M	625M

So as you can see, redo in rust provides more than 2 times improvement of speed and 1/3 reduce of storage size (because more compact wal record format: we do not need to store information about target block).

knizhnik · 2022-03-04T10:22:43Z

With prefetch effect of zenith wal redo is expected to be even larger.

knizhnik · 2022-03-04T11:44:07Z

Related information concerning measuring speed of wal redo:
We have told many times that we should measure speed of our WAL redo process. It is possible to monitor it now using Prometheus metrics, but I was not sure if precision is enough and also want to measure speed of individual operations.
So my results with release Rust version on our pgbench tests with -N are the following:

Prometheus
pageserver_wal_redo_time_sum 2.556758943000012
pageserver_wal_redo_time_count 15964
Time of update operation (1000 iterations in loop): 400usec for vector with 61 updates

So in first case average WAL redo time is 160usec, in the second - 400usec.
It corresponds to maximum 2500 get_page_at_lsn requests per second (not including network roundtrip). Actual TPS for one client is about 1000. Taken in account that pgbench script with -N option accesses just one random tuple(which requires reading ~2 pages: index and heap), this times are compatible and wal-redo postgres can really be a bottleneck.

Apply one insert record by wal-redo postgres (eliminating all communication and xlog decoding overhead) is about 2usec. If we multiple it by 61, then get about 100 usec for applying all wal records needed to reconstruct one page. Others 300 usecs seems to be used by communication.

knizhnik · 2022-03-04T13:49:33Z

By the way, time of select count(*) when selecting from image layers is 1383.
So it is still ~4 times faster than page reconstruction (even embedded in rust).

knizhnik · 2022-11-07T13:13:48Z

I have performed series of experiments trying to determine walredo bottlenecks:

First of all I dump requests to wal redo process in the file and then replayed them: time postgres --wal-redo < walredo.log > /dev/null
Replaying 3.2 Gb of WAL takes only 1.3 seconds. So there is not problem with redo speed itself.
Then I created Rust program simulating interaction with postgres wal-redo proess through the pipe.
It is also reads log data from saved walredo.log file. Replaying WAL in async mode (without waiting of each page reconstruction) takes about 2 seconds. But if we wait for result before sending new request to walredo process then time is increased to 14 seconds. So buffering wal redo requests seems to be very important.

I have implemented such way of multiplexing/buffering requests to walredo process through channels:
https://github.com/neondatabase/neon/tree/walredo_channel
It increase speed of Ketteq Q1 execution from 30 to 24 seconds. But it is not enough. Most likely we need to have more parallel requests (in case of Ketteq parallel seqscan spawns 6 parallel workers).

Having pool of walredo workers (up to 4 processes) allows to reduce Q1 time from 30 to 18 seconds.
https://github.com/neondatabase/neon/tree/walredo-optimizations
But spawning multiple walredo processes for each tenant may not be so good idea.

knizhnik · 2022-11-07T13:16:04Z

This is small Rust program I am using to for replaying WAL from the file. It needs to be executed in the directory containing walredo.log and with PGDATA pointed to wal-redo directory. To produce walredo.log, I have patched walredo.rs to write to the file data sent to the pipe.

walredo.tar.gz

buffer size	time (sec)
1	13.911
2	7.226
4	3.005
8	1.785

koivunej · 2022-11-07T16:38:16Z

Interested in looking at this.

knizhnik · 2022-11-07T16:39:28Z

I just realized that my redo_channel branch is not doing buffering in right way.
I have fixed it and now I get the same 18 seconds for Q1.

knizhnik · 2022-11-07T17:38:07Z

So right now situation with Ketteq Q1 query is the following:

state	time (sec)
cold (main)	28
warm shared buffers	4.6
pageserver cache	6
latest image layers	10
buffered walredo	18

There is still large gap between cases when pageserver has to do page reconstruction (18 seconds) and when it is not needed (10 seconds).
It is not quite clear because experiments with bufferig shows that with bufer size = 8 time of walredo is less than 2 seconds.
May be I still doing something wrong in my buffering implementation using channels. Or parallel workers are not able to produce enough number of parallel requests.

knizhnik · 2022-11-07T17:51:05Z

Average number of buffered requests for Q1 with 6 parallel workers is just 2.5
It explains this 8 seconds difference. In the table above time of walredo from file with buffer size = 2 was 7 seconds.
What is not clear - why 7 backends (6 parallel workers + master backend) can concurrently produce just 2.7 requests.

koivunej · 2022-11-14T10:46:16Z

I am thinking what we would most benefit from right now is a reproducable benchmark around for example the Ketteq Q1 alike situation. Trying to work towards that, while trying to understand the #2778 as well.

adds a simple walredo bench to allow some comparison of the walredo throughput. Cc: #1339, #2778

koivunej · 2022-11-22T08:57:53Z

Tried quite a few permutations of implementing the pipelining walredo using tokio primitives in #2875 but it doesn't look viable, at least until the root cause have been understood. With tokio patched to support vectored writes, I feel like it should be faster and reuse the memory but for some reason it is slower overall.

koivunej · 2023-07-17T07:25:46Z

@knizhnik's #3368 was merged instead. I think we are good to close this for now. Please reopen if it wasn't.

knizhnik mentioned this issue Mar 4, 2022

Implement redo handlers in rust for some wal records #1340

Closed

knizhnik self-assigned this Mar 4, 2022

knizhnik added c/storage/pageserver Component: storage: pageserver c/storage/wal Component: storage: relates to WAL processing a/benchmark Area: related to benchmarking labels Mar 4, 2022

stepashka added this to the 0.6 Towards Tech Prev milestone Mar 15, 2022

bojanserafimov mentioned this issue Mar 18, 2022

Add pageserver component perf tests #1380

Closed

stepashka removed this from the 0.6 Towards Tech Prev milestone Mar 23, 2022

stepashka changed the title ~~Epic: increase WAL redo speed~~ Increase WAL redo speed Apr 5, 2022

stepashka mentioned this issue Apr 5, 2022

Epic: Read Latency — define metrics, define and achieve launch objectives #1466

Closed

16 tasks

neondatabase-bot bot added this to the 0.6 Towards Tech Prev milestone Apr 5, 2022

stepashka assigned hlinnaka Apr 21, 2022

stepashka linked a pull request Apr 26, 2022 that will close this issue

Optimize WAL redo speed #1362

Closed

stepashka modified the milestones: 0.7 Towards Tech Prev, 1.0 Technical preview May 6, 2022

neondatabase-bot bot removed this from the 1.0 Technical preview milestone May 17, 2022

This was referenced Oct 24, 2022

Investigate bulk load performance #1263

Open

Epic: performance parity with vanilla on medium-sized databases (100-300GB) #2221

Closed

hlinnaka mentioned this issue Nov 7, 2022

Defunct wal-redo processes #2761

Closed

hlinnaka assigned koivunej Nov 7, 2022

knizhnik mentioned this issue Nov 8, 2022

Walredo channel #2778

Closed

koivunej mentioned this issue Nov 15, 2022

perf: simple walredo bench #2816

Merged

koivunej added a commit that referenced this issue Nov 16, 2022

perf: simple walredo bench (#2816)

1d10572

adds a simple walredo bench to allow some comparison of the walredo throughput. Cc: #1339, #2778

koivunej added a commit that referenced this issue Nov 17, 2022

perf: simple walredo bench (#2816)

9a8dbd3

adds a simple walredo bench to allow some comparison of the walredo throughput. Cc: #1339, #2778

hlinnaka mentioned this issue Dec 22, 2022

Epic: shmempipe for WAL redo process communication #3184

Closed

4 tasks

koivunej removed their assignment Jan 24, 2023

koivunej closed this as completed Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase WAL redo speed #1339

Increase WAL redo speed #1339

knizhnik commented Mar 4, 2022

knizhnik commented Mar 4, 2022 •

edited

knizhnik commented Mar 4, 2022

knizhnik commented Mar 4, 2022

knizhnik commented Mar 4, 2022

knizhnik commented Nov 7, 2022 •

edited

knizhnik commented Nov 7, 2022 •

edited

koivunej commented Nov 7, 2022

knizhnik commented Nov 7, 2022 •

edited

knizhnik commented Nov 7, 2022 •

edited

knizhnik commented Nov 7, 2022 •

edited

koivunej commented Nov 14, 2022

koivunej commented Nov 22, 2022

koivunej commented Jul 17, 2023

Increase WAL redo speed #1339

Increase WAL redo speed #1339

Comments

knizhnik commented Mar 4, 2022

knizhnik commented Mar 4, 2022 • edited

knizhnik commented Mar 4, 2022

knizhnik commented Mar 4, 2022

knizhnik commented Mar 4, 2022

knizhnik commented Nov 7, 2022 • edited

knizhnik commented Nov 7, 2022 • edited

koivunej commented Nov 7, 2022

knizhnik commented Nov 7, 2022 • edited

knizhnik commented Nov 7, 2022 • edited

knizhnik commented Nov 7, 2022 • edited

koivunej commented Nov 14, 2022

koivunej commented Nov 22, 2022

koivunej commented Jul 17, 2023

knizhnik commented Mar 4, 2022 •

edited

knizhnik commented Nov 7, 2022 •

edited

knizhnik commented Nov 7, 2022 •

edited

knizhnik commented Nov 7, 2022 •

edited

knizhnik commented Nov 7, 2022 •

edited

knizhnik commented Nov 7, 2022 •

edited