Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase WAL redo speed #1339

Closed
Tracked by #1466
knizhnik opened this issue Mar 4, 2022 · 13 comments
Closed
Tracked by #1466

Increase WAL redo speed #1339

knizhnik opened this issue Mar 4, 2022 · 13 comments
Assignees
Labels
a/benchmark Area: related to benchmarking c/storage/pageserver Component: storage: pageserver c/storage/wal Component: storage: relates to WAL processing

Comments

@knizhnik
Copy link
Contributor

knizhnik commented Mar 4, 2022

We are currently applying WAL records in separate process (wal-redo postgres), sending wal records through the pipe and receiving reconstructed pages as response.
It adds quite significant overhead.

Also received WAL records can be malformed, so we should place wal-redo process in sand-box to prevent hack of the system. Right now we are using seccomp to prohibit all "dangerous" system calls. But I am not sure that it is enough.

Also there is just one instance of wal-redo process per-tenant, so it can be a bottleneck. There were attempts to spawn pool of wal-redo postgreses. In some cases it have positive effect on performance, in some cases- negative. And if we are going to server larger number of tenants by one pageserver, then large number of wal-redo processes can be a problem.

So I want investigate how difficult it can be to reimplement postgres redo handlers in Rust and can it improve performance.

@knizhnik
Copy link
Contributor Author

knizhnik commented Mar 4, 2022

Right now I have implemented redo handler for only one WAL record: HEAP INSERT. But it allows to compare apply wal records speed.

Configuration:

autovacuum = off
shared_buffers = 1Mb
wal_log_hints = off
---
checkpoint_interval = 10000 sec

So we disable checkpointing and use small shared buffers to force page reconstruction. Vacuum is disabled to exclude background activity.

Query:

create table t2 as select generate_series(1,10000000);
select count(*) from t2;

Results:

Zenith branch Insert time count(*) sequential count(*) parallel repo size wal size
main 18658 29666 12886 873M 625M
rust_redo_handlers 20826 20254 5722 568M 625M

So as you can see, redo in rust provides more than 2 times improvement of speed and 1/3 reduce of storage size (because more compact wal record format: we do not need to store information about target block).

@knizhnik
Copy link
Contributor Author

knizhnik commented Mar 4, 2022

With prefetch effect of zenith wal redo is expected to be even larger.

@knizhnik
Copy link
Contributor Author

knizhnik commented Mar 4, 2022

Related information concerning measuring speed of wal redo:
We have told many times that we should measure speed of our WAL redo process. It is possible to monitor it now using Prometheus metrics, but I was not sure if precision is enough and also want to measure speed of individual operations.
So my results with release Rust version on our pgbench tests with -N are the following:

  1. Prometheus
    pageserver_wal_redo_time_sum 2.556758943000012
    pageserver_wal_redo_time_count 15964

  2. Time of update operation (1000 iterations in loop): 400usec for vector with 61 updates

So in first case average WAL redo time is 160usec, in the second - 400usec.
It corresponds to maximum 2500 get_page_at_lsn requests per second (not including network roundtrip). Actual TPS for one client is about 1000. Taken in account that pgbench script with -N option accesses just one random tuple(which requires reading ~2 pages: index and heap), this times are compatible and wal-redo postgres can really be a bottleneck.

Apply one insert record by wal-redo postgres (eliminating all communication and xlog decoding overhead) is about 2usec. If we multiple it by 61, then get about 100 usec for applying all wal records needed to reconstruct one page. Others 300 usecs seems to be used by communication.

@knizhnik
Copy link
Contributor Author

knizhnik commented Mar 4, 2022

By the way, time of select count(*) when selecting from image layers is 1383.
So it is still ~4 times faster than page reconstruction (even embedded in rust).

@knizhnik knizhnik self-assigned this Mar 4, 2022
@knizhnik knizhnik added c/storage/pageserver Component: storage: pageserver c/storage/wal Component: storage: relates to WAL processing a/benchmark Area: related to benchmarking labels Mar 4, 2022
@stepashka stepashka added this to the 0.6 Towards Tech Prev milestone Mar 15, 2022
@stepashka stepashka removed this from the 0.6 Towards Tech Prev milestone Mar 23, 2022
@stepashka stepashka changed the title Epic: increase WAL redo speed Increase WAL redo speed Apr 5, 2022
@neondatabase-bot neondatabase-bot bot added this to the 0.6 Towards Tech Prev milestone Apr 5, 2022
@stepashka stepashka linked a pull request Apr 26, 2022 that will close this issue
@neondatabase-bot neondatabase-bot bot removed this from the 1.0 Technical preview milestone May 17, 2022
@knizhnik
Copy link
Contributor Author

knizhnik commented Nov 7, 2022

I have performed series of experiments trying to determine walredo bottlenecks:

First of all I dump requests to wal redo process in the file and then replayed them: time postgres --wal-redo < walredo.log > /dev/null
Replaying 3.2 Gb of WAL takes only 1.3 seconds. So there is not problem with redo speed itself.
Then I created Rust program simulating interaction with postgres wal-redo proess through the pipe.
It is also reads log data from saved walredo.log file. Replaying WAL in async mode (without waiting of each page reconstruction) takes about 2 seconds. But if we wait for result before sending new request to walredo process then time is increased to 14 seconds. So buffering wal redo requests seems to be very important.

I have implemented such way of multiplexing/buffering requests to walredo process through channels:
https://github.com/neondatabase/neon/tree/walredo_channel
It increase speed of Ketteq Q1 execution from 30 to 24 seconds. But it is not enough. Most likely we need to have more parallel requests (in case of Ketteq parallel seqscan spawns 6 parallel workers).

Having pool of walredo workers (up to 4 processes) allows to reduce Q1 time from 30 to 18 seconds.
https://github.com/neondatabase/neon/tree/walredo-optimizations
But spawning multiple walredo processes for each tenant may not be so good idea.

@knizhnik
Copy link
Contributor Author

knizhnik commented Nov 7, 2022

This is small Rust program I am using to for replaying WAL from the file. It needs to be executed in the directory containing walredo.log and with PGDATA pointed to wal-redo directory. To produce walredo.log, I have patched walredo.rs to write to the file data sent to the pipe.

walredo.tar.gz

buffer size time (sec)
1 13.911
2 7.226
4 3.005
8 1.785

@koivunej
Copy link
Contributor

koivunej commented Nov 7, 2022

Interested in looking at this.

@knizhnik
Copy link
Contributor Author

knizhnik commented Nov 7, 2022

I just realized that my redo_channel branch is not doing buffering in right way.
I have fixed it and now I get the same 18 seconds for Q1.

@knizhnik
Copy link
Contributor Author

knizhnik commented Nov 7, 2022

So right now situation with Ketteq Q1 query is the following:

state time (sec)
cold (main) 28
warm shared buffers 4.6
pageserver cache 6
latest image layers 10
buffered walredo 18

There is still large gap between cases when pageserver has to do page reconstruction (18 seconds) and when it is not needed (10 seconds).
It is not quite clear because experiments with bufferig shows that with bufer size = 8 time of walredo is less than 2 seconds.
May be I still doing something wrong in my buffering implementation using channels. Or parallel workers are not able to produce enough number of parallel requests.

@knizhnik
Copy link
Contributor Author

knizhnik commented Nov 7, 2022

Average number of buffered requests for Q1 with 6 parallel workers is just 2.5
It explains this 8 seconds difference. In the table above time of walredo from file with buffer size = 2 was 7 seconds.
What is not clear - why 7 backends (6 parallel workers + master backend) can concurrently produce just 2.7 requests.

@koivunej
Copy link
Contributor

I am thinking what we would most benefit from right now is a reproducable benchmark around for example the Ketteq Q1 alike situation. Trying to work towards that, while trying to understand the #2778 as well.

koivunej added a commit that referenced this issue Nov 16, 2022
adds a simple walredo bench to allow some comparison of the walredo
throughput.

Cc: #1339, #2778
koivunej added a commit that referenced this issue Nov 17, 2022
adds a simple walredo bench to allow some comparison of the walredo
throughput.

Cc: #1339, #2778
@koivunej
Copy link
Contributor

Tried quite a few permutations of implementing the pipelining walredo using tokio primitives in #2875 but it doesn't look viable, at least until the root cause have been understood. With tokio patched to support vectored writes, I feel like it should be faster and reuse the memory but for some reason it is slower overall.

@koivunej
Copy link
Contributor

@knizhnik's #3368 was merged instead. I think we are good to close this for now. Please reopen if it wasn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/benchmark Area: related to benchmarking c/storage/pageserver Component: storage: pageserver c/storage/wal Component: storage: relates to WAL processing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants