Walredo channel #2778

knizhnik · 2022-11-08T16:51:26Z

See #1339
Buffering walredo requests has critical impact on performance.
This PR tried to do it using Rst channels.
It allows to reduce Ketteq Q1 execution from 30 to 17 seconds.
Unfortunately average number of buffered requests with 6 parallel is only 2.5.
10 seconds is time of Q1 execution without page reconstruction (with latest image layers).
So remaining 7 seconds - is walredo time with 2.5 buffered requests.
Not sure if we can provide large level of bufferisation. Maximal number of buffered requests is 7 (6 parallel workers + master),
so channels are really able to provide requests buffering. But average number is just 2.5

MMeent · 2022-11-09T11:25:36Z

pageserver/src/walredo.rs

-        pg_version: u32,
-    ) -> Result<PostgresRedoProcess, Error> {
+    #[instrument(skip_all,fields(tenant_id=%tenant_id))]
+    fn launch(conf: &PageServerConf, tenant_id: TenantId) -> Result<PostgresRedoProcess, Error> {


Why is the pg version removed and forced to PG14?

This is why of unresolved issues related with this PR.
There is also similar problem in main - see #2770
pg_version should be somehow specified at WalRedoManger creation time.

problame

This will conflict with #2776 and I'm planning to merge that in a few minutes.

knizhnik · 2022-11-09T13:01:03Z

This will conflict with #2776 and I'm planning to merge that in a few minutes.

Yes. But I hope

This will conflict with #2776 and I'm planning to merge that in a few minutes.

Yes, I see.
Conflict can be resolved.
But I still do not quite understand why killing child process from drop() method doesn't work.
I read comment to #2776, but motivation is still unclear to me: I do not understand how launch() method can fail AFTER spawning child process. Looks like the only reason of lunch() failure is some problems with spawning process, doesn't it?

problame · 2022-11-09T13:20:56Z

Looks like the only reason of lunch() failure is some problems with spawning process, doesn't it?

If hit any of the unwrap() or set_nonblocking(...)? early returns after spawning the std::process::Child before wrapping it into the PostgresRedoProcess, then the Child gets dropped and leaked. It's documented as a pitfall in the standard library: https://doc.rust-lang.org/std/process/struct.Child.html#warning

knizhnik · 2022-11-09T13:39:00Z

Yes, it is true.
But I do not see any temporary reason why child's stdin/sdout can not be taken and switched to non-blocking mode.
If it can happen, then it means some permanent OS/process problem which can not be somehow ignored.
If we try to continue normal work and spawn walredo process once agai then most likely we will be faced with the same error.
So I do not think that we should some how try to catch and handle this errors.

problame · 2022-11-09T14:59:54Z

Let's continue this conversation in the PR: #2776 (comment)

hlinnaka · 2022-11-10T12:19:52Z

pageserver/src/walredo.rs

+    // mutiplexor pipe: use sync_channel to allow sharing sender by multiple threads
+    // and limit size of buffer
+    sender: SyncSender<(ChannelId, Vec<u8>)>,
+    // set of receiver channels
+    receivers: Vec<Mutex<Receiver<Bytes>>>,
+    // atomicly incremented counter for choosing receiver
+    round_robin: AtomicUsize,


The usual way to have a request/response queue like this is to create a "oneshot" channel for each request, and send the sending half of it along with the request. So something like:

/// To request WAL redo, create a oneshot channel for the response, and send the /// serialized request and the send-half of the oneshot channel here. Then wait for /// the response on the receive-half of the oneshot channel request_queue: tokio::sync::mpsc::Sender<(tokio::sync::oneshot::Sender<Bytes>, Vec<u8>)>,

You can use those tokio channels from blocking code too, the Senders and Receivers also have blocking send and receive functions.

With blocking_recv I get the following error:

Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.

I wonder why it is necessary to use tokio::sync::oneshot::Sender ?
I tried standard std::sync::mpsc::Sender and speed is the same.
I have tested speed on such simple test:

let (tx, rx): ( SyncSender<(Sender<u32>,u32)>, Receiver<(Sender<u32>,u32)>, ) = mpsc::sync_channel(1024*1024); std::thread::spawn(move || { for i in 0..N_ITER { let (sender,data) = rx.recv().unwrap(); assert!(data == i); sender.send(data).unwrap(); } }); for i in 0..N_ITER { let (tx1, rx1) = mpsc::channel(); tx.send((tx1,i)).unwrap(); let resp = rx1.recv().unwrap(); assert!(resp == i); }

I also tried https://docs.rs/crossbeam-channel/0.5.6/crossbeam_channel/index.html
because I saw recommendation to use it as much faster alternative to std::sync::mpsc.
But at my test I didn't see significant difference.

I wonder why it is necessary to use tokio::sync::oneshot::Sender ?
I tried standard std::sync::mpsc::Sender and speed is the same.

This is just a design pattern. With oneshot it's clear at the type level that you are just going to get back only a single response. The std's rather complex mpsc optimizes for this "oneshot" usage first, then switches to more different kind of implementation when Sender's are cloned or more messages are being sent. Later implementations (crossbeam, flume) do not necessarily do this.

Cannot start a runtime from within a runtime. This happens because a function (like block_on) attempted to block the current thread while the thread is being used to drive asynchronous tasks.

This is quite unfortunate. I wonder if somewhere either a block_on or block_in_place is already being used.

This is quite unfortunate.

Why do we need to use tokio::sync::oneshot if it provides not performance advantages comparing with std::sync::mpsc
where there are no such problems?

koivunej · 2022-11-14T11:06:34Z

If I understand correctly but my understanding is that this changes the walredo to:

run in it's own thread
send work to the external process while the external process is busy doing other things (sending response, stderr logging or actual walredo)
- essentially pipelining the walredo requests (still FIFO order)
- unsure but requests are now serialized by the caller completly ahead of time instead of as we go

I think the implementation is becoming quite complicated with the added ring buffer and custom polling. There might be benefits from tokio based implementation. I don't really understand how can the walredo process work without any changes with this arrangement, are the default buffer sizes so large and the requests usually so small that this becomes useful?

knizhnik · 2022-11-14T11:16:49Z

If I understand correctly but my understanding is that this changes the walredo to:
* run in it's own thread

* send work to the external process while the external process is busy doing other things (sending response, stderr logging or actual walredo)
  
  * essentially pipelining the walredo requests, but still FIFO order
  * unsure but requests are now serialized by the caller completly ahead of time instead of as we go
I think the implementation is becoming quite complicated with the added ring buffer and custom polling. There might be benefits from tokio based implementation. I don't really understand how can the walredo process work without any changes with this arrangement, are the default buffer sizes so large and the requests usually so small that this becomes useful?

Right now each thread in pageserver is working with walredo process directly, using mutex for synchronization.
Metrics show that in case of large number of parallel requests a lot of time is spent in just waiting for this mutex.
It can be addressed in two ways:

Buffering
Using pool of walredo processes.
Both solutions are current implemented (as PRs) and shows the similar performance (at least at Ketteq queries).
This perfomance is mostly limited by number of parallel requests to walredo this threads can produce (average number is 2.5)

Yes, this PR is spawning separate thread for communication with walredo process and use rust channel for communication between threads. There is bounded channel for sending requests. And here where actually buffering tales place.
And responses are delivered using individual (one-shot) channels.

Yes, buffering has critical impact on walredo performance - see #1339 and my table with dependency of redo speed from number of buffered requests (almost linear).

hlinnaka · 2022-11-14T11:25:31Z

run in it's own thread

That might actually become a problem. We have had problems with large numbers of threads in the past. If you have thousands of tenants, and you launch a separate thread for each tenant, it's too much. Can we replace the thread with a tokio task easily?

I think the implementation is becoming quite complicated with the added ring buffer and custom polling.

I don't think there's any need for the ring buffer anymore, now that this uses a separate oneshot channel for each response. (I mean oneshot channel in how it's used, even if it's actually implemented using std::mspc). I'm surprised to see that it's still there. Let's remove it.

We could try to rewrite the polling using tokio::select. However, I actually wrote it using like that at first, but switched to nix::poll for performance reasons in commit b38e841, so if we do that it needs to be performance tested carefully. I don't think there's any fundamental reason why tokio async would be so much slower, though, so maybe the old tokio implementation was just doing something silly.

An even more drastic change would be to replace the pipe with shared memory.

But the thing I like about this PR is that it's relatively small. I'd like to get this into a committable shape, keeping it as simple as possible, and experiment with more drastic rewrites later.

hlinnaka · 2022-11-14T11:33:12Z

pageserver/src/walredo.rs

+    // Ring buffer with channel identifiers of buffered redo requests
+    ring_buf: [Option<Sender<Bytes>>; N_CHANNELS],
+    // Head position in ring buffer
+    ring_buf_pos: usize,
+    // Number of buffered requests
+    n_buffered: usize,


Hmm, so the ring buffer is now used just locally in the thread, to keep track which response received from the walredo process belongs to which request. Let's replace it with std::collections::VecDeque

hlinnaka · 2022-11-14T11:39:51Z

pageserver/src/walredo.rs

+                self.page_pos += self.stdout.read(&mut self.page[self.page_pos..])?;
+                if self.page_pos == BLCKSZ as usize {
+                    assert!(self.n_buffered != 0);
+                    let sender = self.ring_buf[self.ring_buf_pos].take().unwrap();


This will panic, if the walredo process spontaneously writes to its stdout, when no requests have been sent to it. Let's handle that more gracefully, without panicking.

But what will happen if walredo wil output something to the stdout while request processing?
Pageserver will get so,e unexpected input and most likely will i any case panic.
So I can just remove this assert and replace it with some checking and reporting error but IMHO it will not make our system more reliable or fool-protected. As far as I understand, all logging performed by postgres is currently redirected to stderr for walredo process. So the problem will be only in case of there are some printf in Postgres code. But then we are in trouble and the faster we detect it, the less harm it can made. In this sense assertion is preferable to just reporting error and may be restarting walredo process. As far as we are not permanently logs for presence of errors, such failure can be ignore until it cause so,e critical problem like data corruption.l

But what will happen if walredo wil output something to the stdout while request processing?
Pageserver will get so,e unexpected input and most likely will i any case panic.

We should handle that without panicking, too.

So I can just remove this assert and replace it with some checking and reporting error but IMHO it will not make our system more reliable or fool-protected. As far as I understand, all logging performed by postgres is currently redirected to stderr for walredo process. So the problem will be only in case of there are some printf in Postgres code. But then we are in trouble and the faster we detect it, the less harm it can made. In this sense assertion is preferable to just reporting error and may be restarting walredo process. As far as we are not permanently logs for presence of errors, such failure can be ignore until it cause so,e critical problem like data corruption.l

We should start paying more attention to ERRORs in the logs. We shouldn't need to panic, just to get the attention of developers earlier. It's not cool to crash the whole pageserver, causing an outage for all tenants, when there is a problem with one tenant.

Also keep in mind that the WAL redo process is untrusted. Even if we are 100% sure that we never write to stdout in the postgres code we have, we still need to be prepared for the case that someone writes a malicious WAL record that leads to arbitrary code execution in the WAL redo process, and that code writes to stdout.

Right now response from walredo process is just 8192 opaque bytes. Pagesever is not able to somehow check their content. So I was wrong - there will be no panic in pageserver. It will just send wrong page images to compute. Which in some cases may be ever worse, because may cause crash of compute or incorrect work of client application.

There are two reasons which can cause it:

Presence of "illegal" printf in Postgres code or even extensions.

Malicious WAL record which cause walredo to do wrong things, including writing to stdout.

Probability of the first case seems to be much higher. We can not protect user from shutting his leg. We should not allow it to affect other clients. And will be nice to protect user from incorrect behavior caused by our or any other third party bugs (i.e. we allow to use postgis extension and somewhere inside its thousands of code there is some printf). So may be we should add some response header for walredo (with length and may be CRC) so that pageserver an check them?

In any case it seems to be separate PR.
I can remove assert ot replace it with reporting error, but IMHO it is not enough to address this problem.

There are two reasons which can cause it:

Presence of "illegal" printf in Postgres code or even extensions.

Malicious WAL record which cause walredo to do wrong things, including writing to stdout.

Probability of the first case seems to be much higher. We can not protect user from shutting his leg. We should not allow it to affect other clients. And will be nice to protect user from incorrect behavior caused by our or any other third party bugs (i.e. we allow to use postgis extension and somewhere inside its thousands of code there is some printf). So may be we should add some response header for walredo (with length and may be CRC) so that pageserver an check them?

+1. I think a simple "request ID" would do. In the request, send a unique request ID, e.g just using a u64 counter. In the wal redo process, include the request ID in the response, and in the pageserver, check that the ID in the response matches what you expected.

In any case it seems to be separate PR.

Agreed.

I can remove assert ot replace it with reporting error, but IMHO it is not enough to address this problem.

Yeah, please replace the assert with something here. It solves the "one tenant can cause an outage for others" problem, but not the other problems. That's good enough for this PR.

Just want to take back my comment concerning prints in extension: extensions are not involved in wal replaying so them can not output something to stdout.

With PG15 extensions are involved with WAL redo, see https://www.postgresql.org/docs/current/custom-rmgr.html

koivunej · 2022-11-14T11:48:52Z

Can we replace the thread with a tokio task easily?

I don't think so. We would be essentially running another event loop with in the event loop, it would be blocking. Moving this to any of the existing blocking pools will only move the issue (now instead of unnamed extra threads we have larger blocking pools, which might become full since these are long running). I'd suggest tolerate the new thread here until we rewrite it away from custom poll implementation.

However, I actually wrote it using like that at first, but switched to nix::poll for performance reasons in commit b38e841, so if we do that it needs to be performance tested carefully.

I am trying to come up with a criterion based benchmark for this subsystem.

hlinnaka · 2022-11-14T12:31:02Z

I don't think so. We would be essentially running another event loop with in the event loop, it would be blocking. Moving this to any of the existing blocking pools will only move the issue (now instead of unnamed extra threads we have larger blocking pools, which might become full since these are long running).

Hmm. I guess the right way to do it would be to call spawn_blocking() before you do the poll(). Then each iteration would call spawn_blocking() though, and I'm not sure how big the overhead of that is. This is pretty performance sensitive. Another approach would be to have a separate tokio runtime for these tasks. Then the blocking calls wouldn't disrupt other parts of the system.

I'd suggest tolerate the new thread here until we rewrite it away from custom poll implementation.

I am quite worried about having thousands of threads. We've been burned by that before: #1719.

knizhnik · 2022-11-14T12:58:48Z

I do not see big problem with spawning tokio task rather than OS thread for walredo.
We are doing it now for all pageserver background tasks (GC, compaction, upload,...)
so why we can not just do the same for walredo?

But I concern about two things:

Development loops (not "design->implementation->testing->refactoring":) There are some submodules in Neon which implementation is changed in cycles (async->sync->async->...). Looks like walredo has a chance to be yet another such component if we rewrite this nix::poll code with standard tokio select. I am not saying that it is bad, but is seems to be something wrong in such "strategy".
There is still separate walredo process for each tenant. Process is much more heavy weight structure than thread.If we concern about large number of threads, then should we also care about large number of processes.

knizhnik · 2022-11-14T13:07:36Z

Frankly speaking this sync-sync mash is horrible:( 30 years ago in Solaris there were normal threads with flexible mapping between user and system threads. Such lightweight thread are also provided by Go and Java. And here we till have to deal with this awful async/await and use prints for debugging.

koivunej · 2022-11-15T07:56:14Z

pageserver/src/walredo.rs

+                    panic!(
+                        "wal-redo-postgres: {}",
+                        String::from_utf8_lossy(&errbuf[0..n])
+                    );


Noted while trying to come up with a benchmark for this (accidential wrong buffertag), that this used to be only an error!. Similar note on the other instance.

Thank you, fixed.

knizhnik · 2022-11-15T09:18:29Z

Hmm. I guess the right way to do it would be to call spawn_blocking() before you do the poll(). Then each iteration would call spawn_blocking() though, and I'm not sure how big the overhead of that is. This is pretty performance sensitive. Another approach would be to have a separate tokio runtime for these tasks. Then the blocking calls wouldn't disrupt other parts of the system.

After more thinking I also come to the conclusion that changing wal-redo thread to tokio task will be non-trivial.
I agree that it can be quite useful (reduce number of threads) and even may result in better performance.
Right now we are trying to extract from the channel as much requests as we can (using non-blocking try_recv) but ideally we should multiplex reading fro channel and writting to wlredo process pipe. But i this case we need to use tokio channels, rather than std::sync::mpsc. But at another end of the channel we have sync code (apply_wal_records) which is in turned call from tokio task.Can we tokio channels in this case?

But in any case, I think that we should create separate PR for it.

refer #2770

adds a simple walredo bench to allow some comparison of the walredo throughput. Cc: #1339, #2778

hlinnaka · 2022-11-18T12:43:44Z

Quick status sync on this:

Let's finalize this PR. The only big blocker is the addition of one thread per timeline. That's not acceptable, because we have a hard limit of 4096 threads (#1609), after which you get a panic. With one thread per timeline, we get uncomfortably close to that limit.

@koivunej, you're working on replacing the thread with a tokio task in this PR, right?

koivunej · 2022-11-21T09:46:02Z

Apologies for the late reply but yes, I am working on that. I'll open PRs today.

knizhnik · 2022-11-21T10:01:13Z

Another solution (may be more complex) is to have global pool of walredo processes with dedicated threads for interaction with them. Number of threads and processes may be different, so 1-to-1 relation is not required.

We already have PR with pool of walredo processes which can improve perforance in case of seqscan or background pageserver activities. But the main problem here is that if each tenant will spawn multiple walredo processes, then total number of process can become very large and cause exhaustion of some system resources. Having pool of waledo process we can guarantee that this number if limited.

Certainly it will be great f we can reuse waledo process to server multiple tenants because it is actually stateless. But it is insecure. So we should with fin some way to efficiently reset all process memory and local storage (which seems to be impossible), either start-and-stopn processes on demand. May be there is some way to efficiently "clone" process to speedup its start.

LizardWizzard · 2023-03-21T16:04:19Z

Given other improvements in walredo area is it still relevant?

knizhnik force-pushed the walredo_channel branch from aff0a4e to dc8f5e4 Compare November 9, 2022 08:22

knizhnik requested review from a team as code owners November 9, 2022 10:54

MMeent reviewed Nov 9, 2022

View reviewed changes

hlinnaka requested a review from problame November 9, 2022 11:35

problame reviewed Nov 9, 2022

View reviewed changes

problame mentioned this pull request Nov 9, 2022

walredo: fix zombie processes ([postgres] <defunct>) #2776

Merged

hlinnaka reviewed Nov 10, 2022

View reviewed changes

hlinnaka requested a review from alexstanovoy November 10, 2022 15:40

koivunej mentioned this pull request Nov 14, 2022

Increase WAL redo speed #1339

Closed

hlinnaka reviewed Nov 14, 2022

View reviewed changes

koivunej reviewed Nov 15, 2022

View reviewed changes

koivunej mentioned this pull request Nov 15, 2022

perf: simple walredo bench #2816

Merged

knizhnik added 4 commits November 15, 2022 12:29

Send requests to walredo process through the channel

142e347

Do buffering requests to walredo process

915144a

Use nonblocking IO for walredo pipe communication

3df7532

Fix merge conflicts

db2b14d

knizhnik added 7 commits November 15, 2022 12:30

Fix merge conflicts

9ecc54e

Make clippy happy

a30cdea

Make clippy happy

32e0461

Include sender in sent message

e6b484d

Replace ring buffer for buffered walredp requests with VecDeqeu

fa69ee3

Replace panic! with error! in walredo thread

aa40544

Fix merge conflicts

74e77d8

knizhnik force-pushed the walredo_channel branch from 20dd649 to 74e77d8 Compare November 15, 2022 10:57

knizhnik added 2 commits November 15, 2022 20:53

Support multiple walredo processes per tenant

ecb4539

refer #2770

Fix unit test build

fa73b8e

koivunej added a commit that referenced this pull request Nov 16, 2022

perf: simple walredo bench (#2816)

1d10572

adds a simple walredo bench to allow some comparison of the walredo throughput. Cc: #1339, #2778

koivunej added a commit that referenced this pull request Nov 17, 2022

perf: simple walredo bench (#2816)

9a8dbd3

adds a simple walredo bench to allow some comparison of the walredo throughput. Cc: #1339, #2778

koivunej mentioned this pull request Nov 21, 2022

Tokio based walredo #2875

Closed

2 tasks

hlinnaka self-requested a review November 28, 2022 14:09

knizhnik closed this Mar 21, 2023

bayandin deleted the walredo_channel branch May 19, 2023 13:05

Walredo channel #2778

Walredo channel #2778

Conversation

knizhnik commented Nov 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

problame left a comment

Choose a reason for hiding this comment

knizhnik commented Nov 9, 2022

problame commented Nov 9, 2022

knizhnik commented Nov 9, 2022

problame commented Nov 9, 2022

hlinnaka Nov 10, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koivunej commented Nov 14, 2022 • edited

knizhnik commented Nov 14, 2022

hlinnaka commented Nov 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hlinnaka Nov 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koivunej commented Nov 14, 2022

hlinnaka commented Nov 14, 2022

knizhnik commented Nov 14, 2022

knizhnik commented Nov 14, 2022

koivunej Nov 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knizhnik commented Nov 15, 2022

hlinnaka commented Nov 18, 2022

koivunej commented Nov 21, 2022

knizhnik commented Nov 21, 2022

LizardWizzard commented Mar 21, 2023

hlinnaka Nov 10, 2022 •

edited

koivunej commented Nov 14, 2022 •

edited

hlinnaka Nov 14, 2022 •

edited

koivunej Nov 15, 2022 •

edited