Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Walredo channel #2778

Closed
wants to merge 13 commits into from
Closed

Walredo channel #2778

wants to merge 13 commits into from

Conversation

knizhnik
Copy link
Contributor

@knizhnik knizhnik commented Nov 8, 2022

See #1339
Buffering walredo requests has critical impact on performance.
This PR tried to do it using Rst channels.
It allows to reduce Ketteq Q1 execution from 30 to 17 seconds.
Unfortunately average number of buffered requests with 6 parallel is only 2.5.
10 seconds is time of Q1 execution without page reconstruction (with latest image layers).
So remaining 7 seconds - is walredo time with 2.5 buffered requests.
Not sure if we can provide large level of bufferisation. Maximal number of buffered requests is 7 (6 parallel workers + master),
so channels are really able to provide requests buffering. But average number is just 2.5

pg_version: u32,
) -> Result<PostgresRedoProcess, Error> {
#[instrument(skip_all,fields(tenant_id=%tenant_id))]
fn launch(conf: &PageServerConf, tenant_id: TenantId) -> Result<PostgresRedoProcess, Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the pg version removed and forced to PG14?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why of unresolved issues related with this PR.
There is also similar problem in main - see #2770
pg_version should be somehow specified at WalRedoManger creation time.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will conflict with #2776 and I'm planning to merge that in a few minutes.

@knizhnik
Copy link
Contributor Author

knizhnik commented Nov 9, 2022

This will conflict with #2776 and I'm planning to merge that in a few minutes.

Yes. But I hope

This will conflict with #2776 and I'm planning to merge that in a few minutes.

Yes, I see.
Conflict can be resolved.
But I still do not quite understand why killing child process from drop() method doesn't work.
I read comment to #2776, but motivation is still unclear to me: I do not understand how launch() method can fail AFTER spawning child process. Looks like the only reason of lunch() failure is some problems with spawning process, doesn't it?

@problame
Copy link
Contributor

problame commented Nov 9, 2022

Looks like the only reason of lunch() failure is some problems with spawning process, doesn't it?

If hit any of the unwrap() or set_nonblocking(...)? early returns after spawning the std::process::Child before wrapping it into the PostgresRedoProcess, then the Child gets dropped and leaked. It's documented as a pitfall in the standard library: https://doc.rust-lang.org/std/process/struct.Child.html#warning

@knizhnik
Copy link
Contributor Author

knizhnik commented Nov 9, 2022

Yes, it is true.
But I do not see any temporary reason why child's stdin/sdout can not be taken and switched to non-blocking mode.
If it can happen, then it means some permanent OS/process problem which can not be somehow ignored.
If we try to continue normal work and spawn walredo process once agai then most likely we will be faced with the same error.
So I do not think that we should some how try to catch and handle this errors.

@problame
Copy link
Contributor

problame commented Nov 9, 2022

Let's continue this conversation in the PR: #2776 (comment)

Comment on lines 110 to 116
// mutiplexor pipe: use sync_channel to allow sharing sender by multiple threads
// and limit size of buffer
sender: SyncSender<(ChannelId, Vec<u8>)>,
// set of receiver channels
receivers: Vec<Mutex<Receiver<Bytes>>>,
// atomicly incremented counter for choosing receiver
round_robin: AtomicUsize,
Copy link
Contributor

@hlinnaka hlinnaka Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual way to have a request/response queue like this is to create a "oneshot" channel for each request, and send the sending half of it along with the request. So something like:

    /// To request WAL redo, create a oneshot channel for the response, and send the
    /// serialized request and the send-half of the oneshot channel here. Then wait for
    /// the response on the receive-half of the oneshot channel
    request_queue: tokio::sync::mpsc::Sender<(tokio::sync::oneshot::Sender<Bytes>, Vec<u8>)>,

You can use those tokio channels from blocking code too, the Senders and Receivers also have blocking send and receive functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With blocking_recv I get the following error:

Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.

I wonder why it is necessary to use tokio::sync::oneshot::Sender ?
I tried standard std::sync::mpsc::Sender and speed is the same.
I have tested speed on such simple test:

    let (tx, rx): (
        SyncSender<(Sender<u32>,u32)>,
        Receiver<(Sender<u32>,u32)>,
    ) = mpsc::sync_channel(1024*1024);
    std::thread::spawn(move || {
		for i in 0..N_ITER {
			let (sender,data) = rx.recv().unwrap();
			assert!(data == i);
			sender.send(data).unwrap();
		}
	});

	for i in 0..N_ITER {
		let (tx1, rx1) = mpsc::channel();
		tx.send((tx1,i)).unwrap();
		let resp = rx1.recv().unwrap();
		assert!(resp == i);
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried https://docs.rs/crossbeam-channel/0.5.6/crossbeam_channel/index.html
because I saw recommendation to use it as much faster alternative to std::sync::mpsc.
But at my test I didn't see significant difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why it is necessary to use tokio::sync::oneshot::Sender ?
I tried standard std::sync::mpsc::Sender and speed is the same.

This is just a design pattern. With oneshot it's clear at the type level that you are just going to get back only a single response. The std's rather complex mpsc optimizes for this "oneshot" usage first, then switches to more different kind of implementation when Sender's are cloned or more messages are being sent. Later implementations (crossbeam, flume) do not necessarily do this.

Cannot start a runtime from within a runtime. This happens because a function (like block_on) attempted to block the current thread while the thread is being used to drive asynchronous tasks.

This is quite unfortunate. I wonder if somewhere either a block_on or block_in_place is already being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite unfortunate.

Why do we need to use tokio::sync::oneshot if it provides not performance advantages comparing with std::sync::mpsc
where there are no such problems?

@koivunej
Copy link
Contributor

koivunej commented Nov 14, 2022

If I understand correctly but my understanding is that this changes the walredo to:

  • run in it's own thread
  • send work to the external process while the external process is busy doing other things (sending response, stderr logging or actual walredo)
    • essentially pipelining the walredo requests (still FIFO order)
    • unsure but requests are now serialized by the caller completly ahead of time instead of as we go

I think the implementation is becoming quite complicated with the added ring buffer and custom polling. There might be benefits from tokio based implementation. I don't really understand how can the walredo process work without any changes with this arrangement, are the default buffer sizes so large and the requests usually so small that this becomes useful?

@knizhnik
Copy link
Contributor Author

If I understand correctly but my understanding is that this changes the walredo to:

* run in it's own thread

* send work to the external process while the external process is busy doing other things (sending response, stderr logging or actual walredo)
  
  * essentially pipelining the walredo requests, but still FIFO order
  * unsure but requests are now serialized by the caller completly ahead of time instead of as we go

I think the implementation is becoming quite complicated with the added ring buffer and custom polling. There might be benefits from tokio based implementation. I don't really understand how can the walredo process work without any changes with this arrangement, are the default buffer sizes so large and the requests usually so small that this becomes useful?

Right now each thread in pageserver is working with walredo process directly, using mutex for synchronization.
Metrics show that in case of large number of parallel requests a lot of time is spent in just waiting for this mutex.
It can be addressed in two ways:

  1. Buffering
  2. Using pool of walredo processes.
    Both solutions are current implemented (as PRs) and shows the similar performance (at least at Ketteq queries).
    This perfomance is mostly limited by number of parallel requests to walredo this threads can produce (average number is 2.5)

Yes, this PR is spawning separate thread for communication with walredo process and use rust channel for communication between threads. There is bounded channel for sending requests. And here where actually buffering tales place.
And responses are delivered using individual (one-shot) channels.

Yes, buffering has critical impact on walredo performance - see #1339 and my table with dependency of redo speed from number of buffered requests (almost linear).

@hlinnaka
Copy link
Contributor

  • run in it's own thread

That might actually become a problem. We have had problems with large numbers of threads in the past. If you have thousands of tenants, and you launch a separate thread for each tenant, it's too much. Can we replace the thread with a tokio task easily?

I think the implementation is becoming quite complicated with the added ring buffer and custom polling.

I don't think there's any need for the ring buffer anymore, now that this uses a separate oneshot channel for each response. (I mean oneshot channel in how it's used, even if it's actually implemented using std::mspc). I'm surprised to see that it's still there. Let's remove it.

We could try to rewrite the polling using tokio::select. However, I actually wrote it using like that at first, but switched to nix::poll for performance reasons in commit b38e841, so if we do that it needs to be performance tested carefully. I don't think there's any fundamental reason why tokio async would be so much slower, though, so maybe the old tokio implementation was just doing something silly.

An even more drastic change would be to replace the pipe with shared memory.

But the thing I like about this PR is that it's relatively small. I'd like to get this into a committable shape, keeping it as simple as possible, and experiment with more drastic rewrites later.

Comment on lines 586 to 591
// Ring buffer with channel identifiers of buffered redo requests
ring_buf: [Option<Sender<Bytes>>; N_CHANNELS],
// Head position in ring buffer
ring_buf_pos: usize,
// Number of buffered requests
n_buffered: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so the ring buffer is now used just locally in the thread, to keep track which response received from the walredo process belongs to which request. Let's replace it with std::collections::VecDeque

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

self.page_pos += self.stdout.read(&mut self.page[self.page_pos..])?;
if self.page_pos == BLCKSZ as usize {
assert!(self.n_buffered != 0);
let sender = self.ring_buf[self.ring_buf_pos].take().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will panic, if the walredo process spontaneously writes to its stdout, when no requests have been sent to it. Let's handle that more gracefully, without panicking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what will happen if walredo wil output something to the stdout while request processing?
Pageserver will get so,e unexpected input and most likely will i any case panic.
So I can just remove this assert and replace it with some checking and reporting error but IMHO it will not make our system more reliable or fool-protected. As far as I understand, all logging performed by postgres is currently redirected to stderr for walredo process. So the problem will be only in case of there are some printf in Postgres code. But then we are in trouble and the faster we detect it, the less harm it can made. In this sense assertion is preferable to just reporting error and may be restarting walredo process. As far as we are not permanently logs for presence of errors, such failure can be ignore until it cause so,e critical problem like data corruption.l

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what will happen if walredo wil output something to the stdout while request processing?
Pageserver will get so,e unexpected input and most likely will i any case panic.

We should handle that without panicking, too.

So I can just remove this assert and replace it with some checking and reporting error but IMHO it will not make our system more reliable or fool-protected. As far as I understand, all logging performed by postgres is currently redirected to stderr for walredo process. So the problem will be only in case of there are some printf in Postgres code. But then we are in trouble and the faster we detect it, the less harm it can made. In this sense assertion is preferable to just reporting error and may be restarting walredo process. As far as we are not permanently logs for presence of errors, such failure can be ignore until it cause so,e critical problem like data corruption.l

We should start paying more attention to ERRORs in the logs. We shouldn't need to panic, just to get the attention of developers earlier. It's not cool to crash the whole pageserver, causing an outage for all tenants, when there is a problem with one tenant.

Also keep in mind that the WAL redo process is untrusted. Even if we are 100% sure that we never write to stdout in the postgres code we have, we still need to be prepared for the case that someone writes a malicious WAL record that leads to arbitrary code execution in the WAL redo process, and that code writes to stdout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now response from walredo process is just 8192 opaque bytes. Pagesever is not able to somehow check their content. So I was wrong - there will be no panic in pageserver. It will just send wrong page images to compute. Which in some cases may be ever worse, because may cause crash of compute or incorrect work of client application.

There are two reasons which can cause it:

  1. Presence of "illegal" printf in Postgres code or even extensions.
  2. Malicious WAL record which cause walredo to do wrong things, including writing to stdout.

Probability of the first case seems to be much higher. We can not protect user from shutting his leg. We should not allow it to affect other clients. And will be nice to protect user from incorrect behavior caused by our or any other third party bugs (i.e. we allow to use postgis extension and somewhere inside its thousands of code there is some printf). So may be we should add some response header for walredo (with length and may be CRC) so that pageserver an check them?

In any case it seems to be separate PR.
I can remove assert ot replace it with reporting error, but IMHO it is not enough to address this problem.

Copy link
Contributor

@hlinnaka hlinnaka Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two reasons which can cause it:

  1. Presence of "illegal" printf in Postgres code or even extensions.

  2. Malicious WAL record which cause walredo to do wrong things, including writing to stdout.

Probability of the first case seems to be much higher. We can not protect user from shutting his leg. We should not allow it to affect other clients. And will be nice to protect user from incorrect behavior caused by our or any other third party bugs (i.e. we allow to use postgis extension and somewhere inside its thousands of code there is some printf). So may be we should add some response header for walredo (with length and may be CRC) so that pageserver an check them?

+1. I think a simple "request ID" would do. In the request, send a unique request ID, e.g just using a u64 counter. In the wal redo process, include the request ID in the response, and in the pageserver, check that the ID in the response matches what you expected.

In any case it seems to be separate PR.

Agreed.

I can remove assert ot replace it with reporting error, but IMHO it is not enough to address this problem.

Yeah, please replace the assert with something here. It solves the "one tenant can cause an outage for others" problem, but not the other problems. That's good enough for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to take back my comment concerning prints in extension: extensions are not involved in wal replaying so them can not output something to stdout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With PG15 extensions are involved with WAL redo, see https://www.postgresql.org/docs/current/custom-rmgr.html

@koivunej
Copy link
Contributor

Can we replace the thread with a tokio task easily?

I don't think so. We would be essentially running another event loop with in the event loop, it would be blocking. Moving this to any of the existing blocking pools will only move the issue (now instead of unnamed extra threads we have larger blocking pools, which might become full since these are long running). I'd suggest tolerate the new thread here until we rewrite it away from custom poll implementation.

However, I actually wrote it using like that at first, but switched to nix::poll for performance reasons in commit b38e841, so if we do that it needs to be performance tested carefully.

I am trying to come up with a criterion based benchmark for this subsystem.

@hlinnaka
Copy link
Contributor

I don't think so. We would be essentially running another event loop with in the event loop, it would be blocking. Moving this to any of the existing blocking pools will only move the issue (now instead of unnamed extra threads we have larger blocking pools, which might become full since these are long running).

Hmm. I guess the right way to do it would be to call spawn_blocking() before you do the poll(). Then each iteration would call spawn_blocking() though, and I'm not sure how big the overhead of that is. This is pretty performance sensitive. Another approach would be to have a separate tokio runtime for these tasks. Then the blocking calls wouldn't disrupt other parts of the system.

I'd suggest tolerate the new thread here until we rewrite it away from custom poll implementation.

I am quite worried about having thousands of threads. We've been burned by that before: #1719.

@knizhnik
Copy link
Contributor Author

I do not see big problem with spawning tokio task rather than OS thread for walredo.
We are doing it now for all pageserver background tasks (GC, compaction, upload,...)
so why we can not just do the same for walredo?

But I concern about two things:

  1. Development loops (not "design->implementation->testing->refactoring":) There are some submodules in Neon which implementation is changed in cycles (async->sync->async->...). Looks like walredo has a chance to be yet another such component if we rewrite this nix::poll code with standard tokio select. I am not saying that it is bad, but is seems to be something wrong in such "strategy".
  2. There is still separate walredo process for each tenant. Process is much more heavy weight structure than thread.If we concern about large number of threads, then should we also care about large number of processes.

@knizhnik
Copy link
Contributor Author

Frankly speaking this sync-sync mash is horrible:( 30 years ago in Solaris there were normal threads with flexible mapping between user and system threads. Such lightweight thread are also provided by Go and Java. And here we till have to deal with this awful async/await and use prints for debugging.

Comment on lines +688 to +693
panic!(
"wal-redo-postgres: {}",
String::from_utf8_lossy(&errbuf[0..n])
);
Copy link
Contributor

@koivunej koivunej Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted while trying to come up with a benchmark for this (accidential wrong buffertag), that this used to be only an error!. Similar note on the other instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed.

@knizhnik
Copy link
Contributor Author

Hmm. I guess the right way to do it would be to call spawn_blocking() before you do the poll(). Then each iteration would call spawn_blocking() though, and I'm not sure how big the overhead of that is. This is pretty performance sensitive. Another approach would be to have a separate tokio runtime for these tasks. Then the blocking calls wouldn't disrupt other parts of the system.

After more thinking I also come to the conclusion that changing wal-redo thread to tokio task will be non-trivial.
I agree that it can be quite useful (reduce number of threads) and even may result in better performance.
Right now we are trying to extract from the channel as much requests as we can (using non-blocking try_recv) but ideally we should multiplex reading fro channel and writting to wlredo process pipe. But i this case we need to use tokio channels, rather than std::sync::mpsc. But at another end of the channel we have sync code (apply_wal_records) which is in turned call from tokio task.Can we tokio channels in this case?

But in any case, I think that we should create separate PR for it.

koivunej added a commit that referenced this pull request Nov 16, 2022
adds a simple walredo bench to allow some comparison of the walredo
throughput.

Cc: #1339, #2778
koivunej added a commit that referenced this pull request Nov 17, 2022
adds a simple walredo bench to allow some comparison of the walredo
throughput.

Cc: #1339, #2778
@hlinnaka
Copy link
Contributor

Quick status sync on this:

Let's finalize this PR. The only big blocker is the addition of one thread per timeline. That's not acceptable, because we have a hard limit of 4096 threads (#1609), after which you get a panic. With one thread per timeline, we get uncomfortably close to that limit.

@koivunej, you're working on replacing the thread with a tokio task in this PR, right?

@koivunej
Copy link
Contributor

Apologies for the late reply but yes, I am working on that. I'll open PRs today.

@knizhnik
Copy link
Contributor Author

Another solution (may be more complex) is to have global pool of walredo processes with dedicated threads for interaction with them. Number of threads and processes may be different, so 1-to-1 relation is not required.

We already have PR with pool of walredo processes which can improve perforance in case of seqscan or background pageserver activities. But the main problem here is that if each tenant will spawn multiple walredo processes, then total number of process can become very large and cause exhaustion of some system resources. Having pool of waledo process we can guarantee that this number if limited.

Certainly it will be great f we can reuse waledo process to server multiple tenants because it is actually stateless. But it is insecure. So we should with fin some way to efficiently reset all process memory and local storage (which seems to be impossible), either start-and-stopn processes on demand. May be there is some way to efficiently "clone" process to speedup its start.

@koivunej koivunej mentioned this pull request Nov 21, 2022
2 tasks
@hlinnaka hlinnaka self-requested a review November 28, 2022 14:09
@LizardWizzard
Copy link
Contributor

Given other improvements in walredo area is it still relevant?

@knizhnik knizhnik closed this Mar 21, 2023
@bayandin bayandin deleted the walredo_channel branch May 19, 2023 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants