-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Walredo channel #2778
Walredo channel #2778
Conversation
aff0a4e
to
dc8f5e4
Compare
pageserver/src/walredo.rs
Outdated
pg_version: u32, | ||
) -> Result<PostgresRedoProcess, Error> { | ||
#[instrument(skip_all,fields(tenant_id=%tenant_id))] | ||
fn launch(conf: &PageServerConf, tenant_id: TenantId) -> Result<PostgresRedoProcess, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the pg version removed and forced to PG14?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why of unresolved issues related with this PR.
There is also similar problem in main - see #2770
pg_version should be somehow specified at WalRedoManger creation time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will conflict with #2776 and I'm planning to merge that in a few minutes.
Yes. But I hope
Yes, I see. |
If hit any of the |
Yes, it is true. |
Let's continue this conversation in the PR: #2776 (comment) |
pageserver/src/walredo.rs
Outdated
// mutiplexor pipe: use sync_channel to allow sharing sender by multiple threads | ||
// and limit size of buffer | ||
sender: SyncSender<(ChannelId, Vec<u8>)>, | ||
// set of receiver channels | ||
receivers: Vec<Mutex<Receiver<Bytes>>>, | ||
// atomicly incremented counter for choosing receiver | ||
round_robin: AtomicUsize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usual way to have a request/response queue like this is to create a "oneshot" channel for each request, and send the sending half of it along with the request. So something like:
/// To request WAL redo, create a oneshot channel for the response, and send the
/// serialized request and the send-half of the oneshot channel here. Then wait for
/// the response on the receive-half of the oneshot channel
request_queue: tokio::sync::mpsc::Sender<(tokio::sync::oneshot::Sender<Bytes>, Vec<u8>)>,
You can use those tokio channels from blocking code too, the Senders and Receivers also have blocking send and receive functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With blocking_recv
I get the following error:
Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.
I wonder why it is necessary to use tokio::sync::oneshot::Sender
?
I tried standard std::sync::mpsc::Sender
and speed is the same.
I have tested speed on such simple test:
let (tx, rx): (
SyncSender<(Sender<u32>,u32)>,
Receiver<(Sender<u32>,u32)>,
) = mpsc::sync_channel(1024*1024);
std::thread::spawn(move || {
for i in 0..N_ITER {
let (sender,data) = rx.recv().unwrap();
assert!(data == i);
sender.send(data).unwrap();
}
});
for i in 0..N_ITER {
let (tx1, rx1) = mpsc::channel();
tx.send((tx1,i)).unwrap();
let resp = rx1.recv().unwrap();
assert!(resp == i);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried https://docs.rs/crossbeam-channel/0.5.6/crossbeam_channel/index.html
because I saw recommendation to use it as much faster alternative to std::sync::mpsc.
But at my test I didn't see significant difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why it is necessary to use tokio::sync::oneshot::Sender ?
I tried standard std::sync::mpsc::Sender and speed is the same.
This is just a design pattern. With oneshot it's clear at the type level that you are just going to get back only a single response. The std
's rather complex mpsc optimizes for this "oneshot" usage first, then switches to more different kind of implementation when Sender
's are cloned or more messages are being sent. Later implementations (crossbeam, flume) do not necessarily do this.
Cannot start a runtime from within a runtime. This happens because a function (like
block_on
) attempted to block the current thread while the thread is being used to drive asynchronous tasks.
This is quite unfortunate. I wonder if somewhere either a block_on
or block_in_place
is already being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite unfortunate.
Why do we need to use tokio::sync::oneshot
if it provides not performance advantages comparing with std::sync::mpsc
where there are no such problems?
If I understand correctly but my understanding is that this changes the walredo to:
I think the implementation is becoming quite complicated with the added ring buffer and custom polling. There might be benefits from tokio based implementation. I don't really understand how can the walredo process work without any changes with this arrangement, are the default buffer sizes so large and the requests usually so small that this becomes useful? |
Right now each thread in pageserver is working with walredo process directly, using mutex for synchronization.
Yes, this PR is spawning separate thread for communication with walredo process and use rust channel for communication between threads. There is bounded channel for sending requests. And here where actually buffering tales place. Yes, buffering has critical impact on walredo performance - see #1339 and my table with dependency of redo speed from number of buffered requests (almost linear). |
That might actually become a problem. We have had problems with large numbers of threads in the past. If you have thousands of tenants, and you launch a separate thread for each tenant, it's too much. Can we replace the thread with a tokio task easily?
I don't think there's any need for the ring buffer anymore, now that this uses a separate oneshot channel for each response. (I mean oneshot channel in how it's used, even if it's actually implemented using std::mspc). I'm surprised to see that it's still there. Let's remove it. We could try to rewrite the polling using tokio::select. However, I actually wrote it using like that at first, but switched to An even more drastic change would be to replace the pipe with shared memory. But the thing I like about this PR is that it's relatively small. I'd like to get this into a committable shape, keeping it as simple as possible, and experiment with more drastic rewrites later. |
pageserver/src/walredo.rs
Outdated
// Ring buffer with channel identifiers of buffered redo requests | ||
ring_buf: [Option<Sender<Bytes>>; N_CHANNELS], | ||
// Head position in ring buffer | ||
ring_buf_pos: usize, | ||
// Number of buffered requests | ||
n_buffered: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so the ring buffer is now used just locally in the thread, to keep track which response received from the walredo process belongs to which request. Let's replace it with std::collections::VecDeque
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pageserver/src/walredo.rs
Outdated
self.page_pos += self.stdout.read(&mut self.page[self.page_pos..])?; | ||
if self.page_pos == BLCKSZ as usize { | ||
assert!(self.n_buffered != 0); | ||
let sender = self.ring_buf[self.ring_buf_pos].take().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will panic, if the walredo process spontaneously writes to its stdout, when no requests have been sent to it. Let's handle that more gracefully, without panicking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what will happen if walredo wil output something to the stdout while request processing?
Pageserver will get so,e unexpected input and most likely will i any case panic.
So I can just remove this assert and replace it with some checking and reporting error but IMHO it will not make our system more reliable or fool-protected. As far as I understand, all logging performed by postgres is currently redirected to stderr for walredo process. So the problem will be only in case of there are some printf
in Postgres code. But then we are in trouble and the faster we detect it, the less harm it can made. In this sense assertion is preferable to just reporting error and may be restarting walredo process. As far as we are not permanently logs for presence of errors, such failure can be ignore until it cause so,e critical problem like data corruption.l
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what will happen if walredo wil output something to the stdout while request processing?
Pageserver will get so,e unexpected input and most likely will i any case panic.
We should handle that without panicking, too.
So I can just remove this assert and replace it with some checking and reporting error but IMHO it will not make our system more reliable or fool-protected. As far as I understand, all logging performed by postgres is currently redirected to stderr for walredo process. So the problem will be only in case of there are some printf in Postgres code. But then we are in trouble and the faster we detect it, the less harm it can made. In this sense assertion is preferable to just reporting error and may be restarting walredo process. As far as we are not permanently logs for presence of errors, such failure can be ignore until it cause so,e critical problem like data corruption.l
We should start paying more attention to ERRORs in the logs. We shouldn't need to panic, just to get the attention of developers earlier. It's not cool to crash the whole pageserver, causing an outage for all tenants, when there is a problem with one tenant.
Also keep in mind that the WAL redo process is untrusted. Even if we are 100% sure that we never write to stdout in the postgres code we have, we still need to be prepared for the case that someone writes a malicious WAL record that leads to arbitrary code execution in the WAL redo process, and that code writes to stdout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now response from walredo process is just 8192 opaque bytes. Pagesever is not able to somehow check their content. So I was wrong - there will be no panic in pageserver. It will just send wrong page images to compute. Which in some cases may be ever worse, because may cause crash of compute or incorrect work of client application.
There are two reasons which can cause it:
- Presence of "illegal"
printf
in Postgres code or even extensions. - Malicious WAL record which cause walredo to do wrong things, including writing to stdout.
Probability of the first case seems to be much higher. We can not protect user from shutting his leg. We should not allow it to affect other clients. And will be nice to protect user from incorrect behavior caused by our or any other third party bugs (i.e. we allow to use postgis extension and somewhere inside its thousands of code there is some printf). So may be we should add some response header for walredo (with length and may be CRC) so that pageserver an check them?
In any case it seems to be separate PR.
I can remove assert ot replace it with reporting error, but IMHO it is not enough to address this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two reasons which can cause it:
Presence of "illegal"
printf
in Postgres code or even extensions.Malicious WAL record which cause walredo to do wrong things, including writing to stdout.
Probability of the first case seems to be much higher. We can not protect user from shutting his leg. We should not allow it to affect other clients. And will be nice to protect user from incorrect behavior caused by our or any other third party bugs (i.e. we allow to use postgis extension and somewhere inside its thousands of code there is some printf). So may be we should add some response header for walredo (with length and may be CRC) so that pageserver an check them?
+1. I think a simple "request ID" would do. In the request, send a unique request ID, e.g just using a u64 counter. In the wal redo process, include the request ID in the response, and in the pageserver, check that the ID in the response matches what you expected.
In any case it seems to be separate PR.
Agreed.
I can remove assert ot replace it with reporting error, but IMHO it is not enough to address this problem.
Yeah, please replace the assert with something here. It solves the "one tenant can cause an outage for others" problem, but not the other problems. That's good enough for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to take back my comment concerning prints in extension: extensions are not involved in wal replaying so them can not output something to stdout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With PG15 extensions are involved with WAL redo, see https://www.postgresql.org/docs/current/custom-rmgr.html
I don't think so. We would be essentially running another event loop with in the event loop, it would be blocking. Moving this to any of the existing blocking pools will only move the issue (now instead of unnamed extra threads we have larger blocking pools, which might become full since these are long running). I'd suggest tolerate the new thread here until we rewrite it away from custom poll implementation.
I am trying to come up with a criterion based benchmark for this subsystem. |
Hmm. I guess the right way to do it would be to call spawn_blocking() before you do the poll(). Then each iteration would call spawn_blocking() though, and I'm not sure how big the overhead of that is. This is pretty performance sensitive. Another approach would be to have a separate tokio runtime for these tasks. Then the blocking calls wouldn't disrupt other parts of the system.
I am quite worried about having thousands of threads. We've been burned by that before: #1719. |
I do not see big problem with spawning tokio task rather than OS thread for walredo. But I concern about two things:
|
Frankly speaking this sync-sync mash is horrible:( 30 years ago in Solaris there were normal threads with flexible mapping between user and system threads. Such lightweight thread are also provided by Go and Java. And here we till have to deal with this awful async/await and use prints for debugging. |
panic!( | ||
"wal-redo-postgres: {}", | ||
String::from_utf8_lossy(&errbuf[0..n]) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted while trying to come up with a benchmark for this (accidential wrong buffertag), that this used to be only an error!
. Similar note on the other instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, fixed.
After more thinking I also come to the conclusion that changing wal-redo thread to tokio task will be non-trivial. But in any case, I think that we should create separate PR for it. |
20dd649
to
74e77d8
Compare
Quick status sync on this: Let's finalize this PR. The only big blocker is the addition of one thread per timeline. That's not acceptable, because we have a hard limit of 4096 threads (#1609), after which you get a panic. With one thread per timeline, we get uncomfortably close to that limit. @koivunej, you're working on replacing the thread with a tokio task in this PR, right? |
Apologies for the late reply but yes, I am working on that. I'll open PRs today. |
Another solution (may be more complex) is to have global pool of walredo processes with dedicated threads for interaction with them. Number of threads and processes may be different, so 1-to-1 relation is not required. We already have PR with pool of walredo processes which can improve perforance in case of seqscan or background pageserver activities. But the main problem here is that if each tenant will spawn multiple walredo processes, then total number of process can become very large and cause exhaustion of some system resources. Having pool of waledo process we can guarantee that this number if limited. Certainly it will be great f we can reuse waledo process to server multiple tenants because it is actually stateless. But it is insecure. So we should with fin some way to efficiently reset all process memory and local storage (which seems to be impossible), either start-and-stopn processes on demand. May be there is some way to efficiently "clone" process to speedup its start. |
Given other improvements in walredo area is it still relevant? |
See #1339
Buffering walredo requests has critical impact on performance.
This PR tried to do it using Rst channels.
It allows to reduce Ketteq Q1 execution from 30 to 17 seconds.
Unfortunately average number of buffered requests with 6 parallel is only 2.5.
10 seconds is time of Q1 execution without page reconstruction (with latest image layers).
So remaining 7 seconds - is walredo time with 2.5 buffered requests.
Not sure if we can provide large level of bufferisation. Maximal number of buffered requests is 7 (6 parallel workers + master),
so channels are really able to provide requests buffering. But average number is just 2.5