New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokio based walredo #2875
Tokio based walredo #2875
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
5d8b4dd
to
6615cd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it at my EC2 host with Ketteq data.
Instead of 17 second for Q1 I get 22.
IMHO it is quite noticeable slowdown.
That is inline with the microbenchmark... Thanks for running that! I don't think this path is viable anymore. |
From what I can tell, the biggest slowdown happens with the read returning EAGAIN then epoll then retrying the read (and it still takes microseconds), then from cross thread communication with the channels. Switching channel implementations don't seem to make a difference, flume is fast as tokio's for these purposes. |
Ok, I've finally understood what affects this thanks to a bit commute and working on the laptop which has less cores. The perf difference to |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Periodic updates:
This turned out to be difficult, since cargo never appends Best workaround for the blocking-in-async so far is to just minimize the locks needed altogether by more single-threaded approaches. For example, if two things can't run concurrently, we could just only ever launch one of them. We could also handle this with very small amount or no locks by having an async task for tenant. |
73b9d73
to
44d5b4f
Compare
Rebased on top of main, cleaned up some of the commits. Lets see how the benchmarks look now. |
Now performance is much better: 16.5 sec for Q1 with main and 14.5 with Heikki's parallel requests patch. |
Thanks @knizhnik for running the bench. I am actually a bit surprised why is it now that much faster. I'll hook up the multiprocess pooling perhaps next, though I am annoyed by the test failures I didn't get before rebase. Possible reason for it being faster: the version you ran (cb17a43) had 1MB pipes. I'll ping you once I get the tests passing. |
I missed responding to the point about being faster this last week, I would never expect the async version to ever be faster than the owning thread version, at least the epoll based. As noted, tokio and epoll and the large thread count do bring their own challenges, which I haven't been able to understand. This however affects only the microbenchmarked single-request case, none of the pipelined. Update on the failures: I still have test failures locally (2 real + 2 backward/forward compat). |
cb17a43
to
63ed162
Compare
Need to wait for the allure report (probably after benches) for:
First one seems like a "port not free" transient issue. Luckily the debug regress tests passed. The aggressive gc one doesn't reproduce locally. Aggressive gc one looks very interesting: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2875/release/3563889915/index.html#suites/1352491f56c5af21f175ec2ac55cea0f/7eb18ea4e7757a9a/?attachment=1627478f048e3371 |
This comment was marked as outdated.
This comment was marked as outdated.
4447d96
to
c3a83aa
Compare
This comment was marked as outdated.
This comment was marked as outdated.
Trying the benchmarks with the prefetch on default on next rebase. |
c3a83aa
to
e2a0214
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took an initial look at walredo.rs
only. Didn't look at anything past stderr_task
.
Generally, this looks pretty great.
Apart from the various comments, I think what's deeply needed here is a block comment that explains how the std{in,out} tasks work together.
With a separate section on cancellation.
And another section of what happens if the walredo process crashes.
TEMP_FILE_SUFFIX, | ||
); | ||
|
||
match tokio::fs::remove_dir_all(&datadir).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a cleanup pass during tenant load/attach that removes all wal-redo-datadirs.
Then change the code here to simply do mkdir
without -p
.
(All banking on the assumption that nth_tenant_proces
will never overflow)
No directly obvious advantage but it reduces the chance of messing up in the future.
"Messing up" meaning deleting a live walredo data directory.
.arg("-N") | ||
.env_clear() | ||
.env("LD_LIBRARY_PATH", &pg_lib_dir_path) | ||
.env("DYLD_LIBRARY_PATH", &pg_lib_dir_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is dropping of comment // macOS
intentional?
.env("LD_LIBRARY_PATH", &pg_lib_dir_path) | ||
.env("DYLD_LIBRARY_PATH", &pg_lib_dir_path) | ||
.env("PGDATA", &datadir) | ||
// The redo process is not trusted, and runs in seccomp mode that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that comment not true / relevant anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I should go over these lost comments.
// stdout is used to communicate the resulting page | ||
let stdout = child.stdout.take().expect("not taken yet"); | ||
|
||
#[cfg(not_needed)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop it, then
{ | ||
loop { | ||
let read = stdout | ||
.read_buf(buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this know to stop reading after the we have read 8192 bytes into the buffer
?
Espsecially if there's more than one iteration of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs are a bit vague for read_buf
, implementation is quite clear "one read". My understanding is that it will go to sleep until there's a readyness event, then try to read the remaining vec. Because of how the response protocol goes, it is that one 8192 (or more, if unlucky/stalled) byte read.
To be honest I haven't tested any interesting configurations for this, for example 7kB pipe size, but I can't see how it could go wrong.
.map_err(anyhow::Error::new)?; | ||
} | ||
// in general flush is not needed, does nothing on pipes, but since we now accept | ||
// AsyncWrite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... now what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I included the flush here, so you could test this with anything, like a tcpstream. But with pipes, that flush does nothing.
// the request after it has been moved. this could be worked around by making the oneshot | ||
// into a normal mpsc queue of size 2 and making the read side race against the channel | ||
// closing and timeouted read. | ||
match write_res.await.and_then(|x| x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the deal with this `.and_then(|x| x)?
I think it wouldn't hurt to have a preceding line like
let write_res = write_res.await.and_then(|x| x); // explain why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use .and_then(|x| x)
for Result::flatten
(still unstable, unfortunately).
So the timeout returns Result<Inner, tokio::time::Elapsed>
which I map_err
into anyhow::Result<Inner>
, but the Inner
is anyhow::Result<()>
, so flattening gives me one anyhow::Result<()>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. To me this is more of a readability question. Without inlay hints from rust analyzer, the typing wasn't straightforward to me.
let work_timeout = Duration::from_secs(1); | ||
|
||
loop { | ||
let ((request, response), reservation) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call it response_tx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would make sense.
// wait the write to complete before sending the completion over, because we cannot fail | ||
// the request after it has been moved. this could be worked around by making the oneshot | ||
// into a normal mpsc queue of size 2 and making the read side race against the channel | ||
// closing and timeouted read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this comment, probably because I'm not as deep into it as you are.
To me it's plain obvious that we need to .await
the write.
If we don't .await
it here, the walredo process will simply never see it.
This is the first time we poll that future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quickly reading through this, I'm probably confusing request
and response[_tx]
here.
if inter_tx.is_closed() { | ||
drop(response.send(Err(anyhow::anyhow!( | ||
"Failed to read walredo response: stdout task closed already" | ||
)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we retry this request on a new process, then? Same question for the Err(io_or_timeout)
case below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I already wrote it more than once but I guess I deemed it unlikely, given how the most needed would be the already pipelined requests for which the request
has already been dropped.
But yeah, thinking this again, I can't see why not save what we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we straight-out must not fail legitimate GetPage
requests with a permanent error, can we? To compute, that would look like an IO error, wouldn't it?
adds pageserver/fixtures for the page image.
also reuse threads between runs for less noise in profiling.
less warnings.
it seems that we get much more stable values and less warnings out of criterion this way.
reimplements the older poll and std::process::Command based implementation to work on tokio primitives and latch on to task_mgr lifecycle management. there is a minor gain in performance with the pipelined requests, which is partly negated by tokio just being slower than the blocking poll. the implementation will get faster with future tokio enhancements, such as vectored writes.
now it should be the hierarchy: - walredo ctrl per tenant (tenant_id) - walredo per process (pid) - stdin_task - stdout_task - stderr_task
first step in multi process controller. this implementation scales down to zero in X times the outer queue read timeout in stdin_task, it is probably too fast.
integrate the process controller.
e2a0214
to
56c7d15
Compare
This PR is not viable at least now while it does look good on the OLAP benches, the OLTP has the 7-12 usec to 24usec penalty (in microbenchmark |
Deleting branch on |
Tested this after c1731bc landed on main. Even on COMPUTE_REQUEST runtime, getting similar or perhaps a bit better perf than main with the OLTP test, but not that much faster. Performance for OLAP/prefetched would be better. Performance for for example image layer creation would most likely be slower, because there's a cross runtime penalty vs. on COMPUTE_REQUEST the page requests can benefit from the LIFO slot optimization. |
Separated from #2875. The microbenchmark has been validated to show similar difference as to larger scale OLTP benchmark.
Re-implements the walredo process handling as an tokio managed process and task. I think the implementation is at feature parity but not all of these are easy to test:
Timeout is now handled as "from the write starting to the read completing", which might not be correct with the request pipelining.
#1700 is not handled, as in requests are not retried. That could be implemented.
An interesting deadlock was discovered because of blocking code elsewhere: #2975. This was solved by adding a new runtime.
While working on this, I noticed that tokio does not yet support vectored writes for child processes: tokio-rs/tokio#5216 -- the implementation will benefit from upgrading to this future tokio version.
This branch currently has primitive implementations of scale up multiprocess walredo controller and rebasing of #2880 (which hopefully didn't go too wrong). These are only up for discussion, as they share much of the code with "scale down".
"Scale walredo" down to zero seemed like an interesting one to implement, I don't know if it's needed or what would be a sensible timeout for it. That could be moved to a follow-up PR as well.
Cc: #2778.