Tokio based walredo #2875

koivunej · 2022-11-21T12:59:01Z

Re-implements the walredo process handling as an tokio managed process and task. I think the implementation is at feature parity but not all of these are easy to test:

lazy start (without it, many tests will fail)
kill and relaunch the process if any request fails (untested)
- this only happens with timeouts, or io error, like with previous version
- no checking for zero pages

Timeout is now handled as "from the write starting to the read completing", which might not be correct with the request pipelining.

#1700 is not handled, as in requests are not retried. That could be implemented.

An interesting deadlock was discovered because of blocking code elsewhere: #2975. This was solved by adding a new runtime.

While working on this, I noticed that tokio does not yet support vectored writes for child processes: tokio-rs/tokio#5216 -- the implementation will benefit from upgrading to this future tokio version.

This branch currently has primitive implementations of scale up multiprocess walredo controller and rebasing of #2880 (which hopefully didn't go too wrong). These are only up for discussion, as they share much of the code with "scale down".

"Scale walredo" down to zero seemed like an interesting one to implement, I don't know if it's needed or what would be a sensible timeout for it. That could be moved to a follow-up PR as well.

Cc: #2778.

libs/pageserver_api/src/models.rs

pageserver/src/walredo.rs

knizhnik

I tested it at my EC2 host with Ketteq data.
Instead of 17 second for Q1 I get 22.
IMHO it is quite noticeable slowdown.

koivunej · 2022-11-22T08:20:20Z

That is inline with the microbenchmark... Thanks for running that! I don't think this path is viable anymore.

koivunej · 2022-11-22T11:27:07Z

From what I can tell, the biggest slowdown happens with the read returning EAGAIN then epoll then retrying the read (and it still takes microseconds), then from cross thread communication with the channels. Switching channel implementations don't seem to make a difference, flume is fast as tokio's for these purposes.

koivunej · 2022-11-23T11:41:23Z

Ok, I've finally understood what affects this thanks to a bit commute and working on the laptop which has less cores.

The perf difference to walredo-channel from fa73b8e is explained by cross thread communication as by default the use pageserver::task_mgr::BACKGROUND_RUNTIME is used. It is created with 24 workers on my workstation. Lowering that to 1 thread makes the async-walredo (or a version I'll next clean up and push) faster or the same as walredo-channel. This does not of course include the larger scale performance difference, but I'll see to trying ways around that next, maybe starting with flume channels again.

koivunej · 2022-11-25T08:52:07Z

Periodic updates:

so I will need to add --cfg tokio_unstable to the build

This turned out to be difficult, since cargo never appends RUSTFLAGS, it just overrides. So very likely, the project setting would get overridden by user setting (use mold) and finally might be overridden by environment RUSTFLAGS. Going ahead sprinkling tokio::task::block_in_place around the many functions where we take blocking locks and/or do blocking io. I don't think this is a good idea either, because we could starve the runtime(s) this way too. Tradeoff to disabling the lifo slot would had been performance while leaving the door open to yet-unknown kind of starvation following from "one should not block in async context".

Best workaround for the blocking-in-async so far is to just minimize the locks needed altogether by more single-threaded approaches. For example, if two things can't run concurrently, we could just only ever launch one of them. We could also handle this with very small amount or no locks by having an async task for tenant.

koivunej · 2022-11-25T11:57:53Z

Rebased on top of main, cleaned up some of the commits. Lets see how the benchmarks look now.

knizhnik · 2022-11-25T13:19:59Z

Now performance is much better: 16.5 sec for Q1 with main and 14.5 with Heikki's parallel requests patch.
So it provides the same performance as sync version.
Nice, but it is a pity that it is not faster;)

koivunej · 2022-11-25T14:10:13Z

Thanks @knizhnik for running the bench. I am actually a bit surprised why is it now that much faster. I'll hook up the multiprocess pooling perhaps next, though I am annoyed by the test failures I didn't get before rebase.

Possible reason for it being faster: the version you ran (cb17a43) had 1MB pipes. I'll ping you once I get the tests passing.

koivunej · 2022-11-28T09:03:25Z

Now performance is much better: 16.5 sec for Q1 with main and 14.5 with Heikki's parallel requests patch. So it provides the same performance as sync version. Nice, but it is a pity that it is not faster;)

I missed responding to the point about being faster this last week, I would never expect the async version to ever be faster than the owning thread version, at least the epoll based. As noted, tokio and epoll and the large thread count do bring their own challenges, which I haven't been able to understand. This however affects only the microbenchmarked single-request case, none of the pipelined.

Update on the failures: I still have test failures locally (2 real + 2 backward/forward compat).

workspace_hack/Cargo.toml

koivunej · 2022-11-28T11:10:41Z

Need to wait for the allure report (probably after benches) for:

First one seems like a "port not free" transient issue. Luckily the debug regress tests passed. The aggressive gc one doesn't reproduce locally.

Aggressive gc one looks very interesting: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2875/release/3563889915/index.html#suites/1352491f56c5af21f175ec2ac55cea0f/7eb18ea4e7757a9a/?attachment=1627478f048e3371

koivunej · 2022-12-02T17:09:48Z

Trying the benchmarks with the prefetch on default on next rebase.

problame

Took an initial look at walredo.rs only. Didn't look at anything past stderr_task.
Generally, this looks pretty great.

Apart from the various comments, I think what's deeply needed here is a block comment that explains how the std{in,out} tasks work together.

With a separate section on cancellation.

And another section of what happens if the walredo process crashes.

problame · 2022-12-02T19:16:12Z

pageserver/src/walredo.rs

+        TEMP_FILE_SUFFIX,
+    );
+
+    match tokio::fs::remove_dir_all(&datadir).await {


I think we should have a cleanup pass during tenant load/attach that removes all wal-redo-datadirs.

Then change the code here to simply do mkdir without -p.
(All banking on the assumption that nth_tenant_proces will never overflow)

No directly obvious advantage but it reduces the chance of messing up in the future.
"Messing up" meaning deleting a live walredo data directory.

problame · 2022-12-02T19:16:44Z

pageserver/src/walredo.rs

+        .arg("-N")
+        .env_clear()
+        .env("LD_LIBRARY_PATH", &pg_lib_dir_path)
+        .env("DYLD_LIBRARY_PATH", &pg_lib_dir_path)


Is dropping of comment // macOS intentional?

problame · 2022-12-02T19:19:09Z

pageserver/src/walredo.rs

-            .env("LD_LIBRARY_PATH", &pg_lib_dir_path)
-            .env("DYLD_LIBRARY_PATH", &pg_lib_dir_path)
-            .env("PGDATA", &datadir)
-            // The redo process is not trusted, and runs in seccomp mode that


Is that comment not true / relevant anymore?

Good catch, I should go over these lost comments.

problame · 2022-12-02T19:20:38Z

pageserver/src/walredo.rs

+    // stdout is used to communicate the resulting page
+    let stdout = child.stdout.take().expect("not taken yet");
+
+    #[cfg(not_needed)]


drop it, then

problame · 2022-12-02T19:30:37Z

pageserver/src/walredo.rs

+    {
+        loop {
+            let read = stdout
+                .read_buf(buffer)


How does this know to stop reading after the we have read 8192 bytes into the buffer?
Espsecially if there's more than one iteration of the loop?

The docs are a bit vague for read_buf, implementation is quite clear "one read". My understanding is that it will go to sleep until there's a readyness event, then try to read the remaining vec. Because of how the response protocol goes, it is that one 8192 (or more, if unlucky/stalled) byte read.

To be honest I haven't tested any interesting configurations for this, for example 7kB pipe size, but I can't see how it could go wrong.

problame · 2022-12-02T20:00:52Z

pageserver/src/walredo.rs

+                    .map_err(anyhow::Error::new)?;
+            }
+            // in general flush is not needed, does nothing on pipes, but since we now accept
+            // AsyncWrite


... now what?

Now I included the flush here, so you could test this with anything, like a tcpstream. But with pipes, that flush does nothing.

problame · 2022-12-02T20:01:45Z

pageserver/src/walredo.rs

+        // the request after it has been moved. this could be worked around by making the oneshot
+        // into a normal mpsc queue of size 2 and making the read side race against the channel
+        // closing and timeouted read.
+        match write_res.await.and_then(|x| x) {


What is the deal with this `.and_then(|x| x)?

I think it wouldn't hurt to have a preceding line like

let write_res = write_res.await.and_then(|x| x); // explain why

I use .and_then(|x| x) for Result::flatten (still unstable, unfortunately).

So the timeout returns Result<Inner, tokio::time::Elapsed> which I map_err into anyhow::Result<Inner>, but the Inner is anyhow::Result<()>, so flattening gives me one anyhow::Result<()>.

Ok. To me this is more of a readability question. Without inlay hints from rust analyzer, the typing wasn't straightforward to me.

problame · 2022-12-02T20:07:15Z

pageserver/src/walredo.rs

+    let work_timeout = Duration::from_secs(1);
+
+    loop {
+        let ((request, response), reservation) = {


let's call it response_tx ?

Yes, this would make sense.

problame · 2022-12-02T20:10:26Z

pageserver/src/walredo.rs

+        // wait the write to complete before sending the completion over, because we cannot fail
+        // the request after it has been moved. this could be worked around by making the oneshot
+        // into a normal mpsc queue of size 2 and making the read side race against the channel
+        // closing and timeouted read.


I don't get this comment, probably because I'm not as deep into it as you are.
To me it's plain obvious that we need to .await the write.
If we don't .await it here, the walredo process will simply never see it.
This is the first time we poll that future.

Quickly reading through this, I'm probably confusing request and response[_tx] here.

problame · 2022-12-02T20:18:12Z

pageserver/src/walredo.rs

+                if inter_tx.is_closed() {
+                    drop(response.send(Err(anyhow::anyhow!(
+                        "Failed to read walredo response: stdout task closed already"
+                    ))));


Shouldn't we retry this request on a new process, then? Same question for the Err(io_or_timeout) case below.

That's a good point. I already wrote it more than once but I guess I deemed it unlikely, given how the most needed would be the already pipelined requests for which the request has already been dropped.

But yeah, thinking this again, I can't see why not save what we can.

I thought we straight-out must not fail legitimate GetPage requests with a permanent error, can we? To compute, that would look like an IO error, wouldn't it?

adds pageserver/fixtures for the page image.

also reuse threads between runs for less noise in profiling.

less warnings.

it seems that we get much more stable values and less warnings out of criterion this way.

reimplements the older poll and std::process::Command based implementation to work on tokio primitives and latch on to task_mgr lifecycle management. there is a minor gain in performance with the pipelined requests, which is partly negated by tokio just being slower than the blocking poll. the implementation will get faster with future tokio enhancements, such as vectored writes.

now it should be the hierarchy: - walredo ctrl per tenant (tenant_id) - walredo per process (pid) - stdin_task - stdout_task - stderr_task

first step in multi process controller. this implementation scales down to zero in X times the outer queue read timeout in stdin_task, it is probably too fast.

integrate the process controller.

koivunej · 2022-12-12T14:21:18Z

This PR is not viable at least now while it does look good on the OLAP benches, the OLTP has the 7-12 usec to 24usec penalty (in microbenchmark short/short/1) which is too much. I haven't been able to understand root cause for this slowdown, but assume it's "threads". So closing and focusing on #2947.

koivunej · 2022-12-12T14:23:27Z

Deleting branch on neondatabase/neon, archiving in koivunej/neon: https://github.com/koivunej/neon/tree/async_walredo

koivunej · 2023-01-16T15:53:41Z

Tested this after c1731bc landed on main. Even on COMPUTE_REQUEST runtime, getting similar or perhaps a bit better perf than main with the OLTP test, but not that much faster. Performance for OLAP/prefetched would be better. Performance for for example image layer creation would most likely be slower, because there's a cross runtime penalty vs. on COMPUTE_REQUEST the page requests can benefit from the LIFO slot optimization.

Separated from #2875. The microbenchmark has been validated to show similar difference as to larger scale OLTP benchmark.

koivunej requested review from a team as code owners November 21, 2022 12:59