Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokio based walredo #2875

Closed
wants to merge 12 commits into from
Closed

Tokio based walredo #2875

wants to merge 12 commits into from

Conversation

koivunej
Copy link
Contributor

@koivunej koivunej commented Nov 21, 2022

Re-implements the walredo process handling as an tokio managed process and task. I think the implementation is at feature parity but not all of these are easy to test:

  • lazy start (without it, many tests will fail)
  • kill and relaunch the process if any request fails (untested)
    • this only happens with timeouts, or io error, like with previous version
    • no checking for zero pages

Timeout is now handled as "from the write starting to the read completing", which might not be correct with the request pipelining.

#1700 is not handled, as in requests are not retried. That could be implemented.

An interesting deadlock was discovered because of blocking code elsewhere: #2975. This was solved by adding a new runtime.

While working on this, I noticed that tokio does not yet support vectored writes for child processes: tokio-rs/tokio#5216 -- the implementation will benefit from upgrading to this future tokio version.

This branch currently has primitive implementations of scale up multiprocess walredo controller and rebasing of #2880 (which hopefully didn't go too wrong). These are only up for discussion, as they share much of the code with "scale down".

"Scale walredo" down to zero seemed like an interesting one to implement, I don't know if it's needed or what would be a sensible timeout for it. That could be moved to a follow-up PR as well.

Cc: #2778.

@koivunej koivunej requested review from a team as code owners November 21, 2022 12:59
pageserver/src/walredo.rs Outdated Show resolved Hide resolved
pageserver/src/walredo.rs Outdated Show resolved Hide resolved
@koivunej koivunej added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Nov 21, 2022
@koivunej

This comment was marked as outdated.

@koivunej koivunej marked this pull request as draft November 21, 2022 13:43
pageserver/src/walredo.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@knizhnik knizhnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it at my EC2 host with Ketteq data.
Instead of 17 second for Q1 I get 22.
IMHO it is quite noticeable slowdown.

@koivunej
Copy link
Contributor Author

That is inline with the microbenchmark... Thanks for running that! I don't think this path is viable anymore.

@koivunej
Copy link
Contributor Author

koivunej commented Nov 22, 2022

From what I can tell, the biggest slowdown happens with the read returning EAGAIN then epoll then retrying the read (and it still takes microseconds), then from cross thread communication with the channels. Switching channel implementations don't seem to make a difference, flume is fast as tokio's for these purposes.

@koivunej
Copy link
Contributor Author

Ok, I've finally understood what affects this thanks to a bit commute and working on the laptop which has less cores.

The perf difference to walredo-channel from fa73b8e is explained by cross thread communication as by default the use pageserver::task_mgr::BACKGROUND_RUNTIME is used. It is created with 24 workers on my workstation. Lowering that to 1 thread makes the async-walredo (or a version I'll next clean up and push) faster or the same as walredo-channel. This does not of course include the larger scale performance difference, but I'll see to trying ways around that next, maybe starting with flume channels again.

@koivunej

This comment was marked as outdated.

@koivunej

This comment was marked as outdated.

@koivunej
Copy link
Contributor Author

Periodic updates:

so I will need to add --cfg tokio_unstable to the build

This turned out to be difficult, since cargo never appends RUSTFLAGS, it just overrides. So very likely, the project setting would get overridden by user setting (use mold) and finally might be overridden by environment RUSTFLAGS. Going ahead sprinkling tokio::task::block_in_place around the many functions where we take blocking locks and/or do blocking io. I don't think this is a good idea either, because we could starve the runtime(s) this way too. Tradeoff to disabling the lifo slot would had been performance while leaving the door open to yet-unknown kind of starvation following from "one should not block in async context".

Best workaround for the blocking-in-async so far is to just minimize the locks needed altogether by more single-threaded approaches. For example, if two things can't run concurrently, we could just only ever launch one of them. We could also handle this with very small amount or no locks by having an async task for tenant.

@koivunej koivunej force-pushed the async_walredo branch 2 times, most recently from 73b9d73 to 44d5b4f Compare November 25, 2022 11:56
@koivunej
Copy link
Contributor Author

Rebased on top of main, cleaned up some of the commits. Lets see how the benchmarks look now.

@knizhnik
Copy link
Contributor

knizhnik commented Nov 25, 2022

Now performance is much better: 16.5 sec for Q1 with main and 14.5 with Heikki's parallel requests patch.
So it provides the same performance as sync version.
Nice, but it is a pity that it is not faster;)

@koivunej
Copy link
Contributor Author

koivunej commented Nov 25, 2022

Thanks @knizhnik for running the bench. I am actually a bit surprised why is it now that much faster. I'll hook up the multiprocess pooling perhaps next, though I am annoyed by the test failures I didn't get before rebase.

Possible reason for it being faster: the version you ran (cb17a43) had 1MB pipes. I'll ping you once I get the tests passing.

@koivunej
Copy link
Contributor Author

Now performance is much better: 16.5 sec for Q1 with main and 14.5 with Heikki's parallel requests patch. So it provides the same performance as sync version. Nice, but it is a pity that it is not faster;)

I missed responding to the point about being faster this last week, I would never expect the async version to ever be faster than the owning thread version, at least the epoll based. As noted, tokio and epoll and the large thread count do bring their own challenges, which I haven't been able to understand. This however affects only the microbenchmarked single-request case, none of the pipelined.

Update on the failures: I still have test failures locally (2 real + 2 backward/forward compat).

@koivunej
Copy link
Contributor Author

koivunej commented Nov 28, 2022

Need to wait for the allure report (probably after benches) for:

First one seems like a "port not free" transient issue. Luckily the debug regress tests passed. The aggressive gc one doesn't reproduce locally.

Aggressive gc one looks very interesting: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2875/release/3563889915/index.html#suites/1352491f56c5af21f175ec2ac55cea0f/7eb18ea4e7757a9a/?attachment=1627478f048e3371

@koivunej koivunej added run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label and removed run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label labels Nov 28, 2022
@koivunej

This comment was marked as outdated.

@koivunej

This comment was marked as outdated.

@koivunej koivunej marked this pull request as ready for review December 2, 2022 16:59
@koivunej koivunej added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Dec 2, 2022
@koivunej
Copy link
Contributor Author

koivunej commented Dec 2, 2022

Trying the benchmarks with the prefetch on default on next rebase.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took an initial look at walredo.rs only. Didn't look at anything past stderr_task.
Generally, this looks pretty great.

Apart from the various comments, I think what's deeply needed here is a block comment that explains how the std{in,out} tasks work together.

With a separate section on cancellation.

And another section of what happens if the walredo process crashes.

TEMP_FILE_SUFFIX,
);

match tokio::fs::remove_dir_all(&datadir).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a cleanup pass during tenant load/attach that removes all wal-redo-datadirs.

Then change the code here to simply do mkdir without -p.
(All banking on the assumption that nth_tenant_proces will never overflow)

No directly obvious advantage but it reduces the chance of messing up in the future.
"Messing up" meaning deleting a live walredo data directory.

.arg("-N")
.env_clear()
.env("LD_LIBRARY_PATH", &pg_lib_dir_path)
.env("DYLD_LIBRARY_PATH", &pg_lib_dir_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is dropping of comment // macOS intentional?

.env("LD_LIBRARY_PATH", &pg_lib_dir_path)
.env("DYLD_LIBRARY_PATH", &pg_lib_dir_path)
.env("PGDATA", &datadir)
// The redo process is not trusted, and runs in seccomp mode that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that comment not true / relevant anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I should go over these lost comments.

// stdout is used to communicate the resulting page
let stdout = child.stdout.take().expect("not taken yet");

#[cfg(not_needed)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop it, then

{
loop {
let read = stdout
.read_buf(buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this know to stop reading after the we have read 8192 bytes into the buffer?
Espsecially if there's more than one iteration of the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs are a bit vague for read_buf, implementation is quite clear "one read". My understanding is that it will go to sleep until there's a readyness event, then try to read the remaining vec. Because of how the response protocol goes, it is that one 8192 (or more, if unlucky/stalled) byte read.

To be honest I haven't tested any interesting configurations for this, for example 7kB pipe size, but I can't see how it could go wrong.

.map_err(anyhow::Error::new)?;
}
// in general flush is not needed, does nothing on pipes, but since we now accept
// AsyncWrite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... now what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I included the flush here, so you could test this with anything, like a tcpstream. But with pipes, that flush does nothing.

// the request after it has been moved. this could be worked around by making the oneshot
// into a normal mpsc queue of size 2 and making the read side race against the channel
// closing and timeouted read.
match write_res.await.and_then(|x| x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the deal with this `.and_then(|x| x)?

I think it wouldn't hurt to have a preceding line like

let write_res = write_res.await.and_then(|x| x); // explain why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use .and_then(|x| x) for Result::flatten (still unstable, unfortunately).

So the timeout returns Result<Inner, tokio::time::Elapsed> which I map_err into anyhow::Result<Inner>, but the Inner is anyhow::Result<()>, so flattening gives me one anyhow::Result<()>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. To me this is more of a readability question. Without inlay hints from rust analyzer, the typing wasn't straightforward to me.

let work_timeout = Duration::from_secs(1);

loop {
let ((request, response), reservation) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call it response_tx ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would make sense.

// wait the write to complete before sending the completion over, because we cannot fail
// the request after it has been moved. this could be worked around by making the oneshot
// into a normal mpsc queue of size 2 and making the read side race against the channel
// closing and timeouted read.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this comment, probably because I'm not as deep into it as you are.
To me it's plain obvious that we need to .await the write.
If we don't .await it here, the walredo process will simply never see it.
This is the first time we poll that future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly reading through this, I'm probably confusing request and response[_tx] here.

if inter_tx.is_closed() {
drop(response.send(Err(anyhow::anyhow!(
"Failed to read walredo response: stdout task closed already"
))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we retry this request on a new process, then? Same question for the Err(io_or_timeout) case below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I already wrote it more than once but I guess I deemed it unlikely, given how the most needed would be the already pipelined requests for which the request has already been dropped.

But yeah, thinking this again, I can't see why not save what we can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we straight-out must not fail legitimate GetPage requests with a permanent error, can we? To compute, that would look like an IO error, wouldn't it?

koivunej and others added 12 commits December 7, 2022 17:07
adds pageserver/fixtures for the page image.
also reuse threads between runs for less noise in profiling.
less warnings.
it seems that we get much more stable values and less warnings out of
criterion this way.
reimplements the older poll and std::process::Command based
implementation to work on tokio primitives and latch on to task_mgr
lifecycle management.

there is a minor gain in performance with the pipelined requests, which
is partly negated by tokio just being slower than the blocking poll. the
implementation will get faster with future tokio enhancements, such as
vectored writes.
now it should be the hierarchy:

- walredo ctrl per tenant (tenant_id)
  - walredo per process (pid)
    - stdin_task
    - stdout_task
    - stderr_task
first step in multi process controller. this implementation scales down
to zero in X times the outer queue read timeout in stdin_task,
it is probably too fast.
integrate the process controller.
@koivunej
Copy link
Contributor Author

This PR is not viable at least now while it does look good on the OLAP benches, the OLTP has the 7-12 usec to 24usec penalty (in microbenchmark short/short/1) which is too much. I haven't been able to understand root cause for this slowdown, but assume it's "threads". So closing and focusing on #2947.

@koivunej koivunej closed this Dec 12, 2022
@koivunej
Copy link
Contributor Author

Deleting branch on neondatabase/neon, archiving in koivunej/neon: https://github.com/koivunej/neon/tree/async_walredo

@koivunej koivunej deleted the async_walredo branch December 12, 2022 14:23
@koivunej
Copy link
Contributor Author

Tested this after c1731bc landed on main. Even on COMPUTE_REQUEST runtime, getting similar or perhaps a bit better perf than main with the OLTP test, but not that much faster. Performance for OLAP/prefetched would be better. Performance for for example image layer creation would most likely be slower, because there's a cross runtime penalty vs. on COMPUTE_REQUEST the page requests can benefit from the LIFO slot optimization.

koivunej added a commit that referenced this pull request Jan 16, 2023
Separated from #2875.

The microbenchmark has been validated to show similar difference as to
larger scale OLTP benchmark.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants