Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: shmempipe for WAL redo process communication #3184

Closed
4 tasks
hlinnaka opened this issue Dec 22, 2022 · 14 comments
Closed
4 tasks

Epic: shmempipe for WAL redo process communication #3184

hlinnaka opened this issue Dec 22, 2022 · 14 comments
Assignees
Labels
c/storage/compute Component: storage: compute c/storage Component: storage t/Epic Issue type: Epic

Comments

@hlinnaka
Copy link
Contributor

Motivation

Whenever the pageserver needs to replay WAL records, it sends the records to the WAL redo process over a pipe, and the WAL redo process responds over another pipe. The communication over those pipes adds quite a lot of latency. Let's replace the pipes with shared memory, which is faster.

DoD

WAL redo latency is reduced.

Implementation

The idea is to mmap() a piece of shared memory between the pageserver process and the WAL redo process. In that shared memory segment, establish queues for the requests and responses. A ring buffer is the usual way to implement such queues.

In addition to the shared memory segment, there is a notification mechanism to notify the other process that there is a request or response waiting in the queue. We can continue using a pipe for that: just send a single byte to wake up the other process. Or eventfd(2) or futex(2), but those are Linux-specific. Perhaps use eventfd(2) with fallback implementation with regular pipe for oher platforms. Inter-process notification over any of those mechanisms adds quite a lot of latency, so it is best to busy-wait for a while, and only go to sleep waiting for the notifcation after that.

Note that it is not safe to use pthread_mutexes or condition variables for the inter-process notification for security reasons. The pageserver cannot assume that the memory holding the mutex has valid contents.

Requirements

  • Safety: The WAL redo process is untrusted. It can corrupt and write any malicious contents to the shared memory segment. The queue implementation in the pageserver cannot trust the contents of the shared memory.
  •  Testability & Reviewability: To convince ourselves that we don't introduce any undefined behaviour or vulnerabilities in the pageserver, the pageserver code that accesses the shared memory should be as small and isolated as possible. A separate crate would be good, with clear, safe interface, and clear invariants inside that can be inspected and tested.
  • Performance: Microbenchmark results with Proof-of-Concept here. Should achieve those latencies.
  • Handle errors gracefully. If the WAL redo process crashes, or writes garbage, or is stuck and never reads requests , the pageserver should notice, kill the process and restart it.

Related work

Initial Proof-of-Concept by Konstantin: #2947
Current work-in-progress PR: #3163
Earlier discussion on WAL redo speed: #1339

@hlinnaka hlinnaka added the t/Epic Issue type: Epic label Dec 22, 2022
@vadim2404 vadim2404 added the c/storage Component: storage label Dec 22, 2022
@knizhnik
Copy link
Contributor

The portable alternative to eventfd/futex is just pipe or Unix socket with select. Pipe seems to be better solution, because we need not worry about ports. And it is used in Postgres latches implementation. The main difference is that in postgres pipe descriptors are inherited by fork. And here we need to use mechanism to send file descriptor to another process. It is also terrible non-portable, but at least it is supported by Linux, FreeBSD, MacOS and may be by some other Unix dialects.

If you want, I can try to implement such synchronization primitive.
But as far as we are not going to run Neon on something other than Linux platforms (are we?) then we can use shmem pipe implementation with eventfd for Linux and current stdio pipe implementation for other platforms (like MacOS). But it requires significant code duplication (we will need to maintain two different implementation of communication between pageserver and walredo) and this is why I want to avoid it as much as possible.

@hlinnaka
Copy link
Contributor Author

The portable alternative to eventfd/futex is just pipe or Unix socket with select. Pipe seems to be better solution, because we need not worry about ports. And it is used in Postgres latches implementation. The main difference is that in postgres pipe descriptors are inherited by fork. And here we need to use mechanism to send file descriptor to another process. It is also terrible non-portable, but at least it is supported by Linux, FreeBSD, MacOS and may be by some other Unix dialects.

We can pass down the file descriptor through fork() here as well. In fact I think that would be a better way to pass down the file descriptor for the "shared memory file" too, instead of passing down the filename and re-opening it in the child process. It's a little tricky to get right when forking from a multi-threaded process, but still possible.

If you want, I can try to implement such synchronization primitive. But as far as we are not going to run Neon on something other than Linux platforms (are we?) then we can use shmem pipe implementation with eventfd for Linux and current stdio pipe implementation for other platforms (like MacOS). But it requires significant code duplication (we will need to maintain two different implementation of communication between pageserver and walredo) and this is why I want to avoid it as much as possible.

I don't think it's a lot of duplication. eventfd and a pipe are very similar: you send() to notify the other process, and read() to wait for the signal. I think you just need a small #ifdef or #[cfg(target_os = "linux")] where the file descriptor is created.

@MMeent
Copy link
Contributor

MMeent commented Dec 23, 2022

An idea on reducing memcpy calls in a shmempipe that is only used for redo communication:

The shmempipe implementation has dedicated areas for sending data, and receiving data. Both have a fixed buffer size, resulting in a limited amount of data we can fit in these queues. The queue is used only for sending

We can, however, split the queue in 3 main shared memory areas, not 2:

  • N page buffers
  • send queue
  • receive queue

Assuming we think it is OK to limit the queue to N in-flight redo requests, we can make Pageserver put the page image in an empty page buffer in shared memory, transfer ownership of that buffer through the send queue, and receive back ownership with the response from the receive queue.
This means that REDO only has to modify a page buffer, but it is not required to fill it with a page image received from PageServer, nor to copy the full 8kB to the response queue, which saves cycles.

For Pageserver, we know we send variable-sized data, but usually at least one 8kB page., but we know we'll (normally) receive a response containing an 8kB block.

#2947 (comment) mentions some interesting numbers:

KetteQ consumes 400k pages in 11.5s. Presumably, our storage utilizes DDR4 2933. According to this Wikipedia article, that has a peak data transfer rate of 23466MB/sec, or 23.466GB/sec (Note: The linked comment mentions 96Gbps, which feels wrong). 400k pages * 8kiB * 4 copy operations (image -> redo pipe -> buffer -> response pipe -> return value) = 13.1GB. At DDR4-2933 transfer rates, these memcpy operations alone would take 0.558 seconds (not the 30usec mentioned in the comment) or 4.8% of the total time. Utilizing a shared area for the full-page images that are transferred through the shmempipe, we reduce the number of memcopies by at least half - and with it, presumably the time spent in WAL Redo by a similar, if not larger, fraction.

@knizhnik
Copy link
Contributor

As far as I understand there are 4 channels and this is why 23GB/sec need to be multiplied by 4 which gives 96GB/sec):
https://www.google.com/search?channel=fs&client=ubuntu&q=DDR4+2933+bandwidth
Still seems to be small percent of total time.

I already wrote that the main my argument against this proposal is that walredo process is not using shared buffers: it use local buffer to avoid sync. overhead. It is also not so large - I didn't notice big performance improvement after this optimization. But I am not sure that copying 8kb is significantly less expensive than synchronization cost.

@MMeent
Copy link
Contributor

MMeent commented Dec 23, 2022

I already wrote that the main my argument against this proposal is that walredo process is not using shared buffers: it use local buffer to avoid sync. overhead. It is also not so large - I didn't notice big performance improvement after this optimization. But I am not sure that copying 8kb is significantly less expensive than synchronization cost.

The point of the proposal is that you don't need extra sync cost, and that "local buffers" is that shared memory section.

What we'd do is set LocalBufferBlockPointers to this shared memory segment, and let Pageserver specify which of the N buffers should be accessed.

sequenceDiagram
   PageServer->>"redo page buffers": find and reserve next empty page
   "redo page buffers"->>PageServer: "buffer X"
   Pageserver->>"buffer X": fill redo buffer
   PageServer->>WalRedo: Redo WAL on buffer X
   WalRedo->>"buffer X": apply changes
   WalRedo->>PageServer: Redo complete in buffer X
   PageServer->>"buffer X": read changes
   PageServer->>"redo page buffers": buffer is now available for reuse

The only synchronization required here is in PageServer's code, to make sure it doesn't start to try to use more buffers than were allocated. WALRedo only touches the buffer that was indicated by PageServer for that redo request. Pageserver dictates which buffer WalRedo can use for which redo request, which thus shouldn't need further synchronization.

@knizhnik
Copy link
Contributor

Such protocol will eliminate (or at least complicate) batching of walredo requests processing which seems to be the most efficient way now to improve performance. Please notice that all this walredo activity started with attempt to break request-response cycle in pageserver-walredo process communication, i.e. make walredo processing more asynchronous.
I want to point to Slack conversation: https://neondb.slack.com/archives/C04BLQ4LW7K/p1668784750319719
Please notice that replaying walredo from file (maximal possible async level) takes about 1 second for Q1. And with shmempipe current record is 11.5 seconds. So memory bandwidth is really seems to be not a bottleneck.

@MMeent
Copy link
Contributor

MMeent commented Dec 27, 2022

As far as I understand there are 4 channels and this is why 23GB/sec need to be multiplied by 4 which gives 96GB/sec):

That's only the case if you can split the bandwidth across memory lanes, which is very unlikely to be the case: A single allocation is extremely likely to be located on only a single DIMM, which is thus limited to the bandwidth of only a single memory channel. When processing across multiple pipes, you can get those higher bandwidths, but I don't think we can assume that our memory IOs are always spread evenly across the memory channels.

Such protocol will eliminate (or at least complicate) batching of walredo requests processing

Why would that be the case? You can have up to N=number-of-buffers WAL redo requests in the pipeline. It won't be much more complicated than the current multiple-producer single-consumer pipeline.

Please notice that replaying walredo from file (maximal possible async level) takes about 1 second for Q1. And with shmempipe current record is 11.5 seconds. So memory bandwidth is really seems to be not a bottleneck.

How does that "replaying WALRedo from file" work? Do you have a reference? Because my earlier calculation shows that 0.55s of those 1s could potentially be attributed to memory copies, which is fairly significant. That does indeed mean that there are still 10.5 seconds of other overhead, but reducing those 0.5 should not be ignored just because it isn't the biggest and most obvious contributor to time spent.

@knizhnik
Copy link
Contributor

Why would that be the case? You can have up to N=number-of-buffers WAL redo requests in the pipeline. It won't be much more complicated than the current multiple-producer single-consumer pipeline.

walredoproc.c:174:

	/*
	 * WAL redo does not need a large number of buffers. And speed of
	 * DropRelFileNodeAllLocalBuffers() is proportional to the number of
	 * buffers. So let's keep it small (default value is 1024)
	 */
	num_temp_buffers = 4;

Yes, it can be increased. But why?

How does that "replaying WALRedo from file" work? Do you have a reference? Because my earlier calculation shows that 0.55s of those 1s could potentially be attributed to memory copies, which is fairly significant. That does indeed mean that there are still 10.5 seconds of other overhead, but reducing those 0.5 should not be ignored just because it isn't the biggest and most obvious contributor to time spent.

It was discusses in Slack thread. I just dump in the file al walredo requests which was sent to walredo process by pageserver during Q1 execution. And then just put this file as stdin for walredo process.

I do not think that we should optimize something which has no significant impact on performance.
It is really bd practice. Complication of code leads to errors. We have much more complex prefetch mechanism now. And we still have SIGSEGVs caused by prefetch. Fortunately that no of our users faced with this problem yet. Now you are rewriting last written LSN cache and it will become also several times more complex. And once again - there seems to be no advantages in performance, may be even some slowdown caused by more expensive eviction algorithm. "Sharing" page cache between pageserver and walredo process also significantly complicate pageserver-walredo communication. And take in account that walredo process is not trusted, so it is can perform any invalid operation and try to crash pageserver... Even with shmempipe we faced with a lot of issues. And here I expect even more pit falls...

@vadim2404 vadim2404 added the c/storage/compute Component: storage: compute label Jan 9, 2023
@shanyp shanyp mentioned this issue Jan 9, 2023
6 tasks
@koivunej
Copy link
Contributor

koivunej commented Jan 9, 2023

Proposing a path for #3163 to become ready and open questions in this message. Apologies on the use of footnotes :)

Current state of #3163

Non-brief description of the current state in #3163.

Current implementation works similar to #2947 but assuming not every reader is familiar with either, I'll recap it here. The two processes share a region of memory. I call the pageserver side "owner" and the pgxn/walredo/walredoproc.c the "worker". The shared memory replaces standard input and output (stdin for requests, stdout for responses).

Single thread requesting and worker responding looks like:

  1. owner writes the full request to the first SPSC lockless queue (see docs.rs/ringbuf)
  2. owner wakes up the worker using the first eventfd semaphore (see man eventfd)
  3. worker reads the full request, begins processing it message by message
  4. owner sleeps on second eventfd semaphore
  5. worker writes the full response to the second SPSC lockless queue
  6. worker wakes up the owner using the second eventfd semaphore
  7. owner reads the response, and is complete

For OLAP1 situations the implementation supports owner side pipelining the requests, and each requesting thread waiting it's turn to read the response. Threads are parked and woken up in order or by being lucky and not having to park at all.

Both lockless queues are guarded by std::sync::Mutex on the owner side. The ringbuf crate handles sufficient atomic operations to ensure visibility between the processes.

Proposed path forward

  1. OLTP1 performance [@koivunej + @knizhnik will work in pair]
    • this has been a major source of frustration for me since all of the solutions seem slower than Shmem pipe #2947, however this is balanced by the fact that there is less spinning, so most likely this is now well balanced
  2. clean up all pthread primitives and unused code [done by @koivunej]
    • atomics will most likely be enough unless I forget something
  3. get miri tests working again [@koivunej is working on it]
    • miri reports if there are any known illegal rust things done
    • will require shimming for eventfd usage in single process
  4. "vendor" the ringbuffer crate [@knizhnik]
    • with upstreamable try_ operations which only do checked operations
    • kill child on miscalculated head/tail
  5. cleanup pageserver/src/walredo.rs [@knizhnik ]
    • switch to tokio task waiting for the sub process to die while pumping the stderr messages
  6. implement the "restart on child death" for the OwnedRequester::request_response [@knizhnik]
    • process exit and any miscalculated head/tail will need to trigger this
    • all phases of writing request, parking, reading response need to be "restartable"
    • all threads calling OwnedRequester::request_response will need to agree on the restart before anyone can actually restart
    • process needs to be restarted as well

Later on, on following PRs:

  1. asyncify the OwnedRequester::request_response by making the eventfd, thread parking, locking async [@koivunej]
    • the "restart on child death" will be semi-tricky to implement with cancellable futures
    • to support std::mem::forget futures it will be very difficult 2

Open questions:

  • how to do non-linux eventfd? Alternatives I can see:
    • instead of one fd, have two pipes reading and writing one byte at time
    • keeping the old poll based implementation around for macos

Footnotes

  1. There are pageserver/bench/bench_walredo.rs where short/short/1 and medium/medium/1 are assumed to show the OLTP case and the OLAP case is shown by /N where N > 1. The number N signifies how many threads are concurrently redoing a page. The benchmarks go as high up as N=16, which hasn't been observed in the wild. 2

  2. Supporting forgettable futures will be very difficult and might come down to if we want to support them or not. Downside of "not supporting" will be the lingering possibility of facing a very mysterious rare debugging case once someone introduces a code path which might leak a rare walredo future.
    The problem of forgotten futures is that we will have no knowledge of the forgetting taking place, so when it's time to read the response, nothing will happen when a wakeup is sent for the task whose turn it is to read the response, so we will deadlock.
    The only design I can see supporting std::mem::forget the futures is Tokio based walredo #2875, which might become more viable as we continue to asyncify the core Introduce RequestContexts. #3228.

@hlinnaka
Copy link
Contributor Author

hlinnaka commented Jan 9, 2023

Proposed path forward

+1 on this plan

  1. OLTP1 performance
  • this has been a major source of frustration for me since all of the solutions seem slower than Shmem pipe #2947, however this is balanced by the fact that there is less spinning, so most likely this is now well balanced

Right, it's always a tradeoff to decide how long to busy-wait. If you busy-wait for too long, you waste a lot of resources spinning, but without busy-waiting, it takes a few microseconds for notification to come through eventfd.

I spent some time digging into virtualization, qemu, kvm, and virtio a few weeks back. They have the same problem: the guest VM and host VM communicate over a queue in shared memory, but after adding some work to the queue, you need a mechanism to notify the other side. You can use a "vmexit" from the guest VM, which uses eventfd to notify the process in the host that there is something in the queue. Or you can busy-poll. See https://vmsplice.net/~stefan/stefanha-kvm-forum-2017.pdf for a presentation on this. The core idea is that after processing some work, you busy-wait for a few microseconds in case that more work arrives quickly enough, and then sleep.

Later on, on following PRs:

  1. asyncify the OwnedRequester::request_response by making the eventfd, thread parking, locking async

    • the "restart on child death" will be semi-tricky to implement with cancellable futures
    • to support std::mem::forget futures it will be very difficult 2

You could do this work on top of #3228. It's still under review and will surely change somewhat before merging, but the core change of TimelineGet::get to be async is there and shouldn't change much.

@knizhnik
Copy link
Contributor

knizhnik commented Jan 9, 2023

Right, it's always a tradeoff to decide how long to busy-wait. If you busy-wait for too long, you waste a lot of resources spinning, but without busy-waiting, it takes a few microseconds for notification to come through eventfd.

Atomics and busy wait are efficient if we have some number of requests which can be combined.
This is why shmpipe shows best results for OLAP requests where prefetch or parallel workers can issue multiple requests to walredo. It should be also efficient if there are multiple concurrent OLTP requests.
But if there is only one client issuing OLTP requests (which is expected to be most common case for free tier clients),
then all this ataomics and busy loop can not reduce number of syscalls, because we in any case have to send request to walredo process and wait for it't result. Page reconstruction is not so trivial task (at least copying 8kb in both direction), so that waiting its completion in busy loop may be not so good idea, having negative impact on overall pageserver performance.

@koivunej
Copy link
Contributor

See https://vmsplice.net/~stefan/stefanha-kvm-forum-2017.pdf for a presentation on this

Looking at the relevant sysfs tunables on my machine, they match what is on these slides -- noteworthy that shrink is zero. I did some initial testing with fixed "100us" wait (elapsed is checked every 1024 rounds, if over 100us, then go to wait on eventfd). This did seem to perform quite well, but @knizhnik had some doubts if just setting a fixed time target was any better than just doing a number of loops.

Benchmarking using the OLTP case is a bit tricky as safekeeper was broken on the commit I started on. The "this seemed to perform quite well" was determined by running it on the same parent commit than shmem_pipe branch started on, however on rebasing onto a very recent main there is a different kind of baseline and 5k tps is no longer reached. (Didn't check to rebase shmem_pipe because the Makefile rewrite happened in the meantime.)

When doing the busy loop tuning, the microbenchmark is worthless because it seems that you always "win" in it when you busy loop.

@knizhnik
Copy link
Contributor

Some my thoughts about shmpipe after conversation with @koivunej :

  1. Current implementation seems to me to be too complex and "unsafe" (in sense that a lot of code is marked as unsafe so this Rust implementation can be treated as more safe than C implementation). Actually ring buffer is very simple data structure: just two pointers, modular arithmetic and few checks, especially if it is accessed in critical section (protected my some mutex) which both implementation does.
  2. I agree with @koivunej that Rust implementation is better from integration point of view (building, testing,...) and it will be easier to adopt it to sync model once all pageserver code becomes async.
  3. The advantage of shared memory pipe is mostly achieved by sing busy loop which tries to minimize syscalls. My first attempt to implement shmpipe using just pthread mutext/convar shows performance comparable with stdio pipe. But busy loop increase CPU usage which may lead to reducing of overall system performance and increasing consumed CPU time (and so are costs). Also it will work only if we have multiple concurrent requests to walredo. It will be in case of executing of OLAP queries or multiple concurrent OLTP queries. But if there is just one backend executing OLTP queries, shmpipe will not give any advantages comparing with current stdio pipe.
  4. This Rust smpipe implementation is using ringbuf modules which IMHO also overcomplicated. But the main problem is that ringbuf module doesn't contain any protection from incorrect direct modification of content of shared memory (head/tail positions). I failed to find the way of integrating this chanes without changing of this module. But it means that we need to maintain our own fork of this module.
  5. Detecting walredo process crash is more difficult then with stdio pipe (when you just receive SIGCHILD). It can be done in different ways: use pipe for waiting in busy loop or start separate task for waiting for walredo process status and then waking up all process blocked on eventfd...
  6. Using some kind of adaptive algorithms for busy wait may be not so good idea in our case because our goal is not only increase performance (decrease latency), but also reduce CPU consumption. Also workload can changed frequently (number of active tenants). I afraid that if we try to use this shmpipe version on production, then we will observer significant increase of CPU usage. And if we tune parameters to make less iteration of busy loop before waiting, then performance will become very similar with what we have now in main.

@koivunej
Copy link
Contributor

Closing the shmempipe_rs (#3163) efforts in favor of a simpler #3368.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/compute Component: storage: compute c/storage Component: storage t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

5 participants