New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: shmempipe for WAL redo process communication #3184
Comments
The portable alternative to eventfd/futex is just pipe or Unix socket with If you want, I can try to implement such synchronization primitive. |
We can pass down the file descriptor through fork() here as well. In fact I think that would be a better way to pass down the file descriptor for the "shared memory file" too, instead of passing down the filename and re-opening it in the child process. It's a little tricky to get right when forking from a multi-threaded process, but still possible.
I don't think it's a lot of duplication. eventfd and a pipe are very similar: you send() to notify the other process, and read() to wait for the signal. I think you just need a small #ifdef or |
An idea on reducing The shmempipe implementation has dedicated areas for sending data, and receiving data. Both have a fixed buffer size, resulting in a limited amount of data we can fit in these queues. The queue is used only for sending We can, however, split the queue in 3 main shared memory areas, not 2:
Assuming we think it is OK to limit the queue to N in-flight redo requests, we can make Pageserver put the page image in an empty page buffer in shared memory, transfer ownership of that buffer through the send queue, and receive back ownership with the response from the receive queue. For Pageserver, we know we send variable-sized data, but usually at least one 8kB page., but we know we'll (normally) receive a response containing an 8kB block. #2947 (comment) mentions some interesting numbers: KetteQ consumes 400k pages in 11.5s. Presumably, our storage utilizes DDR4 2933. According to this Wikipedia article, that has a peak data transfer rate of 23466MB/sec, or 23.466GB/sec (Note: The linked comment mentions 96Gbps, which feels wrong). 400k pages * 8kiB * 4 copy operations (image -> redo pipe -> buffer -> response pipe -> return value) = 13.1GB. At DDR4-2933 transfer rates, these memcpy operations alone would take 0.558 seconds (not the 30usec mentioned in the comment) or 4.8% of the total time. Utilizing a shared area for the full-page images that are transferred through the shmempipe, we reduce the number of memcopies by at least half - and with it, presumably the time spent in WAL Redo by a similar, if not larger, fraction. |
As far as I understand there are 4 channels and this is why 23GB/sec need to be multiplied by 4 which gives 96GB/sec): I already wrote that the main my argument against this proposal is that walredo process is not using shared buffers: it use local buffer to avoid sync. overhead. It is also not so large - I didn't notice big performance improvement after this optimization. But I am not sure that copying 8kb is significantly less expensive than synchronization cost. |
The point of the proposal is that you don't need extra sync cost, and that "local buffers" is that shared memory section. What we'd do is set sequenceDiagram
PageServer->>"redo page buffers": find and reserve next empty page
"redo page buffers"->>PageServer: "buffer X"
Pageserver->>"buffer X": fill redo buffer
PageServer->>WalRedo: Redo WAL on buffer X
WalRedo->>"buffer X": apply changes
WalRedo->>PageServer: Redo complete in buffer X
PageServer->>"buffer X": read changes
PageServer->>"redo page buffers": buffer is now available for reuse
The only synchronization required here is in PageServer's code, to make sure it doesn't start to try to use more buffers than were allocated. WALRedo only touches the buffer that was indicated by PageServer for that redo request. Pageserver dictates which buffer WalRedo can use for which redo request, which thus shouldn't need further synchronization. |
Such protocol will eliminate (or at least complicate) batching of walredo requests processing which seems to be the most efficient way now to improve performance. Please notice that all this walredo activity started with attempt to break request-response cycle in pageserver-walredo process communication, i.e. make walredo processing more asynchronous. |
That's only the case if you can split the bandwidth across memory lanes, which is very unlikely to be the case: A single allocation is extremely likely to be located on only a single DIMM, which is thus limited to the bandwidth of only a single memory channel. When processing across multiple pipes, you can get those higher bandwidths, but I don't think we can assume that our memory IOs are always spread evenly across the memory channels.
Why would that be the case? You can have up to N=number-of-buffers WAL redo requests in the pipeline. It won't be much more complicated than the current multiple-producer single-consumer pipeline.
How does that "replaying WALRedo from file" work? Do you have a reference? Because my earlier calculation shows that 0.55s of those 1s could potentially be attributed to memory copies, which is fairly significant. That does indeed mean that there are still 10.5 seconds of other overhead, but reducing those 0.5 should not be ignored just because it isn't the biggest and most obvious contributor to time spent. |
walredoproc.c:174:
Yes, it can be increased. But why?
It was discusses in Slack thread. I just dump in the file al walredo requests which was sent to walredo process by pageserver during Q1 execution. And then just put this file as stdin for walredo process. I do not think that we should optimize something which has no significant impact on performance. |
Proposing a path for #3163 to become ready and open questions in this message. Apologies on the use of footnotes :) Current state of #3163Non-brief description of the current state in #3163.Current implementation works similar to #2947 but assuming not every reader is familiar with either, I'll recap it here. The two processes share a region of memory. I call the pageserver side "owner" and the Single thread requesting and worker responding looks like:
For OLAP1 situations the implementation supports owner side pipelining the requests, and each requesting thread waiting it's turn to read the response. Threads are parked and woken up in order or by being lucky and not having to park at all. Both lockless queues are guarded by Proposed path forward
Later on, on following PRs:
Open questions:
Footnotes
|
+1 on this plan
Right, it's always a tradeoff to decide how long to busy-wait. If you busy-wait for too long, you waste a lot of resources spinning, but without busy-waiting, it takes a few microseconds for notification to come through eventfd. I spent some time digging into virtualization, qemu, kvm, and virtio a few weeks back. They have the same problem: the guest VM and host VM communicate over a queue in shared memory, but after adding some work to the queue, you need a mechanism to notify the other side. You can use a "vmexit" from the guest VM, which uses eventfd to notify the process in the host that there is something in the queue. Or you can busy-poll. See https://vmsplice.net/~stefan/stefanha-kvm-forum-2017.pdf for a presentation on this. The core idea is that after processing some work, you busy-wait for a few microseconds in case that more work arrives quickly enough, and then sleep.
You could do this work on top of #3228. It's still under review and will surely change somewhat before merging, but the core change of |
Atomics and busy wait are efficient if we have some number of requests which can be combined. |
Looking at the relevant sysfs tunables on my machine, they match what is on these slides -- noteworthy that shrink is zero. I did some initial testing with fixed "100us" wait (elapsed is checked every 1024 rounds, if over 100us, then go to wait on eventfd). This did seem to perform quite well, but @knizhnik had some doubts if just setting a fixed time target was any better than just doing a number of loops. Benchmarking using the OLTP case is a bit tricky as safekeeper was broken on the commit I started on. The "this seemed to perform quite well" was determined by running it on the same parent commit than When doing the busy loop tuning, the microbenchmark is worthless because it seems that you always "win" in it when you busy loop. |
Some my thoughts about shmpipe after conversation with @koivunej :
|
Motivation
Whenever the pageserver needs to replay WAL records, it sends the records to the WAL redo process over a pipe, and the WAL redo process responds over another pipe. The communication over those pipes adds quite a lot of latency. Let's replace the pipes with shared memory, which is faster.
DoD
WAL redo latency is reduced.
Implementation
The idea is to mmap() a piece of shared memory between the pageserver process and the WAL redo process. In that shared memory segment, establish queues for the requests and responses. A ring buffer is the usual way to implement such queues.
In addition to the shared memory segment, there is a notification mechanism to notify the other process that there is a request or response waiting in the queue. We can continue using a pipe for that: just send a single byte to wake up the other process. Or eventfd(2) or futex(2), but those are Linux-specific. Perhaps use eventfd(2) with fallback implementation with regular pipe for oher platforms. Inter-process notification over any of those mechanisms adds quite a lot of latency, so it is best to busy-wait for a while, and only go to sleep waiting for the notifcation after that.
Note that it is not safe to use pthread_mutexes or condition variables for the inter-process notification for security reasons. The pageserver cannot assume that the memory holding the mutex has valid contents.
Requirements
Related work
Initial Proof-of-Concept by Konstantin: #2947
Current work-in-progress PR: #3163
Earlier discussion on WAL redo speed: #1339
The text was updated successfully, but these errors were encountered: