Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shmempipe in rust #3163

Closed
wants to merge 76 commits into from
Closed

shmempipe in rust #3163

wants to merge 76 commits into from

Conversation

koivunej
Copy link
Contributor

@koivunej koivunej commented Dec 20, 2022

Very much an early draft, still has crashes and a lot of refactorings to do. Port/rewrite of earlier #2947.

Creating a PR to solicit feedback esp on:

  • Makefile and C integration

And to manage TODO's:

  • make sure worker request read side waits on a condition
  • investigate the rare crash caused by something in ringbuf
  • align write/read mutex and conditions
  • short/short/1 around 10us, Shmem pipe #2947 is 7us, main is 12us
    • without seccomp
  • switch to tokio for the purpose of detecting death, keep old poll for mac os
    • signal any spinners/sleepers about the death to restart
  • fix miri tests once more (broken with pthread cond)

@hlinnaka
Copy link
Contributor

Thinking about the security aspects of this a little: The postgres process is untrusted, we have to assume that it could write arbitrary malicious data to the shared memory area. Is ringbuf crate robust like that? We need to vet it very carefully. I don't see anything in the crate docs to talk about that scenario. I've got a feeling that we cannot really trust it and we'll need to write our own implementation from scratch with that in mind. The amount of unsafes in this PR and in ringbuf is scary.

@hlinnaka
Copy link
Contributor

Makefile and C integration

Those parts look ok to me

@knizhnik
Copy link
Contributor

Thinking about the security aspects of this a little: The postgres process is untrusted, we have to assume that it could write arbitrary malicious data to the shared memory area. Is ringbuf crate robust like that? We need to vet it very carefully. I don't see anything in the crate docs to talk about that scenario. I've got a feeling that we cannot really trust it and we'll need to write our own implementation from scratch with that in mind. The amount of unsafes in this PR and in ringbuf is scary.

I have to say that my original C implementation of shmpipe also suffers from this problem. It was intended that walredo process calls just two function from libshmpipe.a to receive request and send response. But we in any case has some region of memory shared between walredo and pageserver processes. Cracked walredo process can place any data in this segment. It means that pageserver should treate content of this region as absolutely untrusted. Any manipulation with pipe positions, with sync primitives should be preceded with verification of content of this structs. It is not so hard to do with ring buffer position, but how can I validate for example content of pthread_mutex and guarantee that call of pthread_mutex_lock will not cause SIGSEGV?

I afraid that the same problem exists in Rust version of shmpipe, especially if it is using some external crate. Most likely ringbuf crate is not ready that somebody will try to corrupt content of this buffer. The question is whether such verification can be done efficiently at all?

@hlinnaka
Copy link
Contributor

It is not so hard to do with ring buffer position, but how can I validate for example content of pthread_mutex and guarantee that call of pthread_mutex_lock will not cause SIGSEGV?

Ouch, I hadn't thought of that!

Perhaps we should use eventfd(2) for the notifications instead of pthreads. Not sure what the performance of that compares with pthread cond variables. And I guess eventfd is not available on macos but could emulate it with a regular pipe there.

I afraid that the same problem exists in Rust version of shmpipe, especially if it is using some external crate. Most likely ringbuf crate is not ready that somebody will try to corrupt content of this buffer. The question is whether such verification can be done efficiently at all?

It's just a few array bounds checks in the right places, so I'm not worried about the performance here. But I'm afraid it rules out using any existing crates, and we really need to design the thing carefully. We should keep the unsafe primitives as small as possible so that it can be reviewed carefully, and build higher level abstractions on top of that.

@koivunej
Copy link
Contributor Author

koivunej commented Dec 21, 2022

Updated the description. This is actually slower than Konstantins work, around 10us for short/short/1 vs. 7us for #2947.

pthread_mutex_lock will not cause SIGSEGV?

Ouch, I hadn't thought of that!

I guess lockups and thread starvation would be a better angle, but this is a good point :)

And I guess eventfd is not available on macos

Even trying to find what is supported and available on macos (not to mention working) is tedious work. I was thinking we could just keep the old stdio based for compatibility.

It's just a few array bounds checks in the right places, so I'm not worried about the performance here. But I'm afraid it rules out using any existing crates, and we really need to design the thing carefully. We should keep the unsafe primitives as small as possible so that it can be reviewed carefully, and build higher level abstractions on top of that.

Why does it rule out other crates? Index manipulations are here: https://docs.rs/ringbuf/latest/src/ringbuf/ring_buffer/base.rs.html#110-144, but if we look closer this is all safe code. head() and tail() calls do atomic reads, and this is all bounds checked.

(Lately what I remember reading, removing bounds checks from rust with unsafe have in many occasions made the performance worse, so I wasn't even thinking that someone would do that. But this might be because so many are using iterators allowing better codegen than doing indexed accesses all over.)

unsafe primitives

if this is related to how I use unsafe all over with the Pin::new_unchecked to make sure pthread primitives are not moved, I don't think it's a valid concern at least for this usage of Pin.

@knizhnik
Copy link
Contributor

I guess lockups and thread starvation would be a better angle, but this is a good point :)

Lockups and starvation may affect only one tenant.
While SIGSEGV will crash complete pageserver and so affect all tenents.

Frankly speaking I do not see acceptable safe solution at this moment.
There is no reliable way to verify any system sync primitive with shared state (i.e. which content is located in shared memory and can be directly changed by walredo process).
Using other system calls for synchronization: stdio pipes (like postgres for latches), eventfd, posix semaphores,... seems to be more reliable solution.

Verification of other shmpipe state variables (position in the buffers) also should be done with special care: as far as content of memory can be changed at any time (doesn't matter whether we are in critical section or not), it is necessary either to use atomic CAS, either copy state to local variables and wok with this local copies.

@koivunej
Copy link
Contributor Author

I guess lockups and thread starvation would be a better angle, but this is a good point :)

Lockups and starvation may affect only one tenant.

We do have global pool of threads however, so starving any tokio worker would decrease them, after 8 locking up page request handling.

@hlinnaka
Copy link
Contributor

It's just a few array bounds checks in the right places, so I'm not worried about the performance here. But I'm afraid it rules out using any existing crates, and we really need to design the thing carefully. We should keep the unsafe primitives as small as possible so that it can be reviewed carefully, and build higher level abstractions on top of that.

Why does it rule out other crates? Index manipulations are here: https://docs.rs/ringbuf/latest/src/ringbuf/ring_buffer/base.rs.html#110-144, but if we look closer this is all safe code. head() and tail() calls do atomic reads, and this is all bounds checked.

That particular code looks safe. I'm still scared though. Look at RbBase::vacant_len, for example:

    /// The number of items stored in the buffer at the moment.
    fn occupied_len(&self) -> usize {
        let modulus = self.modulus();
        (modulus.get() + self.tail() - self.head()) % modulus
    }

The untrusted process can set 'head' and 'tail' to anything, and with suitable values this integer arithmetic can overflow. Integer overflow panics on debug builds. A panic won't lead to arbitrary code execution, but it's still unpleasant.

I'm not sure if there are more serious cases than that. Maybe not. But the bottom line is that the crate wasn't really designed with such a hostile environment in mind, so it's on us to inspect it, and determine which functions are safe to use and what kind of inconsistencies are possible. For example, with suitable values of 'head' and 'tail', it's possible that is_empty() == true and occupied_len() == 0 at the same time. And whenever the crate is updated, we need to inspect the changes again.

@koivunej
Copy link
Contributor Author

Microbenchmarking results.

I don't know right now what causes that 3us gap. With short/{8,16} the difference is for this PR probably due to how I wakeup the threads in order, none of them spin.

shmem_pipe --disable-seccomp:

short/short/1           time:   [7.7323 µs 7.7366 µs 7.7409 µs]
short/short/2           time:   [25.548 µs 25.715 µs 25.840 µs]
short/short/4           time:   [43.238 µs 43.448 µs 43.663 µs]
short/short/8           time:   [79.991 µs 80.242 µs 80.504 µs]
short/short/16          time:   [206.89 µs 211.08 µs 215.92 µs]

medium/medium/1         time:   [130.35 µs 130.57 µs 130.78 µs]
medium/medium/2         time:   [280.39 µs 280.76 µs 281.11 µs]
medium/medium/4         time:   [533.36 µs 534.83 µs 536.29 µs]
medium/medium/8         time:   [1.0486 ms 1.0519 ms 1.0556 ms]
medium/medium/16        time:   [2.2088 ms 2.2468 ms 2.2849 ms]

shmempipe_rs --disable-seccomp and two eventfds:

short/short/1           time:   [10.811 µs 10.836 µs 10.873 µs]
short/short/2           time:   [29.148 µs 29.410 µs 29.631 µs]
short/short/4           time:   [49.000 µs 50.056 µs 51.787 µs]
short/short/8           time:   [81.239 µs 81.955 µs 82.660 µs]
short/short/16          time:   [173.79 µs 174.95 µs 176.15 µs]

medium/medium/1         time:   [144.36 µs 144.60 µs 144.85 µs]
medium/medium/2         time:   [301.73 µs 302.48 µs 303.45 µs]
medium/medium/4         time:   [582.53 µs 583.93 µs 585.40 µs]
medium/medium/8         time:   [1.1050 ms 1.1080 ms 1.1111 ms]
medium/medium/16        time:   [2.2121 ms 2.2170 ms 2.2218 ms]

@koivunej
Copy link
Contributor Author

koivunej commented Dec 21, 2022

I think the delta between solutions has to be within the walredoproc.c, in the copying and possible extra enlargening with the wanted size vs. the needed more bytes. Unsure how often that happens, since microbenchmarks only have examples which will_init == true:

enlargeStringInfo(inBuf, len);

Thinking this more, it probably doesn't happen, because the StringInfo is reset to in the beginning of each round, assuming the reset it works and doesn't enlarge needlessly.

However, my hacky attempt at optimistically slicing did not yield any measurable difference: https://gist.github.com/koivunej/4b31df6f8e222019c96ff56a1839398c

Trying with the full message reading with in-band length prefixing.

@knizhnik
Copy link
Contributor

There seems to one more possible vulnerability of both C and Rust version of shmpipe. I afraid that is can be show stopper. Will be happy to be wrong. In both cases mmap is used. Obviously walredo process should have permissions to call mmap.
Mapped segment is determined by name which includes tenant ID.

But what can prevent cracked walredo process from mapping segment of pipe belonging to another tenant? It just needs to somehow guess it's tenant ID... After that it will be able to monitor all walredo traffic of another tenant, i.e. be aware of all updates it performs. It just need to decode wal records and store data in own database and voilà: data is stolen.
This scenario looks really contrived and hard to implement but actually there is nothing impossible here. Having sources of pageserver and walredo process it will not take a lot of time to implement such data-grabber. At least it seems to be easier than getting control on walredo process.

@hlinnaka
Copy link
Contributor

But what can prevent cracked walredo process from mapping segment of pipe belonging to another tenant?

seccomp prevents that. After calling seccomp(), that walredo process cannot open any files.

@knizhnik
Copy link
Contributor

But what can prevent cracked walredo process from mapping segment of pipe belonging to another tenant?

seccomp prevents that. After calling seccomp(), that walredo process cannot open any files.

Nice to hear it. I have not realized that we prohibits open.

@koivunej
Copy link
Contributor Author

koivunej commented Dec 23, 2022

The makefile is busted however, the dependency to the target/release/libshmempipe.a cannot be at the highest level, because it doesn't necessarily cause anything to be rebuilt or relinked even if the cargo build [--release] -p shmempipe would get executed. Though, this problem may only have surfaced after I fixed the postgres-v14 not to copy the c.h on every header build.

@koivunej
Copy link
Contributor Author

In a hurry, have to push more ugly commits.

As an episodic update, now there's the eventfd usage. Most of the time during xmas has been spent understanding the memory corruption caused by using a postponed publisher. I do not know why that happens, but it does happen, leading to "stall" as too many bytes are read from the to_worker queue.

Another interesting was the completly buffy impl of UnparkInorder, which is now heap based. It is faster in microbenchmarks, saves some time with the large threads waiting. I still don't know how to make this into a 7us version. Currently on my machine with taskset and performance governor, boost disabled, it's consistent 9-10us.

Also noting that memcpy only --bin producer_consumer --features demo takes at best 4us. Which is much faster than the shmem_pipe test executable (>10us for me at -O2).

@shanyp
Copy link
Contributor

shanyp commented Jan 9, 2023

tracked by #3184

@koivunej koivunej force-pushed the shmempipe_rs branch 2 times, most recently from 40df63c to bc9d6f6 Compare January 11, 2023 11:24
largest change is not to try lock anything from the inside the
initialization. other change is refcounts and tombstoning.
this is probably still not good enough; what happens if we don't drop in
the parent process but the child is then killed. Will anything of value
remain after robust unlock? probably not.
sadly tempfile crate is not currently supported. to run the tests with
miri, just do:

    cargo +nightly miri test -p shmempipe

next steps:
- flesh out a proper two thread test case
this is most likely a bogus thing to do ... added a longer fixme about
how could it be avoided. it probably never matters what is done over at
C-side, it is wrong that in rust we just cast it as mut slice, but there
are no really safe alternatives.
I had earlier misprofiled that the waits should be much smaller than
they end up being. busy looping for 100us seems to work much better and
is on-par with the shmem_pipe, but will be calibrating a loop count for
this instead of polling the instant.
as part of trying out if this makes sense in a world where we start by
having two different implementations, shmempipe for linux, stdio for
!linux.

so far no regressions, perhaps a slight one on oltp select-after-update.

continuing with trying without allocating the whole message.
instead feed the small chunks at a time. this might give less time to
sleep and thus interleave the start of redoing with the request sending
operation.
earlier as in right after writing the first slice (first message) in case it was sleeping.
@knizhnik
Copy link
Contributor

It's just a few array bounds checks in the right places, so I'm not worried about the performance here. But I'm afraid it rules out using any existing crates, and we really need to design the thing carefully. We should keep the unsafe primitives as small as possible so that it can be reviewed carefully, and build higher level abstractions on top of that.

Why does it rule out other crates? Index manipulations are here: https://docs.rs/ringbuf/latest/src/ringbuf/ring_buffer/base.rs.html#110-144, but if we look closer this is all safe code. head() and tail() calls do atomic reads, and this is all bounds checked.

That particular code looks safe. I'm still scared though. Look at RbBase::vacant_len, for example:

    /// The number of items stored in the buffer at the moment.
    fn occupied_len(&self) -> usize {
        let modulus = self.modulus();
        (modulus.get() + self.tail() - self.head()) % modulus
    }

The untrusted process can set 'head' and 'tail' to anything, and with suitable values this integer arithmetic can overflow. Integer overflow panics on debug builds. A panic won't lead to arbitrary code execution, but it's still unpleasant.

I'm not sure if there are more serious cases than that. Maybe not. But the bottom line is that the crate wasn't really designed with such a hostile environment in mind, so it's on us to inspect it, and determine which functions are safe to use and what kind of inconsistencies are possible. For example, with suitable values of 'head' and 'tail', it's possible that is_empty() == true and occupied_len() == 0 at the same time. And whenever the crate is updated, we need to inspect the changes again.

May be I am wrong, but this function is not used now by our code (including Consumer::len()/Producer::len().

@knizhnik
Copy link
Contributor

Sorry, I found a place where it is used - in debugger asserting in advance current position.

@hlinnaka
Copy link
Contributor

Sorry, I found a place where it is used - in debugger asserting in advance current position.

Yeah, the crate contains many traits implementing different parts. For our purposes, I think we should remove many of the abstractions and all the code that we don't need. That makes it easier to see what's going on, and makes it easier to verify that the unsafe code is sound.

it's more clear that we might need to restart these requests with a
slice of bytes instead of clonable iterator or something.
this also relaxes the bound on &[Bytes] to &[impl AsRef<[u8]>] so that
we can keep using Vec and slices.. perhaps a compromise would be the
bytes::Buf? still, making the bytes dependency optional again.
@vadim2404
Copy link
Contributor

Closing this PR in favor of async WAL redo

@vadim2404 vadim2404 closed this Jan 30, 2023
@bayandin bayandin deleted the shmempipe_rs branch May 19, 2023 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants