shmempipe in rust #3163

koivunej · 2022-12-20T10:15:45Z

Very much an early draft, still has crashes and a lot of refactorings to do. Port/rewrite of earlier #2947.

Creating a PR to solicit feedback esp on:

Makefile and C integration

And to manage TODO's:

make sure worker request read side waits on a condition
investigate the rare crash caused by something in ringbuf
align write/read mutex and conditions
short/short/1 around 10us, Shmem pipe #2947 is 7us, main is 12us
- without seccomp
switch to tokio for the purpose of detecting death, keep old poll for mac os
- signal any spinners/sleepers about the death to restart
fix miri tests once more (broken with pthread cond)

pgxn/neon_walredo/walredoproc.c

hlinnaka · 2022-12-20T14:20:47Z

Thinking about the security aspects of this a little: The postgres process is untrusted, we have to assume that it could write arbitrary malicious data to the shared memory area. Is ringbuf crate robust like that? We need to vet it very carefully. I don't see anything in the crate docs to talk about that scenario. I've got a feeling that we cannot really trust it and we'll need to write our own implementation from scratch with that in mind. The amount of unsafes in this PR and in ringbuf is scary.

hlinnaka · 2022-12-20T14:22:32Z

Makefile and C integration

Those parts look ok to me

knizhnik · 2022-12-20T14:44:49Z

Thinking about the security aspects of this a little: The postgres process is untrusted, we have to assume that it could write arbitrary malicious data to the shared memory area. Is ringbuf crate robust like that? We need to vet it very carefully. I don't see anything in the crate docs to talk about that scenario. I've got a feeling that we cannot really trust it and we'll need to write our own implementation from scratch with that in mind. The amount of unsafes in this PR and in ringbuf is scary.

I have to say that my original C implementation of shmpipe also suffers from this problem. It was intended that walredo process calls just two function from libshmpipe.a to receive request and send response. But we in any case has some region of memory shared between walredo and pageserver processes. Cracked walredo process can place any data in this segment. It means that pageserver should treate content of this region as absolutely untrusted. Any manipulation with pipe positions, with sync primitives should be preceded with verification of content of this structs. It is not so hard to do with ring buffer position, but how can I validate for example content of pthread_mutex and guarantee that call of pthread_mutex_lock will not cause SIGSEGV?

I afraid that the same problem exists in Rust version of shmpipe, especially if it is using some external crate. Most likely ringbuf crate is not ready that somebody will try to corrupt content of this buffer. The question is whether such verification can be done efficiently at all?

hlinnaka · 2022-12-20T14:59:32Z

It is not so hard to do with ring buffer position, but how can I validate for example content of pthread_mutex and guarantee that call of pthread_mutex_lock will not cause SIGSEGV?

Ouch, I hadn't thought of that!

Perhaps we should use eventfd(2) for the notifications instead of pthreads. Not sure what the performance of that compares with pthread cond variables. And I guess eventfd is not available on macos but could emulate it with a regular pipe there.

I afraid that the same problem exists in Rust version of shmpipe, especially if it is using some external crate. Most likely ringbuf crate is not ready that somebody will try to corrupt content of this buffer. The question is whether such verification can be done efficiently at all?

It's just a few array bounds checks in the right places, so I'm not worried about the performance here. But I'm afraid it rules out using any existing crates, and we really need to design the thing carefully. We should keep the unsafe primitives as small as possible so that it can be reviewed carefully, and build higher level abstractions on top of that.

koivunej · 2022-12-21T08:18:29Z

Updated the description. This is actually slower than Konstantins work, around 10us for short/short/1 vs. 7us for #2947.

pthread_mutex_lock will not cause SIGSEGV?

Ouch, I hadn't thought of that!

I guess lockups and thread starvation would be a better angle, but this is a good point :)

And I guess eventfd is not available on macos

Even trying to find what is supported and available on macos (not to mention working) is tedious work. I was thinking we could just keep the old stdio based for compatibility.

It's just a few array bounds checks in the right places, so I'm not worried about the performance here. But I'm afraid it rules out using any existing crates, and we really need to design the thing carefully. We should keep the unsafe primitives as small as possible so that it can be reviewed carefully, and build higher level abstractions on top of that.

Why does it rule out other crates? Index manipulations are here: https://docs.rs/ringbuf/latest/src/ringbuf/ring_buffer/base.rs.html#110-144, but if we look closer this is all safe code. head() and tail() calls do atomic reads, and this is all bounds checked.

(Lately what I remember reading, removing bounds checks from rust with unsafe have in many occasions made the performance worse, so I wasn't even thinking that someone would do that. But this might be because so many are using iterators allowing better codegen than doing indexed accesses all over.)

unsafe primitives

if this is related to how I use unsafe all over with the Pin::new_unchecked to make sure pthread primitives are not moved, I don't think it's a valid concern at least for this usage of Pin.

knizhnik · 2022-12-21T09:40:24Z

I guess lockups and thread starvation would be a better angle, but this is a good point :)

Lockups and starvation may affect only one tenant.
While SIGSEGV will crash complete pageserver and so affect all tenents.

Frankly speaking I do not see acceptable safe solution at this moment.
There is no reliable way to verify any system sync primitive with shared state (i.e. which content is located in shared memory and can be directly changed by walredo process).
Using other system calls for synchronization: stdio pipes (like postgres for latches), eventfd, posix semaphores,... seems to be more reliable solution.

Verification of other shmpipe state variables (position in the buffers) also should be done with special care: as far as content of memory can be changed at any time (doesn't matter whether we are in critical section or not), it is necessary either to use atomic CAS, either copy state to local variables and wok with this local copies.

koivunej · 2022-12-21T10:41:20Z

I guess lockups and thread starvation would be a better angle, but this is a good point :)

Lockups and starvation may affect only one tenant.

We do have global pool of threads however, so starving any tokio worker would decrease them, after 8 locking up page request handling.

pageserver/src/walredo.rs

hlinnaka · 2022-12-21T12:28:58Z

It's just a few array bounds checks in the right places, so I'm not worried about the performance here. But I'm afraid it rules out using any existing crates, and we really need to design the thing carefully. We should keep the unsafe primitives as small as possible so that it can be reviewed carefully, and build higher level abstractions on top of that.

Why does it rule out other crates? Index manipulations are here: https://docs.rs/ringbuf/latest/src/ringbuf/ring_buffer/base.rs.html#110-144, but if we look closer this is all safe code. head() and tail() calls do atomic reads, and this is all bounds checked.

That particular code looks safe. I'm still scared though. Look at RbBase::vacant_len, for example:

    /// The number of items stored in the buffer at the moment.
    fn occupied_len(&self) -> usize {
        let modulus = self.modulus();
        (modulus.get() + self.tail() - self.head()) % modulus
    }

The untrusted process can set 'head' and 'tail' to anything, and with suitable values this integer arithmetic can overflow. Integer overflow panics on debug builds. A panic won't lead to arbitrary code execution, but it's still unpleasant.

I'm not sure if there are more serious cases than that. Maybe not. But the bottom line is that the crate wasn't really designed with such a hostile environment in mind, so it's on us to inspect it, and determine which functions are safe to use and what kind of inconsistencies are possible. For example, with suitable values of 'head' and 'tail', it's possible that is_empty() == true and occupied_len() == 0 at the same time. And whenever the crate is updated, we need to inspect the changes again.

koivunej · 2022-12-21T15:56:44Z

Microbenchmarking results.

I don't know right now what causes that 3us gap. With short/{8,16} the difference is for this PR probably due to how I wakeup the threads in order, none of them spin.

shmem_pipe --disable-seccomp:

short/short/1           time:   [7.7323 µs 7.7366 µs 7.7409 µs]
short/short/2           time:   [25.548 µs 25.715 µs 25.840 µs]
short/short/4           time:   [43.238 µs 43.448 µs 43.663 µs]
short/short/8           time:   [79.991 µs 80.242 µs 80.504 µs]
short/short/16          time:   [206.89 µs 211.08 µs 215.92 µs]

medium/medium/1         time:   [130.35 µs 130.57 µs 130.78 µs]
medium/medium/2         time:   [280.39 µs 280.76 µs 281.11 µs]
medium/medium/4         time:   [533.36 µs 534.83 µs 536.29 µs]
medium/medium/8         time:   [1.0486 ms 1.0519 ms 1.0556 ms]
medium/medium/16        time:   [2.2088 ms 2.2468 ms 2.2849 ms]

shmempipe_rs --disable-seccomp and two eventfds:

short/short/1           time:   [10.811 µs 10.836 µs 10.873 µs]
short/short/2           time:   [29.148 µs 29.410 µs 29.631 µs]
short/short/4           time:   [49.000 µs 50.056 µs 51.787 µs]
short/short/8           time:   [81.239 µs 81.955 µs 82.660 µs]
short/short/16          time:   [173.79 µs 174.95 µs 176.15 µs]

medium/medium/1         time:   [144.36 µs 144.60 µs 144.85 µs]
medium/medium/2         time:   [301.73 µs 302.48 µs 303.45 µs]
medium/medium/4         time:   [582.53 µs 583.93 µs 585.40 µs]
medium/medium/8         time:   [1.1050 ms 1.1080 ms 1.1111 ms]
medium/medium/16        time:   [2.2121 ms 2.2170 ms 2.2218 ms]

koivunej · 2022-12-21T20:10:06Z

I think the delta between solutions has to be within the walredoproc.c, in the copying and possible extra enlargening with the wanted size vs. the needed more bytes. Unsure how often that happens, since microbenchmarks only have examples which will_init == true:

neon/pgxn/neon_walredo/walredoproc.c

Line 447 in 7ff591f

enlargeStringInfo(inBuf, len);

Thinking this more, it probably doesn't happen, because the StringInfo is reset to in the beginning of each round, assuming the reset it works and doesn't enlarge needlessly.

However, my hacky attempt at optimistically slicing did not yield any measurable difference: https://gist.github.com/koivunej/4b31df6f8e222019c96ff56a1839398c

Trying with the full message reading with in-band length prefixing.

knizhnik · 2022-12-22T09:05:24Z

There seems to one more possible vulnerability of both C and Rust version of shmpipe. I afraid that is can be show stopper. Will be happy to be wrong. In both cases mmap is used. Obviously walredo process should have permissions to call mmap.
Mapped segment is determined by name which includes tenant ID.

But what can prevent cracked walredo process from mapping segment of pipe belonging to another tenant? It just needs to somehow guess it's tenant ID... After that it will be able to monitor all walredo traffic of another tenant, i.e. be aware of all updates it performs. It just need to decode wal records and store data in own database and voilà: data is stolen.
This scenario looks really contrived and hard to implement but actually there is nothing impossible here. Having sources of pageserver and walredo process it will not take a lot of time to implement such data-grabber. At least it seems to be easier than getting control on walredo process.

hlinnaka · 2022-12-22T09:53:32Z

But what can prevent cracked walredo process from mapping segment of pipe belonging to another tenant?

seccomp prevents that. After calling seccomp(), that walredo process cannot open any files.

knizhnik · 2022-12-22T10:19:52Z

But what can prevent cracked walredo process from mapping segment of pipe belonging to another tenant?

seccomp prevents that. After calling seccomp(), that walredo process cannot open any files.

Nice to hear it. I have not realized that we prohibits open.

koivunej · 2022-12-23T09:49:43Z

The makefile is busted however, the dependency to the target/release/libshmempipe.a cannot be at the highest level, because it doesn't necessarily cause anything to be rebuilt or relinked even if the cargo build [--release] -p shmempipe would get executed. Though, this problem may only have surfaced after I fixed the postgres-v14 not to copy the c.h on every header build.

koivunej · 2022-12-27T16:43:16Z

In a hurry, have to push more ugly commits.

As an episodic update, now there's the eventfd usage. Most of the time during xmas has been spent understanding the memory corruption caused by using a postponed publisher. I do not know why that happens, but it does happen, leading to "stall" as too many bytes are read from the to_worker queue.

Another interesting was the completly buffy impl of UnparkInorder, which is now heap based. It is faster in microbenchmarks, saves some time with the large threads waiting. I still don't know how to make this into a 7us version. Currently on my machine with taskset and performance governor, boost disabled, it's consistent 9-10us.

Also noting that memcpy only --bin producer_consumer --features demo takes at best 4us. Which is much faster than the shmem_pipe test executable (>10us for me at -O2).

shanyp · 2023-01-09T10:29:11Z

tracked by #3184

largest change is not to try lock anything from the inside the initialization. other change is refcounts and tombstoning.

this is probably still not good enough; what happens if we don't drop in the parent process but the child is then killed. Will anything of value remain after robust unlock? probably not.

sadly tempfile crate is not currently supported. to run the tests with miri, just do: cargo +nightly miri test -p shmempipe next steps: - flesh out a proper two thread test case

this is most likely a bogus thing to do ... added a longer fixme about how could it be avoided. it probably never matters what is done over at C-side, it is wrong that in rust we just cast it as mut slice, but there are no really safe alternatives.

I had earlier misprofiled that the waits should be much smaller than they end up being. busy looping for 100us seems to work much better and is on-par with the shmem_pipe, but will be calibrating a loop count for this instead of polling the instant.

as part of trying out if this makes sense in a world where we start by having two different implementations, shmempipe for linux, stdio for !linux. so far no regressions, perhaps a slight one on oltp select-after-update. continuing with trying without allocating the whole message.

instead feed the small chunks at a time. this might give less time to sleep and thus interleave the start of redoing with the request sending operation.

earlier as in right after writing the first slice (first message) in case it was sleeping.

knizhnik · 2023-01-16T14:27:11Z

It's just a few array bounds checks in the right places, so I'm not worried about the performance here. But I'm afraid it rules out using any existing crates, and we really need to design the thing carefully. We should keep the unsafe primitives as small as possible so that it can be reviewed carefully, and build higher level abstractions on top of that.

Why does it rule out other crates? Index manipulations are here: https://docs.rs/ringbuf/latest/src/ringbuf/ring_buffer/base.rs.html#110-144, but if we look closer this is all safe code. head() and tail() calls do atomic reads, and this is all bounds checked.

That particular code looks safe. I'm still scared though. Look at RbBase::vacant_len, for example:
    /// The number of items stored in the buffer at the moment.
    fn occupied_len(&self) -> usize {
        let modulus = self.modulus();
        (modulus.get() + self.tail() - self.head()) % modulus
    }
The untrusted process can set 'head' and 'tail' to anything, and with suitable values this integer arithmetic can overflow. Integer overflow panics on debug builds. A panic won't lead to arbitrary code execution, but it's still unpleasant.

I'm not sure if there are more serious cases than that. Maybe not. But the bottom line is that the crate wasn't really designed with such a hostile environment in mind, so it's on us to inspect it, and determine which functions are safe to use and what kind of inconsistencies are possible. For example, with suitable values of 'head' and 'tail', it's possible that is_empty() == true and occupied_len() == 0 at the same time. And whenever the crate is updated, we need to inspect the changes again.

May be I am wrong, but this function is not used now by our code (including Consumer::len()/Producer::len().

knizhnik · 2023-01-16T14:37:09Z

Sorry, I found a place where it is used - in debugger asserting in advance current position.

hlinnaka · 2023-01-16T14:58:32Z

Sorry, I found a place where it is used - in debugger asserting in advance current position.

Yeah, the crate contains many traits implementing different parts. For our purposes, I think we should remove many of the abstractions and all the code that we don't need. That makes it easier to see what's going on, and makes it easier to verify that the unsafe code is sound.

it's more clear that we might need to restart these requests with a slice of bytes instead of clonable iterator or something.

this also relaxes the bound on &[Bytes] to &[impl AsRef<[u8]>] so that we can keep using Vec and slices.. perhaps a compromise would be the bytes::Buf? still, making the bytes dependency optional again.

vadim2404 · 2023-01-30T15:14:37Z

Closing this PR in favor of async WAL redo

hlinnaka reviewed Dec 20, 2022

View reviewed changes

pgxn/neon_walredo/walredoproc.c Show resolved Hide resolved

vadim2404 reviewed Dec 21, 2022

View reviewed changes

pageserver/src/walredo.rs Outdated Show resolved Hide resolved

hlinnaka mentioned this pull request Dec 22, 2022

Epic: shmempipe for WAL redo process communication #3184

Closed

4 tasks

koivunej force-pushed the shmempipe_rs branch 2 times, most recently from 40df63c to bc9d6f6 Compare January 11, 2023 11:24

koivunej added 9 commits January 16, 2023 15:22

early wip

98f921a

more wip

45cadb9

largest change is not to try lock anything from the inside the initialization. other change is refcounts and tombstoning.

wip: miri capability, fix uninit ref_count, cleanup

b3f46f7

docs for type states

c25a458

drop on ref_count == 0

48d169a

pthread_cond support

e14a39c

macos note

a8679c2

fix: remove ref counting, drop when all locks can be taken

b504339

this is probably still not good enough; what happens if we don't drop in the parent process but the child is then killed. Will anything of value remain after robust unlock? probably not.

complete the condvar impl

a864f97

koivunej added 19 commits January 16, 2023 15:23

remove bogus comment

bdcdc63

fix: get miri test going

5383a00

sadly tempfile crate is not currently supported. to run the tests with miri, just do: cargo +nightly miri test -p shmempipe next steps: - flesh out a proper two thread test case

producer_consumer: minor doc fixes

615f274

producer_consumer: silence warning

d7b2c17

tests: refactor miri test to add more

e090025

fix: unnecessary *mut where *const is enough

ccd0601

fix: too often done memset

232d75b

this is most likely a bogus thing to do ... added a longer fixme about how could it be avoided. it probably never matters what is done over at C-side, it is wrong that in rust we just cast it as mut slice, but there are no really safe alternatives.

fixup: C condition parentheses

e2e96a1

refactor: remove leftover main.rs

7701d5e

wip: counted version of busy-wait before wait

112506d

wip: try not building a single large message

7f33c69

instead feed the small chunks at a time. this might give less time to sleep and thus interleave the start of redoing with the request sending operation.

refactor: remove commented out code

97201cf

refactor: minor prettyfications

ba58050

refactor: move unparkinorder under sync

a977919

fix: kick off the worker earlier

4d35a90

earlier as in right after writing the first slice (first message) in case it was sleeping.

fix: handle fixme about while => if

0ff3ef5

rebase issue: nix 0.26.1

c06c07a

koivunej force-pushed the shmempipe_rs branch from 9e53fcb to c06c07a Compare January 16, 2023 14:21

koivunej added 4 commits January 16, 2023 19:06

fix: use slice of bytes instead of iterator

edad1e7

it's more clear that we might need to restart these requests with a slice of bytes instead of clonable iterator or something.

fix: producer_consumer once more

1d27bfe

this also relaxes the bound on &[Bytes] to &[impl AsRef<[u8]>] so that we can keep using Vec and slices.. perhaps a compromise would be the bytes::Buf? still, making the bytes dependency optional again.

refactor(prod_cons): make requester threads configurable

67313c6

fix: enable seccomp back up

30ac26c

vadim2404 closed this Jan 30, 2023

bayandin deleted the shmempipe_rs branch May 19, 2023 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shmempipe in rust #3163

shmempipe in rust #3163

koivunej commented Dec 20, 2022 •

edited

hlinnaka commented Dec 20, 2022

hlinnaka commented Dec 20, 2022

knizhnik commented Dec 20, 2022

hlinnaka commented Dec 20, 2022

koivunej commented Dec 21, 2022 •

edited

knizhnik commented Dec 21, 2022

koivunej commented Dec 21, 2022

hlinnaka commented Dec 21, 2022

koivunej commented Dec 21, 2022

koivunej commented Dec 21, 2022 •

edited

knizhnik commented Dec 22, 2022

hlinnaka commented Dec 22, 2022

knizhnik commented Dec 22, 2022

koivunej commented Dec 23, 2022 •

edited

koivunej commented Dec 27, 2022

shanyp commented Jan 9, 2023

knizhnik commented Jan 16, 2023

knizhnik commented Jan 16, 2023

hlinnaka commented Jan 16, 2023

vadim2404 commented Jan 30, 2023

shmempipe in rust #3163

shmempipe in rust #3163

Conversation

koivunej commented Dec 20, 2022 • edited

hlinnaka commented Dec 20, 2022

hlinnaka commented Dec 20, 2022

knizhnik commented Dec 20, 2022

hlinnaka commented Dec 20, 2022

koivunej commented Dec 21, 2022 • edited

knizhnik commented Dec 21, 2022

koivunej commented Dec 21, 2022

hlinnaka commented Dec 21, 2022

koivunej commented Dec 21, 2022

koivunej commented Dec 21, 2022 • edited

knizhnik commented Dec 22, 2022

hlinnaka commented Dec 22, 2022

knizhnik commented Dec 22, 2022

koivunej commented Dec 23, 2022 • edited

koivunej commented Dec 27, 2022

shanyp commented Jan 9, 2023

knizhnik commented Jan 16, 2023

knizhnik commented Jan 16, 2023

hlinnaka commented Jan 16, 2023

vadim2404 commented Jan 30, 2023

koivunej commented Dec 20, 2022 •

edited

koivunej commented Dec 21, 2022 •

edited

koivunej commented Dec 21, 2022 •

edited

koivunej commented Dec 23, 2022 •

edited