Weak memory emulation using store buffers #1963

cbeuw · 2022-01-17T23:44:28Z

This implements the second half of the Lidbury & Donaldson paper: weak memory emulation using store buffers. A store buffer is created over a memory range on atomic access. Stores will push store elements into the buffer and loads will search through the buffer in reverse modification order, determine which store elements are valid for the current load, and pick one randomly.

This implementation will never generate weak memory behaviours forbidden by the C++11 model, but it is incapable of producing all possible weak behaviours allowed by the model. There are certain weak behaviours observable on real hardware but not while using this.

Note that this implementation does not take into account of C++20's memory model revision to SC accesses and fences introduced by P0668. This implementation is not fully correct under the revised C++20 model and may generate behaviours C++20 disallows.

Rust follows the C++20 memory model (except for the Consume ordering and some operations not performable through C++'s std::atomic API). It is therefore possible for this implementation to generate behaviours never observable when the same program is compiled and run natively. Unfortunately, no literature exists at the time of writing which proposes an implementable and C++20-compatible relaxed memory model that supports all atomic operation existing in Rust. The closest one is A Promising Semantics for Relaxed-Memory Concurrency by Jeehoon Kang et al. However, this model lacks SC accesses and is therefore unusable by Miri (SC accesses are everywhere in library code).

Safe/sound Rust allows for more operations on atomic locations than the C++20 atomic API was intended to allow, such as non-atomically accessing a previously atomically accessed location, or accessing previously atomically accessed locations with a differently sized operation (such as accessing the top 16 bits of an AtomicU32). These scenarios are generally left undefined in formalisations of C++ memory model, even though they became possible in C++20 with std::atomic_ref<T>. In Rust, these operations can only be done through a &mut AtomicFoo reference or one derived from it, therefore these operations can only happen after all previous accesses on the same locations. This implementation is adapted to accommodate these.

TODOs:

Add tests cases that actually demonstrate weak memory behaviour (even if they are scheduler dependent)
Change {mutex, rwlock, cond, srwlock}_get_or_create_id functions under src/shims to use atomic RMWs instead of separate read -> check if need to create a new one -> write steps
Make sure Crossbeam tests still pass (Add spin loop hints in tests for Miri crossbeam-rs/crossbeam#831)
Move as much weak-memory related code as possible into weak_memory.rs
Remove "weak memory effects are not emulated" warnings
Accommodate certain mixed size and mixed atomicity accesses Rust allows on top of the C++ model

RalfJung · 2022-01-18T00:36:35Z

Oh wow, this is amazing! Thanks so much for working on this. :)

Unfortunately I will have to put off a detailed review until mid-February or maybe even March, since interview season is approaching rapidly for me. So for now I can just give some high-level comments:

Adding store buffers means that loads have a choice wrt which value they will return. How do you resolve that choice? It might even be reasonable to have flags that control this, I guess.
Unfortunately I don't know which effect the SCfix has on operational models, but the Promising Semantics v2 is newer than SCfix so maybe they incorporated that? We could also email the SCfix authors -- and at least Jeehoon has a GitHub account.

In any case it's impossible to explore all weak memory behaviours allowed by the standard for various reasons, such as Miri's interpreter not being preemptive.

I assume that would also require some form of load buffering / speculation / "promises" for relaxed behaviors?

cbeuw · 2022-01-18T01:04:17Z

Of course! My term has started now as well so I will have less time to work on this than I did over Christmas

Adding store buffers means that loads have a choice wrt which value they will return. How do you resolve that choice? It might even be reasonable to have flags that control this, I guess.

Right now it's a uniform random choice using the existing rng in MemoryExtra. The race detection paper's author Christopher implemented it as an exponential distribution with the earliest one in mo having the greatest probability.
https://github.com/ChrisLidbury/tsan11/blob/ecbd6b81e9b9454e01cba78eb9d88684168132c7/lib/tsan/rtl/tsan_relaxed.cc#L290-L297

I agree we could have a flag for this with possible strategies like last mo, uniform, earliest valid etc.

but the Promising Semantics v2 is newer than SCfix so maybe they incorporated that?

I saw Promising Semantics was used to handle SC fences over in loom tokio-rs/loom#220, I didn't realise there's a v2. Will have a look! Both the author (Christopher) and the supervisor (Alastair, who actually taught my concurrency course last term) of the race detector paper are on GitHub as well. If we could incorporate something that works with scfix (plus the already-implemented change on simplified release sequences) into the model I'm sure they'll be interested.

I assume that would also require some form of load buffering / speculation / "promises" for relaxed behaviors?

Yes. Although by non-preemptive here I meant the scheduler not context switching until an explicit yield (which rules out a lot of weak behaviours that requires certain interleavings), but yes, for speculative loads it will require speculative execution in the interpreter

PeterWrighten

I'm sorry because it seems that there are some mistakes occurred, which let me approve this request without review. I consider I am not capable now to review this snippet, so maybe it is a greater idea to seek others to review it. Thanks:)

bors · 2022-04-01T23:00:42Z

☔ The latest upstream changes (presumably #2045) made this pull request unmergeable. Please resolve the merge conflicts.

RalfJung

Sorry again for the long delay! It is hard to find time to review big complicated changes like this. This is some impressive work. :-)

I went over the code and left some comments. However, I honestly don't understand at least half of it, and I won't have time any time soon to dig into the paper and convince myself that the algorithm makes sense, or to meticulously check that you implemented the algorithm faithfully. This concerns me, because it means in case of any problems I have no way of debugging this. In that sense I quite appreciate that you put most of this into a new file, because it means if it all goes horribly wrong one day I can at least rip it out fairly easily. In fact it might be a good idea to further reduce the data_race.rs changes and move basically everything that works on AllocExtra::weak_memory into the new file (including the let ... if that even test whether that data is present).

So, under these caveats, I think I am okay with landing this. Ideally, if/when a problem comes up, I'll just ping you and you can help. :)

What is the situation with that C++20 revised definition? Is it fair to say that this simply implements C++11 semantics instead? That seems fine for me, it just needs documenting (including in the README). It's probably better to stick to a published algorithm rather than trying to adjust it... I certainly couldn't review those adjustments.
However you say

(CppMem still says it's consistent, but that's because it uses C++11 model)

so does what you implemented disagree with CppMem, or does "The test case RWC+syncs given in Lahav et al.'s paper fails" actually mean "it says it is consistent"? I am confused.
Is there a testcase you can add that should pass/fail but does the opposite, due to this?

It also would be good to have some test case that shows that store buffering is actually happening. Obviously this might have to run like 10 times and assert that it happens at least once, but still, at least a smoke test should be there -- right now, the test case you added would have already passed before your PR, so that's kind of a shame.

And finally, can you make sure that the crossbeam tests still pass with your patch? They have Miri on CI, and it'd be a shame if we broke their CI by landing this.

src/data_race.rs

src/machine.rs

src/weak_memory.rs

tests/run-pass/concurrency/weak_memory.rs

src/weak_memory.rs

tests/run-pass/concurrency/weak_memory.rs

cbeuw · 2022-04-14T22:49:45Z

Thanks for coming back to this!

In fact it might be a good idea to further reduce the data_race.rs changes and move basically everything that works on AllocExtra::weak_memory into the new file (including the let ... if that even test whether that data is present).

I'll look into this, though I suspect lifting code that accesses InterpCx, GlobalState, AllocExtra etc. simultaneously into separate functions may enrage the borrow checker.

What is the situation with that C++20 revised definition? Is it fair to say that this simply implements C++11 semantics instead?

All possible behaviours under the current implementation are allowed under C++11 semantics, this is what I would call correct or sound (albeit incomplete).

Very briefly, C++20 weakened SC loads and stores (accesses), and strengthened SC fences. This gives rises to 3 scenarios

Pure SC accesses: since it's weakened in C++20, the permitted behaviours under C++11 is a strict subset of that of C++20. So the current implementation is still correct
Pure SC fences: this is what Promising Semantics addressed and implemented in Implement Release, AcqRel, and SeqCst fences tokio-rs/loom#220. It would be very easy to do here as well.
Interaction between SC accesses and SC fences: this is the part no one has figured out yet.

Note that loom doesn't support SC accesses (they are weakened to Acquire/Release). I don't think it's acceptable for us to do the same here because Miri as a dynamic checker shouldn't produce false positives (behaviours not allowed under the memory model), whereas Loom is a static checker and can favour coverage at the expense of overly-weak behaviours.

As such, we have to stick to what's given in the Lidbury & Donaldson paper and therefore C++11 model. I'm glad you are on board with this.

so does what you implemented disagree with CppMem, or does "The test case RWC+syncs given in Lahav et al.'s paper fails" actually mean "it says it is consistent"?

My implementation agrees with CppMem. The trouble is Rust's documentation specifies that Rust follows C++20 model, not C++11. Ideally we want to do C++20 (i.e. RWC+syncs passes), but we can only do C++11 so this will have to be documented as a caveat of Miri.

Is there a testcase you can add that should pass/fail but does the opposite, due to this?

test_rwc_syncs from the SCFix paper should pass if compiled by rustc using C++20 model, but fails with our implementation. When I added this I was thinking that I can make the implementation consistent with C++20, but now I'll treat it as known issues.

It also would be good to have some test case that shows that store buffering is actually happening.

The issue is that, whether a buffered store gets read at all is dependent on Miri's scheduler and the RNG that picks one store from potentially many candidates. The randomness can be dealt with by running the test many times, but the scheduler is completely deterministic and multiple runs make no difference. Of course, I can find an assertion that succeeds under the current version, but it's not robust against potentially subtle perturbations to the scheduler behaviour. Nonetheless I'm happy to add such test with extensive comments on why it may fail.

can you make sure that the crossbeam tests still pass with your patch

Will do

Quite a lot of UI tests are broken after rebasing. Looks like they're due to the same issue. Will investigate later.

src/weak_memory.rs

RalfJung · 2022-04-17T01:16:38Z

I'll look into this, though I suspect lifting code that accesses InterpCx, GlobalState, AllocExtra etc. simultaneously into separate functions may enrage the borrow checker.

Thanks! I was mostly imagining changes like moving this entire block into the other file, which I don't think should cause borrow checker trouble?

All possible behaviours under the current implementation are allowed under C++11 semantics, this is what I would call correct or sound (albeit incomplete).

So in contrast, the implementation is not 'correct' under C++20, in the sense of generating behaviors that C++20 disallows?

What about 'completeness' (generating all possible behaviors, given enough different random seeds)?

My implementation agrees with CppMem. The trouble is Rust's documentation specifies that Rust follows C++20 model, not C++11. Ideally we want to do C++20 (i.e. RWC+syncs passes), but we can only do C++11 so this will have to be documented as a caveat of Miri.

Seems fair, given that that's what other tools also do and that the entire weak memory support can be disabled.

test_rwc_syncs from the SCFix paper should pass if compiled by rustc using C++20 model, but fails with our implementation. When I added this I was thinking that I can make the implementation consistent with C++20, but now I'll treat it as known issues.

Okay. Please add this as a compile-fail testcase with comments explaining the situation.

The issue is that, whether a buffered store gets read at all is dependent on Miri's scheduler and the RNG that picks one store from potentially many candidates. The randomness can be dealt with by running the test many times, but the scheduler is completely deterministic and multiple runs make no difference. Of course, I can find an assertion that succeeds under the current version, but it's not robust against potentially subtle perturbations to the scheduler behaviour. Nonetheless I'm happy to add such test with extensive comments on why it may fail.

The scheduler can be 'nudged' by calling thread::yield_now. Is that sufficient?
Anyway we already have some other scheduler-dependent tests, so with sufficient comments that is fine. Thanks!

cbeuw · 2022-04-17T11:40:27Z

So in contrast, the implementation is not 'correct' under C++20, in the sense of generating behaviors that C++20 disallows?

Indeed it will generate behaviors that C++20 disallows.

What about 'completeness' (generating all possible behaviors, given enough different random seeds)?

It's not complete. There are categories of allowed weak behaviour this implementation will never produce, such as load buffering.

RalfJung · 2022-04-17T12:30:31Z

Okay. That sounds like a good thing to include into the summary at the top of weak_memory.rs. :)

bors · 2022-05-10T07:55:29Z

☔ The latest upstream changes (presumably #2084) made this pull request unmergeable. Please resolve the merge conflicts.

RalfJung

I left some more comments; my understanding is a bunch of my previous ones are also still open.

This PR is getting huge! Do you think there is any way to stage this over multiple PRs, for ease of reviewing?

src/allocation_map.rs

README.md

src/machine.rs

src/shims/posix/sync.rs

src/sync.rs

src/weak_memory.rs

cbeuw · 2022-05-11T17:44:51Z

my understanding is a bunch of my previous ones are also still open.

Sorry I was thinking of adding my would-be responses as comments, though I tend to procrastinate them until after code changes...

Do you think there is any way to stage this over multiple PRs

Everything in 0b9c8b0 is standalone. I'll cherry-pick it into another PR.

All the changes to scheduler-dependent race detector tests are simply adding -Zmiri-disable-weak-memory-emulation. I could leave them (along with amending "thread support is experimental and incomplete: weak memory effects are not emulated." everywhere) until all other reviews are done if you find them noisy

RalfJung · 2022-05-11T17:54:17Z

All the changes to scheduler-dependent race detector tests are simply adding -Zmiri-disable-weak-memory-emulation. I could leave them (along with amending "thread support is experimental and incomplete: weak memory effects are not emulated." everywhere) until all other reviews are done if you find them noisy

I don't mind them. The hard to review part is in src, not tests. :) But I did wonder why that flag is required there?

README.md

cbeuw · 2022-05-11T19:09:37Z

But I did wonder why that flag is required there?

Take one for example

pub fn test_simple_release() {
    let mut a = 0u32;
    let b = &mut a as *mut u32;
    let c = EvilSend(b);

    unsafe {
        let j1 = spawn(move || {
            *c.0 = 1;
            SYNC.store(1, Ordering::Release);
        });

        let j2 = spawn(move || {
            if SYNC.load(Ordering::Acquire) == 1 {
                *c.0
            } else {
                0
            }
        });

        j1.join().unwrap();
        assert_eq!(j2.join().unwrap(),1);
    }
}

This test won't work on real hardware: nothing is guaranteeing SYNC.load(Ordering::Acquire) reads 1 on first go in the second thread. It works in Miri because the scheduler makes sure SYNC.store(1, Ordering::Release) is ran before thread 2 even started, PLUS the fact that everything reads the latest value without weak memory.

Although my PR doesn't change the scheduler, meaning SYNC.store(1, Ordering::Release) is still executed before SYNC.load(Ordering::Acquire), due to weak memory effects SYNC.load(Ordering::Acquire) can read many 0s before it observes the 1. This is why in my tests a synchronising load(Ordering::Acquire) is always in a loop until it reads the desired value.

RalfJung · 2022-05-11T19:35:19Z

I see, makes sense.

That makes me wonder if we should let users control the probability of reading an old value, similar to the knob we have to control the probability of cmpxchg_weak spuriously failing.

src/data_race.rs

Use atomic RMW for `{mutex, rwlock, cond, srwlock}_get_or_create_id` functions This is required for #1963 `{mutex, rwlock, cond, srwlock}_get_or_create_id()` currently checks whether an ID field is 0 using an atomic read, allocate one and get a new ID if it is, then write it in a separate atomic write. This is fine without weak memory. For instance, in `pthread_mutex_lock` which may be called by two threads concurrently, only one thread can read 0, create and then write a new ID, the later-run thread will always see the newly created ID and never 0. ```rust fn pthread_mutex_lock(&mut self, mutex_op: &OpTy<'tcx, Tag>) -> InterpResult<'tcx, i32> { let this = self.eval_context_mut(); let kind = mutex_get_kind(this, mutex_op)?.check_init()?; let id = mutex_get_or_create_id(this, mutex_op)?; let active_thread = this.get_active_thread(); ``` However, with weak memory behaviour, both threads may read 0: the first thread has to see 0 because nothing else was written to it, and the second thread is not guaranteed to observe the latest value, causing a duplicate mutex to be created and both threads "successfully" acquiring the lock at the same time. This is a pretty typical pattern requiring the use of atomic RMWs. RMW *always* reads the latest value in a location, so only one thread can create the new mutex and ID, all others scheduled later will see the new ID.

bors · 2022-05-13T19:43:08Z

☔ The latest upstream changes (presumably #2114) made this pull request unmergeable. Please resolve the merge conflicts.

…ction

Co-authored-by: Ralf Jung <post@ralfj.de>

…f UB Co-authored-by: Ralf Jung <post@ralfj.de>

cbeuw · 2022-06-06T18:34:52Z

Squashed a bit and rebased

RalfJung · 2022-06-06T19:30:32Z

Awesome! Thank you so much for working on this and sticking with me through the many rounds of review. I feel like Miri will be an excellent tool for finding bugs in concurrent software. :)

@bors r+

bors · 2022-06-06T19:30:34Z

📌 Commit 1b32d14 has been approved by RalfJung

bors · 2022-06-06T19:30:40Z

⌛ Testing commit 1b32d14 with merge e6d3d98...

cbeuw · 2022-06-06T19:47:59Z

Thank you for reviewing this massive PR! I learnt quite a lot implementing this, and I'm glad we got to scrutinise Rust's atomics specification as a bonus

bors · 2022-06-06T19:57:15Z

☀️ Test successful - checks-actions
Approved by: RalfJung
Pushing e6d3d98 to master...

clarify how Rust atomics correspond to C++ atomics `@cbeuw` noted in rust-lang/miri#1963 that the correspondence between C++ atomics and Rust atomics is not quite as obvious as one might think, since in Rust I can use `get_mut` to treat previously non-atomic data as atomic. However, I think using C++20 `atomic_ref`, we can establish a suitable relation between the two -- or do you see problems with that `@cbeuw?` (I recall you said there was some issue, but it was deep inside that PR and Github makes it impossible to find...) Cc `@thomcc;` not sure whom else to ping for atomic memory model things.

clarify how Rust atomics correspond to C++ atomics ``@cbeuw`` noted in rust-lang/miri#1963 that the correspondence between C++ atomics and Rust atomics is not quite as obvious as one might think, since in Rust I can use `get_mut` to treat previously non-atomic data as atomic. However, I think using C++20 `atomic_ref`, we can establish a suitable relation between the two -- or do you see problems with that ``@cbeuw?`` (I recall you said there was some issue, but it was deep inside that PR and Github makes it impossible to find...) Cc ``@thomcc;`` not sure whom else to ping for atomic memory model things.

PeterWrighten approved these changes Jan 28, 2022

View reviewed changes

PeterWrighten reviewed Jan 28, 2022

View reviewed changes

cbeuw force-pushed the weak-memory branch from ec5c4ab to dbde843 Compare January 30, 2022 22:05

RalfJung reviewed Apr 14, 2022

View reviewed changes

cbeuw force-pushed the weak-memory branch from 1b73d67 to c171242 Compare April 14, 2022 20:48

RalfJung reviewed Apr 17, 2022

View reviewed changes

src/weak_memory.rs Outdated Show resolved Hide resolved

cbeuw force-pushed the weak-memory branch 2 times, most recently from c0be33b to 20db949 Compare May 7, 2022 00:46

cbeuw force-pushed the weak-memory branch from edaf7d1 to c7ccb2a Compare May 10, 2022 22:27

RalfJung reviewed May 11, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

cbeuw force-pushed the weak-memory branch from f0858c9 to 40bd1d0 Compare May 11, 2022 18:56

cbeuw mentioned this pull request May 11, 2022

Use atomic RMW for {mutex, rwlock, cond, srwlock}_get_or_create_id functions #2114

Merged

cbeuw force-pushed the weak-memory branch from 40bd1d0 to 22a1674 Compare May 11, 2022 21:34

RalfJung reviewed May 12, 2022

View reviewed changes

src/data_race.rs Outdated Show resolved Hide resolved

cbeuw force-pushed the weak-memory branch from 99a6ca4 to 20f1a88 Compare May 12, 2022 22:58

cbeuw and others added 16 commits June 6, 2022 19:15

Differentiate between not multithreading and temp disabling race dete…

2321b15

…ction

Destroy store buffers on non-racy non-atomic accesses

226ed41

Allow non-racy mixed size accesses

613d60d

Split extra_cpp tests into sound and unsafe

bfa5645

Update experimental threading warning

6a73ded

Wording improvements

a7c832b

Co-authored-by: Ralf Jung <post@ralfj.de>

Move logic out of machine.rs

ceb173d

Forbade all racing mixed size atomic accesses

4a07f78

Refer to GitHub issue on overwritten init value

8215702

Give flag temp disabling race detector a better name

c731071

Specify only perfectly overlapping accesses can race

6d0c76e

Move tests to new directories

65f39bd

Simplify known C++20 inconsistency test

1379036

Remove unused lifetimes

6fb7c13

Add more backgrounds on lazy store buffers

bf7a5c4

Co-authored-by: Ralf Jung <post@ralfj.de>

Make racy imperfectly overlapping atomic access unsupported instead o…

1b32d14

…f UB Co-authored-by: Ralf Jung <post@ralfj.de>

cbeuw force-pushed the weak-memory branch 2 times, most recently from 997e5fb to 1b32d14 Compare June 6, 2022 18:27

bors merged commit e6d3d98 into rust-lang:master Jun 6, 2022

RalfJung mentioned this pull request Jul 2, 2022

Detect non-perfectly-overlapping unordered atomic accesses as UB? #2303

Closed

cbeuw deleted the weak-memory branch June 27, 2023 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weak memory emulation using store buffers #1963

Weak memory emulation using store buffers #1963

cbeuw commented Jan 17, 2022 •

edited

RalfJung commented Jan 18, 2022

cbeuw commented Jan 18, 2022

PeterWrighten left a comment •

edited

bors commented Apr 1, 2022

RalfJung left a comment •

edited

cbeuw commented Apr 14, 2022 •

edited

RalfJung commented Apr 17, 2022

cbeuw commented Apr 17, 2022

RalfJung commented Apr 17, 2022

bors commented May 10, 2022

RalfJung left a comment

cbeuw commented May 11, 2022

RalfJung commented May 11, 2022

cbeuw commented May 11, 2022

RalfJung commented May 11, 2022

bors commented May 13, 2022

cbeuw commented Jun 6, 2022

RalfJung commented Jun 6, 2022

bors commented Jun 6, 2022

bors commented Jun 6, 2022

cbeuw commented Jun 6, 2022

bors commented Jun 6, 2022

Weak memory emulation using store buffers #1963

Weak memory emulation using store buffers #1963

Conversation

cbeuw commented Jan 17, 2022 • edited

RalfJung commented Jan 18, 2022

cbeuw commented Jan 18, 2022

PeterWrighten left a comment • edited

Choose a reason for hiding this comment

bors commented Apr 1, 2022

RalfJung left a comment • edited

Choose a reason for hiding this comment

cbeuw commented Apr 14, 2022 • edited

RalfJung commented Apr 17, 2022

cbeuw commented Apr 17, 2022

RalfJung commented Apr 17, 2022

bors commented May 10, 2022

RalfJung left a comment

Choose a reason for hiding this comment

cbeuw commented May 11, 2022

RalfJung commented May 11, 2022

cbeuw commented May 11, 2022

RalfJung commented May 11, 2022

bors commented May 13, 2022

cbeuw commented Jun 6, 2022

RalfJung commented Jun 6, 2022

bors commented Jun 6, 2022

bors commented Jun 6, 2022

cbeuw commented Jun 6, 2022

bors commented Jun 6, 2022

cbeuw commented Jan 17, 2022 •

edited

PeterWrighten left a comment •

edited

RalfJung left a comment •

edited

cbeuw commented Apr 14, 2022 •

edited