Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

On a wip branch, there are some subtle soundness issues with Lazy #30

Closed
matklad opened this issue Mar 30, 2023 · 17 comments
Closed

On a wip branch, there are some subtle soundness issues with Lazy #30

matklad opened this issue Mar 30, 2023 · 17 comments

Comments

@matklad
Copy link

matklad commented Mar 30, 2023

Just couldn't resist taking a look at a couple of safety issues that have burnt me in the past in this area :-)

This type

pub(super) struct Lazy<T, F> {

has two problems:

First, it misses unsafe impl<T: Sync + Send, F: Sync> Sync for Lazy<T, F> (the inferred impl misses + Send). This can be weaponized to turn Lazy into a channel for sending Sync+!Send types across threads:

fn sync_not_send<T: Sync + Default>() {
    let lazy = Lazy::new(move || T::default());
    std::thread::scope(|scope| {
        scope.spawn(|| {
            lazy.get(); // We create T in this thread
        });
    });
    // And drop in this thread.
    drop(lazy);
    // So we have send a !Send type over threads. (with some more legwork, its
    // possible to even sneak the value out of drop through thread local)
}

Second, the type owns T (it can drop it), but this isn't visible to compiler. That is, PhantomData<Box<T>> is missing. In principle, this should be weaponizable by subverting dropcheck. If the compiler believes that dropping Lazy can't drop T, then it might allow code where the T in lazy has some dangling pointers when the Lazy is dropped, something along the lines of

fn no_dropchk() {
    struct Sneaky<'a>(&'a str);
    impl<'a> Drop for Sneaky<'a> {
        fn drop(&mut self) {
            println!(".0 = {}", self.0);
        }
    }

    let lazy;
    let s = " hello".to_string();
    lazy = Lazy::new(|| Sneaky(s.as_str()));
}

No, this exact code compiles, because, although compiler doesn't see that we can drop T, it sees that we can drop F, and those share the problematic lifetime. I am maybe 70% confident that, because of this, there's no way too weaponize it, but still adding phantom data would seem like a good idea.

(newer versions of OnceCell crate also have this version of synchronization primitive: https://docs.rs/once_cell/latest/once_cell/race/struct.OnceBox.html. We don't have a spnning no-alloc version though, as I somewhat firmly believe that, for all code that could spin lock wihtout yielding to OS/blocking interrupts/etc, the caller should explicitly type out .spin somewhere)

BurntSushi added a commit that referenced this issue Mar 31, 2023
@BurntSushi
Copy link
Owner

BurntSushi commented Mar 31, 2023

Ahhhhhh yesss! My secret plan to have someone review this code worked. :D

Could I convince you to review Pool too? It's not quite the same as Lazy, but it is similarish. And I just caught a soundness hole in it a few weeks ago. The last three tests in src/util/pool.rs tell that story.

We don't have a spnning no-alloc version though, as I somewhat firmly believe that, for all code that could spin lock wihtout yielding to OS/blocking interrupts/etc, the caller should explicitly type out .spin somewhere)

Yeah, I've read your blog post a few times and I find it pretty compelling. Buuuuuuuuuuut, I don't really know what else to do. For Lazy, the core-only functionality is there so that folks in no-std no-alloc environments have an easier path. Namely, one of the main use cases of regex-automata is the ability to serialize a DFA, and then deserialize it in a much more constrained environment (no-std and no-alloc). I actually don't have a ton of experience in said environments, so I don't quite know just how useful having something like a Lazy type is. But, it certainly makes things like regex-cli generate serialize easier because it just write out Rust code that uses library types from regex-automata to make a DFA globally available with zero fuss. But, I don't feel too bad about this, because nobody has to use Lazy. It isn't used inside the regex internals anywhere. (Currently, anyway. There are use cases for it, but for a variety of reasons unrelated to Lazy, I haven't been able to make it work as good as I've wanted.)

With that said, I'm also using a spinlock in Pool, and that gets used when only std is disabled. So if you have alloc but not std, then you get a Pool with a spinlock. (Pool is unavailable in no-std and no-alloc environments because it's pretty heavily tied to the idea of allocating something.) In practice where this is going to show up is when folks disable the std feature in the regex crate (which folks have been asking for for a while). So when they do that, a call to say, Regex::find, will ask the pool for some memory to run a search. And that's going to hit the spinlock.

Now, in regex-automata, there is an escape hatch for all of this. You can choose to use a lower level API that lets you pass your own mutable memory, which means the regex doesn't need to talk to a pool at all. But, regex proper doesn't have those APIs, and I don't really want it to have those APIs because it fattens up an already big API for a very niche use case. Maybe eventually I'll cave on that point, but at least at present, I don't feel strongly enough about spinlocks to choose a different point in the design space as I currently conceive it.

I had also been hoping to get to the section on spinlocks in @m-ou-se's book (pinging you on the off chance you want to chime in here) to see if there was any extra insight to be gathered there. I am traveling tomorrow, so I might actually get that chance!

Mara: big picture thing here is that, at some point in the next few months, the regex crate will provide a way to build without std. When it does this, it still needs to use a synchronization primitive to get some mutable scratch space in order to execute a search. Basically, think of it as a Mutex<Vec<SomeRegexMemory>>. It plucks something out at the start of a search, does some reading & writing to it, and then stuffs it back in when it's done so that another search can reuse it. Now with std enabled, it does actually use a Mutex (with some other tricks). But when you disable std, currently, my plan is to have it use a spinlock because I don't really know what else to do if I want to keep providing the nice high level Regex::find APIs (and I do).

@BurntSushi
Copy link
Owner

@matklad Also, I've pushed up fixes to your feedback. Thank you. I've also added big warnings to Pool and Lazy (with a linkt to your blog post).

@BurntSushi
Copy link
Owner

This is the current code that implements the spinlock mentioned above for a Pool:

/// A spin-lock based mutex. Yes, I have read spinlocks cosnidered
/// harmful[1], and if there's a reasonable alternative choice, I'll
/// happily take it.
///
/// I suspect the most likely alternative here is a Treiber stack, but
/// implementing one correctly in a way that avoids the ABA problem looks
/// subtle enough that I'm not sure I want to attempt that. But otherwise,
/// we only need a mutex in order to implement our pool, so if there's
/// something simpler we can use that works for our `Pool` use case, then
/// that would be great.
///
/// Note that this mutex does not do poisoning.
///
/// [1]: https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html
#[derive(Debug)]
struct Mutex<T> {
locked: AtomicBool,
data: UnsafeCell<T>,
}
// SAFETY: Since a Mutex guarantees exclusive access, as long as we can
// send it across threads, it must also be Sync.
unsafe impl<T: Send> Sync for Mutex<T> {}
impl<T> Mutex<T> {
/// Create a new mutex for protecting access to the given value across
/// multiple threads simultaneously.
const fn new(value: T) -> Mutex<T> {
Mutex {
locked: AtomicBool::new(false),
data: UnsafeCell::new(value),
}
}
/// Lock this mutex and return a guard providing exclusive access to
/// `T`. This blocks if some other thread has already locked this
/// mutex.
fn lock(&self) -> MutexGuard<'_, T> {
while self
.locked
.compare_exchange(
false,
true,
Ordering::AcqRel,
Ordering::Acquire,
)
.is_err()
{
core::hint::spin_loop();
}
// SAFETY: The only way we're here is if we successfully set
// 'locked' to true, which implies we must be the only thread here
// and thus have exclusive access to 'data'.
let data = unsafe { &mut *self.data.get() };
MutexGuard { locked: &self.locked, data }
}
}
/// A guard that derefs to &T and &mut T. When it's dropped, the lock is
/// released.
#[derive(Debug)]
struct MutexGuard<'a, T> {
locked: &'a AtomicBool,
data: &'a mut T,
}
impl<'a, T> core::ops::Deref for MutexGuard<'a, T> {
type Target = T;
fn deref(&self) -> &T {
self.data
}
}
impl<'a, T> core::ops::DerefMut for MutexGuard<'a, T> {
fn deref_mut(&mut self) -> &mut T {
self.data
}
}
impl<'a, T> Drop for MutexGuard<'a, T> {
fn drop(&mut self) {
// Drop means 'data' is no longer accessible, so we can unlock
// the mutex.
self.locked.store(false, Ordering::Release);
}
}

@m-ou-se
Copy link

m-ou-se commented Mar 31, 2023

Mara: big picture thing here is that, at some point in the next few months, the regex crate will provide a way to build without std. When it does this, it still needs to use a synchronization primitive to get some mutable scratch space in order to execute a search. Basically, think of it as a Mutex<Vec<SomeRegexMemory>>. It plucks something out at the start of a search, does some reading & writing to it, and then stuffs it back in when it's done so that another search can reuse it. Now with std enabled, it does actually use a Mutex (with some other tricks). But when you disable std, currently, my plan is to have it use a spinlock because I don't really know what else to do if I want to keep providing the nice high level Regex::find APIs (and I do).

If that Vec is just used as a stack, to which you only push and pop only one element at a time that will only be used by one thread, it might be simple to do that lock free. :)

(Lock free lists are usually quite tricky, but I think your use case might be one that doesn't run into most of the common tricky issues.)

@matklad
Copy link
Author

matklad commented Mar 31, 2023

My general stance on spinlocks is driven primarily by aesthetic reasons: it's no that I debugged some curious latency spikes in a server, it's just that I understand the mechanics of how spin locks could lead to issues, and that shifts me to "it's not up to the library to make this choice". Maybe in practice it is actually fine! But maybe not: here's someone hitting an actual deadlock in the crossbeam. I guess that issue can be used to create a litmus test here:

  • the user is using a no_std regex in user-land code (eg, they are using some library which doesn't need and doesn't have std version, and that library depends on Regex with default-features = false).
  • the application uses two threads, pinned to the same CPU, with one thread being higher priority
  • two threads concurrently try to do a regex search, and, with an unfortunate interleaving, the result is a deadlock

That does seem quite unlikely, but also is not impossible, and the result (deadlock) is almost as bad as it can go. Is it ok for a library to potentially open the used to this scenario? On the aesthetic ground definitely not! On practical grounds, it might actually be yes?


One thing I've realized here is that no_std+alloc is in some sense an incoherent world. If you have a global allocator, you must have a proper implementation of Mutex to protect it (that impl can be a no-op for threadless environments). This is somewhat ironic in this situation --- you are building a globabl memory pool, a specialized version of a global allocator, you need the same kinds of tools to build that, the tools are actually in your address space (as they are used by the global allocator), but you can't reach them!


I think for pool what Mara suggests makes the most sense. Lock-free pool + racy lazy should be totally fine. I am not sure what to do with no_std no_alloc lazy... I see that it is used to intialized an &[u32]. As the initialization is deterministic, I am wondering if it would be correct to do something like:

if !initialized.load(Acquire) {
  for byte in bytes.iter() { byte.store(deterministic_value, Relaxed) }
  initialized.store(true, Release)
}
for byte in bytes.iter() {
  byte.load(NonAtomic);
}

That is, threads race to overwrite the byte array, but, because they are guaranteed to write the same value, it is actually ok? My reading of C++ memory model tells me that no, this won't be ok. On the high level, I believe the memory model doesn't have anything specific about "same value", so reasoning relying on that would we wrong. More specifically, as we ultimately want to do non-atomic reads, then our atomic relaxed writes do nothing. And, if I remove that relaxed store, it seems clear that this is UB --- you have non-atomic read and write racing with each other, and that's definitely UB, even if the write overwites the same value.

Curiously though, if you also make sure that reads are atomic relaxed, that should be fine! (but then you'd have to return a &[AtomicU32]).


pub static {name}: Lazy<DFA<&'static [u32]>> = Lazy::new(|| {{
    {deserialize}
}});

I'd maybe write this as

pub fn {name}() DFA<&'static [u32]> {
    static LAZY = Lazy<DFA<&'static [u32]>> = Lazy::new(|| {{
        {deserialize}
    }});
    &*LAZY
}

to not expose specific synchronization in the API.


As to what to do here in general, I think there are these choices:

  • Figuring out what the users actually want when they ask for "no_std regex". It might be either of the two:

    • I want to match a string with a regex on my embedded device
    • I want to re-use the api of the regex crate as is, such that I get access to oodles of regex-using crates for free

    My personal stance here is that no_std is a different, resource constrained environment and so in the general case it just needs to have a different set of APIs. This makes composing libraries hard, but, if you care about resource-constrained environments, you probably care about zero-cost abstractions more than you care about seamless composition and would be fine with writing some amount of glue code yourself.

  • If there's a requirement to keep the API, then you can either silently default to spinlock, or you can make the whole thing pluggable from the outside. I think https://docs.rs/critical-section/latest/critical_section/ and https://docs.rs/getrandom/latest/getrandom/#custom-implementations do this, and the trick boils down to, at the lower level, extern "C" { fn hook(); } declaration in the upstream crate which is satisfied by the downstream crate and ultimately resolved by the linker.

  • You can also make this someone else's problem! In particular, lazy_static I believe falls back to spinlocks for no_std, so, by (optionally) depending on it, you absolve your own crate from questionable code. "If you just want spinlocks, use lazy_static with all of the caveats" I think is a fine strategy.

@m-ou-se
Copy link

m-ou-se commented Mar 31, 2023

Here's a lock-free-ish pool that I think suits your needs: https://gist.github.com/m-ou-se/5fdcbdf7dcf4585199ce2de697f367a4

It does still (very briefly) lock the stack when popping an element, but it does not lock to push an element, nor does it keep anything locked while allocating and creating a new element.

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 31, 2023

@m-ou-se thank you! The reason why i hadn't used the approach of a lock free structure was indeed because of the ABA problem. It looks like you get around that by still using a spin lock. Does that still fall afoul of the problems with spin locks in general though? (Yours has much smaller scope than mine, of course.)

@m-ou-se
Copy link

m-ou-se commented Mar 31, 2023

A spinlock isn't generally bad. A regular Mutex also spins for quite a while before it blocks. Using a spinlock when it can sometimes spin for a long time is where it starts becoming a bad idea. In this case, it is only locked between the compare-exchange and the store that immediately follows, not while performing any allocations or syscalls etc. (An Arc does exactly the same in Arc::downgrade().)

@m-ou-se
Copy link

m-ou-se commented Mar 31, 2023

(But as usual, only a benchmark will provide real answers. ^^)

@m-ou-se
Copy link

m-ou-se commented Mar 31, 2023

It looks like you get around that by still using a spin lock.

(I would say "by using spinning", not "by using a spin lock". A SpinLock<Vec<T>> would need to be locked on every access. My implementation only briefly 'locks' for popping, not for pushing or allocating (or resizing a Vec).)

@BurntSushi
Copy link
Owner

Hmmm interesting. I guess what I had in mind is whether @matklad's deadlock scenario can happen in your pool.

@matklad Thinking about it more, why doesn't the core::hint::spin_loop() hint prevent the deadlock from happening? I assume the deadlock happens because the OS doesn't know user code is in the spin lock, but I guess core::hint::spin_loop() should resolve that?

@m-ou-se
Copy link

m-ou-se commented Mar 31, 2023

I assume the deadlock happens because the OS doesn't know user code is in the spin lock, but I guess core::hint::spin_loop() should resolve that?

A spin loop hint doesn't communicate with the operating system. It's basically just a powerful "200 × nop" that also instructs the cpu to switch to the other hyperthread of the same core, depending on the architecture.

(The operating system equivalent is std::thread::yield_now(), but if you are going to do such a syscall you should consider just doing a futex/locking syscall instead (i.e. use a Mutex).)

@matklad
Copy link
Author

matklad commented Mar 31, 2023

I guess what I had in mind is whether @matklad's deadlock scenario can happen in your pool.

Yeah, I believe it could happen if:

  • there are two threads pinned to a single core
  • the OS facilities are used to enforce "thread 1 has higher priority than thread 2, so, if both are runnable, 1 should always receive higher priority"

https://github.com/matklad/spin-of-death/blob/master/src/main.rs (from my post) might be a good starting point to create the above situation.

A spinlock isn't generally bad.

I think it's important to make a distinction between "spinning a bit before syscall" (awesome perf optimization!) and an actual spinlock, where we burn the CPU, blocked on the progress of some other thread (can create a deadlock if thread priorities are unfair).

@m-ou-se
Copy link

m-ou-se commented Mar 31, 2023

That deadlock situation is one that could happen when a thread gets preempted after it has locked but before it has unlocked the list, the newly running thread starts waiting for the list to be unlocked, and the scheduler will never preempt the second thread to let the first one make progress.

This won't happen with purely cooperative scheduling (only switching threads on yield/syscall), since there is no syscall or any user code between the lock and unlock. (Unlike with a SpinLock<Vec<T>>, where a lot can happen while it is locked.) Nor will it happen with 'normal' preemptive scheduling. It could happen when using the same regex in a main thread and a signal handler, interrupt handler or other thread when that second thread fully blocks the main thread until it finishes (e.g. when using priorities and real time scheduling without limits). This is a rare scenario, but not impossible/unthinkable in some specific (embedded/low level) applications.

This is avoided by putting a maximum on the number of spin cycles and falling back to something else. If you fall back to blocking using the operating system, you've implemented a mutex. But in this case you could also just fall back to not using the pool and allocating/deallocating directly.

Note that using malloc/Box::new in a signal handler on Unix is actually not allowed, because it can just result in the exact same deadlock with the lock(s) in the allocator implementation, so that doesn't really solve anything for this case. In other words, for Unix signal handlers, not using the Pool and always doing Box::new would result in the exact same issue. Doing anything that allocates in a signal handler is simply not supported.

So what's left is (non-preempted/"real time") threads with different priorities, and custom/embedded systems that do not use libc but have their own allocator that is safe to be used in such situations. In those cases, the Pool::get I sketched indeed has an issue that Box::new wouldn't have.

To solve that issue we need to interact with the operating system in some way, which is basically what we're trying to avoid in no_std. The only real operating system operations we have are the ones from the alloc crate: allocation and deallocation. So if the options for the fallback for spinning for too long are: a mutex, a yield() call, thread parking, a futex call, or Box::new/drop, then falling back to Box::new/drop after e.g. a 100 spin cycles might be the most obvious option. Not the most performant, but maybe that's okay for a rare situation. (Such trade offs are always super tricky to make.)

@m-ou-se
Copy link

m-ou-se commented Mar 31, 2023

Note that alloc::sync::Arc's downgrade() and get_mut() have the exact same problem in those situations. I'll leave it up to you to decide whether that means that "Arc has a bug" or that "those situations are apparently unsupported if you use alloc" and what that means for regex. :)

@matklad
Copy link
Author

matklad commented Apr 1, 2023

Got this to deadlock without signals, just using realtime thread priorities:

matklad/spin-of-death@7ed9770

But yeah, it indeed seems like Arc::downgrade can deadlock in a similar way if a different lower priority thread does get_mut in a bad moment.

@BurntSushi
Copy link
Owner

Aye, thanks for this everyone! I think I'm just going to swap out my alloc-only implementation of Pool with @m-ou-se's from above. I think I'm comfortable being in a similar spot as Arc::downgrade in exchange for keeping the nicer higher level API. And I'll add warnings to the docs. I also feel okay about this because if someone is concerned about it, they will have an escape route by using lower level APIs that don't make use of a Pool at all.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants