Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pyo3::once_cell::GILOnceCell #975

Merged
merged 1 commit into from Jun 19, 2020
Merged

Conversation

davidhewitt
Copy link
Member

This is a draft idea of how to get something similar to lazy_static which avoids deadlocks related to the GIL.

It's based on once_cell API, but uses GIL to guarantee thread safety instead of its own locks.

I was able to prove it works by replacing both our datetime's static mut as well as LazyHeapType. I think LazyStaticType should be able to be replaced too, but I had some complications around GIL safety which I think #970 will solve - so I'll wait for that to merge first.

src/exceptions.rs Outdated Show resolved Hide resolved
src/once_cell.rs Outdated Show resolved Hide resolved
@davidhewitt
Copy link
Member Author

I managed to use OnceCell for the implementation of LazyStaticType too.

This led to an interesting discovery. I think our original assessment about "recursive" class attributes, as discussed here, might be wrong: #905 (comment)

The assumption there was that PyCell::new does not use the type object. But it turns out that it does - PyCell::new eventually calls T::alloc which uses the type object to create a new instance.

For "recursive" class attributes I think this shows they are unfortunately not sound, because I don't think we should be using the incomplete type object to create new instances.

As a result I now disable recursive class attributes in this PR.

cc @scalexm - as original implementor of class attributes, what do you think of this?

@davidhewitt davidhewitt force-pushed the once-cell branch 2 times, most recently from 4f95169 to 885b428 Compare June 15, 2020 08:20
@kngwyu
Copy link
Member

kngwyu commented Jun 15, 2020

I'm sorry I cannot afford to give a detailed review now. I'll give it later.

At a glance, this looks like a really nice refactoring for our internal implementation 👍
But I'm not sure this should be a public API and how to explain what this is to users 🤔

This led to an interesting discovery. I think our original assessment about "recursive" class attributes, as discussed here, might be wrong: #905 (comment)

Sign... 😓 sorry for my incorrect comment. But could you please leave it as is for now?
I think it's better if we can set tp_dict after initializing the class, which could be out of this PR.

@scalexm
Copy link
Contributor

scalexm commented Jun 15, 2020

First of all, I’ve just read the current implementation (before this PR) of LazyStaticType. The public API was unsound, as one caller could set the initialized flag, then start calling initialize_type_object with a &mut ref to the cell interior, then while that call is still ongoing, an other caller can call get_or_init, see that initialized is already set to true and create a & ref to the cell interior, which incorrectly aliases with the former &mut reference.

In particular get_or_init was not reentrant. A correct implementation should have deadlocked or panicked, and we would have caught the « recursive class attribute » problem thanks to the tests I added specifically for that.

This PR now does the correct thing, in that it takes a lock token as an argument to get_or_init.

Now, according to what @davidhewitt found, it seems to me that a solution for class attributes needs to happen in two steps: for each type object, we’d have two distinct initialization procedures, each one encapsulated in some kind of Once:

  • initialize_type_object, which does everything except initializing tp_dict, and which disallows reentrancy to be sure that we don’t introduce other bugs
  • initialize_tp_dict, which initializes the tp_dict, and which resolves reentrancy cycles by simply not calling the function again

I know that @programmerjake expressed concern about changing the tp_dict after calling PyType_ready, but I think it’s perfectly allowed: see for example https://github.com/python/cpython/blob/25f38d7044a3a47465edd851c4e04f337b2c4b9b/Python/sysmodule.c#L2757

@scalexm
Copy link
Contributor

scalexm commented Jun 15, 2020

To clarify, I was thinking about something like this:

struct LazyStaticType {
    value: OnceCell<Box<ffi::PyTypeObject>>,

    // Does not need to be atomic, an `UnsafeCell` would suffice in reality,
    // but it's less convenient. Initially set to `false`.
    init_dict: AtomicBool,
}

impl LazyStaticType {
    ...
    
    pub fn get_or_init<T: PyClass>(&self, py: Python) -> &ffi::PyTypeObject {
        let type_object = self.value
            .get_or_init(py, || {
                let mut type_object = Box::new(ffi::PyTypeObject_INIT);
                // Recursive call inside `initialize_type_object` will loop indefinitely.
                // We can also make it panic if wanted, as in the PR.
                initialize_type_object::<T>(py, T::MODULE, type_object.as_mut()).unwrap_or_else(
                    |e| {
                        e.print(py);
                        panic!("An error occurred while initializing class {}", T::NAME)

                    },
                );
                type_object
            })
            .as_ref()
        
        // Note: we're still holding the GIL lock here.
        if self.init_dict.load(Ordering::Relaxed) {
            // Re-entrant call: just return the type object even if `tp_dict` is
            // not yet filled. Since `ffi::pyTypeObject` is not `Send`, there are
            // also no risks of concurrent read/writes to the `tp_dict`.
            return type_object;
        }
        self.init_dict.store(true, Ordering::Relaxed);
        // Maybe add a RAII guard here for cleanup in case of panic.
        init_tp_dict(type_object.tp_dict, py);
        type_object
    }
    
    ...
}

fn init_tp_dict(tp_dict: *mut PyObject, py: Python) {
    // Only use FFI functions to manipulate the dict here.
   // Note: we must not release the GIL while we’re making changes to the dict!
    ...
}

———————
Edit: @davidhewitt pointed out that because we’re going to run arbitrary user code in init_tp_dict (class attribute methods), the GIL can be temporarily released and the assumptions I make cannot hold.

I still believe a correct solution will be similar to what I proposed, either by managing to guarantee some synchronization invariants about the tp_dict (for example every read or write access, either direct or indirect through FFI functions, needs to hold a GIL), or by adding some kind of unsafe fn get_type_object_but_tp_dict_not_yet_filled(&self) -> &ffi::PyTypeObject which can only be used during the initialization of the tp_dict and which basically will be used only for allocating new objects to fill the tp_dict.

src/once_cell.rs Show resolved Hide resolved
//
// That could lead to all sorts of unsafety such as using incomplete type objects
// to initialize class instances, so recursive initialization is prevented.
let thread_not_already_initializing = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that, in conjunction with an appropriate RAII guard in case of panic (so that one thread which panicks can still let another thread perform the initialization), we could just use a boolean flag: we have an exclusive GIL so we are the only one which can recursively call the method.

I believe that a map keyed with thread IDs is only necessary for e.g. multi reader locks.

Also, without this check, a recursive call will just overflow the stack, so maybe we don’t need the check at all, unless you prefer it for debugging purposes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because of the above issue that f() can release the GIL, we can't necessarily prevent multiple threads from attempting to initialize the type object. What the OnceCell implementation above can guarantee is that all values returned from get_or_init are the same reference - even if init runs multiple times...

Trying to work around this by blocking all threads except the first with a lock is exactly how we get to the deadlock scenario in lazy_static 😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, without this check, a recursive call will just overflow the stack, so maybe we don’t need the check at all, unless you prefer it for debugging purposes.

I'm torn. I don't expect the check to cause much overhead (will usually be a HashSet with just one element and never read.)

The stack overflow is a little harder to debug, but I guess it's fine :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it’s not much a matter of overhead but rather of code simplification :)

also about the boolean flag, in that case we control the f that is being called (initialize_type_object) so we can assume it won’t release the GIL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We almost control the f being called but I don't know if we can guarantee it won't release the GIL. User could write a custom implementation of just one part (e.g. class attribute initialization) which might easily break this assumption in a very subtle way...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we don’t give the user access to the Python token (which we don’t in class attributes, exactly because we were afraid to do so!), we’re safe.

I believe Python::acquire_gil().python().allow_threads(|| { ... }) will temporarily release the GIL, even in class attributes. So I think we don't have to give class attribute user the Python token for stuff to happen!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks for the clarification. Then I understand that the discussion we had on #905 of whether or not to allow passing a Python parameter to class attr methods was a bit pointless :)

Also the solution I proposed cannot work as is, as I need to make the assumption that the GIL is held during the whole tp_dict initialization.

Do you know whether we have some kind of guarantees about the tp_dict synchronization? For example, do we guarantee that whenever we read or write to it (either directly, or indirectly through FFI functions), we hold a GIL?

Also, I’m in favor of leaving your check for recursive initialization as is. We can revisit when we sort out this recursive class attribute thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling it difficult to understand why f() matters here. Now f() is this closure and we don't call any user-defined function in this closure, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initialize class attributes inside initialize_type_object, and the initialization calls into user-defined code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initialize class attributes inside initialize_type_object, and the initialization calls into user-defined code.

Ah, I understand, thanks.
But it looks the only candidate is #[classattr] 🤔 ...

@davidhewitt
Copy link
Member Author

davidhewitt commented Jun 15, 2020

I know that @programmerjake expressed concern about changing the tp_dict after calling PyType_ready, but I think it’s perfectly allowed: see for example https://github.com/python/cpython/blob/25f38d7044a3a47465edd851c4e04f337b2c4b9b/Python/sysmodule.c#L2757

I'm not sure exactly how to interpret the docs:

@davidhewitt
Copy link
Member Author

But I'm not sure this should be a public API and how to explain what this is to users 🤔

It gives users a way to avoid the lazy_static deadlock seen in #973, so it would be nice if it is a public API.

@scalexm
Copy link
Contributor

scalexm commented Jun 16, 2020

I’m in favor of merging this PR as it is (modulo review from @kngwyu), as it fixes an important soundness issue in LazyStaticType::get_or_init.

It’s ok to disable recursive class attributes for now, as it seems clear that a solution to this problem is out of scope for this PR.

Copy link
Member

@kngwyu kngwyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments.

In addition, can the name OnceCell be confusing?

pyo3-derive-backend/src/pyclass.rs Show resolved Hide resolved
src/type_object.rs Outdated Show resolved Hide resolved
src/type_object.rs Outdated Show resolved Hide resolved
@kngwyu
Copy link
Member

kngwyu commented Jun 16, 2020

Thank you @davidhewitt and @scalexm for the discussion and pointing out the unsoundness! 👍

It’s ok to disable recursive class attributes for now, as it seems clear that a solution to this problem is out of scope for this PR.

Agreed. I want to adopt @scalexm's 'two-stage' initialization strategy if it's possible, but in this PR it's more important to concentrate on how OnceCell should be.

@davidhewitt
Copy link
Member Author

In addition, can the name OnceCell be confusing?

I chose the name OnceCell because it's a write-once cell exactly like the ones found in the once_cell crate.

But I would be very happy to use a different name if anyone can find one. A nice thing about calling it OnceCell is we can document this API as being a drop-in replacement for once_cell::sync::OnceCell which won't deadlock due to the GIL :)

@kngwyu
Copy link
Member

kngwyu commented Jun 17, 2020

What I meant is about using the same name as the famous OnceCell.
Our OnceCell and their one can be used at the same time. If one initializes their OnceCell before getting GIL, it works without deadlock.

@davidhewitt
Copy link
Member Author

If one initializes their OnceCell before getting GIL, it works without deadlock.

Not sure exactly what you mean by this? I can create a deadlock with once_cell::sync and the GIL pretty easily:

Click to expand
use pyo3::prelude::*;
use once_cell::sync::OnceCell;
use std::thread::sleep;

static OC: OnceCell<()> = OnceCell::new();
static DURATION: std::time::Duration = std::time::Duration::from_secs(1);

#[test]
fn test1() {
    let gil = Python::acquire_gil();
    let py = gil.python();
    OC.get_or_init(|| py.allow_threads(|| sleep(DURATION)));
}

#[test]
fn test2() {
    let gil = Python::acquire_gil();
    let py = gil.python();
    OC.get_or_init(|| py.allow_threads(|| sleep(DURATION)));
}

What I meant is about using the same name as the famous OnceCell.

That's fair. What about GILOnceCell ? I am happy to take any suggestions too!

davidhewitt added a commit to davidhewitt/pyo3 that referenced this pull request Jun 17, 2020
@kngwyu
Copy link
Member

kngwyu commented Jun 17, 2020

I like GILOnceCell or GILCell.
GILLazyCell is also good but maybe should be used for our Lazy equivalent.

@davidhewitt
Copy link
Member Author

I think let's have GILOnceCell and maybe later we can make GILLazy. I think those names have nice symmetry with the well-known OnceCell and Lazy (just stick GIL on the front 😄)

@davidhewitt
Copy link
Member Author

(I was thinking maybe in the future GILCell could be something which is a bit more like RefCell but can be shared between threads - kind of like PyCell we already have but for all types, not just for PyClass)

@davidhewitt
Copy link
Member Author

Renamed to GILOnceCell, added documentation and CHANGELOG, and rebased on master. This is now ready for final review / consideration for merging.

Let's create a follow-up issue to solve self-typed class attributes?

@davidhewitt davidhewitt marked this pull request as ready for review June 18, 2020 07:31
Copy link
Member

@kngwyu kngwyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly on doc comments.

src/once_cell.rs Outdated Show resolved Hide resolved
src/once_cell.rs Outdated Show resolved Hide resolved
src/once_cell.rs Outdated Show resolved Hide resolved
src/once_cell.rs Show resolved Hide resolved
guide/src/faq.md Show resolved Hide resolved
@davidhewitt davidhewitt changed the title Add pyo3::once_cell::OnceCell Add pyo3::once_cell::GILOnceCell Jun 18, 2020
@davidhewitt
Copy link
Member Author

Rebased on master and reworded the docs as suggested - thanks for the feedback!

@kngwyu
Copy link
Member

kngwyu commented Jun 19, 2020

LGTM, thanks!

@kngwyu kngwyu merged commit 44b2247 into PyO3:master Jun 19, 2020
@kngwyu
Copy link
Member

kngwyu commented Jun 19, 2020

Opened #982.

scalexm added a commit to scalexm/pyo3 that referenced this pull request Jun 22, 2020
Use some kind of two-stage initialization as described in PyO3#975, by
being very cautious about when to allow the GIL to be released.
scalexm added a commit to scalexm/pyo3 that referenced this pull request Jun 23, 2020
Use some kind of two-stage initialization as described in PyO3#975, by
being very cautious about when to allow the GIL to be released.
@davidhewitt davidhewitt deleted the once-cell branch December 24, 2021 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants