Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__traverse__ should not alter any refcounts, at least in CPython #3165

Closed
lifthrasiir opened this issue May 19, 2023 · 9 comments · Fixed by #3168
Closed

__traverse__ should not alter any refcounts, at least in CPython #3165

lifthrasiir opened this issue May 19, 2023 · 9 comments · Fixed by #3168
Labels

Comments

@lifthrasiir
Copy link
Contributor

lifthrasiir commented May 19, 2023

Bug Description

Back when we've got a support for the GC protocol, while it is not usual for __traverse__ to do anything more than calling visit.call, it was assumed that running any Python code is not harmful and currently __traverse__ trampoline is no more than an ordinary trampoline with a slightly different signature.

It turns out that not only this assumption is wrong but also any currently generated tp_traverse implementation is prone to crash. CPython has reused PyGC_Head fields to store intermediate refcounts for a long time, so tp_traverse should be never able to read or write any PyGC_Head fields.

In particular the __traverse__ trampoline would first allocate a new GILPool and apply deferred refcount updates at once, possibly deallocating some objects in the pool and in turn somehow touching temporarily corrupted PyGC_Head (e.g. PyObject_GC_Untrack or PyObject_GC_Del). For this reason I believe this is the root cause of #1623 and #3064; both suggests that PyGC_Head is corrupted and threading seems to be relevant here (it is much easier to defer refcount updates in non-interpreter threads).

Reproduction

Too lengthy to open by default

I was able to reliably trigger SIGSEGV on Python 3.8 through 3.11 in x86-64 Linux. Given the following src/lib.rs and typical Cargo.toml (which has been omitted):

use pyo3::{prelude::*, PyTraverseError, PyVisit};
use std::{cell::RefCell, thread};

#[pyclass]
struct Ref {
    inner: RefCell<Option<Py<Self>>>,
    clear_on_traverse: bool,
}

impl Ref {
    fn new(clear_on_traverse: bool) -> Self {
        Self {
            inner: RefCell::new(None),
            clear_on_traverse,
        }
    }
}

#[pymethods]
impl Ref {
    fn __traverse__(&self, visit: PyVisit<'_>) -> Result<(), PyTraverseError> {
        if self.clear_on_traverse {
            self.inner.take();
        } else {
            if let Some(inner) = &*self.inner.borrow() {
                visit.call(inner)?;
            }
        }
        Ok(())
    }

    fn __clear__(&mut self) {
        self.inner.take();
    }
}

#[pyfunction]
fn crash1() {
    thread::spawn(move || {
        _ = Python::with_gil(|py| -> PyResult<_> {
            let a = Py::new(py, Ref::new(false))?;
            a.borrow_mut(py).inner.replace(Some(a.clone_ref(py)));

            let b = Py::new(py, Ref::new(false)))?;
            Ok((a, b))
        });
        eprintln!("thread finished");
    });

    Ok(())
}

#[pyfunction]
fn crash2() {
    thread::spawn(move || {
        _ = Python::with_gil(|py| -> PyResult<_> {
            let a1 = Py::new(py, Ref::new(true))?;
            let a2 = Py::new(py, Ref::new(false))?;
            a1.borrow_mut(py).inner.replace(Some(a2.clone_ref(py)));
            a2.borrow_mut(py).inner.replace(Some(a1.clone_ref(py)));

            let b = Py::new(py, Ref::new(false)))?;
            Ok((a1, b))
        });
        eprintln!("thread finished");
    });
}

#[pymodule]
#[pyo3(name = "pyo3_crash_test")]
fn module_entry(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(crash1, m)?)?;
    m.add_function(wrap_pyfunction!(crash2, m)?)?;
    Ok(())
}

Running the following Python code will crash inside gc.collect():

import sys
import gc
import time
import pyo3_crash_test

pyo3_crash_test.crash1() # or crash2
time.sleep(0.1) # not strictly required, but ensures that the thread finishes before GC
print('collecting...', file=sys.stderr)
gc.collect()
assert False, 'crash expected during GC'

Here Ref is holding an optional reference to other Ref, possibly itself. Setting clear_on_traverse will clear the reference during the traversal, simulating an otherwise totally safe operation to do.

In the crash1 the refcounts for a and b will be 2 and 1 respectively at the time of GC. GC will overwrite PyGC_Head._gc_prev for both a and b (effectively converting the list to singly-linked), and then __traverse__ will be called for a, creating a new GILPool and triggering ReferencePool::update_counts. As both values are registered for Py_DECREF the refcount of b will be now 0, and this will eventually call PyObject_GC_Del which assumes that the list is still doubly-linked and crashes.

The situation is similar for crash2: a2 gets dropped with the GIL still acquired, so the refcounts for a1, a2 and b are 2, 1 and 1 respectively. Now assume that GILPool::new somehow recognizes __traverse__ and doesn't call update_counts. Even in this case a1 will now remove a reference to a2 during __traverse__, falsely beliving Py_DECREF is safe to call as the GIL is acquired, so the same thing will happen for a2.

Possible Solution

It seems evident to me that __traverse__ should be handled as if no GIL is (and can be) acquired. This implies that Python::with_gil should probably fail and gil_is_acquired should return false inside __traverse__.

I believe it is also possible to have multiple nested __traverse__ calls. The GC process drains the linked list of objects while updating PyGC_Head and it sets gcstate->collecting to 1 during the process anyway, so __traverse__ cannot be called twice for the same object and nested calls should be rare, but there are at least three possible paths (as of the main branch of CPython):

  • Creating new objects may trigger PyObject_GC_Track, which implicitly calls tp_traverse in debug builds.
  • gc.get_referrers and gc.get_referrents use tp_traverse to collect objects, and are not affected by gcstate->collecting. They can be independently nested if __traverse__ somehow calls those functions.
  • While directly calling gc.collect() checks for gcstate->collecting and returns immediately, the automatic GC scheduling may result in a double GC. Specifically there is an interval between the scheduling (_Py_ScheduleGC) and the actual GC (_Py_RunGC) and some opcode may end up calling gc.collect() inbetween. This is extremely rare though, as this interval can be at most a single opcode, and is otherwise safe.

Probably it is sufficient to add the second "traversal" counter for the nested __traverse__ calls in addition to GIL_COUNT. If the traversal counter is non-zero the GIL cannot be acquired and gil_is_acquired should return false regardless of GIL_COUNT. The GIL_COUNT is still updated as usual, but it is effectively ignored until the traversal counter drops to zero, at which point update_counts should be considered (but only when GIL_COUNT was non-zero). But I'm not very much confident about this solution for multiple reasons, including that there is no actual guarantee that we can call any other Python API from tp_traverse and someday CPython may touch PyGC_Head from unexpected functions---I mean, even gcstate->collecting check is not complete so who knows? If possible, a fully static solution to prevent anything but PyVisit would be much more desirable.

Suggestions

Given a full solution needs much testing at least, I looked for possible workarounds without a local PyO3 update. Unfortunately GILPool::new call is mandatory for the trampoline and the only indirect approach would be temporarily untracking objects via PyObject_GC_UnTrack, but that pretty much contradicts why we have tp_traverse in the first place.

So it seems that PyO3 needs a minimal update to ensure that you remain safe if you don't do weird things inside __traverse__. No promises here, but I hope to file a small PR to disable update_counts from the trampoline soon. In my local testing this fixes crash1 but not crash2.

Metadata

Your operating system and version

x86-64 Linux (at least)

Your Python version (python --version)

3.8.10, 3.9.16, 3.10.11, 3.11.13

Your Rust version (rustc --version)

1.69.0

Your PyO3 version

0.18.3

How did you install python? Did you use a virtualenv?

deadsnakes PPA, the bundled version of venv, pip and the most recent version of maturin (0.15.2).

@adamreichold
Copy link
Member

It seems evident to me that traverse should be handled as if no GIL is (and can be) acquired. This implies that Python::with_gil should probably fail and gil_is_acquired should return false inside traverse.

I think so too.

What I did not yet understand in your exposition is whether it is actually necessary to change GIL_COUNT while inside __traverse__? Meaning that if we prevent the GIL from being acquired within __traverse__, wouldn't that imply GIL_COUNT staying at its initial value until we leave traverse?

What is also not clear to me is how the implementation of __traverse__ will look like under the new regime. Will it take slf: *const PyCell<Self> or slf: *const Self instead of &self since we cannot borrow the interior of the PyCell? Will this imply PyVisit::call becoming unsafe and taking raw pointers as well? Is the resulting API still actually better than the raw FFI then?

@adamreichold
Copy link
Member

adamreichold commented May 20, 2023

@davidhewitt I am currently thinking how to get a grip on mod gil to be somewhat confident in making any changes to it. While we already decreased our API surface w.r.t. GILGuard with this release, I wonder if GILPool could go away as well?

One option would be to provide any API similar to Python::with_gil, e.g. the example code

// Some long-running process like a webserver, which never releases the GIL.
loop {
    // Create a new pool, so that PyO3 can clear memory at the end of the loop.
    let pool = unsafe { py.new_pool() };

    // It is recommended to *always* immediately set py to the pool's Python, to help
    // avoid creating references with invalid lifetimes.
    let py = pool.python();

    // do stuff...
}

could be replaced by

// Some long-running process like a webserver, which never releases the GIL.
loop {
    Python::with_pool(|py| {
       // do stuff...
    });
}

which creates the pool behind the scenes and just exposes its GIL token.

I mainly wonder whether this would be just too much breakage for 0.19 together with the with_gil change? And whether this would be flexible enough for our users while actually giving us more room to manoeuvre with the internals? Or would all of this be superseded by the owned objects API anyway?

EDIT: I think this could also significantly simplify the implementation of GILPool and OWNED_OBJECTS as proper nesting would be ensure, i.e. we could just stored Cell<*mut GILPool> in thread-local storage to contain the current pool.

@lifthrasiir
Copy link
Contributor Author

What I did not yet understand in your exposition is whether it is actually necessary to change GIL_COUNT while inside __traverse__? Meaning that if we prevent the GIL from being acquired within __traverse__, wouldn't that imply GIL_COUNT staying at its initial value until we leave traverse?

You are right that it is not necessary. That's about all though, it doesn't give you any additional optimization opportunity. I just added GILPool::new_without_update_counts in my testing.

What is also not clear to me is how the implementation of __traverse__ will look like under the new regime. Will it take slf: *const PyCell<Self> or slf: *const Self instead of &self since we cannot borrow the interior of the PyCell?

I believe you can still borrow PyCell from __traverse__. We can't use the GIL, but we can still make use of the fact that there is at most one thread with the GIL. (Compare with SuspendGIL, which actually releases the GIL during its lifetime and has no such guarantee.) My example just happens to use borrowing to trigger Py::drop.

Will this imply PyVisit::call becoming unsafe and taking raw pointers as well? Is the resulting API still actually better than the raw FFI then?

I didn't think too much about making anything unsafe but that sounds like another possible approach. In fact, as we are not super confident to allow any other function calls in __traverse__, we can require unsafe fn __traverse__(&self, ...) to signal that it needs a manual verification. While not as good as an additional counter, I guess it's better than making PyVisit::call unsafe because PyVisit is not even necessary to trigger a crash. In my example, you can refactor Ref with clear_on_traverse into a separate class which don't even touch PyVisit in its __traverse__.

@adamreichold
Copy link
Member

I just added GILPool::new_without_update_counts in my testing.

I think we should push that as a #[doc(hidden)] internal API which can be used from impl_traverse_slot as soon as possible. AFAIU, this will be required in any case short of not creating a GILPool at all, i.e. having traverse implementation life completely outside of GIL land?

I believe you can still borrow PyCell from traverse. We can't use the GIL, but we can still make use of the fact that there is at most one thread with the GIL.

The problem is that PyCell derefs into PyAny which provides implicit access to a GIL token via PyAny::py, i.e. we cannot even expose &PyCell<Self> or any &PyAny to the traverse implementation if it is not supposed to safely assume holding the GIL.

@adamreichold
Copy link
Member

I believe you can still borrow PyCell from traverse. We can't use the GIL, but we can still make use of the fact that there is at most one thread with the GIL.

The problem is that PyCell derefs into PyAny which provides implicit access to a GIL token via PyAny::py, i.e. we cannot even expose &PyCell<Self> or any &PyAny to the traverse implementation if it is not supposed to safely assume holding the GIL.

Ah, but we can still expose &self but just not how that borrow was created which is basically what we are already doing. So this leaves with with_gil not yielding access to a GIL token when calling from within __traverse__.

@adamreichold
Copy link
Member

@lifthrasiir Could you try whether #3168 would also work for you?

@lifthrasiir
Copy link
Contributor Author

lifthrasiir commented May 20, 2023

The problem is that PyCell derefs into PyAny which provides implicit access to a GIL token via PyAny::py, i.e. we cannot even expose &PyCell<Self> or any &PyAny to the traverse implementation if it is not supposed to safely assume holding the GIL.

Ah, but we can still expose &self but just not how that borrow was created which is basically what we are already doing. So this leaves with with_gil not yielding access to a GIL token when calling from within __traverse__.

Yeah, the only self parameter that is safe to expose from __traverse__ would be &self. And I think __traverse__ is the only method that does not accept any other self parameter because no wrapper function is generated for it. Maybe we should explicitly test that.1

Could you try whether #3168 would also work for you?

I haven't checked yet---I need an access to a company machine for the original code, so was making an independent test for the PR.

I believe your draft still creates and drops Py<PyCell<T>>. This doesn't directly pose a problem unless the refcount was very large (u32/u64::MAX), but I'm still uneasy about calling Py_INCREF and Py_DECREF at all, given that they are already slated to change a bit in Python 3.12. Generally speaking: is it safe to construct &PyCell<T> directly from a borrowed reference? If so that can be used to completely bypass Py<PyCell<T>>.

Footnotes

  1. Actually... we should also make sure that we are calling the __traverse__ method defined in #[pymethods] impls, because currently the method is called to PyRef<'_, T>, so it might be possible to redefine __traverse__ to get a reference to PyRef and thus the GIL token.

@adamreichold
Copy link
Member

I believe your draft still creates and drops Py<PyCell>.

py.from_borrowed_ptr::<PyCell<T>>(slf) produces a &PyCell<T>, so the Py<PyCell<T>> was already bypassed in the code shipped as 0.18.3.

Actually... we should also make sure that we are calling the traverse method defined in #[pymethods] impls, because currently the method is called to PyRef<'_, T>, so it might be possible to redefine traverse to get a reference to PyRef and thus the GIL token.

The latest version of my draft enforces the signature

fn(&T, PyVisit<'_>) -> Result<(), PyTraverseError>,

instead of just calling borrow.__traverse__.

@adamreichold adamreichold mentioned this issue May 21, 2023
7 tasks
lifthrasiir added a commit to lifthrasiir/pyo3 that referenced this issue May 22, 2023
@lifthrasiir
Copy link
Contributor Author

Now I have an access to the company machine again and I confirmed that #3168 fixes my crash. Great!

@adamreichold I've added more tests (#3175) that I was originally planning to use for my PR, feel free to use them.

lifthrasiir added a commit to lifthrasiir/pyo3 that referenced this issue May 22, 2023
lifthrasiir added a commit to lifthrasiir/pyo3 that referenced this issue May 22, 2023
bors bot added a commit that referenced this issue May 25, 2023
3168: Do not apply deferred ref count updates and prevent the GIL from being acquired inside of __traverse__ implementations. r=davidhewitt a=adamreichold

Closes #2301
Closes #3165


Co-authored-by: Adam Reichold <adam.reichold@t-online.de>
@bors bors bot closed this as completed in 32c335e May 25, 2023
adamreichold pushed a commit that referenced this issue May 25, 2023
bors bot added a commit that referenced this issue May 25, 2023
3184: Additional tests for #3168 r=adamreichold a=adamreichold

These were a part of tests `@lifthrasiir` was preparing for #3165, and I believe it's worthy to add them (any single of them fails in the current main branch).

Co-authored-by: Kang Seonghoon <public+git@mearie.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants