Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault second run of rust function in python pyo3 0.21.2 #4105

Open
Crispy13 opened this issue Apr 21, 2024 · 24 comments · May be fixed by #4095
Open

segmentation fault second run of rust function in python pyo3 0.21.2 #4105

Crispy13 opened this issue Apr 21, 2024 · 24 comments · May be fixed by #4095
Labels

Comments

@Crispy13
Copy link

Crispy13 commented Apr 21, 2024

Hi. I'm trying to apply pyo3 0.21.2 to 0.20 codes.

With 0.20 codes my python app works, but with the new code does not.

The current situation:

  • python version = 3.9.15
  1. the first call of rust_function in python succeeds, but the second call fails.
  2. The error was segmentation fault.
  3. The error messages are slightly different per run:
vk[d].setdefault(a, dict())[p_key] = sound_data
KeyError: 'chr11:871499/2' // <- this string can't be one of p_key, a or d. The string like that comes from another variable.

Exception ignored deletion of interned string failed:
Traceback (most recent call last):
  File "python_project.py", line 333, in rust_function
    self.rust_have_been_failed = True
KeyError: 'À\xad\x8dU' // <- A value that shouldn't exist if it's working properly.

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
--Type <RET> for more, q to quit, c to continue without paging--
pymalloc_free (ctx=0x0, p=0x1)
    at /usr/local/src/conda/python-3.9.15/Objects/obmalloc.c:1866
1866    /usr/local/src/conda/python-3.9.15/Objects/obmalloc.c: No such file or directory.
/usr/local/src/conda/python-3.9.15/Objects/object.c: No such file or directory.
  1. returning (String, i32) instead of (Py<PyString>, i32) does not raise the error.

The codes here: https://gist.github.com/Crispy13/03d25f8d556e644b5425ddf1decff6dc

The rust function get input string from python, and return it back with additional i32.

Is it me? What's the cause of the error?

@adamreichold
Copy link
Member

The codes here:

Could you provide a minimal reproducer? The linked code does not contain the definition of do_something. Does it crash without that function being called?

@Crispy13
Copy link
Author

Crispy13 commented Apr 21, 2024

It crashed too when not using do_something. (I just returned dummy output)

I found that it did not crash when I don't use rayon, just using iterator.

And I modified the gist code to more closely resemble actual code

@Crispy13
Copy link
Author

Crispy13 commented Apr 21, 2024

Could you try this?

I wrote a minimal reproducible code newly.

https://gist.github.com/Crispy13/320129862c5fe80740a14dcea50519fd

The error does not occur when using global thread pool instead of local one.

[dependencies]
anyhow = "1.0.82"
pyo3 = { version = "0.21.2", features = ["extension-module"] }
rayon = "1.10.0"

python=3.9.15
CentOS Linux 8
Linux 4.18.0-305.3.1.el8.x86_64
x86_64 GNU/Linux

@adamreichold
Copy link
Member

Thank you for making it more self-contained. Sadly, I still cannot reproduce the crash. Calling test repeatedly like

>>> foo.test([["foo", "bar"], ["foo1", "bar1"]], [("foo", 1), ("bar", 2)])
x=foo, y=bar, z=foo, n=1
x=foo, y=bar, z=bar, n=2
x=foo1, y=bar1, z=foo, n=1
x=foo1, y=bar1, z=bar, n=2
[('foo', 2), ('bar', 4), ('foo', 2), ('bar', 4)]

works here both using debug and release builds.

Before I try to more closely reproduce your test environment (Python version, distribution, etc.), could you provide a test vector which triggers this for you?

@Crispy13
Copy link
Author

Crispy13 commented Apr 21, 2024

Would you try this python code?

https://gist.github.com/Crispy13/c3acaef979ed93567f5e14e1b34add47

I found somethings:

  • When not using python multiprocessing code, the error does not occur.
  • When not using gc.collect(), the error does not occur.
  • The error occurs randomly. per run to run.
  • Python 3.12.3 does not raise error.

@adamreichold
Copy link
Member

Thank you, I was able to reproduce the crash using Python 3.11 running on OpenSUSE Tumbleweed:

#0  0x00007ffff7ba132c in pymalloc_alloc (ctx=<optimized out>, nbytes=55) at Objects/obmalloc.c:1979
        size = 3
        pool = 0x7ffff740c000
        bp = 0xfff740f92d000000 <error: Cannot access memory at address 0xfff740f92d000000>
        size = <optimized out>
        pool = <optimized out>
        bp = <optimized out>
#1  _PyObject_Malloc (ctx=<optimized out>, nbytes=55) at Objects/obmalloc.c:1998
        ptr = <optimized out>
#2  0x00007ffff7ba2b8b in PyObject_Malloc (size=<optimized out>) at Objects/obmalloc.c:712
No locals.
#3  PyUnicode_New (size=6, maxchar=<optimized out>) at Objects/unicodeobject.c:1425
        obj = <optimized out>
        unicode = <optimized out>
        data = <optimized out>
        kind = PyUnicode_1BYTE_KIND
        is_sharing = 0
        is_ascii = <optimized out>
        char_size = 1
        struct_size = <optimized out>
#4  0x00007ffff7ba1ee7 in unicode_decode_utf8 (s=<optimized out>, s@entry=0x7ffff7cc9945 "closed", size=6, error_handler=error_handler@entry=_Py_ERROR_UNKNOWN, errors=errors@entry=0x0, consumed=consumed@entry=0x0) at Objects/unicodeobject.c:5117
        starts = 0x7ffff7cc9945 "closed"
        end = 0x7ffff7cc994b ""
        u = <optimized out>
        writer = {buffer = <_io.TextIOWrapper at remote 0x7ffff751fed0>, data = 0x1, kind = (unknown: 0xf7f0b8d8), maxchar = 32767, size = 140737488344576, pos = 0, min_length = 140737350357639, min_char = 4149578032, overallocate = 255 '\377', readonly = 127 '\177'}
        startinpos = 140737349642402
        endinpos = 140737343273328
        errmsg = <optimized out>
        error_handler_obj = <unknown at remote 0x7ffff7f0b8d8>
        exc = <unknown at remote 0x7fffffffd560>
#5  0x00007ffff7bd7a77 in PyUnicode_DecodeUTF8Stateful (consumed=0x0, errors=0x0, size=<optimized out>, s=0x7ffff7cc9945 "closed") at Objects/unicodeobject.c:5250
No locals.
#6  PyUnicode_FromString (u=0x7ffff7cc9945 "closed") at Objects/unicodeobject.c:2277
        size = <optimized out>
#7  PyObject_GetAttrString (v=<_io.TextIOWrapper at remote 0x7ffff75f8040>, name=0x7ffff7cc9945 "closed") at Objects/object.c:798
        w = <optimized out>
        res = <optimized out>
#8  0x00007ffff7c6501f in file_is_closed (fobj=fobj@entry=<_io.TextIOWrapper at remote 0x7ffff75f8040>) at Python/pylifecycle.c:1610
        r = <optimized out>
        tmp = <optimized out>
#9  0x00007ffff7c64d51 in flush_std_files () at Python/pylifecycle.c:1642
        tstate = <optimized out>
        fout = <optimized out>
        ferr = <_io.TextIOWrapper at remote 0x7ffff75f8040>
        tmp = <optimized out>
        status = 0
#10 0x00007ffff7c55017 in Py_FinalizeEx () at Python/pylifecycle.c:1811
        status = 0
        runtime = 0x7ffff7ee2f20 <_PyRuntime>
        tstate = 0x7ffff7f0b8d8 <_PyRuntime+166328>
        malloc_stats = <optimized out>
#11 0x00007ffff7c618d6 in Py_RunMain () at Modules/main.c:682
        exitcode = 0
#12 0x00007ffff7c29c47 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:734
        args = {argc = 2, use_bytes_argv = 1, bytes_argv = 0x7fffffffd8f8, wchar_argv = 0x0}
#13 0x00007ffff762a1f0 in __libc_start_call_main () from /lib64/libc.so.6
No symbol table info available.
#14 0x00007ffff762a2b9 in __libc_start_main_impl () from /lib64/libc.so.6
No symbol table info available.
#15 0x0000555555555085 in _start ()
No symbol table info available.

so we can be sure that this has nothing do with the specific environment and affects at least two but not all Python versions.

@adamreichold
Copy link
Member

When not using python multiprocessing code, the error does not occur.#
When not using gc.collect(), the error does not occur.

For me it also crashes keeping just one of the GC part or the multiprocessing part, but not if I remove both. For me, it also crashes reliably. I suspect that the Python heap gets corrupted and that something sufficiently involved needs to happen after the corruption to trigger the crash.

@adamreichold
Copy link
Member

The error does not occur when using global thread pool instead of local one.

Rayon is not involved at all, just spawning a thread which exits before the main thread does. But doing that using std::thread::spawn suffices to trigger the crash.

@adamreichold
Copy link
Member

Ok, I think I better understand this now and it appears that the global reference pool is the culprit: After returning from test, we have 16 reference count increments and 28 reference count decrements pending due to cloning/dropping Py instances on the background thread. If I wrap that part of of test in allow_threads, e.g.

py.allow_threads(move || {
    std::thread::spawn(move || {
        input_unbound
            .iter()
            .flat_map(|a| {
                input2_unbound
                    .iter()
                    .cloned()
                    .map(move |b| (a, b))
                    .collect::<Vec<_>>()
            })
            .map(|(e, r)| {
                let res = rust_func(e[0].as_str(), e[1].as_str(), r.0.as_str(), r.1);
                (r.0.into_py_bytes(), res)
            })
            .collect::<Vec<_>>()
    }).join().unwrap()
})

then the pending reference counts are applied and the segfault goes away.

@adamreichold
Copy link
Member

@Crispy13 Could you try using a Git dependency for PyO3 based on the branch proposed in #4106? For example, using

pyo3 = { git = "https://github.com/PyO3/pyo3.git", branch = "apply-pending-refcnt-updates-before-release", features = ["extension-module"] }

in your dependencies section?

@adamreichold
Copy link
Member

As a workaround for your existing code base, you can wrap all of your Rayon invocations with Python::allow_threads which should give you a stable application without any dependency changes.

@davidhewitt
Copy link
Member

Uff, thanks both for reporting and digging into this further. I haven't attempted to reproduce the exact case above but following on from what I see you uncovering, I think I see at least one fundamental challenge with the reference pool.

The problem, as I think I understand it, is that the reference pool can hold clone/drop pairs to an object. (Maybe there are other problems too which we're missing.) Normal code can free that object before the reference pool sync gets applied to those pairs, and then the when the pool runs the object gets resurrected and re-dropped. Or worse. Either way, it's some form of use after free.

This test panics, essentially it is demonstrating a double free.

#[test]
fn test_clone_drop_crash() {
    use crate::prelude::*;

    #[pyclass(crate = "crate")]
    struct TestClass {
        dropped: bool,
    }

    impl Drop for TestClass {
        fn drop(&mut self) {
            let _ = dbg!(std::backtrace::Backtrace::force_capture());
            assert!(!self.dropped);
            self.dropped = true;
        }
    }

    Python::with_gil(|py| {
        let obj = Py::new(py, TestClass { dropped: false }).unwrap();

        assert_eq!(obj.get_refcnt(py), 1);

        // put a pair of clone / drop operations on the release pool for obj
        std::thread::scope(|s| {
            let _ = s
                .spawn(|| {
                    let _ = obj.clone();
                })
                .join();
        });

        assert_eq!(obj.get_refcnt(py), 1);
        assert!(!obj.borrow(py).dropped);
        assert_eq!(obj.get_refcnt(py), 1);

        drop(obj); // <-- crashes here, because
        // 1 - decreasing obj refcount to 0 immediately causes tp_dealloc to get called
        // 2 - calling tp_dealloc for this pyclass calls GILPool::new which calls update_counts
        // 3 - update_counts increases the count to 1 and then back to 0, causing a second nested tp_dealloc call & double drop
    });
}

@davidhewitt
Copy link
Member

I think we have two options to fix this problem of clone/drop pairs causing temporary resurrection:

1 - remove the Clone implementation from Py
2 - make the reference pool update more algorithmically complex by identifying such pairs and eliding reference count operations involving them

If we can find a design which is not prohibitively expensive for performance, option 2 seems more user-friendly. So it'd be worth at least considering what implementation is available to us.

@adamreichold
Copy link
Member

calling tp_dealloc for this pyclass calls GILPool::new which calls update_counts

This happens because you use a #[pyclass] for the test, right? If you elide the dropped flag and use e.g. a built-in type like PyString, the double-dealloc would still be there, but manifest itself much later when some random taking of the GIL updates the counts?

@adamreichold
Copy link
Member

adamreichold commented Apr 21, 2024

make the reference pool update more algorithmically complex by identifying such pairs and eliding reference count operations involving them

I don't think this will work as we cannot rely on getting these updates in pairs, e.g. the following reduced test case

#[test]
fn test_clone_drop_crash() {
    use crate::prelude::*;

    #[pyclass(crate = "crate")]
    struct TestClass {
        dropped: bool,
    }

    impl Drop for TestClass {
        fn drop(&mut self) {
            self.dropped = true;
        }
    }

    Python::with_gil(|py| {
        let obj = Py::new(py, TestClass { dropped: false }).unwrap();

        assert_eq!(obj.get_refcnt(py), 1);

        // put a clone operation on the release pool for obj
        let obj_clone = std::thread::scope(|s| s.spawn(|| obj.clone()).join().unwrap());

        drop(obj_clone); // immediately apply the drop here with the GIL

        assert_eq!(obj.get_refcnt(py), 1); // fails because the reference count underflowed
    });
}

will fail with

thread 'gil::tests::test_clone_drop_crash' panicked at src/gil.rs:968:9:
assertion `left == right` failed
  left: 140367188162640
 right: 1

because we apply the decrement before we had any chance to actually look at the increment.

I think we never ran into this because the common pattern is to wrap such sections in allow_threads which works around this. But my conclusion here would be that to provide the service that the global reference pool provides, we could actually "update the counts" before each and every single reference count operation. (And I am not even sure if this means, operation by us via Py or any reference count update.)

@adamreichold
Copy link
Member

Considering the loopholes, its performance impacts and the possibility of solving this problem via a cooperating GILless Python interpreter, I fear we need to drop the reference pool unconditionally.

I think we might still want to keep the Clone impl though because we cannot remove the Drop impl, so we have to decide on what to do about the reference count updates without the GIL in any case, i.e. finish our discussion in #4095 albeit under different circumstances.

(Funny coincidence to get this bug report after discussing what to do about the reference pool. But then again, with the GIL pool on the way out, the reference pool is the last remnant of the "global state" memory management style we seem to be moving away from.)

@davidhewitt
Copy link
Member

I think we never ran into this because the common pattern is to wrap such sections in allow_threads which works around this.

This, and I think also probably what's happened is that by removing the overhead and delayed reference counting of the GILPool we've now unmasked these other problems.

@davidhewitt
Copy link
Member

Considering the loopholes, its performance impacts and the possibility of solving this problem via a cooperating GILless Python interpreter, I fear we need to drop the reference pool unconditionally.

I agree that the long term seems to be that we will remove the reference pool, assuming that GIL-less Python is successful.

I think we might still want to keep the Clone impl though because we cannot remove the Drop impl, so we have to decide on what to do about the reference count updates without the GIL in any case, i.e. finish our discussion in #4095 albeit under different circumstances.

Based on what we've seen here, my opinion is that:

  • Drop is necessary, even if it is awkward to handle without the GIL
  • Clone is convenient, and I now believe is fundamentally incorrect to defer without the GIL

@adamreichold
Copy link
Member

Clone is convenient, and I now believe is fundamentally incorrect to defer without the GIL

I agree that we should not try to defer reference count updates. I think that the convenience is warranted though as user code cannot do something else but call with_gil to then call clone_ref. This has known deadlock issues, so if we remove the Clone, we should probably at least provide a try_with_gil so that user code can emulate its current mode of operation.

@Icxolu
Copy link
Contributor

Icxolu commented Apr 21, 2024

I don't really have much experience here, but would it help if we remove the fast path from these operations, and make reference count updates of Py always go through the pool, regardless of the GIL being held or not? If I understand the problem correctly, this could solve the two examples above (but there might be more fundamental problems I haven't thought of).

@adamreichold
Copy link
Member

I see two problems: We do not control all reference count updates as some will have through Python/C code instead of our code paths. And this would undo the performance gains on which you worked so hard as part of the bound API.

@Crispy13
Copy link
Author

@adamreichold

It seems that segmentation fault does not occur anymore.

I'll leave a comment If I find something

@Crispy13
Copy link
Author

Crispy13 commented Apr 22, 2024

Sadly, In my case, this didn't solve the error.

Python 3.9.15
If I return PyBytes instead of PyString, the error seems to go away.

* edit
In the code I actually use, the inner function(rust_function in the reproducer) use cython library.
In this case PyBytes also failed.

@adamreichold
Copy link
Member

Sadly, In my case, this didn't solve the error.

You could give #4095 a try if you feel adventurous. But the semantics are different and you probably need to change your code since you cannot Clone in background threads any longer.

If I return PyBytes instead of PyString, the error seems to go away.

I don't think this has anything to do with the specific types in question as our global reference pool is generally unsound and you should be able to hit this using whatever types.

I would expect that wrapping any background processing in allow_threads should be able to fix most of these issues, but until #4095 is finished there are no guarantees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants