Changed PyByte::new_init and PyByteArray::new_init such that init can fail #1083

juntyr · 2020-08-06T08:50:02Z

This is a followup from #1074 to allow PyBytes::new_with and PyByteArray::new_with to fail during initialisation.

… fail

juntyr · 2020-08-06T08:58:57Z

The failing test is an interesting one. When I run that test in isolation, it succeeds. When I run it in combination with other tests it mostly fails but sometimes succeeds. Do you have an idea of where I introduced that probabilistic bug?

kngwyu

Thank you, but isn't it too complex?

kngwyu · 2020-08-06T08:56:18Z

src/types/bytearray.rs

@@ -23,36 +23,55 @@ impl PyByteArray {

    /// Creates a new Python `bytearray` object with an `init` closure to write its contents.
    /// Before calling `init` the bytearray is zero-initialised.
+    /// * If `init` returns `Err(e)`, `new_with` will return `Err(e)`.
+    /// * If `init` returns `Ok(new_len)`, the allocated bytearray will be truncated to length


Do we really want to truncate the bytes?
I cannot imagine any use case.

We can easily remove the truncation again - it was mentioned by @davidhewitt for a potential API so I included it so we can see how it would look like. For the bytes, the idea is that as init implies that you are still generating the content, you might not know the exact size of the bytearray / bytestring up front, just an upper bound.

Imagine you want to serialize some large dataset directly to Python. You may not know its final allocated length, so with this API you can pre-allocate the memory space in Python, write to it, and then return the amount of bytes written so that the array can be safely truncated.

I suppose that an alternative approach would be for argument to init to be a mutable reference to some custom struct instead of &mut [u8]:

pub struct UninitializedPyBytes<'a>(*mut ffi::PyObject, Python<'a>); impl UninitializedPyBytes { /// Resize the byte buffer, zero-intializing any additional storage allocated /// May return Err on OOM pub fn resize(&mut self) -> PyResult<()> { /* ... */ } /// Access the underlying bytes pub fn as_bytes(&mut self) -> &mut [u8] { /* ... */ } }

We could also implement DerefMut<Target = [u8]>, and maybe Read and Write for this. Write could even support automatically growing the allocation.

Would we support the same interface for both PyBytes and PyByteArray? I don't think we need implement io::Read here, though. Regarding automatic resizing by io::Write, should we follow the simply size x 2 rule? Would we also automatically truncate the bytes after writing (which I found requires an additional struct that implements Write, references the bytes and implements Drop so we ensure that the truncation to the eventual length is performed after writing has completed)?

Yep I guess Read is redundant. And yeah I was thinking the struct could have a private method which finishes the writing, truncates if needed, and returns a finished PyBytes object.

Truncation has been removed in 4b3422e

kngwyu · 2020-08-06T08:59:16Z

src/types/bytearray.rs

+        let py = gil.python();
+        let locals = PyDict::new(py);
+
+        py.run(


I don't like this kind of test since it can be really unstable because of GC.
We can confirm that DECREF is called and it's sufficient. I think it's too nervous to test the memory difference.

If just the comment is ok, I'll gladly remove that test :)

This test has been removed in 4b3422e

kngwyu · 2020-08-06T09:00:11Z

The failing test is an interesting one. When I run that test in isolation, it succeeds. When I run it in combination with other tests it mostly fails but sometimes succeeds. Do you have an idea of where I introduced that probabilistic bug?

Simple please do not assert the memory difference.

kngwyu · 2020-08-06T09:04:39Z

And I'm also doubtful about assuming init can fail.
Since the contents of PyBytes is always the same (= all bytes are zero), the failure only depends on the other contexts than PyBytes.
In such a case, I think one should detect the failure before initializing the bytes.

juntyr · 2020-08-06T09:04:57Z

The failing test is an interesting one. When I run that test in isolation, it succeeds. When I run it in combination with other tests it mostly fails but sometimes succeeds. Do you have an idea of where I introduced that probabilistic bug?

Simple please do not assert the memory difference.

I don't think the memory difference assert is the problem (unless I'm misreading the error log). I think it is more fundamental memory access issue.

juntyr · 2020-08-06T09:10:12Z

And I'm also doubtful about assuming init can fail.
Since the contents of PyBytes is always the same (= all bytes are zero), the failure only depends on the other contexts than PyBytes.
In such a case, I think one should detect the failure before initializing the bytes.

Personally, I am using the Python allocated bytes to serialise to with serde. As I only get write access to those bytes inside init, I have to serialise inside init. Serialisation can fail, however, in which case I would like new_with to gracefully fail and deallocate the bytes as well so I can handle the error outside. In this use case, I am unable to detect the failure before calling new_with.

davidhewitt · 2020-08-06T09:14:12Z

src/types/bytearray.rs

+                Err(e) => {
+                    // Deallocate pyptr to avoid leaking the bytearray
+                    ffi::Py_DECREF(pyptr);
+                    return Err(e);
+                }


An interesting thought about the UninitializedPyBytes API is that it could have a Drop impl which automatically deallocates the pointer, which would mean that it'd happen automatically on Err return without us having to do anything.

davidhewitt · 2020-08-06T09:16:39Z

Segmentation fault is an interesting one - did you observe this locally?

juntyr · 2020-08-06T09:22:33Z

Segmentation fault is an interesting one - did you observe this locally?

For me, the type of error also changes randomly but has included SEGFAULTs, the test returning Err (when an unwrap fails), etc.

davidhewitt · 2020-08-06T09:32:37Z

Having seen how this API evolves, it seems a few different tradeoffs in usage as well as some open questions about the best design for the API, as well as what users want from it.

I'm beginning to wonder if this API should be moved to a separate crate for now, maybe pyo3-bytes-utils, which could be used to expose a few variants of the API, with/without truncate, error handling, etc. Maybe this way we can learn what's popular before we add a final design of this API to the pyo3 core.

If so, I'd be very happy to mention this crate in the README

juntyr · 2020-08-06T09:51:42Z

Exposing this in a separate crate would be an interesting approach, though I am not sure if I can promise to have the time to support it (I am currently doing an internship during which the original issue came up and will be doing my final year project at university next year). Still, I would be happy to explore different API designs. For the main pyo3 crate, should we leave new_with as it is or extend it with the reduced API update which allows init to return an Err which is forwarded (truncation would be removed from the API)?

davidhewitt · 2020-08-06T09:58:22Z

A small crate with just a couple of functions hopefully won't generate many support requests ;)

Perhaps removing the truncate option from pyo3 but allowing the API to be fallible is a good compromise. @kngwyu what route do you prefer?

kngwyu · 2020-08-06T13:36:49Z

Personally, I am using the Python allocated bytes to serialise to with serde. As I only get write access to those bytes inside init, I have to serialise inside init. Serialisation can fail, however, in which case I would like new_with to gracefully fail and deallocate the bytes as well so I can handle the error outside. In this use case, I am unable to detect the failure before calling new_with.

OK, I'm happy to know the use-case.

Perhaps removing the truncate option from pyo3 but allowing the API to be fallible is a good compromise.

Agreed.

I'm beginning to wonder if this API should be moved to a separate crate for now, maybe pyo3-bytes-utils, which could be used to expose a few variants of the API, with/without truncate, error handling, etc.

Hmm ... 🤔
It sounds overdoing for me (though I'm not sure the word 'overdoing' is appropriate here).

kngwyu · 2020-08-06T13:38:09Z

For SIGSEGV: since this kind of bug is really difficult to debug, I'm going debug it myself.

davidhewitt · 2020-08-06T14:42:08Z

src/types/bytearray.rs

            let buffer = ffi::PyByteArray_AsString(pyptr) as *mut u8;
            debug_assert!(!buffer.is_null());
            // Zero-initialise the uninitialised bytearray
            std::ptr::write_bytes(buffer, 0u8, len);
            // (Further) Initialise the bytearray in init
-            init(std::slice::from_raw_parts_mut(buffer, len));
-            pybytearray
+            match init(std::slice::from_raw_parts_mut(buffer, len)) {


I've just noticed that if init panics, then this will leak memory =(

I'm going to add Py::into_ref later today, which I think you should be able to use to avoid this case.

(Solution I'm thinking would be that instead of if pyptr.is_null(), you can call Py::from_owned_ptr_or_err to take ownership of the pointer. And then Py::into_ref(py) at the end of the function.)

Thanks for finding this issue! What will Py::into_ref do? Will I still have to call ffi::Py_DECREF(pyptr) on an Err in init to deallocate it or will this be taken care of as I now have a &PyBytes reference from the beginning?

You'll have Py<PyBytes> from the beginning, and .into_ref(py) will convert this into &PyBytes.

It'll take care of deallocating automatically 🚀

Ah, I was slightly confused by having both Py::from_owned_ptr_or_err(py, ptr) and py.from_owned_ptr_or_err(ptr) :)

The code for the new_with function is actually starting to look very clean and concise - I look forward to using Py<T>::into_ref(py)

See #1098 - hopefully you'll have this API soon!

Py::into_ref is now available to use 🚀

I just hope I didn't mess up merging anything ...

I did fail in one way though and had to rewrite some of my changes (I had prototyped everything with a declaration of Py::into_ref) - so I might have missed something

kngwyu · 2020-08-08T05:23:08Z

We don't observe tha segfault now.
So at last it is because of the memory difference test?

juntyr · 2020-08-08T09:59:31Z

We don't observe tha segfault now.
So at last it is because of the memory difference test?

It must have been related to that test somehow - maybe the dropping of the PyByteArray created problems.

…_bytearray_new_with

davidhewitt

LGTM! Thanks for your continued improvements to this API!

juntyr · 2020-08-12T18:43:55Z

@kngwyu What do you think about the modified API?

kngwyu · 2020-08-13T04:18:51Z

LGTM, thanks!

juntyr · 2020-08-13T08:33:39Z

@davidhewitt @kngwyu Thank you for your continued feedback! Just out of interest, do you have a rough schedule already for the next release of pyo3?

davidhewitt · 2020-08-13T20:46:46Z

I was wondering about this myself yesterday, so have opened an issue to collect opinions. See #1104

Changed PyByte::new_init and PyByteArray::new_init such that init can…

aeceb18

… fail

kngwyu requested changes Aug 6, 2020

View reviewed changes

davidhewitt reviewed Aug 6, 2020

View reviewed changes

Simplified fallible PyBytes::new_with and PyByteArray::new_with API

4b3422e

davidhewitt reviewed Aug 6, 2020

View reviewed changes

davidhewitt mentioned this pull request Aug 10, 2020

Py::as_ref and Py::into_ref (remove AsPyRef) #1098

Merged

Merge remote-tracking branch 'upstream/master' into fallible_py_bytes…

e6dc4b2

…_bytearray_new_with

juntyr requested a review from kngwyu August 11, 2020 21:53

davidhewitt approved these changes Aug 11, 2020

View reviewed changes

kngwyu approved these changes Aug 13, 2020

View reviewed changes

kngwyu merged commit 9ab7225 into PyO3:master Aug 13, 2020

Changed PyByte::new_init and PyByteArray::new_init such that init can fail #1083

Changed PyByte::new_init and PyByteArray::new_init such that init can fail #1083

Conversation

juntyr commented Aug 6, 2020

juntyr commented Aug 6, 2020

kngwyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kngwyu commented Aug 6, 2020

kngwyu commented Aug 6, 2020

juntyr commented Aug 6, 2020

juntyr commented Aug 6, 2020

Choose a reason for hiding this comment

davidhewitt commented Aug 6, 2020

juntyr commented Aug 6, 2020

davidhewitt commented Aug 6, 2020

juntyr commented Aug 6, 2020

davidhewitt commented Aug 6, 2020

kngwyu commented Aug 6, 2020

kngwyu commented Aug 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kngwyu commented Aug 8, 2020

juntyr commented Aug 8, 2020

davidhewitt left a comment

Choose a reason for hiding this comment

juntyr commented Aug 12, 2020

kngwyu commented Aug 13, 2020

juntyr commented Aug 13, 2020

davidhewitt commented Aug 13, 2020