OsStr and Path conversions #1379

kangalio · 2021-01-10T17:54:41Z

Implements conversions between OsStr/OsString/Path/PathBuf and Python strings: #1377

This is my first PR on this project and my first interaction with the codebase

This PR is a draft because in several places of the code I'm not sure if I did the right thing. Those places are marked with a // HELP comment. I'd greatly appreciate if an experienced PyO3 maintainer could check those places.

davidhewitt

Thanks very much for this PR! This is a really great start. I've given it a thorough read and got various feedback and suggestions from my time spent in battle with the PyO3 codebase 💂.

CHANGELOG.md

guide/src/conversions/tables.md

src/types/mod.rs

src/types/osstr.rs

src/types/path.rs

src/types/osstr.rs

kangalio · 2021-01-10T22:45:58Z

Wow, thank you!! That is one detailed code review. I will go over it tomorrow and implement all the suggestions

src/types/osstr.rs

davidhewitt · 2021-01-11T07:56:03Z

I was thinking this morning that we should check to understand carefully the differences in behavior on unix vs windows. If String -> OsString -> PyString gives different results on the two platforms this might trip up users.

Also pathlib.Path doesn't have a C-API, but if it gets one we probably want to make Rust paths convert back and forth with those instead of strings. Potentially worth keeping in mind.

kangalio · 2021-01-11T10:08:47Z

I was thinking this morning that we should check to understand carefully the differences in behavior on unix vs windows. If String -> OsString -> PyString gives different results on the two platforms this might trip up users.

Also pathlib.Path doesn't have a C-API, but if it gets one we probably want to make Rust paths convert back and forth with those instead of strings. Potentially worth keeping in mind.

I think this can be taken care of using unit or integration tests. We could feed many different strings through the String -> OsString -> PyString conversion and back. This test will be run on all OSes by CI, so we know if something breaks somewhere.

Also, is there a reason you reacted confused on my previous comment? Now I am confused 😄

kangalio · 2021-01-11T15:13:23Z

By the way, why do all Rust->Python conversions have to be duplicated? There's two traits to implement: IntoPy<PyObject> and ToPyObject and I don't really see the meaningful difference.

This is a bit awkward in the test because I essentially have to test everything twice to account for all the trait implementations

kngwyu · 2021-01-11T16:34:46Z

why do all Rust->Python conversions have to be duplicated?

It's a common consume/ not consume pattern in Rust, though it does not work nicely sometimes.
I think there are two main problems here (1. the generic parameter of IntoPy is not effectively used 2. we rarely need into operation), which should be discussed in the meantime.

I essentially have to test everything twice

Since they share the implementation, I don't think we need to test all things twice.

kngwyu · 2021-01-11T16:36:25Z

src/conversions/path.rs

+            test_roundtrip::<&Path>(py, path);
+            test_roundtrip::<Cow<'_, Path>>(py, Cow::Borrowed(path));
+            test_roundtrip::<Cow<'_, Path>>(py, Cow::Owned(path.to_path_buf()));
+            test_roundtrip::<PathBuf>(py, path.to_path_buf());


If .as_ref() is called at last, do we need to test all these?

.as_ref() is only called to verify the end result. The conversion is done on the types as is.

kangalio · 2021-01-11T16:49:17Z

It seems like the codecov check fail is a false positive. The places it marked as "not covered by tests" are actually covered by the tests. Perhaps codecov is getting confused by the generic trickery in the tests.

kngwyu · 2021-01-11T19:26:30Z

src/conversions/osstr.rs

+
+// TODO: move to other module to prevent accidentally circumventing the new function?
+#[cfg(windows)]
+struct DropGuard<T>(*mut T);


I don't think it's good to abstract such a specific operation.
In this struct, *mut T implicitly must be a pointer in the Python heap without any reference to Python objects, so it's unsafe to construct this object.
So I recommend making this a more limited one without new and placing this in the function scope if you like this solution more than Vec.

Agreed, this type is super unsafe so I'd rather it was inside the function too. That way only the function can misuse it.

davidhewitt · 2021-01-11T22:06:36Z

Also, is there a reason you reacted confused on my previous comment? Now I am confused 😄

Hehe nope that's just a fat-finger error when checking on the conversation on my phone!

There's two traits to implement: IntoPy<PyObject> and ToPyObject and I don't really see the meaningful difference.

Yes; I would love to rework this pair to also potentially model fallible conversions, but to avoid ecosystem churn until we're really sure what a better design is I think best to leave as-is!

It seems like the codecov check fail is a false positive. The places it marked as "not covered by tests" are actually covered by the tests. Perhaps codecov is getting confused by the generic trickery in the tests.

I think #[inline] functions can have issues with coverage. Probably the generics in the tests got inlined too. Don't need to worry too hard about it; it's annoying and I'd love to improve that one day, but it's fine for now.

Also, I was thinking we might want to consider whether PyString::to_string_lossy gives the same output as converting first to OsString and then calling OsStr::to_string_lossy. Users might assume they're equivalent, though I'm not sure if we should expect them to be. Perhaps we can consider changing PyString::to_string_lossy implementation in this PR.

konstin · 2021-01-11T22:59:01Z

There's two traits to implement: IntoPy<PyObject> and ToPyObject and I don't really see the meaningful difference.

Yes; I would love to rework this pair to also potentially model fallible conversions, but to avoid ecosystem churn until we're really sure what a better design is I think best to leave as-is!

My original plan was to have IntoPy<T> and FromPy<T> mirror std's excellent Into<T> and From<T>, with everything equivalent except for an additional gil token parameter. I also removed some of the conversion traits (such as IntoPyObject and IntoPyTuple), but I never finished so the current design is kinda half-baked.

davidhewitt · 2021-01-12T07:35:18Z

However I'm wondering why we can assume that PyUnicode_AsWideChar won't return any error in the second invocation?

Also, if I understand the Python docs correctly, the return value of PyUnicode_AsWideChar is the number of bytes read, not the total number of bytes in the string. If that is true, won't the size variable be zero and the whole code blow up?

Docs seem to be slightly misleading in this case - the implementation of PyUnicode_AsWideChar has slightly more helpful docs: https://github.com/python/cpython/blob/fb35fa49d192368e94ffec09c092260ed0fea2e1/Objects/unicodeobject.c#L3270

From that implementation it looks like that function will never raise an exception (as long as the first argument is a valid unicode object), and also that doc describes that passing NULL for the buffer causes the full string size to be returned.

kangalio · 2021-01-12T17:56:12Z

I'm currently working on something else but I hope to continue work on this in the next few days

kangalio · 2021-01-12T22:18:27Z

Also, I was thinking we might want to consider whether PyString::to_string_lossy gives the same output as converting first to OsString and then calling OsStr::to_string_lossy. Users might assume they're equivalent, though I'm not sure if we should expect them to be. Perhaps we can consider changing PyString::to_string_lossy implementation in this PR.

I looked at the source code of Python (unicodeobject.c) in an attempt to figure out what the "surrogatepass" error handler does exactlyy, which is used in PyO3's PyString::to_string_lossy. The Python documentation isn't helping either.

In other words, I'm not able to verify if PyString::to_string_lossy does the same thing as PyString -> OsString -> OsString::to_string_lossy, so at least I can't do it in this PR.

kangalio · 2021-01-20T18:17:51Z

Is there anything left to work on in this PR?

kngwyu

LGTM, thanks!

davidhewitt

I think we need to adjust the set of traits provided just slightly, see other comments.

R.E. PyString::to_string_lossy - I'm going to open a separate PR for that case. I ran a quick test script:

use pyo3::prelude::*;
use pyo3::types::{PyDict, PyString};
use std::borrow::Cow;
use std::error::Error;
use std::ffi::OsString;

fn main() -> PyResult<()> {
    Python::with_gil(|py| -> PyResult<()> {
        let locals = PyDict::new(py);
        py.run(
            r#"x = '\udcfa\udcfb\udcfc\udcfd\udcfe\udcff'"#,
            None,
            Some(locals),
        )?;
        let py_str = locals.get_item("x").unwrap().downcast::<PyString>()?;
        dbg!(py_str, py_str.len()?);
        let os_string: OsString = py_str.extract()?;
        dbg!(&os_string, &os_string.len());
        let string: Cow<str> = py_str.to_string_lossy();
        dbg!(&string, &string.chars().count());
        Ok(())
    })
}

The output was this:

[src/main.rs:16] py_str = '\udcfa\udcfb\udcfc\udcfd\udcfe\udcff'
[src/main.rs:16] py_str.len()? = 6
[src/main.rs:18] &os_string = "\xFA\xFB\xFC\xFD\xFE\xFF"
[src/main.rs:18] &os_string.len() = 6
[/home/david/dev/pyo3/src/types/string.rs:86] bytes = b'\xed\xb3\xba\xed\xb3\xbb\xed\xb3\xbc\xed\xb3\x
bd\xed\xb3\xbe\xed\xb3\xbf'
[src/main.rs:20] &string = "������������������"
[src/main.rs:20] &string.chars().count() = 18

It looks like the "surrogatepass" escape handler literally hands the three-byte surrogate sequences to Rust, which then results in three replacement characters being produced by std::String::from_utf8_lossy. I think this is wrong - Python thinks these surrogates are a single codepoint, but they're currently treated as three by PyString::to_string_lossy.

I think that using PyUnicode_EncodeFSDefault is probably better and will then match PyString -> OsString -> String.

davidhewitt · 2021-01-23T09:16:08Z

src/conversions/osstr.rs

+    }
+}
+
+impl<'a> IntoPy<PyObject> for &'a OsString {


This seems like an implementation that's not strictly necessary. What's the motivation for having it?

types/string.rs has an equivalent implementation. I oriented myself at those trait implementations

Ahh ok.

I'm not entirely sure why that impl exists... it might be because #[pyo3(get)] on String fields needs it? Not sure.

If we can't figure out why we need this impl, I'd rather skip it for now. We can always add it later!

davidhewitt · 2021-01-23T09:19:00Z

src/conversions/path.rs

+    }
+}
+
+impl<'a> IntoPy<PyObject> for &'a PathBuf {


Same question as for &'_ OsString.

Same thing; I oriented myself at the trait implementations for String

davidhewitt · 2021-01-23T09:20:30Z

src/conversions/osstr.rs

+    }
+}
+
+impl ToPyObject for Cow<'_, OsStr> {


Also needs IntoPy<PyObject> for Cow<'_, OsStr>.

Haha same thing, seems like types/string.rs is missing this as well

Ah - if you're willing, can you add it there also? 🙏

I think that adding IntoPy for Cow makes sense (as it's the trait needed to be able to return this type from #[pyfunction]).

guide/src/conversions/tables.md

davidhewitt · 2021-01-23T10:11:50Z

src/conversions/osstr.rs

+            let fs_encoded_bytes: &crate::types::PyBytes = unsafe {
+                ob.py()
+                    .from_borrowed_ptr(ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr()))
+            };


PyUnicode_EncodeFSDefault returns a new reference, so we need to own this pointer or we leak memory. I suggest using Py<PyBytes>.

Suggested change

let fs_encoded_bytes: &crate::types::PyBytes = unsafe {

ob.py()

.from_borrowed_ptr(ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr()))

};

let fs_encoded_bytes: Py<crate::types::PyBytes> = unsafe {

Py::from_owned_ptr(ob.py(), ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr()))

};

I will apply the suggestion. If you have time, I would be interested to hear why you couldn't just replace from_borrowed_ptr with from_owned_ptr

Sure thing. from_owned_ptr returns &PyAny (or another native type reference) where the owned pointer has to be stored by PyO3 in a thread-local vector. In comparison, Py (and PyObject, aka Py<PyAny>) directly hold the owned pointer inside them. This is marginally more efficient, and also means that the temporary bytes will be cleaned up immediately, instead of when PyO3 has a chance to cleanup its internal vector safely.

This has been a long thorn in PyO3's API imo - there's discussion at #1056 and #1308 where I hope to eventually remove this difference and make everything as efficient as possible!

src/conversions/path.rs

davidhewitt · 2021-01-23T13:36:54Z

I think that using PyUnicode_EncodeFSDefault is probably better and will then match PyString -> OsString -> String.

I changed my mind on this - it actually would change PyString::to_string_lossy to potentially throw exceptions, which would not be great.

Also the inconsistency does not seem to matter; I found someone else who stumbled across similar conversions at rust-lang/rust#56786 - in the end they decided it was better left as-is.

At least the current implementation of PyString::to_string_lossy is infallible and exactly consistent with what std::String::from_utf8_lossy does.

davidhewitt · 2021-01-24T23:37:50Z

src/conversions/osstr.rs

+            unsafe {
+                // This will not panic because the data from encode_wide is well-formed Windows
+                // string data
+                py.from_owned_ptr::<PyString>(ffi::PyUnicode_FromWideChar(


Just noticed that this can use PyObject::from_owned_ptr, which will remove the need for .into() also.

Suggested change

py.from_owned_ptr::<PyString>(ffi::PyUnicode_FromWideChar(

PyObject::from_owned_ptr(py, ffi::PyUnicode_FromWideChar(

davidhewitt · 2021-01-24T23:38:40Z

src/conversions/osstr.rs

+
+            // Decode from Python's lossless bytes string representation back into raw bytes
+            let fs_encoded_bytes: Py<crate::types::PyBytes> = unsafe {
+                Py::from_owned_ptr(ob.py(), ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr()))


Using crate::Py here may allow you to avoid the painful OS-specific import at the top.

(sorry about the -D warnings in CI - I think on balance it's useful to help keep the PyO3 code health up, even if it's a little frustrating at times 😬)

davidhewitt · 2021-01-24T23:39:16Z

src/conversions/osstr.rs

+    fn extract(ob: &PyAny) -> PyResult<Self> {
+        #[cfg(not(windows))]
+        {
+            let pystring = <PyString as PyTryFrom>::try_from(ob)?; // Cast PyAny to PyString


This line is shared between the two implementations so could be pulled out the top before them both.

davidhewitt · 2021-01-24T23:41:31Z

src/conversions/osstr.rs

+    }
+}
+
+impl ToPyObject for Cow<'_, OsStr> {


Ah - if you're willing, can you add it there also? 🙏

I think that adding IntoPy for Cow makes sense (as it's the trait needed to be able to return this type from #[pyfunction]).

davidhewitt · 2021-01-24T23:44:38Z

src/conversions/osstr.rs

+    }
+}
+
+impl<'a> IntoPy<PyObject> for &'a OsString {


Ahh ok.

I'm not entirely sure why that impl exists... it might be because #[pyo3(get)] on String fields needs it? Not sure.

If we can't figure out why we need this impl, I'd rather skip it for now. We can always add it later!

davidhewitt · 2021-01-24T23:50:04Z

src/conversions/osstr.rs

+            let fs_encoded_bytes: &crate::types::PyBytes = unsafe {
+                ob.py()
+                    .from_borrowed_ptr(ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr()))
+            };


Sure thing. from_owned_ptr returns &PyAny (or another native type reference) where the owned pointer has to be stored by PyO3 in a thread-local vector. In comparison, Py (and PyObject, aka Py<PyAny>) directly hold the owned pointer inside them. This is marginally more efficient, and also means that the temporary bytes will be cleaned up immediately, instead of when PyO3 has a chance to cleanup its internal vector safely.

This has been a long thorn in PyO3's API imo - there's discussion at #1056 and #1308 where I hope to eventually remove this difference and make everything as efficient as possible!

davidhewitt

👍 I think there's some final decisions to be made on the traits to provide, and then this is good to merge. Thanks again!

davidhewitt

Thank you for the many rounds of iteration - this PR is looking great to me now 🚀

davidhewitt · 2021-02-14T08:15:00Z

I'm going to rebase and merge this. Thanks again!

kangalio force-pushed the master branch from 2222c1a to 9586c8a Compare January 10, 2021 21:06

kangalio marked this pull request as ready for review January 10, 2021 22:32

davidhewitt reviewed Jan 10, 2021

View reviewed changes

kngwyu reviewed Jan 11, 2021

View reviewed changes

src/types/osstr.rs Outdated Show resolved Hide resolved

kngwyu reviewed Jan 11, 2021

View reviewed changes

kngwyu approved these changes Jan 21, 2021

View reviewed changes

davidhewitt requested changes Jan 23, 2021

View reviewed changes

This was referenced Jan 23, 2021

pystring: use PyUnicode_AsUTF8AndSize always from Python 3.10 and up #1399

Merged

release: 0.13.2 #1400

Merged

davidhewitt reviewed Jan 24, 2021

View reviewed changes

davidhewitt approved these changes Jan 24, 2021

View reviewed changes

davidhewitt approved these changes Jan 25, 2021

View reviewed changes

Implement conversions for Path/PathBuf

fe9b462

davidhewitt force-pushed the master branch from 67df7bb to fe9b462 Compare February 14, 2021 08:22

davidhewitt merged commit 07c7624 into PyO3:master Feb 14, 2021

davidhewitt mentioned this pull request Feb 14, 2021

Implement PyClass for PathBuf and Path #1377

Closed

	py.from_owned_ptr::<PyString>(ffi::PyUnicode_FromWideChar(
	PyObject::from_owned_ptr(py, ffi::PyUnicode_FromWideChar(

OsStr and Path conversions #1379

OsStr and Path conversions #1379

Conversation

kangalio commented Jan 10, 2021 • edited

davidhewitt left a comment • edited

Choose a reason for hiding this comment

kangalio commented Jan 10, 2021

davidhewitt commented Jan 11, 2021

kangalio commented Jan 11, 2021

kangalio commented Jan 11, 2021

kngwyu commented Jan 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kangalio commented Jan 11, 2021

kngwyu Jan 11, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidhewitt commented Jan 11, 2021 • edited

konstin commented Jan 11, 2021

davidhewitt commented Jan 12, 2021

kangalio commented Jan 12, 2021

kangalio commented Jan 12, 2021

kangalio commented Jan 20, 2021

kngwyu left a comment

Choose a reason for hiding this comment

davidhewitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidhewitt Jan 24, 2021 • edited

Choose a reason for hiding this comment

davidhewitt commented Jan 23, 2021

Choose a reason for hiding this comment

davidhewitt Jan 24, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidhewitt Jan 24, 2021 • edited

Choose a reason for hiding this comment

davidhewitt left a comment

Choose a reason for hiding this comment

davidhewitt left a comment

Choose a reason for hiding this comment

davidhewitt commented Feb 14, 2021

kangalio commented Jan 10, 2021 •

edited

davidhewitt left a comment •

edited

kngwyu Jan 11, 2021 •

edited

davidhewitt commented Jan 11, 2021 •

edited

davidhewitt Jan 24, 2021 •

edited

davidhewitt Jan 24, 2021 •

edited

davidhewitt Jan 24, 2021 •

edited