Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ffi: define some cpython/unicodeobject bindings #1777

Merged
merged 4 commits into from Aug 14, 2021

Conversation

indygreg
Copy link
Contributor

pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.

This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.

I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate #defines as inline functions.

The structs are a bit wonky and probably warrant the most review
scrutiny. I haven't tested this code thoroughly.

Missing symbols have been annotated with skipped and symbols currently
defined in src/ffi/unicodeobject.rs have been annotated with move.

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. It's a particularly gnarly one; the representation of unicode objects is one of the more aggressively optimized parts of the Python interpreter. On top of that, any work here needs to be aware of PEP 623 (affecting Python 3.12).

As-is, it looks to me that the PyASCIIObjectState struct is incorrect. I'm not entirely sure what the correct solution is. (See the review comment below.)

Because there's some very complex inline macros going on here, I would like to either see tests within this ffi module, or a thoroughly-tested safe API for #1776. I'm particuarly worried there's OS, arch and Python-version edge cases for this low-level code. For example this'll presumably crash on Python 3.12, when that comes along, so we'll need to catch that with tests.

symbols currently defined in src/ffi/unicodeobject.rs have been annotated with move.

I understand not wanting to edit the existing code, however I'd much rather that these were moved as part of this PR, perhaps as a standalone commit.

Comment on lines 31 to 42
// This type is defined inline in CPython source.
#[repr(C)]
pub struct PyASCIIObjectState {
// SSTATE_* constants.
pub interned: c_uint,
// PyUnicode_*_KIND constants.
pub kind: c_uint,
pub compact: c_uint,
pub ascii: c_uint,
pub ready: c_uint,
pub _a: c_uint,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct looks wrong; the upstream CPython source implements this state as a 32-bit bit field.

It's not possible to have an equivalent bitfield definition in Rust (e.g. see rust-lang/rfcs#314)

Looks like the cpython crate defines the state member as a single u32 field and uses bitwise operators to read values at various offsets. That seems to make sense, but as far as I'm aware the bitfield order isn't actually well-defined (https://stackoverflow.com/a/19376742) so that seems like it's actually a bug. I guess it'll work on some platforms and be UB on others?

I think some more research is needed on the correct way to handle this.

@indygreg
Copy link
Contributor Author

I was just typing up a comment talking about the limitations of this PR and your review beat me to it :)

Yes, the struct definition is wrong. And the macros could use some tests. I'll address your review feedback.

Also, I noticed that PyO3 is using soon-to-be-deprecated Python Unicode APIs. I'm unsure how much you care about that. If I proceed with #1776, I may have a go at flushing things out and improving the string handling story. Time will tell.

@davidhewitt
Copy link
Member

Also, I noticed that PyO3 is using soon-to-be-deprecated Python Unicode APIs.

Oh interesting, I'd mistakenly thought #1370 flushed them out! Yes, I'd really appreciate any help moving off deprecated APIs.

@indygreg indygreg force-pushed the unicode-apis branch 2 times, most recently from 3387187 to ce91e47 Compare August 11, 2021 00:23
Copy link
Contributor Author

@indygreg indygreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the bitfield definition with this version.

I've added accessor functions for resolving each logical component of the bitfield to a c_uint. The macros/functions now use these accessor functions. I figured it was clearer this way rather than doing bitwise manipulation inline in each macro function.

Tests over bitfield APIs and macros have been added. The behavior asserted in the ASCII and UCS-4 tests seem reasonable to me. Although I wouldn't be surprised if the assertions are non-portable across Python implementations.

A 2nd commit has been introduced to move limited API symbols to the appropriate file.

@davidhewitt
Copy link
Member

👍 thanks, I'll give this a proper review when I get a chance, got a busy few days ahead. Would be great to have it rebased once #1780 is merged so we can see whether the bitfield order is indeed stable on all platforms tested on CI.

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the CI is passing on all platforms, so I'm reasonably convinced this is a safe way to handle the bitfield 👍

I've added some tweaks for PyPy and Python 3.10 below, I'll add them as a commit.

src/ffi/cpython/unicodeobject.rs Show resolved Hide resolved
src/ffi/cpython/unicodeobject.rs Show resolved Hide resolved
src/ffi/cpython/unicodeobject.rs Show resolved Hide resolved
src/ffi/cpython/unicodeobject.rs Show resolved Hide resolved
src/ffi/cpython/unicodeobject.rs Show resolved Hide resolved
src/ffi/cpython/unicodeobject.rs Show resolved Hide resolved
src/ffi/cpython/unicodeobject.rs Show resolved Hide resolved
@davidhewitt
Copy link
Member

Looks like I broke this - I'll fixup and merge later, sorry!

pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (PyO3#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.

This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.

I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate `#define`s as inline functions.

Missing symbols have been annotated with `skipped` and symbols currently
defined in `src/ffi/unicodeobject.rs` have been annotated with `move`.

The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't
support. So we've provided accessor functions for retrieving these
fields' values. No accessor functions are present because you shouldn't
be touching these values from Rust code.

Tests of the bitfield APIs and macro implementations have been added.
All symbols which are canonically defined in cpython/unicodeobject.h
and had been defined in our unicodeobject.rs have been moved to our
corresponding cpython/unicodeobject.rs file.

This module is only present in non-limited build configurations, so
we were able to drop the cfg annotations as part of moving the
definitions.
@indygreg
Copy link
Contributor Author

Looks like I broke this - I'll fixup and merge later, sorry!

Twas a missing ". I amended your commit and force pushed.

@indygreg
Copy link
Contributor Author

Rust didn't seem to honor [allow(deprecated)] on an assert_eq!(...) macro. Unsure why. But moving the annotation to the #[test] function seems to make Python 3.10 happy. I dunno.

@indygreg
Copy link
Contributor Author

Ugh, something doesn't like PyUnicode_AsUTF8AndSize. Something about the annotations on its availability is off.

This API is available in all supported Python versions and isn't
deprecated.
@davidhewitt
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants