Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: Allows to introspect Python modules from cdylib #3977

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Tpt
Copy link
Contributor

@Tpt Tpt commented Mar 22, 2024

This is a prototype for discussion. Code is not clean, buggy, and only a few features is implemented.

A missing piece in the story listed in #2454 is how tools like Maturin move the introspection information generated how by PyO3 into to type stubs files included in the built wheels.

I see three approaches for it:

  1. Make pyo3-macros generate a file with the stubs after having processed all macros of the crate. This has the advantage of being self-contained in the crate but falls short in cases like python classes declared in a crate but exposed in an other crate: there is no guarantees that proc macros of a crate and its dependencies are compiled in the same process and that proc macros will still be able to write files in the future (like with proposal to run them a WASM sandbox).
  2. Make pyo3-macros exports a C function that returns the stubs. The built libraries can then be loaded by Maturin that would call the function and write the stubs to a file. I wrote a quick experiment with a patch to PyO3 that generate a public C ABI __pyo3_stubs_MODULE_NAME function However, for the build system to execute it, a compatible Python interpreter must be present to link with and a compatible CPU or VM to run it, making generation when doing cross-compilation very hard. I guess it's what Python Interface (.pyi) generation and runtime inspection #2454 was heading toward.
  3. Add the introspection data in a custom section of the built cdylib. It requires the introspection data to be completely static. However, we can easily build a library that extract and parses these sections and generate the stub file. Build tool like Maturin would just need to call this library to get the stubs. Cross compilation support is easy as soon as a parser for the built cdylib format exists. However, this might make the generated binaries bigger because of the extra data. This is what the MR experiments with.

Architecture:

  • Each PyO3 element (pymodule, pyclass, pyfunction...) generates a segment in the built binary pyo3_data0 section that contains a JSON "chunk" with the introspection data. Code in pyo3-macros-backend/src/introspection.rs. I had to do some bad hack to generate the segments properly via Rust static elements.
  • JSON chunks can refer to other chunks via global ids. These ids are stored in the PYO3_INTROSPECTION_ID constants, allowing the code building the JSON chunk to get the global id of eg. a class C via C::PYO3_INTROSPECTION_ID. This allows chunks to refer to other chunks easily (eg. to list module elements). A bad hack is used to generate the ids (TypeId::of would have been a nicer approach but is not const on Rust stable yet).
  • The 0 at the end of pyo3_data0 is a version number, allowing breaking changes in the introspection format.
  • The pyo3-introspection crate parses the binary using goblin (library also used by Maturin), fetches all the pyo3_data0 segments (only implemented for macOS Match-O in this experiment), and builds nice data structures.
  • Not done yet: the pyo3-introspection crate would implement a to_stubs function converting the data structures to Type stubs.
  • The pyo3-introspection has an integration tests doing introspection of the pyo3-pytests library.

Experiment limitations (ie TODO if we want to move this forward):

  • only macOS Match-O binaries are supported
  • the only introspection data created is the list of function, classes and submodules inside of a module created using the new declarative module syntax
  • code to generate the segments content inside of Rust static elements is awful.
  • no stub file serialization code yet (but this should be easy)
  • broken CI
  • this approach requires to convert FromPyObject::type_input into an associated constant or a const function and similarly for IntoPy::type_output. This is mandatory in order to allow to make use of them in the static values added to the generated binary.

@davidhewitt
Copy link
Member

Thanks for moving this forward! The idea of using custom data sections is new to me. I see the upsides of it, though I am slightly worried by the extra complexity of needing to worry about linker details in yet another way.

1. Make pyo3-macros generate a file with the stubs after having processed all macros of the crate.

I agree that having the macros generate file(s) is unlikely to be the right solution 👍

2. However, for the build system to execute it, a compatible Python interpreter must be present to link with and a compatible CPU or VM to run it, making generation when doing cross-compilation very hard. I guess it's what Python Interface (.pyi) generation and runtime inspection #2454 was heading toward.

For the library which converts the metadata to .pyi files, I think rather than expecting the build system like maturin to execute it, it would work better if we asked users to keep their .pyi file committed in the repository and they used the library to generate / update the file as part of their test suite. I think in that case then local development should be enough and cross-compiling can be ignored from the picture.

In fact, I wonder if for option (3) explored here then the using a test to generate and update the .pyi file is still better than to run it during maturin build. Here's why:

  • I think it's easier for users to understand what their .pyi file contains if it's committed in the repository rather than generated by maturin as part of the packaging process.
  • With a test, users are hopefully reminded in CI if they forgot to regenerate their .pyi files.
  • No need for us to include this code / information from the binary if it's only needed during development. (Probably we use a cargo feature which users only enable in their dev-dependencies?)

If we agree that using a test to update stubs is a good solution, then I think the choice between runtime code like (2) and data segments like (3) is probably just influenced by whatever is easier for us to implement. We might even be able to swap back and forth between these two options as an implementation detail as we learn.

@Tpt
Copy link
Contributor Author

Tpt commented Mar 25, 2024

If we agree that using a test to update stubs is a good solution, then I think the choice between runtime code like (2) and data segments like (3) is probably just influenced by whatever is easier for us to implement. We might even be able to swap back and forth between these two options as an implementation detail as we learn.

Yes! To have played a bit with both, runtime code like (2) is way easier (the difference between this MR and #2454 is quite significant).

If I try to summarize the pros of each approach:

Approach 3: add introspection to the cdylib and let maturin write the stubs on build:

  • no extra user code/configuration: they upgrade maturin and pyo3 and it just works instead of having to setup a test, make extension-module not a default...
  • support of Rust features: if a feature is enabled, extra stubs might get generated and if it is disabled stubs won't get generated. This seems impossible to get with approach 4
  • easier Python version and platform customization: stubs are generated per Python version and platform

Approach 4: use a test to write/update stubs:

  • hopefully simpler code in PyO3 (not data segment generation and retrieval from binary...)
  • easier debugging and change inspection: stubs are commited as part of repository content
  • Python version and platform specific elements are explicit in the stubs. But it might be painful to implement because it might be a bit of cat-and-mouse game with Rust #[cfg] annotation.
  • smaller generated cdylib (no introspection data in them)

@davidhewitt
Copy link
Member

You make a very good point that automated tests to update a .pyi file struggle with cfg declarations. I suppose one way around that would be to make it possible to have multiple .pyi files committed for different Python / OS combinations and select the correct one in packaging.

One thing that I expect is that .pyi files might want to include extra user customisations beyond what the autogenerated stubs contain. For example, I expect generic code may be tricky to define on the Rust side and may benefit from some handwritten customisation of stubs. (e.g. making a user-defined dict[K, V] class.) I don't have a good answer to how we can solve that.

Overall I don't have a good sense of whether option 3 or 4 is better. In an ideal world we might offer both options. Which one do you think would meet your needs better at present? Maybe we start by implementing that and we learn a lot by doing so!

@Tpt
Copy link
Contributor Author

Tpt commented Mar 27, 2024

One thing that I expect is that .pyi files might want to include extra user customisations beyond what the autogenerated stubs contain. For example, I expect generic code may be tricky to define on the Rust side and may benefit from some handwritten customisation of stubs. (e.g. making a user-defined dict[K, V] class.) I don't have a good answer to how we can solve that.

I would love to avoid people to handwrite customization in stubs because it makes automatically updating the stubs when Rust code is changed very hard. Imho automatically updating stubs to reflect changes in the Rust code is the main value proposition of auto-generating stubs in the first place.

An idea: Add entry points in PyO3 macros to extend the stubs. For example (rought idea, not sure about the actual details):

#[pymodule]
#[py(extra_stub_content = "
K = VarType('T')
V = VarType('V')
")]
mod module {
    #[pyclass(stub_parent = [collections::abc::Mapping::<K,V>])]
    struct CustomDict;
    
    #[pymethods]
    impl CustomDict {
       #[pyo3(signature = (key: K), return_type = V | None)]
       fn get(key: &Bound<'_, PyAny>) -> Option<Bound<'_, PyAny>> {
       }
    }
}

would generate

K = VarType('T')
K = VarType('V')

class CustomDict(collections.abc.Mapping[K,V]):
     def get(key: K) -> V | None: ...

This way stubs would stay auto generated but can be improved by the author.

A possible way to mix options 3 and 4:

  • When building a wheel, Maturin uses by default the present stubs file or, if not present, falls back to automated introspection (option 3)
  • Maturin provides a generate-stubs command that takes for input arguments like --target and --features and generates stubs file, allowing users to pick option 4 by commiting the command result to git. Users are also able to use this command to check that commited to git are up to date (run the command again and check if anything changed) or run some validation on the stubs in the CI using eg. mypy.stubtests even if the stubs are not commited to git.

@davidhewitt
Copy link
Member

Agreed that having the proc macros be able to collect all the necessary information would be nice. I think only time will tell whether they can meet all user needs!

I'm slightly wary of coupling to maturin for all stub generation, because some projects use setuptools-rust or their own build options for good reasons. I think that offering maturin generate-stubs as an alternative development command would be a good way to avoid the problem for projects that don't want to do their packaging with maturin.

cc @messense do you see any concerns with adding this to maturin?

So what's next steps here? Do you want me to start reviewing this code, or will you push more first?

Regarding the data sections, I happened to hear yesterday that UniFFI's proc macros can do something similar about shipping definitions in the shared library, so it might be interesting to look at / ask them how that was implemented.

@messense
Copy link
Member

do you see any concerns with adding this to maturin?

No concern, I think a generate-stubs command will be very useful for users wanting to commit pyi files in git. We can also add a --check option to fail the command when existing pyi file is outdated.

@Tpt
Copy link
Contributor Author

Tpt commented Mar 27, 2024

Thank you!

Agreed that having the proc macros be able to collect all the necessary information would be nice. I think only time will tell whether they can meet all user needs!

Yes! My hope is to cover as many as possible.

I'm slightly wary of coupling to maturin for all stub generation, because some projects use setuptools-rust or their own build options for good reasons. I think that offering maturin generate-stubs as an alternative development command would be a good way to avoid the problem for projects that don't want to do their packaging with maturin.

Additionaly to maturin generate-stubs, I would go one step further and suggest we also follow the approach started in this MR, ie. build a pyo3-introspection stand-alone crate that contains all the code to generate the actual stubs from the binary annotations. This way other build systems would be able to ship integrated automated stub generation if they want. This would also enable other tools like doing breaking change checking based on the introspection data.

Regarding the data sections, I happened to hear yesterday that UniFFI's proc macros can do something similar about shipping definitions in the shared library, so it might be interesting to look at / ask them how that was implemented.

Thank you! I'm going to have a look at it.

So what's next steps here? Do you want me to start reviewing this code, or will you push more first?

I think the current draft already shows the relevant direction, a very high level code review to check if it's going in the good direction would be welcome. Maybe wait for me to have a look at UniFFI, I might change a bit this MR if I find interesting things there. Thank you!

@Rigidity
Copy link

This is very exciting, looking forward to being able to generate type stubs! Currently we have this lengthy and hard to maintain Python script for doing so, which we have to update by hand: https://github.com/Chia-Network/chia_rs/blob/main/wheel/generate_type_stubs.py

This would be a major improvement. Happy to help out however I can (testing, implementation, whatever) as time allows, to hopefully get this out the door 😄

@Tpt
Copy link
Contributor Author

Tpt commented Mar 27, 2024

@Rigidity Thank you! I plan to work on this MRs to get the basics done. Then there will be a lot of features to incrementally add on it (support for all PyO3 features...) so help will coding and testing will be much welcome!

@Tpt Tpt force-pushed the stub-generation-static branch 3 times, most recently from bed5bc2 to e82fc35 Compare May 10, 2024 15:11
@Tpt
Copy link
Contributor Author

Tpt commented May 10, 2024

Sorry for the very long reaction delay (a lot of priorities + vacations).

Regarding the data sections, I happened to hear yesterday that UniFFI's proc macros can do something similar about shipping definitions in the shared library, so it might be interesting to look at / ask them how that was implemented.

I had a look at uniffi, they basically use the same approach as us: embedding the metadata in the binary and then parsing the binary using the same goblin library. The major different I see is that my prototype is putting the metadata into a custom section named set using link_section whereas uniffi is using regular const without custom link_section. I would tend to think custom link_section is a bit clearer when analyzing the binary but I am not sure how much it changes things.

@davidhewitt If you have time, may you have a quick look at the MR to see if the global design goes into a good direction? If yes, I will fix a lot of shortcut I took and get the MR ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants