Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PEP 643 to optimise pip download --no-binary #10195

Open
djpavon opened this issue Jul 24, 2021 · 18 comments
Open

Implement PEP 643 to optimise pip download --no-binary #10195

djpavon opened this issue Jul 24, 2021 · 18 comments
Labels
PEP implementation Involves some PEP type: feature request Request for a new feature

Comments

@djpavon
Copy link

djpavon commented Jul 24, 2021

Currently, when pip downloads sdists it is sometimes necessary for it to build a wheel in order to get trustworthy metadata #7995 #1884. Ideally it wouldn't need to do so, but that is a difficult problem to solve.

I would instead suggest that when pip does need to build a wheel during download it should default to saving that resulting wheel along-side the sdist package. This would let users avoid having to build the wheel twice in some cases, which could save a considerable amount of time.

While there are other cases where that wheel wouldn't be useful to the user (they want to make modifications first, control build settings, etc) it doesn't hurt them to have it, as it was being built anyway, and thus the time and disc space requirements are unchanged. People using pip download in scripts and the like may have to make changes to account for the additional file, so a transition plan would be needed.

I think the closest existing alternative to this is pip download --build ./tmp --no-clean. The main downside to that is knowing and remembering that it is necessary. Users are often surprised that pip download builds packages. Even when aware of the issue it can be easy to forget, especially when an sdist is downloaded because no appropriate wheels are available rather than because --no-binary was used.

This might dovetail with #9769 which aims to always make wheels as an intermediate step in installing sdists.

@uranusjr
Copy link
Member

uranusjr commented Jul 24, 2021

There are a few things we need to unpack here. First of all, pip does not build a wheel for metadata, but only requests to build package metadata, which is expected to require minimal effort. Some packages are configured in a, well, let’s say suboptimal way that a metadata build request results in an entire package build, but that fact is opaque to pip, and the build result is usually still short of a wheel, so technically there’s no wheel to save.

Even if we ignore all of that and say let's just make pip build a wheel for metadata (which means we're unreasonably punishing well-configured projects for less cooperative ones, but let's assume we can magically avoid that), there is still the fundamental issue that a wheel build is not guaranteed to be stable. If you ever noticed, pip does not cache wheels for installation it built from source, because a compilation step is essentially remote code execution and there's nothing stopping a package to produce a different wheels when built a second time (and a lot of projects do that in the real world in forms like build-time feature detection and conditional compilation). So making pip cache wheels built from source is introducing a cache invalidation issue (when can a wheel be reused?), one of the “two most difficult things” in computer science.

In the end, everything boils down to one fundamental fact that, without outside input, it is impossible to make sure a collection of source code is trustworthy in any way without building it from scratch. What pip needs is some additional flags in that source tree to tell pip what it can cache. And there is already a standard for that: PEP 643. If a source distribution implements that PEP, pip can determine the package metadata in it is trustworthy in fine granularity, and avoids the build step when it can. So what I would suggest is to contribute to wheel builders (e.g. setuptools) to see PEP 643 implemented (see pypa/setuptools#2685), and once those projects start generating sdists with appropriate metadata, pip can start doing what you want. Without that, anything pip does to solve your problems will cause us to get complaints from another group of people.

@uranusjr
Copy link
Member

I’ll just re-purpose this issue to track pip’s implementation of PEP 643. Even if there’s currently no-one producing such sdists, we can do the work upfront so one anyone does, they can get the benefits immediately.

@uranusjr uranusjr reopened this Jul 24, 2021
@uranusjr uranusjr changed the title When pip download --no-binary must build a wheel for metadata, save it. Implement PEP 643 to optimise pip download --no-binary Jul 24, 2021
@uranusjr uranusjr added PEP implementation Involves some PEP type: feature request Request for a new feature labels Jul 24, 2021
@pfmoore
Copy link
Member

pfmoore commented Jul 24, 2021

+1 on this. It would be great to get PEP 643 support in place, even if backends are not yet producing PEP 643 compliant sdists, as having support in pip would act as encouragement for them to do so.

I may try to take a look at this, but my open source time is pretty limited at the moment, so if someone else wants to pick this up, I'm fine with that.

@djpavon
Copy link
Author

djpavon commented Jul 24, 2021

Thanks for the detailed explanation. If I understand correctly, when pip wheel is used to download an sdist and build a wheel (rather than doing download separately), it will have the same issue where any partial build artifacts created while gathering metadata won't be suitable for using during the wheel build process, correct?

@uranusjr
Copy link
Member

pip wheel is special because we need to output a wheel anyway, so pip can just build that wheel and inspect metadata in it without a separate metadata-building step (I don't know if that's what pip is doing right now, but theoratically we can do that).

@djpavon
Copy link
Author

djpavon commented Jul 28, 2021

After testing, it looks like pip wheel does indeed avoid building twice for packages that have this problem. The sdist package doesn't appear to be retained with pip wheel --no-clean. For my purposes, where we would like to archive both the sdist and wheels, scraping the sdist urls from stdout of pip wheel and redownloading them (with curl or some other tool) was quite a bit faster than calling pip download then pip wheel, and is a good workaround until PEP 643 support lands.

@tvalentyn
Copy link

First of all, pip does not build a wheel for metadata, but only requests to build package metadata, which is expected to require minimal effort.

#8387 (comment) says the contrary. Is the comment incorrect?

Some packages are configured in a, well, let’s say suboptimal way that a metadata build request results in an entire package build, but that fact is opaque to pip, and the build result is usually still short of a wheel, so technically there’s no wheel to save.

What are the best practices package maintainers can do to speed up building package metadata and avoid the 'suboptimal' way? Do the projects using PEP-517 need to implement a special 'prepare_metadata_for_build_wheel' hook to avoid building a wheel? I am looking at

def prepare_metadata_for_build_wheel(
self, metadata_directory, config_settings=None,
_allow_fallback=True):
"""Prepare a ``*.dist-info`` folder with metadata for this project.
Returns the name of the newly created folder.
If the build backend defines a hook with this name, it will be called
in a subprocess. If not, the backend will be asked to build a wheel,
and the dist-info extracted from that (unless _allow_fallback is
False).
"""
return self._call_hook('prepare_metadata_for_build_wheel', {
'metadata_directory': abspath(metadata_directory),
'config_settings': config_settings,
'_allow_fallback': _allow_fallback,
})
and if I understand it right, pip falls back to building the wheel. Is there an example of a project that correctly implements such a hook or we need PEP 643 to make it possible?

@pradyunsg
Copy link
Member

See https://www.python.org/dev/peps/pep-0517/#prepare-metadata-for-build-wheel

Overall, given a source distribution, pip currently will always generate metadata for that package. If that package does not have a metadata generation hook, pip will generate an entire wheel instead.

@tvalentyn
Copy link

Thanks, @pradyunsg , this makes sense.

It seems that before PEP-517, pip was generating package metadata by calling python setup.py egg_info (#8387 (comment)).

Would it be appropriate for packages that use setuptools to implement prepare_metadata_for_build_wheel as a call to python setup.py egg_info? Here's an example (numpy) where I'd like to have this behavior to avoid the regression in pip download --no-binary command: numpy/numpy#14053 (comment).

If calling egg_info is not a good way to go, do you by chance have an example of a good implementation of the prepare_metadata_for_build_wheel hook?

@uranusjr
Copy link
Member

uranusjr commented Oct 13, 2021

The problem is not if they have prepare_metadata_for_build_wheel or not, but how expensive that metadata preparation step is. For some packages using an older, heavily customised toolchain like numpy, preparing metadata can be an extremely expensive step. PEP 517's prepare_metadata_for_build_wheel should be of similar performance to setup.py egg_info if the package is implementing it correctly, but that can be less easy with a legacy build chain.

Edit: And to answer your original question, no, you can't just call setup.py egg_info in the prepare_metadata_for_build_wheel hook because the output format is different. You can potentially build on it though.

@tvalentyn
Copy link

tvalentyn commented Oct 13, 2021

The numpy regression happened right when they switched to PEP-517 in numpy/numpy#14053. Running setup.py egg_info on numpy's code is still fast as far as I can tell.

@pradyunsg
Copy link
Member

Huh? I feel like it's useful to point out that setuptools does implement prepare_metadata_for_build_wheel, and that numpy uses it.

https://github.com/pypa/setuptools/blob/3dd1af099fc1379935a975a2a5f1561870c9dc92/setuptools/build_meta.py#L275

@pradyunsg
Copy link
Member

And that numpy correctly handles dist_info as well.

@pfmoore
Copy link
Member

pfmoore commented Oct 13, 2021

Running setup.py egg_info on numpy's code is still fast as far as I can tell.

That sounds more like there's an issue with setuptools' prepare_metadata_for_build_wheel implementation, then. Maybe it doesn't work as efficiently as the setup.py egg_info path when a project has a heavily customised build like I believe numpy does? But either way, I think this is something setuptools and numpy need to resolve between them - there's nothing for pip to do here.

@tvalentyn
Copy link

I am confused as to where prepare_metadata_for_build_wheel needs to be implemented. Should this be done within the package code, such as somewhere in setup.py, or if I am using setuptools, then this is hook is implemented in setuptools and beyond my control as a package maintainer?

@pradyunsg
Copy link
Member

Can we move this discussion to a new issue please? This seems to be purely about numpy / PEP 517, and has nothing to do with PEP 643's implementation work.

@pfmoore
Copy link
Member

pfmoore commented Oct 13, 2021

this is hook is implemented in setuptools and beyond my control as a package maintainer?

Yes, it's implemented by the build backend, not the package maintainer. And I agree with @pradyunsg, this should be taken to a different issue (as I say, maybe on the numpy or setuptools tracker).

@tvalentyn
Copy link

Thanks for the feedback, opened pypa/setuptools#2814.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PEP implementation Involves some PEP type: feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

5 participants