Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace pkg_resources with importlib_metadata #7413

Closed
chrahunt opened this issue Nov 29, 2019 · 14 comments
Closed

Replace pkg_resources with importlib_metadata #7413

chrahunt opened this issue Nov 29, 2019 · 14 comments
Labels
project: vendored dependency Related to a vendored dependency type: maintenance Related to Development and Maintenance Processes

Comments

@chrahunt
Copy link
Member

chrahunt commented Nov 29, 2019

What's the problem this feature will solve?

Currently, pip uses a vendored version of pkg_resources in several places.

pkg_resources has a few downsides for our use cases, including:

  1. unconditional scan of entire sys.path on import (Avoid full path enumeration on import of setuptools or pkg_resources? setuptools#510), when we may only need it for a subset of those directories
  2. various issues filed regarding corrupt METADATA files in site-packages preventing pip from starting
  3. complicated to vendor vs a standalone library, since we need to extract it from setuptools
  4. higher barrier of entry for other package management tools that want pip-compatible behavior

Describe the solution you'd like

Replace pkg_resources with importlib.metadata (via the importlib-metadata backport for Python < 3.8).

Our current usages for pkg_resources are:

  1. build_env.BuildEnvironment: identify the set of packages available/conflicting in the isolated environment directories
  2. req.req_install.InstallRequirement: locate an already-installed package
  3. commands.search, commands.show: iterate over the set of installed packages and access metadata
  4. wheel:
    1. parse entrypoints
    2. check wheel version of unpacked Wheel
  5. ad-hoc usage of parsing/canonicalization functions

A combination of importlib_metadata and packaging can satisfy these requirements:

  1. importlib.metadata.distributions(path=["dir1", "dir2"]) then comparison of the version using packaging.specifiers.SpecifierSet
  2. importlib.metadata.distribution("package_name") or importlib.metadata.distributions(path=["..."], name="package_name")
  3. [d.metadata["..."] for d in importlib.metadata.distributions()]
    1. importlib.metadata.Distribution.at(info_dir).entry_points
    2. next(importlib.metadata.distributions(path=["..."])).read_text("WHEEL")
  4. packaging.utils.canonicalize_name, etc

As a vendored library, importlib-metadata would have the following characteristics:

  1. Apache-2.0 license
  2. Several backports for Python 2:
    1. contextlib2, which we already vendor
    2. pathlib2 - we have looked at for tests. There is one issue we need to investigate the impact of: Unusable on Windows/Python2.7 with unicode paths jazzband/pathlib2#56 - MIT licensed, pure Python with no dependencies
    3. configparser - configparser backport - MIT licensed, pure Python with no dependencies
  3. One dependency otherwise:
    1. zipp for making the zipfile module Path-aware - MIT licensed, pure Python with no dependencies

Alternative Solutions

  • Continue using pkg_resources - this approach has the downsides mentioned above
  • Use another metadata-providing library - none researched, I don't imagine we'd choose something else over either of these two

See also

@chrahunt chrahunt added project: vendored dependency Related to a vendored dependency type: maintenance Related to Development and Maintenance Processes state: needs discussion This needs some more discussion labels Nov 29, 2019
@chrahunt
Copy link
Member Author

chrahunt commented Dec 12, 2019

Regarding jazzband/pathlib2#56:

Summary: pathlib2.Path stores path parts as bytes. On Python 2 when provided a unicode object, it is converted to bytes using sys.getfilesystemencoding(), which unconditionally returns 'mbcs' on Windows (in Python 2). 'mbcs' encoding on Python 2 is broken and cannot convert wide characters into the equivalent bytes.

Impact: Medium/Low. Internally in pip we do not deal with unicode except in a small handful of places. In importlib_metadata there is also no explicit use of unicode objects, so it would vary depending on what we provided to the APIs. If we simply do not pass unicode to importlib_metadata or pathlib2.Path then it should be fine. On the other hand, this also would prevent us from fixing e.g. #5972 and probably other issues related to unicode filenames on Windows in Python 2. That isn't a deal-breaker IMO since: it's already broken, that's the only issue I can find about it, and even the issue creator in that case is OK letting it die since it doesn't happen on Python 3.

Mitigation: We can patch pathlib2.Path to raise an exception if provided a unicode object in Python 2, to prevent confusing behavior and enforce the statements above.

@pradyunsg
Copy link
Member

We can patch pathlib2.Path to raise an exception if provided a unicode object in Python 2, to prevent confusing behavior and enforce the statements above.

I'm on board for this. This seems like the least-worst option here.

@uranusjr
Copy link
Member

I’m wondering, are there behavioural differences between pkg_resources and importlib.metadata? Maybe it would be possible to implement a shim to use importlib.metadata if present (so new Python versions), and fall back to pkg_resources otherwise. That way we can choose to freeze pkg_resources in a version that’s known to support old Pythons, and provide improvements only for new Python versions.

@pfmoore
Copy link
Member

pfmoore commented Jan 13, 2020

+1 on this option. Adding a load more vendored code just to support Python 2 is not a route I want to go down, TBH.

@pradyunsg
Copy link
Member

pradyunsg commented Jan 13, 2020

pkg_resources isn't the only one where moving to a more modern option is trickier due to Python 2: #7592 (comment)

@uranusjr
Copy link
Member

uranusjr commented Feb 6, 2020

I briefly looked into this and immediately hit a roadblock. pip checks how a distribution is installed by checking the INSTALLER entry in the dist-info. importlib-metadata does not seem to have an interface for that.

@pradyunsg
Copy link
Member

@jaraco ⬆️

@uranusjr
Copy link
Member

uranusjr commented Feb 6, 2020

I think gitlab/python-devs/importlib_metadata#23 more or less covers the problem. pip will also need to read direct_url.json once PEP 610 is implemented (#7612).

In general I think pip needs a to be able to reliably access arbitrary files in .dist-info, so we wouldn’t be blocked to implement future additions to the package databases.

@chrahunt
Copy link
Member Author

chrahunt commented Feb 6, 2020

It looks like Distribution.get_text gives us that access.

@uranusjr
Copy link
Member

uranusjr commented Feb 6, 2020

Yup, read_text() is the one to use, thanks for the pointer!

@uranusjr
Copy link
Member

uranusjr commented Apr 22, 2020

Writing up my personal thoughts on this (by suggestion of @pradyunsg) since I don’t expect myself to work on this personally for a while, but the plan (from what I can tell) is reletively straightforward and anyone interested in this can try to carry it out.

So the plan is to use importlib-metadata on supported platforms, but keep pkg_resources as a fallback on lower versions. But these two libraries provide vastly different interfaces, and pip currently uses pkg_resources directly in a lot of places that don’t translate well directly into importlib-metadata. What I have in mind is a three-step process:

  1. Introduce abstract classes that hide pkg_resources usages as implementation detail. This would likely include two classes (my plan is to put them in pip._internal.metadata.base), Distribution holding distribution information (e.g. .dist-info), and WorkingSet to introspect the environment. (I borrowed these names from pkg_resources, we can discuss names when someone actually starts implementing them.)
  2. Implement a set of concrete classes to the abstraction using pkg_resources. Identify direct pkg_resources in pip, give each operation a meaningful name, and implement them in those concrete classes. Replace direct usages with these meaningully-named functions.
  3. After all direct pkg_resources usages are purged, actually bring in importlib-metadata and implement another set of concrete classes for them. Implement a “switch” to conditionally return either set of implementation based on Python version.

I believe this work would be valuable even if it happens after we drop Python 2. It is extremely tedious to rewrite all the pkg_resources usages to importlib-metadata (the APIs are very different), so the abstractiion process would still be a good way to make it work, even if the pkg_resources implementation eventually gets dropped immediately after we finish step 3 above.

@jaraco
Copy link
Member

jaraco commented Apr 27, 2020

So the plan is to use importlib-metadata on supported platforms, but keep pkg_resources as a fallback on lower versions.

I'd recommend moving entirely to importlib-metadata, which should support all platforms that pip supports. I'd highly recommend porting to that rather than trying to bridge pkg_resources and importlib.metadata.

Also, I'm happy to help with this effort by providing technical support on the implementation, as long as someone can lead the effort.

@pradyunsg
Copy link
Member

@uranusjr With the above PRs, is there anything more to do here, given that we're using importlib.metadata on 3.11+?

By default, pip uses ``importlib.metadata`` on Python 3.11+, and
``pkg_resourcess`` otherwise. This can be overridden by a couple of ways:
* If environment variable ``_PIP_USE_IMPORTLIB_METADATA`` is set, it
dictates whether ``importlib.metadata`` is used, regardless of Python
version.
* On Python 3.11+, Python distributors can patch ``importlib.metadata``
to add a global constant ``_PIP_USE_IMPORTLIB_METADATA = False``. This
makes pip use ``pkg_resources`` (unless the user set the aforementioned
environment variable to *True*).

@uranusjr
Copy link
Member

Yeah I think everything is in place, the only thing we need now is for time to take over.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
project: vendored dependency Related to a vendored dependency type: maintenance Related to Development and Maintenance Processes
Projects
None yet
Development

No branches or pull requests

5 participants