Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing pip's & PyPI's metadata strictness #264

Open
brainwane opened this issue Jun 12, 2019 · 35 comments
Open

Increasing pip's & PyPI's metadata strictness #264

brainwane opened this issue Jun 12, 2019 · 35 comments

Comments

@brainwane
Copy link
Contributor

brainwane commented Jun 12, 2019

(Followup to discussion during packaging minisummit at PyCon North America in May 2019.)

Conda has a metadata solver. Pip and PyPI already or will soon know cases where package metadata is incorrect; how firmly/strictly should metadata correctness be enforced?

There's general agreement that strictness should be increased; the question is: how quickly and to what extent?

Issues

  1. Metadata (largely dependency) can be incorrect initially, or may become incorrect in the future as new packages are released. This can be in wheels or in the source distribution.
    a. Agreement among PyPI working group to better enforce manylinux (Donald Stufft, Dustin Ingram, EWDIII) -- see "Run auditwheel on new manylinux uploads, reject if it fails #5420" Run auditwheel on new manylinux uploads, reject if it fails pypi/warehouse#5420
    b. EWDIII - There is no technical barrier for PyPI updating its metadata. There is a non-starter on updating the package/changing the metadata.
    c. Also possibility for staged releases. This would allow composition and checks before release. (Nick + Ernest)
    d. Can metadata can be corrected by producing a new wheel or post-release? Likely not by uploading a wheel.
    e. Ability to yank packages (will not install for interval-based version specifications, but would still be available to install under exact/equality pinning) per PEP 592. New simple API change to support yanking (a la Ruby gems)
    f. Metadata should not be able to differentiate between artefact and PyPI
    • The artifact metadata is canonical; and, the metadata exposed by PyPI should never diverge
  2. Non-standards-compliant wheels tagged manylinux-*
  3. Can pip stop (if an install is going to be?) breaking existing requirements
    a. will/may require the solver

Action Items:

  1. PyPI stricter on upload, write a rollout plan
    a. Chris Wilcox said "I am going to start on a validator in pypa/validator that can be leveraged at twine/setuptools/warehouse" -- has started it -- also see twine check should guard against things not accepted by PyPI like version format twine#430 twine check should guard against things not accepted by PyPI like version format endless timeouts #430
    b. Hard fail on invalid markup with explicit description type pypi/warehouse#3285 Warehouse to start hard failing package uploads on invalid markup with explicit description type
  2. Determine the behaviour of wheel build numbers
  3. Simple API for yanking; underway
  4. Finish pip solver
  5. Warning about lack of python_requires
    a. Or could the spec/setuptools be updated to fail on this
    b. Also, can we fail when there's missing author, author_email, URL. Currently warnings on setup. (chris wilcox)
    c. For packages where no restrictions on Python version are desired, a “python_requires==*” would be satisfactory
    d. also see WIP: Add metadata validation setuptools#1562
  6. Could we explore banning old upload clients from PyPI?
    a. Yes; support needs to be added; an issue needs to be created
  7. General soft-fail support [if you opt in to compliance, we’ll block your upload; otherwise we will send you warnings] -- requires Draft release feature on main archive to allow testing a release before it goes live pypi/warehouse#726

This is meant as as a tracking issue covering the various TODOs necessary to plumb this through the parts of the toolchain.

@brainwane
Copy link
Contributor Author

I linked to this on Discourse to give people a heads-up.

Some questions:

Determine the behaviour of wheel build numbers

Who needs to do this? This will affect Twine & Warehouse in particular, right?

Warning about lack of python_requires

Would this be in setuptools? @pganssle @jaraco? @crwilcox I see your name in the notes here - anything you could add here would be great.

@pganssle
Copy link
Member

Would this be in setuptools? @pganssle @jaraco? @crwilcox I see your name in the notes here - anything you could add here would be great.

We could do it in setuptools, but it's more important to do it in warehouse, see pypi/warehouse#3889, since ideally the end goal is that everyone is properly annotating what versions of Python they support (or at least making an affirmative choice to put in something like '*.*'. A warning in setuptools will probably help some people, but more likely it will be largely ignored if it's seen at all.

I think one problem with the "PyPI raises a warning" approach is that IIUC twine has no mechanism to display any such warning. @di and @dstufft would know better than me whether adding such a capability is desirable for this or other reasons.

@crwilcox
Copy link

I have a branch started on this and intend it to be a part of packaging so it can be used from multiple sources. I think starting by using it in warehouse would make sense.

@dstufft
Copy link
Member

dstufft commented Jun 13, 2019

The upload API does not support a warning mechanism, only error. Arguably it's better to send a warning via email though? Or at least one could make the argument that it is. Lots of people publish in an automated fashion and won't ever see a command line warning anyways.

@pradyunsg
Copy link
Member

Warnings from PyPI via email make a lot of sense to me.

It's definitely a better option than adding a warnings mechanism that twine then exposes to the user.

We should also use something (like packaging) which can also be used across projects to do the metadata validation.

@ssbarnea
Copy link

Not really related to packaging itself but I wonder if PYPI could start archiving obsolete/unmaintained packages from the main index. PYPI UX could be improved if we would have a curated index with currently maintained packages. At this moment even if you search for a "foo" package, you will get a very poor search result as the default listing (relevance) does not display the last release date.

Maybe we could use the metadata compliance as a way to filter the package indexes, motivating people to migrate?

Regarding twine upload warnings, I am even more drastic: default to error and allow a temporary bypass option. Sadly 99% of people not even read the warnings, so make them error. The only trick is to include a link to ticket where people can see how to fix it and also comment.

@dholth
Copy link
Member

dholth commented Jun 14, 2019

I've released a couple of wheels with a build number. Every layer removed from the source tends to want its own version number; RPM has epochs. Alternatively you could make 7 corrections by uploading a py30-none-any wheel and then incrementing the tag all the way to py38-none-any.
@ssbarnea I don't think pypi can curate. Is someone maintaining an awesome-python website with a list of great stuff?
How will strict metadata benefit the authors and not just the consumers of a package, and not just in their role as consumers of other packages?

@dstufft
Copy link
Member

dstufft commented Jun 14, 2019

Not really related to packaging itself but I wonder if PYPI could start archiving obsolete/unmaintained packages from the main index. PYPI UX could be improved if we would have a curated index with currently maintained packages. At this moment even if you search for a "foo" package, you will get a very poor search result as the default listing (relevance) does not display the last release date.

The biggest problem with any sort of system that tries to determine if something is obsolete/unmaintained or not... is how do you actually determine if something is obsolete and/or unmaintained and ensure that you don't get false positives for software that is just "done" and just doesn't need any further updates?

@brainwane
Copy link
Contributor Author

@ssbarnea I'd like to keep this issue focused on the metadata strictness issue. For more on archiving unmaintained projects or excluding them from search or otherwise making frequently-maintained packages easier to find, you might want to follow up in pypi/warehouse#4004 , pypi/warehouse#1388 , pypi/warehouse#4319 , pypi/warehouse#4021 , or pypi/warehouse#1971 . Thanks for your ideas!

@brainwane
Copy link
Contributor Author

@dstufft is "Disallow runs of special characters in project names" pypi/warehouse#469 or "Clean database of UNKNOWN and validates against it" pypi/warehouse#69 part of what's necessary, or part of what we want to do, in this increasing-metadata-strictness work?

@brainwane
Copy link
Contributor Author

@crwilcox perhaps you'd like to take a look at pypi/warehouse#194 where people discuss what automated checks, including on metadata, they'd like performed on uploads to PyPI.

@pradyunsg
Copy link
Member

We should also use something (like packaging) which can also be used across projects to do the metadata validation.

Coming from @di's comment pypa/twine#430 (comment), by "like packaging" here, I wasn't suggesting we should add this to packaging itself. Rather, I meant that we should have a well-scoped library that does just this one thing -- package validation.

packaging is (currently) scoped to implementing PEP-backed stuff, that has only-one-way-to-do-it and I'd like to preserve that.

@di
Copy link
Sponsor Member

di commented Sep 17, 2019

packaging is (currently) scoped to implementing PEP-backed stuff, that has only-one-way-to-do-it and I'd like to preserve that.

I was proposing adding this to packaging itself. IMO the same is true for metadata: they are all (more or less) defined in PEPs, and there's only one way to do it (the metadata is either valid, or is isn't).

@brainwane
Copy link
Contributor Author

One of the TODOs here is to finish the pip solver.

The Python Software Foundation's Packaging Working Group has secured funding to help finish the new dependency resolver, and is seeking two contract developers to aid the existing maintainers for several months. Please take a look at the request for proposals and, if you're interested, apply by 22 November 2019. And please spread the word to freelance developers and consulting firms.

@brainwane
Copy link
Contributor Author

We're progressing in the pip resolver work; in a recent meeting, @uranusjr @pfmoore and @pradyunsg started talking about how better Warehouse metadata would help pip with its new resolver, which we'd like to roll out around May.

So pypi/warehouse#726, pypa/packaging#147, pypa/twine#430, and pypa/setuptools#1562 would really help; would anyone like to step up and help get those moving?

@di
Copy link
Sponsor Member

di commented Jan 25, 2020

How is pypi/warehouse#726 related to metadata strictness here?

@ncoghlan
Copy link
Member

One aspect I'm aware of is that standardised two-phase upload makes it easier to test installation metadata accuracy prior to release, since it also allows the testing process to be standardised.

@brainwane
Copy link
Contributor Author

+1 to @ncoghlan. Two-phase upload/package preview gives us

  • Ability to compose and check a package, including metadata, before release
  • Ability to slowly roll out, first, a stern warning "hey this metadata isn't compliant but we'll let you publish it anyway," and then an actual block of noncompliant packages
  • General soft-fail support [if you opt in to compliance, we’ll block your upload; otherwise we will send you warnings]

@di
Copy link
Sponsor Member

di commented Jan 27, 2020

I think it's probably worth making a distinction here: two-phase upload could help prevent metadata that is "not the author's intention, but technically compliant" (such as typos, incorrect python_requires, etc) by allowing them to review and test it. But any metadata that is truly "non compliant" currently won't pass validation and already can't be uploaded.

I think a number of things listed here as "metadata inaccuracy" (like incorrect dependencies, invalid manylinux wheels) might be true, but would require some form of code execution / introspection / auditing (e.g. audithweel) that a two-phase upload doesn't provide (but might be a precursor to).

For things like "warn on missing python_requires", we could have this today if we made this field required in the metadata specification. Implementing two-phase upload would just give us a better way to inform users about this and gracefully change the specification.

Which of these is causing the biggest issue for the resolver work?

@pradyunsg
Copy link
Member

pradyunsg commented Jan 27, 2020

This GitHub issue bundles together the entire bunch of metadata concerns as they were voiced during the (only? main?) metadata-related topic at the Pacakging Mini-Summit 2019 -- most of which ended up being metadata validation related -- however, as noted below the originally proposed topic was oriented around the resolver's operation/UX.

Additionally, two-phase uploads also came up both the resolver-related planning calls; in the discussion of how rolling out the changes to get better metadata on PyPI could be done (in the early Jan call, as a long-term strategy for improving the state of metadata on PyPI) and how it could catch release issues (like pip 20.0, in the more recent call).

IIUC, a comment by me in a resolver-planning meeting recently prompted the follow-up here:

PyPI does not have excellent metadata, and the pip resolver dealing with bad metadata .... would be difficult.


Which of these is causing the biggest issue for the resolver work?

The main issue for the resolver is the metadata inaccuracies, which we really have no option other than to deal with directly in the resolver somehow. A new resolver has to work with the already-published-with-inaccurate-metadata packages -- so even if we somehow end up with a build farm to build packages on PyPI in the future and we can make sure every new release actually presents correct metadata -- that's not gonna change that the resolver still needs to deal with potentially-incorrect metadata from past releases (unless we're planning to tackle the super untractable problem of back-filling this metadata).

Metadata validation isn't super relevant to the resolution process (directly anyway).


Looking back, the original topic suggestion was more oriented around exactly this -- dealing with inaccuracies in the PyPI metadata in the resolution process. The actual discussion at the event, transitioned to be more around other parts of the workflow, which generate/publish metadata (vs use it, like in the resolver) because of the audience-driven nature of the discussion.

The PyCon 2019 notes don't seem to mention this, so I'm going off my memory now: the reason the resolver is mentioned in the summary-notes above is that during the discussion at the summit "should pip stop when it is installing a package that my break my environment" -- we discussed that theoretically, a pip-with-better-resolver would come up with a solution that the current pip would not be able to find in those situations. (I remember this, because we'd joked about the resolver's complexity when we added it to the action items: https://twitter.com/EWDurbin/status/1125447285272395776/photo/3) :)

(and, that makes it 4am IST)

@dstufft
Copy link
Member

dstufft commented Jan 27, 2020

The main issue for the resolver is the metadata inaccuracies, which we really have no option other than to deal with directly in the resolver somehow. A new resolver has to work with the already-published-with-inaccurate-metadata packages -- so even if we somehow end up with a build farm to build packages on PyPI in the future and we can make sure every new release actually presents correct metadata -- that's not gonna change that the resolver still needs to deal with potentially-incorrect metadata from past releases (unless we're planning to tackle the super untractable problem of back-filling this metadata).

It wouldn't be that intractable to backfill metadata for wheels-- but for sdists there's not much we're going to be able to do. Would the resolver be smart enough to be able to know, in a hypothetical scenario, that it can prefetch dependency information for a wheel, but not for a sdist? Would that help?

@pradyunsg
Copy link
Member

Would the resolver be smart enough to be able to know, in a hypothetical scenario, that it can prefetch dependency information for a wheel, but not for a sdist?

Yep yep -- we can do that on the "pip side" of the abstractions.

@brainwane
Copy link
Contributor Author

pypi/warehouse#3889 is the issue requesting that Warehouse reject package uploads that lack a Requires-Python.

@brainwane
Copy link
Contributor Author

brainwane commented Apr 8, 2020

A followup after some discussion in IRC a few days ago.

On some open issues and TODOs:

  1. Metadata (largely dependency) can be incorrect initially, or may become incorrect in the future as new packages are released. This can be in wheels or in the source distribution.
    a. Agreement among PyPI working group to better enforce manylinux (Donald Stufft, Dustin Ingram, EWDIII) -- see "Run auditwheel on new manylinux uploads, reject if it fails #5420" pypa/warehouse#5420

Still working on this.

c. Also possibility for staged releases. This would allow composition and checks before release. (Nick + Ernest)

We'd love help with this feature.

e. Ability to yank packages (will not install for interval-based version specifications, but would still be available to install under exact/equality pinning) per PEP 592. New simple API change to support yanking (a la Ruby gems)

Simple API for yanking; underway.

The To-do items:

  1. PyPI stricter on upload, write a rollout plan
    a. Chris Wilcox said "I am going to start on a validator in pypa/validator that can be leveraged at twine/setuptools/warehouse" -- has started it -- also see pypa/twine#430 twine check should guard against things not accepted by PyPI like version format endless timeouts #430

Relevant work is in progress -- see pypa/packaging#147 (comment) .

b. pypa/warehouse#3285 Warehouse to start hard failing package uploads on invalid markup with explicit description type

This is now done.

  1. Determine the behaviour of wheel build numbers

I don't know whether anyone has made progress on this.

TODO #3 was about the Simple API for yanking, which is underway.

  1. Finish pip solver

In progress.

  1. Warning about lack of python_requires
    a. Or could the spec/setuptools be updated to fail on this

We need to further discuss lack of python_requires at pypi/warehouse#3889 .

b. Also, can we fail when there's missing author, author_email, URL. Currently warnings on setup. (chris wilcox)
c. For packages where no restrictions on Python version are desired, a “python_requires==*” would be satisfactory
d. also see pypa/setuptools#1562

I think we still need to discuss this in pypi/warehouse#194 .

  1. Could we explore banning old upload clients from PyPI?
    a. Yes; support needs to be added; an issue needs to be created

Per the IRC discussion, it sounds like this may or may not be necessary, depending on whether we enforce more specific rules about metadata that must be included in packages, minimum metadata versions, etc.

  1. General soft-fail support [if you opt in to compliance, we’ll block your upload; otherwise we will send you warnings] -- requires pypa/warehouse#726

Again, we would love help implementing package preview/staged releases.

@brainwane
Copy link
Contributor Author

@alanbato is working on package preview/staged releases (now "draft releases"), and the yanking feature is now implemented pypi/warehouse#5837.

We still need further discussion and help with

@dholth
Copy link
Member

dholth commented Apr 23, 2020

I'm using wheel build numbers for an experimental re-compressed wheels repository and they are working correctly. pip knows the re-compressed wheels found on that --extra-index-url should be preferred. This is closer to the "binary packager is not upstream, might need their own version number" use case just like RPM or .deb's packager level version numbers.

@ssbarnea
Copy link

What @dholth said made me wonder if in the future we may be able to repackage wheels, maybe even adding extra constraints that prevent incompatibilities with dependencies that where released after the initial wheel was published. Maybe that would be too much, but the idea of being able to have an increasing packaging number (aka release number), is great.

@pfmoore
Copy link
Member

pfmoore commented Apr 23, 2020

@ssbarnea We'd have to be extremely careful here from a security point of view. It would be bad, for example, if an untrusted party were able to repackage the wheel for numpy, claiming it's just recompressed to save bandwidth, but in fact they also introduce a new dependency on maliciouspackage==1.0.

Obviously any new repository supplying repackaged wheels would be opt-in, so the exposure is limited to people who do opt in, but as a term, "repackaging" implies no changes to what gets installed on the user's machine, and we don't (yet) have mechanisms to ensure that.

@dholth
Copy link
Member

dholth commented Apr 23, 2020 via email

@pradyunsg
Copy link
Member

You'd better trust the mirror

Recompressing wheels is inherently incompatible with a threat model where an attacker has the ability to respond to client requests.

This is a threat model we're protecting PyPI from, with future security enhancements: https://www.python.org/dev/peps/pep-0458/#threat-model. Note that this PEP is accepted and, as far as I know, there is funding for implementation of this functionality as well.

If the user should download multiple wheels (like the original wheel, for checking contained file hashes against), I'm pretty sure we've thrown away any bandwidth gains we'd have made. :)

@dholth
Copy link
Member

dholth commented Apr 23, 2020 via email

@nlhkabu
Copy link
Member

nlhkabu commented May 20, 2020

As this ticket is blocked by the development of the dependency resolver (#988), I thought I would mention here that the team is looking for help from the community to move forward on that subject.

We need to better understand the circumstances under which the new resolver fails, so are asking for pip users with complex dependencies to:

  1. Try the new resolver (use version 20.1, run --unstable-feature=resolver)
  2. Break it :P
  3. File an issue

You can find more information and more detailed instructions here

@brainwane
Copy link
Contributor Author

Now that pypa/pip#988 is resolved, folks have started taking a fresh look at related issues starting at pypa/pip#9187 (comment) .

@johnlabarge
Copy link

I guess a terribly dumb question is can anyone point me to a place where I can control the metadata. For some reason the published package (using tgz) is not having the same metadata as I put in setup.py?

@merwok
Copy link

merwok commented Jul 21, 2021

You can inspect the metadata locally by building a distribution (python setup.py sdist bdist_wheel, requires pip install wheel first for the latter command) and looking at the METADATA or PKG-INFO file inside the dists (which are zip archives).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests