Increasing pip's & PyPI's metadata strictness #264

brainwane · 2019-06-12T19:45:28Z

(Followup to discussion during packaging minisummit at PyCon North America in May 2019.)

Conda has a metadata solver. Pip and PyPI already or will soon know cases where package metadata is incorrect; how firmly/strictly should metadata correctness be enforced?

There's general agreement that strictness should be increased; the question is: how quickly and to what extent?

Issues

Metadata (largely dependency) can be incorrect initially, or may become incorrect in the future as new packages are released. This can be in wheels or in the source distribution.
a. Agreement among PyPI working group to better enforce manylinux (Donald Stufft, Dustin Ingram, EWDIII) -- see "Run auditwheel on new manylinux uploads, reject if it fails #5420" Run auditwheel on new manylinux uploads, reject if it fails pypi/warehouse#5420
b. EWDIII - There is no technical barrier for PyPI updating its metadata. There is a non-starter on updating the package/changing the metadata.
c. Also possibility for staged releases. This would allow composition and checks before release. (Nick + Ernest)
d. Can metadata can be corrected by producing a new wheel or post-release? Likely not by uploading a wheel.
e. Ability to yank packages (will not install for interval-based version specifications, but would still be available to install under exact/equality pinning) per PEP 592. New simple API change to support yanking (a la Ruby gems)
f. Metadata should not be able to differentiate between artefact and PyPI
- The artifact metadata is canonical; and, the metadata exposed by PyPI should never diverge
Non-standards-compliant wheels tagged manylinux-*
Can pip stop (if an install is going to be?) breaking existing requirements
a. will/may require the solver

Action Items:

PyPI stricter on upload, write a rollout plan
a. Chris Wilcox said "I am going to start on a validator in pypa/validator that can be leveraged at twine/setuptools/warehouse" -- has started it -- also see twine check should guard against things not accepted by PyPI like version format twine#430 twine check should guard against things not accepted by PyPI like version format endless timeouts #430
b. Hard fail on invalid markup with explicit description type pypi/warehouse#3285 Warehouse to start hard failing package uploads on invalid markup with explicit description type
Determine the behaviour of wheel build numbers
Simple API for yanking; underway
Finish pip solver
Warning about lack of python_requires
a. Or could the spec/setuptools be updated to fail on this
b. Also, can we fail when there's missing author, author_email, URL. Currently warnings on setup. (chris wilcox)
c. For packages where no restrictions on Python version are desired, a “python_requires==*” would be satisfactory
d. also see WIP: Add metadata validation setuptools#1562
Could we explore banning old upload clients from PyPI?
a. Yes; support needs to be added; an issue needs to be created
General soft-fail support [if you opt in to compliance, we’ll block your upload; otherwise we will send you warnings] -- requires Draft release feature on main archive to allow testing a release before it goes live pypi/warehouse#726

This is meant as as a tracking issue covering the various TODOs necessary to plumb this through the parts of the toolchain.

The text was updated successfully, but these errors were encountered:

brainwane · 2019-06-12T20:06:07Z

I linked to this on Discourse to give people a heads-up.

Some questions:

Determine the behaviour of wheel build numbers

Who needs to do this? This will affect Twine & Warehouse in particular, right?

Warning about lack of python_requires

Would this be in setuptools? @pganssle @jaraco? @crwilcox I see your name in the notes here - anything you could add here would be great.

pganssle · 2019-06-12T20:22:57Z

Would this be in setuptools? @pganssle @jaraco? @crwilcox I see your name in the notes here - anything you could add here would be great.

We could do it in setuptools, but it's more important to do it in warehouse, see pypi/warehouse#3889, since ideally the end goal is that everyone is properly annotating what versions of Python they support (or at least making an affirmative choice to put in something like '*.*'. A warning in setuptools will probably help some people, but more likely it will be largely ignored if it's seen at all.

I think one problem with the "PyPI raises a warning" approach is that IIUC twine has no mechanism to display any such warning. @di and @dstufft would know better than me whether adding such a capability is desirable for this or other reasons.

crwilcox · 2019-06-13T01:01:35Z

I have a branch started on this and intend it to be a part of packaging so it can be used from multiple sources. I think starting by using it in warehouse would make sense.

dstufft · 2019-06-13T14:44:16Z

The upload API does not support a warning mechanism, only error. Arguably it's better to send a warning via email though? Or at least one could make the argument that it is. Lots of people publish in an automated fashion and won't ever see a command line warning anyways.

pradyunsg · 2019-06-13T19:15:07Z

Warnings from PyPI via email make a lot of sense to me.

It's definitely a better option than adding a warnings mechanism that twine then exposes to the user.

We should also use something (like packaging) which can also be used across projects to do the metadata validation.

ssbarnea · 2019-06-14T08:17:52Z

Not really related to packaging itself but I wonder if PYPI could start archiving obsolete/unmaintained packages from the main index. PYPI UX could be improved if we would have a curated index with currently maintained packages. At this moment even if you search for a "foo" package, you will get a very poor search result as the default listing (relevance) does not display the last release date.

Maybe we could use the metadata compliance as a way to filter the package indexes, motivating people to migrate?

Regarding twine upload warnings, I am even more drastic: default to error and allow a temporary bypass option. Sadly 99% of people not even read the warnings, so make them error. The only trick is to include a link to ticket where people can see how to fix it and also comment.

dholth · 2019-06-14T13:30:20Z

I've released a couple of wheels with a build number. Every layer removed from the source tends to want its own version number; RPM has epochs. Alternatively you could make 7 corrections by uploading a py30-none-any wheel and then incrementing the tag all the way to py38-none-any.
@ssbarnea I don't think pypi can curate. Is someone maintaining an awesome-python website with a list of great stuff?
How will strict metadata benefit the authors and not just the consumers of a package, and not just in their role as consumers of other packages?

dstufft · 2019-06-14T13:52:46Z

Not really related to packaging itself but I wonder if PYPI could start archiving obsolete/unmaintained packages from the main index. PYPI UX could be improved if we would have a curated index with currently maintained packages. At this moment even if you search for a "foo" package, you will get a very poor search result as the default listing (relevance) does not display the last release date.

The biggest problem with any sort of system that tries to determine if something is obsolete/unmaintained or not... is how do you actually determine if something is obsolete and/or unmaintained and ensure that you don't get false positives for software that is just "done" and just doesn't need any further updates?

brainwane · 2019-06-19T15:33:19Z

@ssbarnea I'd like to keep this issue focused on the metadata strictness issue. For more on archiving unmaintained projects or excluding them from search or otherwise making frequently-maintained packages easier to find, you might want to follow up in pypi/warehouse#4004 , pypi/warehouse#1388 , pypi/warehouse#4319 , pypi/warehouse#4021 , or pypi/warehouse#1971 . Thanks for your ideas!

brainwane · 2019-06-19T15:35:06Z

@dstufft is "Disallow runs of special characters in project names" pypi/warehouse#469 or "Clean database of UNKNOWN and validates against it" pypi/warehouse#69 part of what's necessary, or part of what we want to do, in this increasing-metadata-strictness work?

brainwane · 2019-06-19T20:00:32Z

@crwilcox perhaps you'd like to take a look at pypi/warehouse#194 where people discuss what automated checks, including on metadata, they'd like performed on uploads to PyPI.

pradyunsg · 2019-09-09T05:31:11Z

We should also use something (like packaging) which can also be used across projects to do the metadata validation.

Coming from @di's comment pypa/twine#430 (comment), by "like packaging" here, I wasn't suggesting we should add this to packaging itself. Rather, I meant that we should have a well-scoped library that does just this one thing -- package validation.

packaging is (currently) scoped to implementing PEP-backed stuff, that has only-one-way-to-do-it and I'd like to preserve that.

di · 2019-09-17T05:58:52Z

packaging is (currently) scoped to implementing PEP-backed stuff, that has only-one-way-to-do-it and I'd like to preserve that.

I was proposing adding this to packaging itself. IMO the same is true for metadata: they are all (more or less) defined in PEPs, and there's only one way to do it (the metadata is either valid, or is isn't).

brainwane · 2019-11-12T00:11:39Z

One of the TODOs here is to finish the pip solver.

The Python Software Foundation's Packaging Working Group has secured funding to help finish the new dependency resolver, and is seeking two contract developers to aid the existing maintainers for several months. Please take a look at the request for proposals and, if you're interested, apply by 22 November 2019. And please spread the word to freelance developers and consulting firms.

brainwane · 2020-01-25T01:44:38Z

We're progressing in the pip resolver work; in a recent meeting, @uranusjr @pfmoore and @pradyunsg started talking about how better Warehouse metadata would help pip with its new resolver, which we'd like to roll out around May.

So pypi/warehouse#726, pypa/packaging#147, pypa/twine#430, and pypa/setuptools#1562 would really help; would anyone like to step up and help get those moving?

di · 2020-01-25T15:26:54Z

How is pypi/warehouse#726 related to metadata strictness here?

ncoghlan · 2020-01-25T15:50:26Z

One aspect I'm aware of is that standardised two-phase upload makes it easier to test installation metadata accuracy prior to release, since it also allows the testing process to be standardised.

brainwane · 2020-01-27T17:13:09Z

+1 to @ncoghlan. Two-phase upload/package preview gives us

Ability to compose and check a package, including metadata, before release
Ability to slowly roll out, first, a stern warning "hey this metadata isn't compliant but we'll let you publish it anyway," and then an actual block of noncompliant packages
General soft-fail support [if you opt in to compliance, we’ll block your upload; otherwise we will send you warnings]

di · 2020-01-27T21:01:59Z

I think it's probably worth making a distinction here: two-phase upload could help prevent metadata that is "not the author's intention, but technically compliant" (such as typos, incorrect python_requires, etc) by allowing them to review and test it. But any metadata that is truly "non compliant" currently won't pass validation and already can't be uploaded.

I think a number of things listed here as "metadata inaccuracy" (like incorrect dependencies, invalid manylinux wheels) might be true, but would require some form of code execution / introspection / auditing (e.g. audithweel) that a two-phase upload doesn't provide (but might be a precursor to).

For things like "warn on missing python_requires", we could have this today if we made this field required in the metadata specification. Implementing two-phase upload would just give us a better way to inform users about this and gracefully change the specification.

Which of these is causing the biggest issue for the resolver work?

pradyunsg · 2020-01-27T22:30:04Z

This GitHub issue bundles together the entire bunch of metadata concerns as they were voiced during the (only? main?) metadata-related topic at the Pacakging Mini-Summit 2019 -- most of which ended up being metadata validation related -- however, as noted below the originally proposed topic was oriented around the resolver's operation/UX.

Additionally, two-phase uploads also came up both the resolver-related planning calls; in the discussion of how rolling out the changes to get better metadata on PyPI could be done (in the early Jan call, as a long-term strategy for improving the state of metadata on PyPI) and how it could catch release issues (like pip 20.0, in the more recent call).

IIUC, a comment by me in a resolver-planning meeting recently prompted the follow-up here:

PyPI does not have excellent metadata, and the pip resolver dealing with bad metadata .... would be difficult.

Which of these is causing the biggest issue for the resolver work?

The main issue for the resolver is the metadata inaccuracies, which we really have no option other than to deal with directly in the resolver somehow. A new resolver has to work with the already-published-with-inaccurate-metadata packages -- so even if we somehow end up with a build farm to build packages on PyPI in the future and we can make sure every new release actually presents correct metadata -- that's not gonna change that the resolver still needs to deal with potentially-incorrect metadata from past releases (unless we're planning to tackle the super untractable problem of back-filling this metadata).

Metadata validation isn't super relevant to the resolution process (directly anyway).

Looking back, the original topic suggestion was more oriented around exactly this -- dealing with inaccuracies in the PyPI metadata in the resolution process. The actual discussion at the event, transitioned to be more around other parts of the workflow, which generate/publish metadata (vs use it, like in the resolver) because of the audience-driven nature of the discussion.

The PyCon 2019 notes don't seem to mention this, so I'm going off my memory now: the reason the resolver is mentioned in the summary-notes above is that during the discussion at the summit "should pip stop when it is installing a package that my break my environment" -- we discussed that theoretically, a pip-with-better-resolver would come up with a solution that the current pip would not be able to find in those situations. (I remember this, because we'd joked about the resolver's complexity when we added it to the action items: https://twitter.com/EWDurbin/status/1125447285272395776/photo/3) :)

(and, that makes it 4am IST)

dstufft · 2020-01-27T23:22:54Z

The main issue for the resolver is the metadata inaccuracies, which we really have no option other than to deal with directly in the resolver somehow. A new resolver has to work with the already-published-with-inaccurate-metadata packages -- so even if we somehow end up with a build farm to build packages on PyPI in the future and we can make sure every new release actually presents correct metadata -- that's not gonna change that the resolver still needs to deal with potentially-incorrect metadata from past releases (unless we're planning to tackle the super untractable problem of back-filling this metadata).

It wouldn't be that intractable to backfill metadata for wheels-- but for sdists there's not much we're going to be able to do. Would the resolver be smart enough to be able to know, in a hypothetical scenario, that it can prefetch dependency information for a wheel, but not for a sdist? Would that help?

pradyunsg · 2020-01-28T04:55:14Z

Would the resolver be smart enough to be able to know, in a hypothetical scenario, that it can prefetch dependency information for a wheel, but not for a sdist?

Yep yep -- we can do that on the "pip side" of the abstractions.

brainwane · 2020-04-03T21:57:05Z

pypi/warehouse#3889 is the issue requesting that Warehouse reject package uploads that lack a Requires-Python.

brainwane · 2020-04-08T15:32:39Z

A followup after some discussion in IRC a few days ago.

On some open issues and TODOs:

Metadata (largely dependency) can be incorrect initially, or may become incorrect in the future as new packages are released. This can be in wheels or in the source distribution.
a. Agreement among PyPI working group to better enforce manylinux (Donald Stufft, Dustin Ingram, EWDIII) -- see "Run auditwheel on new manylinux uploads, reject if it fails #5420" pypa/warehouse#5420

Still working on this.

c. Also possibility for staged releases. This would allow composition and checks before release. (Nick + Ernest)

We'd love help with this feature.

e. Ability to yank packages (will not install for interval-based version specifications, but would still be available to install under exact/equality pinning) per PEP 592. New simple API change to support yanking (a la Ruby gems)

Simple API for yanking; underway.

The To-do items:

PyPI stricter on upload, write a rollout plan
a. Chris Wilcox said "I am going to start on a validator in pypa/validator that can be leveraged at twine/setuptools/warehouse" -- has started it -- also see pypa/twine#430 twine check should guard against things not accepted by PyPI like version format endless timeouts #430

Relevant work is in progress -- see pypa/packaging#147 (comment) .

b. pypa/warehouse#3285 Warehouse to start hard failing package uploads on invalid markup with explicit description type

This is now done.

Determine the behaviour of wheel build numbers

I don't know whether anyone has made progress on this.

TODO #3 was about the Simple API for yanking, which is underway.

Finish pip solver

In progress.

Warning about lack of python_requires
a. Or could the spec/setuptools be updated to fail on this

We need to further discuss lack of python_requires at pypi/warehouse#3889 .

b. Also, can we fail when there's missing author, author_email, URL. Currently warnings on setup. (chris wilcox)
c. For packages where no restrictions on Python version are desired, a “python_requires==*” would be satisfactory
d. also see pypa/setuptools#1562

I think we still need to discuss this in pypi/warehouse#194 .

Could we explore banning old upload clients from PyPI?
a. Yes; support needs to be added; an issue needs to be created

Per the IRC discussion, it sounds like this may or may not be necessary, depending on whether we enforce more specific rules about metadata that must be included in packages, minimum metadata versions, etc.

General soft-fail support [if you opt in to compliance, we’ll block your upload; otherwise we will send you warnings] -- requires pypa/warehouse#726

Again, we would love help implementing package preview/staged releases.

brainwane · 2020-04-23T01:28:57Z

@alanbato is working on package preview/staged releases (now "draft releases"), and the yanking feature is now implemented pypi/warehouse#5837.

We still need further discussion and help with

"Run auditwheel on new manylinux uploads, reject if it fails #5420" Run auditwheel on new manylinux uploads, reject if it fails pypi/warehouse#5420
Determine the behaviour of wheel build numbers -- does this still need investigation?
possibly barring uploads that lack python_requires at Reject packages without a Requires-Python pypi/warehouse#3889
What metadata/installability checks should Warehouse check uploads for? or call settings.configure #194

dholth · 2020-04-23T01:46:04Z

I'm using wheel build numbers for an experimental re-compressed wheels repository and they are working correctly. pip knows the re-compressed wheels found on that --extra-index-url should be preferred. This is closer to the "binary packager is not upstream, might need their own version number" use case just like RPM or .deb's packager level version numbers.

ssbarnea · 2020-04-23T08:23:12Z

What @dholth said made me wonder if in the future we may be able to repackage wheels, maybe even adding extra constraints that prevent incompatibilities with dependencies that where released after the initial wheel was published. Maybe that would be too much, but the idea of being able to have an increasing packaging number (aka release number), is great.

pfmoore · 2020-04-23T09:37:27Z

@ssbarnea We'd have to be extremely careful here from a security point of view. It would be bad, for example, if an untrusted party were able to repackage the wheel for numpy, claiming it's just recompressed to save bandwidth, but in fact they also introduce a new dependency on maliciouspackage==1.0.

Obviously any new repository supplying repackaged wheels would be opt-in, so the exposure is limited to people who do opt in, but as a term, "repackaging" implies no changes to what gets installed on the user's machine, and we don't (yet) have mechanisms to ensure that.

dholth · 2020-04-23T11:22:07Z

You'd better trust the mirror however it would be possible to check all the contained file hashes against the original or otherwise go wild adding security features.

…

On Thu, Apr 23, 2020, at 5:37 AM, Paul Moore wrote: @ssbarnea <https://github.com/ssbarnea> We'd have to be extremely careful here from a security point of view. It would be bad, for example, if an untrusted party were able to repackage the wheel for numpy, claiming it's just recompressed to save bandwidth, but in fact they also introduce a new dependency on `maliciouspackage==1.0`. Obviously any new repository supplying repackaged wheels would be opt-in, so the exposure is limited to people who do opt in, but as a term, "repackaging" implies no changes to what gets installed on the user's machine, and we don't (yet) have mechanisms to ensure that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#264 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABSZEQ45ARCTM5N5TKD4BTROAD6NANCNFSM4HXQGZUA>.

pradyunsg · 2020-04-23T20:39:47Z

You'd better trust the mirror

Recompressing wheels is inherently incompatible with a threat model where an attacker has the ability to respond to client requests.

This is a threat model we're protecting PyPI from, with future security enhancements: https://www.python.org/dev/peps/pep-0458/#threat-model. Note that this PEP is accepted and, as far as I know, there is funding for implementation of this functionality as well.

If the user should download multiple wheels (like the original wheel, for checking contained file hashes against), I'm pretty sure we've thrown away any bandwidth gains we'd have made. :)

dholth · 2020-04-23T20:52:51Z

This bug is really about whether the build tags work. They do. If you were willing to sign the RECORD (or any document containing hashes of all the individual files within the package) and if you were willing to let the recompressed wheel be unpacked before it was totally verified, then your extra bandwidth would be the list of filenames and hashes of the original. Not the entire original wheel. If you're more worried about making a silly mistake than withstanding a sophisticated attack then you're in a much better position. You can use a HTTP range request to download the original wheel's RECORD from pypi, and compare it with what you got from the recompressor. You could have endless fun going down the rabbit hole of having the mirror say "I zipped this correctly" with one signature in which case you would feel confident enough to run unzip on the wheel, and then make sure the individual files were still intact. No matter the security it's really common to have your own source of packages that you want overlaid on top of the public one.

…

On Thu, Apr 23, 2020, at 4:40 PM, Pradyun Gedam wrote: > You'd better trust the mirror Recompressing wheels is inherently incompatible with a threat model where an attacker has the ability to respond to client requests. This is a threat model we're protecting PyPI from, with future security enhancements: https://www.python.org/dev/peps/pep-0458/#threat-model. Note that this PEP is accepted and, as far as I know, there is funding for implementation of this functionality as well. If the user should download multiple wheels (like the original wheel, for checking contained file hashes against), I'm pretty sure we've thrown away any bandwidth gains we'd have made. :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#264 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABSZEWZVS7J4FRHHUA25MTROCRSHANCNFSM4HXQGZUA>.

nlhkabu · 2020-05-20T20:35:27Z

As this ticket is blocked by the development of the dependency resolver (#988), I thought I would mention here that the team is looking for help from the community to move forward on that subject.

We need to better understand the circumstances under which the new resolver fails, so are asking for pip users with complex dependencies to:

Try the new resolver (use version 20.1, run --unstable-feature=resolver)
Break it :P
File an issue

You can find more information and more detailed instructions here

brainwane · 2020-12-03T17:11:52Z

Now that pypa/pip#988 is resolved, folks have started taking a fresh look at related issues starting at pypa/pip#9187 (comment) .

johnlabarge · 2021-07-21T00:28:33Z

I guess a terribly dumb question is can anyone point me to a place where I can control the metadata. For some reason the published package (using tgz) is not having the same metadata as I put in setup.py?

merwok · 2021-07-21T02:06:48Z

You can inspect the metadata locally by building a distribution (python setup.py sdist bdist_wheel, requires pip install wheel first for the latter command) and looking at the METADATA or PKG-INFO file inside the dists (which are zip archives).

brainwane added the toolchain label Jun 12, 2019

brainwane mentioned this issue Jun 12, 2019

Add metadata validation pypa/packaging#147

Closed

Bachmann1234 mentioned this issue Jul 9, 2019

Adds a type check to ensure the description type provided is valid pypa/twine#473

Closed

brainwane mentioned this issue Aug 15, 2019

New Resolver: Rollout, Feedback Loops and Development Flow pypa/pip#6536

Closed

brainwane mentioned this issue Apr 8, 2020

What metadata/installability checks should Warehouse check uploads for? pypi/warehouse#194

Open

brainwane mentioned this issue Dec 3, 2020

New resolver takes a very long time to complete pypa/pip#9187

Closed

xmunoz mentioned this issue Apr 26, 2021

Package preview feature for PyPI psf/fundable-packaging-improvements#32

Open

Increasing pip's & PyPI's metadata strictness #264

Increasing pip's & PyPI's metadata strictness #264

Comments

brainwane commented Jun 12, 2019 • edited

brainwane commented Jun 12, 2019

pganssle commented Jun 12, 2019

crwilcox commented Jun 13, 2019

dstufft commented Jun 13, 2019

pradyunsg commented Jun 13, 2019

ssbarnea commented Jun 14, 2019

dholth commented Jun 14, 2019

dstufft commented Jun 14, 2019

brainwane commented Jun 19, 2019

brainwane commented Jun 19, 2019

brainwane commented Jun 19, 2019

pradyunsg commented Sep 9, 2019

di commented Sep 17, 2019

brainwane commented Nov 12, 2019

brainwane commented Jan 25, 2020

di commented Jan 25, 2020

ncoghlan commented Jan 25, 2020

brainwane commented Jan 27, 2020

di commented Jan 27, 2020

pradyunsg commented Jan 27, 2020 • edited

dstufft commented Jan 27, 2020

pradyunsg commented Jan 28, 2020

brainwane commented Apr 3, 2020

brainwane commented Apr 8, 2020 • edited

brainwane commented Apr 23, 2020

dholth commented Apr 23, 2020

ssbarnea commented Apr 23, 2020

pfmoore commented Apr 23, 2020

dholth commented Apr 23, 2020 via email

pradyunsg commented Apr 23, 2020

dholth commented Apr 23, 2020 via email

nlhkabu commented May 20, 2020

brainwane commented Dec 3, 2020

johnlabarge commented Jul 21, 2021

merwok commented Jul 21, 2021

brainwane commented Jun 12, 2019 •

edited

pradyunsg commented Jan 27, 2020 •

edited

brainwane commented Apr 8, 2020 •

edited