WIP: Add metadata validation #1562

bittner · 2018-10-29T00:27:30Z

Summary of changes

Adds metadata validation that aborts packaging with invalid package information.

Pull Request Checklist

Changes have tests
News fragment added in changelog.d. See documentation for details

bittner · 2018-10-29T00:32:28Z

As suggested in #1390 (comment) I've added the information about which field should satisfy which validation to the new validation module directly.

I still have to write the test for the new module. This is meant as WIP, so I can have the code peer-reviewed more quickly. (Thank you for your understanding!)

bittner · 2018-10-29T00:49:23Z

Looks like I have to make the tests pass for Python 2.

Also, the Exception details son't seem to be printed out in Python 2 -- while they do in Python 3, e.g.

E   setuptools.validation.InvalidMetadataError: [('provides_extras', AssertionError("is of type <class 'set'> but should be of type list(str)",))]

versus Python 2:

E   InvalidMetadataError

    setuptools/validation.py:97: InvalidMetadataError

pganssle

I don't like the idea of adding any sort of field validation to the public interface of setuptools, and I see no reason to wrap the DistributionMetadata in a validation.Metadata class.

If we're going to continue on this path, we should put validators in a setuptools._validation module. I would start with adding a validate_metadata function that acts on DistributionMetadata objects, though in the long run we may need to tweak how precisely it works.

Also, all that said, I think @di's suggestion of putting a metadata validator in packaging instead of in setuptools is a good idea. This is a function that all build backends will likely need.

pganssle · 2018-10-29T13:11:49Z

setuptools/dist.py

@@ -55,17 +56,41 @@ def get_metadata_version(dist_md):
 def write_pkg_file(self, file):
    """Write the PKG-INFO format data to a file object.
    """
+    metadata = Metadata(


Why are you wrapping DistributionMetadata in a Metadata object? Even if there were not other problems with this approach, I would think that the validator could work just as well on DistributionMetadata.

I see. You're right.

On the other hand, consider the validation functionality suggested by @di: it requires that you pass a dictionary to the validation class constructor. How would you solve this with the DistributionMetadata constructor?

I believe @di was referring to more generic functionality for a package other than setuptools, so it depends on the details there. One option is to just generate a non-validated PKG-INFO and pass it to a function that validates PKG-INFO files.

Another is to have an intermediate format that represents PKG-INFO as JSON (thus retaining some type information and avoiding the problem where content and format are indistinguishable) which is convertable by packaging into a PKG-INFO file. packaging could then validate the JSON file before a PKG-INFOfile is generated.

If we're already trying to validate a DistributionMetadata file, then I would just validate the results of the actual getter operations.

pganssle · 2018-10-29T13:17:22Z

setuptools/validation.py

+            setattr(self, key, kwargs[key])
+            self.fields += [key]
+
+    def validate(self, throw_exception=False):


There's no need for throw_exception, if people want to suppress the exceptions from this function, they can do:

try: metadata.validate() invalid = False except InvalidMetadataError: invalid = True

Please note that @di suggested to add validation functionality that behaves like validate().errors, so that it integrates nicely into Warehouse code.

I felt that throwing an exception by default would make code more clumsy. It's just not beautiful. -- I wouldn't want to miss that possibility in general, though.

Please note that @di suggested to add validation functionality that behaves like validate().errors, so that it integrates nicely into Warehouse code.

He suggested no such thing. He suggested that something like the warehouse functionality go into a different project, pypa/packaging, not this one. I think you may have misunderstood him.

I felt that throwing an exception by default would make code more clumsy. It's just not beautiful. -- I wouldn't want to miss that possibility in general, though.

This is in general not an example of good design, because there are now two different error handling pathways. Additionally, the Pythonic way to signal an error condition is to raise an exception. It is not clumsy and in fact it leads to a lot of clumsy code in downstream consumers.

The reason is that in most cases, the functions calling your function can do nothing about an error except abort early and notify the user, if you raise an exception in error conditions, anyone who isn't going to try to recover from the error can just use your function as if it always works because the exception you raise will bubble up the stack until someone either catches it or the program is aborted and the user is notified of the error. If you return an error code instead, you are relying on your consumers to know that your function can either return an error code of some sort or not and raise an error or a handler.

pganssle · 2018-10-29T13:48:26Z

setuptools/validation.py

+    """A string that can contain newline characters"""
+    if val is None:
+        return
+    assert isinstance(val, str), \


Do not use assert for control flow. It is disabled when python is run optimized.

What would be similarly elegant? I don't want to do an if foo: raise Bar() in all that many places.

assert is not intended to be an "elegant" shorthand for raising an exception, it's a way of programmatically making assertions about how your code should work. You use it for things like enforcing a function's contract or documenting assumptions made by the programmer (which are then validated at runtime).

In this specific case, you cannot use it because it will not work. Assertions are to be treated as optional statements and if you run Python in optimized mode, they will simply not run at all, for example:

$ python -Oc "assert False" $ python -c "assert False" Traceback (most recent call last): File "<string>", line 1, in <module> AssertionError

In this particular case, I would either raise ValueError or convert validators to functions returning a boolean, depending on the details. I'm not sure it's worth working out the details in this case, though, because as mentioned elsewhere this functionality rightly belongs in another library.

@bittner Please update the code and use raise exception. Assert should be used only inside tests and I am sure that some linters are even able to identify accidental use outside tests.

@ssbarnea I think at this point this PR needs a much bigger overhaul, and most of it may need to live in this repo. Probably best to hold off on cleaning up this specific code until the overhaul is done.

di · 2018-10-29T15:16:36Z

Agree with @pganssle here, there should be no actual validation logic added to setuptools unless it is somehow setuptools-specific.

bittner · 2018-11-01T09:45:36Z

I agree with you. There are some things I still need to clarify for myself, though:

setuptools is just a "tool" to generate metadata to publish a package (among other things)
distutils is just another tool (like setuptools, see above) -- let's ignore the unlucky monkey-patch inheritance nature of setuptools for now
distutils is mentioned as a deprecation candidate in other discussions
other places, such as Warehouse, may also require validating metadata (i.e. the validation code)
if cleanly implemented, there should probably be a separate, independent module, specifically dedicated to just the metadata used in Python or for Python modules, and their validation.

TL;DR

Is the write_pkg_file function the right place to add the use of metadata validation, like implemented in this PR?
Where should the validation business logic be added -- if not in distutils.dist.DistributionMetadata (as of item 3 above)? Should I move the validation module I introduced with this PR to the packaging package?

pganssle · 2018-11-01T15:50:26Z

@bittner
Regarding points 1, 2 and 3, setuptools is an extension of distutils. Both are tools for building python packages - i.e. converting the source repository into something distributable/installable. distutils is basically deprecated in the sense that it is supplanted by setuptools and likely distutils will move into setuptools. PEP 517 adds a mechanism for freeing the tight coupling between installer tools like pip and distutils and setuptools. This will allow for the user of other build backends and in general is a sounder design because it's based on well-defined interfaces.

Regarding point 4, there are many points you may want to validate the metdata:

Immediately before or after generation (setuptools, flit, other build backends)
Before upload (twine)
Before accepting it to a package index (warehouse, bandersnatch)
Before installation (pip)

The metadata fields and formats are a standard, and as such it is likely that anyone who uses it may want to be able to know the difference between metadata that is compliant with the standard and not compliant. We could write separate validation logic for each of these (and possibly one or more of them will re-implement validation), but it may be a needless duplication of effort, and you may end up with metadata producers generating something that metadata consumers consider invalid if one is more strict than the other.

Regarding your questions:

Is the write_pkg_file function the right place to add the use of metadata validation, like implemented in this PR?

This depends on the nature of the common validator, but I think so, yes.

Where should the validation business logic be added -- if not in distutils.dist.DistributionMetadata (as of item 3 above)? Should I move the validation module I introduced with this PR to the packaging package?

I don't know what you mean by "validation business logic", but I'm guessing you mean the validate method? If so, I would say it belongs in another repo. Possibly pypa/packaging, but I would open an issue there to figure out if that's something they want to include or if we should possibly make it a completely separate package, and if so what the details would be.

If it were me designing it, I would probably design an interface that looks like this:

def validate_metadata(metadata: Dict[str, str]):
    """Function to validate PKG-INFO metadata

    Raises `InvalidMetadataError` for invalid metadata.
    """
    metadata_version = metadata.get("Metadata-Version", None)
    if metadata_version not in VALID_VERSIONS:
        raise InvalidMetadataError("No valid Metadata-Version found!")

    for key, value in metadata.items():
        validate = _get_key_validator(key, metadata_version)
        validate(value)

metadata would be a dictionary mapping the keys from the appropriate version of the Core Metadata Specification to a function that checks whether value is a valid value for that key in that version of the metadata and throws InvalidMetadataError otherwise.

If such an interface were added to a third-party package, write_pkg_info would need to be modified such that it generates a dictionary (possibly ordered) of key-value pairs, then writes the key-value pairs to file in the PKG-INFO format only at the end of the function. This dictionary could then be passed to the validate_metadata function.

Note that in this design, you can easily validate a JSON file containing the metadata by reading it into a dictionary with json.load.

ssbarnea · 2018-12-27T10:44:22Z

Any chance to fix this?

bittner · 2018-12-31T01:11:08Z

Any chance to fix this?

I don't think it will be from this PR. I would have loved to get this fixed. It doesn't depend on me now, I'm afraid. 😞

@ssbarnea Go ahead in pypa/packaging#147 asking for progress, please!

ssbarnea

If you could update this it would be great as we really need to add metadata validation.

ssbarnea · 2019-01-14T20:14:32Z

setuptools/validation.py

+    """A string that can contain newline characters"""
+    if val is None:
+        return
+    assert isinstance(val, str), \


@bittner Please update the code and use raise exception. Assert should be used only inside tests and I am sure that some linters are even able to identify accidental use outside tests.

jaraco · 2020-03-21T09:28:25Z

This contribution seems to have stalled, at an impasse due to lack of clarity on where the validation should be implemented or whose implementation should be used. I'm uninterested in merging functionality here that's going to be implemented redundantly in another package. As a result, I'm going to defer this merge indefinitely, though I'm happy to revive the PR or review a new one at such a point that a (shared) implementation is accepted.

Thanks @bittner for your contribution, which will remain here even if not merged.

bittner · 2020-03-23T10:08:09Z

Well, at least I tried. TBH, I'm a bit disappointed about how this PR went. I would have continued elsewhere if directed properly.

I'm keen on finding the "right place" and implement an elegant, sustainable and beautiful solution (honoring PEP 20 and Clean Code). Clearly, you all know the code base better but you could have leveraged my motivation. Instead, I found myself blocked with some other's intention of "I'll do it myself" (which doesn't seem to have resulted in a working solution until today, half a year after).

This is not meant as a personal rant. I guess, most of us do that kind of programming in their spare time, sacrificing their families. Honor this! And, please, open yourselves to contributions from people that are not yet part of "your club". (At least that was my, clearly personal, impression. Sorry about that.)

ssbarnea · 2020-03-23T10:42:36Z

With all due respect but I seen passing the dead-cat (metadata-validation) between several projects and this is really sad, for multiple reasons: discourage new contributors and fails to implement a feature that is highly desired and needed.

I can understand that the maintainers are worried about added complexity and they prefer to focus on other things but once they realise that metadata needs to be validated by another project, they should be the first to setup this placeholder project and link to it.

The other approach which is build it and we will use it if we like is not really constructive, as is used in many cases as an excuse to do nothing (not saying is the case here).

di · 2020-03-24T00:28:33Z

@bittner Really sorry that you feel like your time/motivation was wasted here, I definitely feel like that's my fault. Unfortunately I haven't had much time (until now) to work on this myself, or help someone else work on it. If you're interested, there will surely still be work to be done here. I posted an update on pypa/packaging#147 outlining what's been holding this up and the current status.

@ssbarnea I'm not sure what you mean. At least on this issue, I think @pganssle and I have agreed that metadata validation should exist in packaging, and this has been the plan from PyPI's perspective as well for quite a while.. This isn't a case of "setuptools maintainers" saying "packaging maintainers" should do it instead... we're all working on the same thing here.

ssbarnea · 2020-03-24T10:34:13Z

@di Super! I am happy to hear we reached an agreement regarding where this feature will land.

pganssle · 2020-03-24T18:36:33Z

@ssbarnea To be clear, there was never any disagreement here. The submitter merely misunderstood what was being asked of them.

bittner · 2020-03-25T00:09:01Z

The submitter merely misunderstood what was being asked of them.

Yeah, sure.

Add metadata validation

914cf24

bittner mentioned this pull request Oct 29, 2018

Newlines in the description field produce a malformed PKG-INFO #1390

Closed

pganssle requested changes Oct 29, 2018

View reviewed changes

pganssle added draft deferred labels Oct 29, 2018

pganssle reviewed Oct 29, 2018

View reviewed changes

bittner mentioned this pull request Dec 5, 2018

Add metadata validation pypa/packaging#147

Closed

ssbarnea suggested changes Jan 14, 2019

View reviewed changes

brainwane mentioned this pull request Jun 12, 2019

Increasing pip's & PyPI's metadata strictness pypa/packaging-problems#264

Open

JuliaSprenger mentioned this pull request Sep 17, 2019

Fix packaging for upload on pypi INM-6/python-odmltables#112

Closed

jaraco closed this Mar 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add metadata validation #1562

WIP: Add metadata validation #1562

bittner commented Oct 29, 2018

bittner commented Oct 29, 2018

bittner commented Oct 29, 2018

pganssle left a comment

pganssle Oct 29, 2018

bittner Nov 1, 2018

pganssle Nov 1, 2018

pganssle Oct 29, 2018

bittner Nov 1, 2018

pganssle Nov 1, 2018 •

edited

pganssle Oct 29, 2018

bittner Nov 1, 2018

pganssle Nov 1, 2018

ssbarnea Jan 14, 2019

pganssle Jan 15, 2019

di commented Oct 29, 2018

bittner commented Nov 1, 2018 •

edited

pganssle commented Nov 1, 2018

ssbarnea commented Dec 27, 2018

bittner commented Dec 31, 2018

ssbarnea left a comment

ssbarnea Jan 14, 2019

jaraco commented Mar 21, 2020

bittner commented Mar 23, 2020 •

edited

ssbarnea commented Mar 23, 2020

di commented Mar 24, 2020

ssbarnea commented Mar 24, 2020

pganssle commented Mar 24, 2020

bittner commented Mar 25, 2020

WIP: Add metadata validation #1562

WIP: Add metadata validation #1562

Conversation

bittner commented Oct 29, 2018

Summary of changes

Pull Request Checklist

bittner commented Oct 29, 2018

bittner commented Oct 29, 2018

pganssle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pganssle Nov 1, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

di commented Oct 29, 2018

bittner commented Nov 1, 2018 • edited

pganssle commented Nov 1, 2018

ssbarnea commented Dec 27, 2018

bittner commented Dec 31, 2018

ssbarnea left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaraco commented Mar 21, 2020

bittner commented Mar 23, 2020 • edited

ssbarnea commented Mar 23, 2020

di commented Mar 24, 2020

ssbarnea commented Mar 24, 2020

pganssle commented Mar 24, 2020

bittner commented Mar 25, 2020

pganssle Nov 1, 2018 •

edited

bittner commented Nov 1, 2018 •

edited

bittner commented Mar 23, 2020 •

edited