Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous version parsing in parse_sdist_filename #527

Open
woodruffw opened this issue Mar 28, 2022 · 32 comments
Open

Ambiguous version parsing in parse_sdist_filename #527

woodruffw opened this issue Mar 28, 2022 · 32 comments

Comments

@woodruffw
Copy link
Member

Hi there! First of all, thanks for continuing to maintain this package -- it's extremely useful 🙂

I'm one of the maintainers of pip-audit, and we had a user report some strange dependency resolution behavior: pypa/pip-audit#248

We were able to root-cause the bug down to a release of cffi (1.0.2-2) that uses the implicit post releases syntax for specifying the post-release number, rather than the canonicalized postN format. This release of cffi is published on PyPI here, without canonicalization, so it's likely that it was uploaded before PyPI began normalizing versions.

Because 1.0.2-2 contains a dash, the following body of packaging.utils.parse_sdist_filename contains an incorrect assumption and parses the source distribution name incorrectly:

def parse_sdist_filename(filename: str) -> Tuple[NormalizedName, Version]:
    if filename.endswith(".tar.gz"):
        file_stem = filename[: -len(".tar.gz")]
    elif filename.endswith(".zip"):
        file_stem = filename[: -len(".zip")]
    else:
        raise InvalidSdistFilename(
            f"Invalid sdist filename (extension must be '.tar.gz' or '.zip'):"
            f" {filename}"
        )

    # We are requiring a PEP 440 version, which cannot contain dashes,
    # so we split on the last dash.
    name_part, sep, version_part = file_stem.rpartition("-")
    if not sep:
        raise InvalidSdistFilename(f"Invalid sdist filename: {filename}")

    name = canonicalize_name(name_part)
    version = Version(version_part)
    return (name, version)

yielding:

>>> from packaging.utils import parse_sdist_filename
>>> parse_sdist_filename("cffi-1.0.2-2.tar.gz")
('cffi-1-0-2', <Version('2')>)

whereas we expected:

>>> from packaging.utils import parse_sdist_filename
>>> parse_sdist_filename("cffi-1.0.2-2.tar.gz")
('cffi', <Version('1.0.2.post2')>)

TL;DR: parse_sdist_filename shouldn't rely on the last dash as a separator between the distribution name and the version, since PEP 440 allows dashes in non-normalized versions. Parsing this correctly poses a bit of a challenge, since distribution names can also contain dashes and numbers and might even contain them in pathological ways, such as:

# package foo3, version 1.0.0.post1
foo3-1.0.0-1.tar.gz

# package foo-3, version 1.0.0.post1
foo-3-1.0.0-1.tar.gz

# package 3_3, version 1.0.0.post1
# i'm not sure this one is valid, but i can't find a countervailing spec in any of the packaging PEPs
3-3-1.0.0-1.tar.gz
@woodruffw
Copy link
Member Author

(Another possible solution here is to normalize all of the current distribution releases hosted on PyPI, and emphasize that parse_sdist_filename only supports normalized PEP 440 versions. But I'm not sure about the blast radius/impact of doing so -- it would mean changing a lot of the index filenames, if not their content-addressed distribution filenames on the CDN.)

@woodruffw
Copy link
Member Author

Here's the valid distribution regexp in PEP 508:

^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$

So 3-3 and 3_3 are valid distribution names, making 3-3-1.0.0-1.tar.gz and 3_3-1.0.0-1.tar.gz valid pre-normalized sdist filenames.

@di
Copy link
Sponsor Member

di commented Mar 28, 2022

I'm not sure I would consider this to be a bug in packaging - we are assuming the filename has a normalized PEP 440 version, which technically cffi-1.0.2-2.tar.gz does, and have no way to determine that this is an implicit post release and not a project named cffi-1.0.2 with a version 2.

One thing we could do here is to allow the caller to optionally provide the distribution name they were expecting to be present, so this function could either use this as a hint during parsing, or just raise an exception when what it found isn't what was expected. I'm not sure how useful this would be broadly, though.

(Another possible solution here is to normalize all of the current distribution releases hosted on PyPI, and emphasize that parse_sdist_filename only supports normalized PEP 440 versions. But I'm not sure about the blast radius/impact of doing so -- it would mean changing a lot of the index filenames, if not their content-addressed distribution filenames on the CDN.)

Fairly unlikely we would do this, IMO.

@woodruffw
Copy link
Member Author

One thing we could do here is to allow the caller to optionally provide the distribution name they were expecting to be present, so this function could either use this as a hint during parsing, or just raise an exception when what it found isn't what was expected. I'm not sure how useful this would be broadly, though.

This seems like a reasonable workaround to me, but +1 on the general usefulness concern (it's definitely useful for our needs on pip-audit, but I'm not sure about anyone else's needs).

woodruffw added a commit to pypa/pip-audit that referenced this issue Mar 28, 2022
woodruffw added a commit to pypa/pip-audit that referenced this issue Mar 28, 2022
)

* pip_audit/dependency_source: match candidate names against project

See: pypa/packaging#527.

Fixes #248.

* pip_audit/dependency_source: remove redundant `is_satisfied_by` test

* test: add tests for vexing sdist parses

* test: update comment

* CHANGELOG: record fixes

* setup: pin `click`

Works around psf/black#2964

* setup: add note about pinned click
@brettcannon
Copy link
Member

I have opened pypa/packaging.python.org#1066 to strengthen the spec around sdist file names. That implicitly updates our docs since we reference it for what the function can parse.

@uranusjr
Copy link
Member

For some additional information. pip is able to install the package because the Simple API also provides the project name in the URL (for cffi, the endpoint is https://pypi.org/simple/cffi/), so pip is able to “know” that the project name is cffi instead of cffi-1-0-2 and use that information to assist parsing. Perhaps parse_sdist_filename could do the same by accepting an optional second argument? (And if packaging does not, pip-audit can still add additional logic around the function to handle edge cases for compatibility.)

We should probably loop some of the cffi maintainers to inform them about this.

@woodruffw
Copy link
Member Author

pip-audit can still add additional logic around the function to handle edge cases for compatibility.)

Yep, that's what we ended up doing: pypa/pip-audit#249

That suggestion otherwise sounds similar to what @di suggested in #527 (comment), which sounds good to me! I'm happy to make those changes in a PR if other consumers would find them useful 🙂

@woodruffw woodruffw changed the title Incorrect version parsing in parse_sdist_filename Ambiguous version parsing in parse_sdist_filename Mar 31, 2022
@woodruffw
Copy link
Member Author

Updated the title to emphasize that this is an ambiguity, not a spec violation per se.

@woodruffw
Copy link
Member Author

Related: the ambiguous parse here could also probably be avoided by enforcing the current guidance on sdist filenames, which suggests that distribution names be normalized from - to _. I opened pypa/packaging.python.org#1077 as part of that, since PyPI and other hosts don't currently perform that normalization.

@uranusjr
Copy link
Member

uranusjr commented May 3, 2022

Well… https://peps.python.org/pep-0625/

@woodruffw
Copy link
Member Author

Well… https://peps.python.org/pep-0625/

Yep! I was just reading that PEP 🙂. Right now it says that distributions are normalized according to 503, which defers to 426, which means that they can still contain -. Unless I'm misunderstanding, we'd probably need a slightly stronger normalization to avoid the ambiguous parse in this case.

(But also NB that it's not currently an issue, since all recent sdists are being published with normalized versions. This would strictly be a "cherry on top" in terms of reducing ambiguity.)

@dstufft
Copy link
Member

dstufft commented May 3, 2022

You missed a part:

The name of an sdist should be {distribution}-{version}.sdist.

  • distribution is the name of the distribution as defined in PEP 345, and normalised according to PEP 503, e.g. 'pip', 'flit-core'.
  • version is the version of the distribution as defined in PEP 440, e.g. 20.2.

Each component is escaped according to the same rules as PEP 427.

Specifically:

Each component is escaped according to the same rules as PEP 427.

So the names get normalized as per PEP 503, which will convert any runs of non alphanum characters into a single -, then PEP 427 escaping - into _.

@dstufft
Copy link
Member

dstufft commented May 3, 2022

TBH, PEP 503 normalization really only means the name gets lowercased, since PEP 427 implements the exact same (effectively anyways) normalization as PEP 503 does, just uses _ instead of -, other than PEP 503 adds lower casing as well.

@woodruffw
Copy link
Member Author

Whoops, I also missed 427.

Just to clarify:

So the names get normalized as per PEP 503, which will convert any runs of non alphanum characters into a single -, then PEP 427 escaping - into _.

This only applies to wheels, not sdists, correct? Experimentally PyPI does not currently normalize - into _ for sdist distribution names. Example: https://pypi.org/simple/pip-audit/

(That's where the ambiguity currently appears to me. The current PyPA spec says that that normalization does occur for sdists, whereas it doesn't appear to.)

@uranusjr
Copy link
Member

uranusjr commented May 3, 2022

PEP 427 only applies to wheels, and PEP 625 says “we’ll use the same rules for sdist”.

@woodruffw
Copy link
Member Author

That's this language, right?

Each component is escaped according to the same rules as PEP 427.

If so, that would mean a format like this, correct?

foo_bar-1.2.3.sdist

If so, that all makes sense so me, and that would be a good improvement 🙂

(This is a significant tangent at this point, but I wonder if we couldn't avoid all of this ad-hoc parsing entirely by updating PEP 503 to include more data-... attributes. This could be done both both sdists and wheels: have attributes like data-distribution=..., data-version=..., and so forth.)

@uranusjr
Copy link
Member

uranusjr commented May 3, 2022

Hopefully all these would go away when we finally standardise the JSON API. Regarding the PEP… happy to hear it all make sense for you, unfortunately not many people are entirely in agreement with the proposal (read the Discourse thread linked in the PEP).

@brettcannon
Copy link
Member

Would a utils.escape_name() function be useful in this case? I assume it would just be utils.canonicalize_name(name).replace("-", "_"), but at least that's easier than saying "do PEP 503 with PEP 427". 😉

@woodruffw
Copy link
Member Author

Would a utils.escape_name() function be useful in this case? I assume it would just be utils.canonicalize_name(name).replace("-", "_"), but at least that's easier than saying "do PEP 503 with PEP 427". 😉

I think it'd be useful for 625, although the current sdist filename format is probably foregone (since PyPI and other hosts don't do that escaping and lots of packages are already hosted with - in the filename).

@brettcannon
Copy link
Member

the current sdist filename format is probably foregone (since PyPI and other hosts don't do that escaping and lots of packages are already hosted with - in the filename).

For old sdists, yes, but there's nothing saying that naming can't be enforced going forward to slowly update the community through attrition.

@brettcannon
Copy link
Member

I opened #542 for escape_name().

@dstufft
Copy link
Member

dstufft commented May 5, 2022

I would have to think about it some, but there's a high chance that we could just force normalize sdist names on upload in Warehouse. There's a lower chance, but still reasonable, that we could rename existing sdists to be normalized. You could open a Warehouse ticket about it to figure it out.

@woodruffw
Copy link
Member Author

Yep, I can open that ticket in a bit.

@dstufft
Copy link
Member

dstufft commented May 9, 2022

FWIW, I was reading the write up about this, and FWIW I do in fact think that this is a bug with the current parse_sdist_filename.

Unfortunately parse_sdist_filename has made the assumption that you can parse a sdist filename without any additional context, which absent PEP 625 is not true. Without something like PEP 625, we have never required sdists normalize or escape any part of their name.

The sdist page on packaging.python.org suggests that canonicalizing the name and version is a de facto standard, which does not appear to actually be true? Looking through a number of projects, I don't see hardly any canonicalizing really happening.

While I agree that we should be, at a minimum, escaping - to _ in sdists, there is no actual standard or practice currently that requires it. I believe that this util function has gone unnoticed because I don't think that many projects utilize post releases, and pip itself doesn't use this function so it's not had enough use to have someone trigger the edge case until now.

Unfortunately I don't have a great answer here, parse_sdist_filename inherently can't work with all currently valid sdist names unless you break compatability to require passing in the additional context of the package name in question. So without breaking compatability, it is a fundamentally broken API. Ideally I guess PEP 625 gets accepted and then at least this function could be documented as only support PEP 625 compatabile file names.

Having parse_sdist_filename exist in this library is kind of weird in general, since the libraries stated goals are:

This library provides utilities that implement the interoperability specifications which have clearly one correct behaviour (eg: PEP 440) or benefit greatly from having a single shared implementation (eg: PEP 425).

Which doesn't actually exist for sdist filenames.

@brettcannon
Copy link
Member

Unfortunately parse_sdist_filename has made the assumption that you can parse a sdist filename without any additional context, which absent PEP 625 is not true.

I think you're right, so we should probably update the docs to say that it implements PEP 625 even though it's not accepted.

Having parse_sdist_filename exist in this library is kind of weird in general

I think when we accepted the function we didn't realize how many nasty edge cases that are all technically legal we had missed, or that they are in any way widespread. Maybe this is more motivation to get PEP 625 accepted. 😁

@dstufft
Copy link
Member

dstufft commented May 10, 2022

Yea, it happens. packaging is full of them :P (though increasingly less of them!)

I think documenting that it implements PEP 625, possibly with warning that PEP 625 isn't accepted or enforced anywhere yet, so that there are edge cases where it will not correctly parse otherwise currently valid names to warn people is probably a reasonable option.

It's also probably the best option for what to do, since it's somewhat silly to just egregiously break compatibility for an edge case, particularly one that I think is somewhat hard to trigger in practice using modern tooling (IIRC setuptools normalizes the version by default anyways, and the cffi version that actually broke this, isn't a valid package for other reasons).

@dstufft

This comment was marked as outdated.

@dstufft

This comment was marked as outdated.

@brettcannon

This comment was marked as outdated.

@brettcannon
Copy link
Member

FYI https://peps.python.org/pep-0625/ has been accepted!

@pradyunsg
Copy link
Member

pradyunsg commented Dec 7, 2022

Do we want this function to serve as a validating parser for PEP 625 (i.e. checking that the distribution is canonically named, checking that the version is normalised) or just a plain "split and call Version"?

@brettcannon
Copy link
Member

I'm personally fine with following PEP 625 since sdists are otherwise just a "good luck to you" situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants