Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop using the json API at pypi #9170

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dimbleby
Copy link
Contributor

@dimbleby dimbleby commented Mar 17, 2024

This is intended mostly as a simplification: now that the pep658 backfill is complete we have two ways of accessing essentially the same information, and it is better to have just one. This also allows the relevant code to be common among all HttpRepository classes.

This should be a small win for any packages with no dependencies. For these packages: poetry reads the json API, correctly mis-trusts it when it says that the package has no dependencies and then ends up at the metadata anyway. Now we just go straight to the metadata.

Otherwise I suspect this is largely a wash: most of the time it swaps one network request for another, and instead of parsing a json response it parses a pkg-info.

To be honest I might not have started on this if I had known how tedious sorting out the tests was going to be...

The test script changes were larger than hoped mostly because it turns out the pypi-repository had a "fallback" flag. The meaning of this flag seems to be "behave correctly": but it is set to False in unit tests. The refactor eliminates that flag and as a result the unit tests now find themselves using more fixtures - because dependencies are correctly being chased.

I have removed two testcases rather than patch them up:

  • test_fallback_pep_658_metadata() becomes somewhat redundant since lots of tests are now using metadata
  • test_solver_skips_invalid_versions()
    • This test was added at 1ebd629, the code that it then tested survives I think only in PypiRepository.search() - which is not what the test covers at all
    • this one testcase was pulling in a large number of fixtures in the pandas / numpy / scipy / matplotlib family, I judge that it just is not worth it

@dimbleby dimbleby force-pushed the stop-using-pypi-json-api branch 4 times, most recently from c3d7060 to d5be499 Compare March 18, 2024 00:20
@abn
Copy link
Member

abn commented Mar 18, 2024

This is great work @dimbleby. A few notes;

  1. Can we please add the metadata file fetching into https://github.com/python-poetry/poetry/blob/main/tests/repositories/fixtures/pypi.org/generate.py so that we can manage them going forward.
  2. Lets remove the description from the files as we do not really use them in our tests.
  3. The binaries added in dist should really be added by the generate script emptying out unnecessary content and updating hashes in relevant files.

@dimbleby
Copy link
Contributor Author

dimbleby commented Mar 18, 2024

I had a look at generate.py: but it did not work for me and when I tried to fix it it did not do the things that I wanted it to do. Fixture updates here were made without it.

I am not motivated to update it to do the work that I have already done. My view would be that if it was useful for you at #9132: super, but I am not interested in updating it. If you think that it is likely to be useful to you then I have no objection to you updating it...

(fwiw my method for downloading metadata files to fixtures was to add some temporary code in the mocking: ie I arranged that when a testcase wanted a missing fixture, it just added it).

removing descriptions etc messes with the hashes, IMO this sort of thing is just generating unnecessary work compared to using the real files.

@abn
Copy link
Member

abn commented Mar 18, 2024

I am not motivated to update it to do the work that I have already done. My view would be that if it was useful for you at #9132: super, but I am not interested in updating it. If you think that it is likely to be useful to you then I have no objection to you updating it...

I can appreciate that it is added effort for this sort of change, but as maintainers it does remove certain pain points and allows us to manage the sprawl of randomly added files. While it might not be "useful" for you or "interesting" to you, it does make life easier for us.

(fwiw my method for downloading metadata files to fixture was to add some temporary code in the mocking: ie I arranged that when a testcase wanted a missing fixture, it just added it).

I think that is a great way of doing it. The original list of things I used in #9132 was similarly generated, but opted for a static list in generate.py as it wont lead to tests succeeding when it shouldn't etc. But yes, for discovery and filling items it makes a lot of sens to do it this way.

@dimbleby
Copy link
Contributor Author

If updating generate.py is something that you want then I am not going to try to stop you. But "useful" or "interesting" are pretty much exactly my deciders for poetry contributions and my judgement is that this is neither of those things.

I would be willing to update this MR to remove generate.py, I acknowledge that the merge has left us in an inconsistent place where the file exists but does not do what it promises.

abn
abn previously requested changes Mar 18, 2024
Copy link
Member

@abn abn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this change is in the right direction we want to go as a project, I am sure this will eventually get across the line.

The following, in my opinion, will need to be resolved prior to merging this.

  1. The mocked test fixtures should be generated such that it can be regenerated along with the existing fixtures - ie. the generate.py script must be updated.
  2. The fixtures in dist should only be for distributions that are fully preserved for the purposes of the test. Rest must be stubbed.
  3. Optionally, any data that is not required for the purposes of the test must be removed.

If you do not want to make those changes yourself; feel free to leave this PR for another contributor to carry/address the code review comments.

@dimbleby
Copy link
Contributor Author

I am sceptical about the value of "stubbing" files in dists. The files added there in this pull request are - combined - considerably smaller than the existing setuptools wheel: the extra complication that stubbing causes in making hashes consistent across metadata and scripts is to my mind not at all worthwhile for the small number of bytes saved.

Anyway I think I probably have been clear enough but so as to leave no doubt: I do not agree that the effort/reward ratio in updating generate.py is favourable and I do not intend to do it.

@abn
Copy link
Member

abn commented Mar 19, 2024

I am sceptical about the value of "stubbing" files in dists. The files added there in this pull request are - combined - considerably smaller than the existing setuptools wheel: the extra complication that stubbing causes in making hashes consistent across metadata and scripts is to my mind not at all worthwhile for the small number of bytes saved.

I can appreciate your skepticism. The existing wheels are there as they have been required by some test cases - ideally I would like to remove them as well. The main intent for stubbing the files has nothing to do with saving bytes (it is a good side-effect), we want to avoid hosting and distributing usable binaries and especially ones that contain vulnerable code. The modified hashes also has a useful property, albeit not required at the moment, of making sure we are testing against the fixtures and not accidentally fetching files from the internet. The hash computation steps also allow us to add custom fixtures and generate hashes in links we care about.

Anyway I think I probably have been clear enough but so as to leave no doubt: I do not agree that the effort/reward ratio in updating generate.py is favourable and I do not intend to do it.

Understood. We simply see it differently.

@abn abn dismissed their stale review March 19, 2024 20:17

Requested changes were implemented in #9175

@dimbleby dimbleby force-pushed the stop-using-pypi-json-api branch 3 times, most recently from 39f2c9d to a252b96 Compare March 19, 2024 23:10
@abn
Copy link
Member

abn commented Mar 20, 2024

@dimbleby I will fix up the generate script, there are a few bad assumptions made that need fixing.

@abn
Copy link
Member

abn commented Mar 21, 2024

@dimbleby made the required changes at #9186 feel free to pick them up from there.

One thing I noticed while getting that working was that this change forces Poetry to pull sdists in the cases a package does not distribute wheels. Which surprisingly seems to be a fair few projects. What this leads to is an awkward situation where the average user's UX is likely impacted negatively with this change.

I am no longer sure if removing it entirely is the best course of action until sdists are also populated. Although, I should also point out that fetching the sdist and inspecting is likely going to more yield more accurate results than the JSON API based on previous issues handled.

cc: @python-poetry/core

@dimbleby
Copy link
Contributor Author

dimbleby commented Mar 21, 2024

this change forces Poetry to pull sdists in the cases a package does not distribute wheels.

that is not changed by this pull request. Project dependencies on the json API are populated only from wheels (specifically, they are populated only if the first uploaded distribution for a version is a wheel). Therefore poetry already has to pull the sdist to determine dependencies for sdist-only releases and this is unavoidable.

what has changed is that previously the unit tests set fallback = False on the PyPiRepository. The meaning of this flag seems to have been "pretend that the json API is reliable when it says no-dependencies", but since that is a bogus assumption the flag was never set outside of unit tests.

@abn
Copy link
Member

abn commented Mar 21, 2024

The fallback = False case is only one instance of this. For example, https://pypi.org/pypi/trackpy/0.6.2/json lists the requires_dist value, my understanding (granted I have not looked at the code path much recently) is that the values here are respected presently as it is not empty. This in-turn means we download the sdist that otherwise we would not have.

@dimbleby
Copy link
Contributor Author

dimbleby commented Mar 21, 2024

I am surprised to see useful values for an sdist-only release.

Perhaps I am out of date about how things work - it looks as though something changed recently to make this possible: https://pypi.org/pypi/trackpy/0.6.1/json does not provide requires_dist but https://pypi.org/pypi/trackpy/0.6.2/json does.

I'm not sure whether I agree that this is enough reason to stick with the json API, but I do agree that the balance is different than I thought it was an hour ago!

I haven't seen pypi provide metadata for any source distributions, only for wheels. Does that agree with your experience? Maybe worth a report on their issue tracker to ask whether this is expected or fixable.

@abn
Copy link
Member

abn commented Mar 21, 2024

I'm not sure whether I agree that this is enough reason to stick with the json API, but I do agree that the balance is different than I thought it was an hour ago!

I think others on core proposed that we might consider keeping the JSON API as a fallback option if we detect an sdist only scenario. That will mean keeping the logic around longer.

I haven't seen pypi provide metadata for any source distributions, only for wheels. Dose that agree with your experience? Maybe worth a report on their issue tracker to ask whether this is expected or fixable.

My understanding was that the metadata would only be available for wheels. The pep is a bit ambiguous on that I believe. So a question might be worth posing.

@dimbleby
Copy link
Contributor Author

My understanding was that the metadata would only be available for wheels

I think that's right, I think it is returning METADATA rather than PKG-INFO.

I think my view would be that the json API is sufficiently un-approved by the warehouse folk, and that sdist-only packages are relatively infrequent, and we should go ahead anyway and encourage anyone who is annoyed by the slowdown to go and bug maintainers to publish wheels. But I'm not 100% committed to that view

As I said at the start, this pull request has altogether been more trouble than I had hoped! I doubt I am likely to work up the enthusiasm any time soon to rearrange it so that it partially reinstates the json api: if y'all think that is a good direction then please don't wait on me to do it!

@dimbleby dimbleby force-pushed the stop-using-pypi-json-api branch 4 times, most recently from 85b3715 to 72a920f Compare March 23, 2024 19:03
@dimbleby
Copy link
Contributor Author

dimbleby commented Mar 23, 2024

notes on recent commits:

  • thanks for the work on generate.py.
  • that the "test fixture consistency" test is passing only shows that it does not spot all the files that have been deleted
  • I've continued to remove test_fallback_pep_658_metadata(), it is surely made redundant now that pretty much everything uses pep 658 metadata
  • I've tweaked test_find_packages_does_not_select_prereleases_if_not_allowed() so that it uses a package where the fixtures actually list some prereleases

re sdists and the json API: I haven't understood what, if anything, has changed recently.

But I have been keeping half an eye out; and have not encountered any other examples where an sdist-only release has its requirements listed on that API (but several examples where they do not).

I still do not expect to do anything about the sometimes-use-the-json-api-after-all idea, and remain content that not using it is reasonable.

@dimbleby dimbleby force-pushed the stop-using-pypi-json-api branch 2 times, most recently from c448f05 to da5ca99 Compare March 23, 2024 19:27
@abn
Copy link
Member

abn commented Mar 23, 2024

  • thanks for the work on generate.py.
  • that the "test fixture consistency" test is passing only shows that it does not spot all the files that have been deleted

Yeah 😂, noticed that. Will solve that later.

In this case you can simply flip the value here for now after the delete for now and hopefully that should work as expected.

https://github.com/python-poetry/poetry/blob/main/tests%2Frepositories%2Ffixtures%2Fpypi.org%2Fgenerate.py#L71

re sdists and the json API: I haven't understood what, if anything, has changed recently.

But I have been keeping half an eye out; and have not encountered any other examples where an sdist-only release has its requirements listed on that API (but several examples where they do not).

I'm a bit lost on this one myself. But I'm going to let others on the team chime in on how they want to proceed. I'm on the fence. Think @neersighted had some thoughts on this as well.

I still do not expect to do anything about the sometimes-use-the-json-api-after-all idea, and remain content that not using it is reasonable.

I understand, and if we have consensus that we really do not want to keep the json API around, we might merge this as is. Let's see.

@neersighted
Copy link
Member

Trying to summarize my thoughts:

This PR can introduce a fair performance regression for users due to the fact that sdist-only distros will require a download of the distfile. Basically, the JSON API was representing the metadata in PKG-INFO, but PEP 658 support on PyPI does not extend to sdists/PKG-INFO.

This is actually a deficiency in PyPI; PEP 658 extends to all distfiles:

The metadata file defined by the Core Metadata Specification [core-metadata] will be served directly by repositories since it contains the necessary information for common use cases. The metadata must only be served for standards-compliant distributions such as wheels [wheel] and sdists [sdist], and must be identical to the distribution’s canonical metadata file, such as a wheel’s METADATA file in the .dist-info directory [dist-info].

From the sdist format, it appears that PKG-INFO is mandatory:

A .tar.gz source distribution (sdist) contains a single top-level directory called {name}-{version} (e.g. foo-1.0), containing the source files of the package. The name and version MUST match the metadata stored in the file. This directory must also contain a pyproject.toml in the format defined in pyproject.toml specification, and a PKG-INFO file containing metadata in the format described in the Core metadata specifications specification. The metadata MUST conform to at least version 2.2 of the metadata specification.

Of course, I am sure there are many sdists on PyPI that don't make PKG-INFO available, but I'd be okay with accepting the reversion to downloading sdists for those as an edge case.

I think we've been considering pypi/warehouse#8254 equivalent to full PEP 658 support in PyPI, but it is clear there is more work to do to get data-core-metadata to also support sdists.

In the mean time, I think we might want to consider using PEP 658 data first, but including the JSON API as a fallback flow, to avoid a user-facing regression in performance.

@dimbleby
Copy link
Contributor Author

dimbleby commented Mar 25, 2024

fwiw my best guess is that it's a recent version of twine that has for some reason become capable of uploading more complete metadata information alongside an sdist - there was a release at the start of February.

That would be consistent with trackpy 0.6.2 being the only sdist-only release that I know of where the json API returns dependency information - that was released a couple of weeks after the twine release.

If that's correct - and it might not be - then abandoning the JSON API now doesn't give up anything, because up until very recently there would have been no sdist-only cases where it was useful anyway. But it could represent a lost opportunity for freshly uploaded sdist-only releases.

@dimbleby
Copy link
Contributor Author

dimbleby commented Mar 25, 2024

A bit of context on "From the sdist format, it appears that PKG-INFO is mandatory"

This describes the "new" sdist format, but there's a massive get-out clause "There is also the legacy source distribution format ..."

The "new" format requires pyproject.toml and metadata 2.2. Currently, it's ony a small exaggeration to say that roughly nothing is compliant with that - it is only recently that pypi even accepted metadata 2.2.

As of today I think it is only packages who are using "modern" tools like flit and hatch who can be compliant with that, and I bet nearly all of those also upload wheels anyway.

I am not sure we will ever see a significant number of packages in the intersection of savvy-enough-to-use-pyproject.toml-and-new-metadata and old-school-enough-to-not-upload-wheels.

So waiting on pypi to serve metadata for those sdists is imo low value.

That cuts both ways: if you think that the tail of old-school sdist-only packages is long enough, you could read this as making the case for never abandoning the json api.

@dimbleby
Copy link
Contributor Author

dimbleby commented Apr 11, 2024

A couple of recent examples of the extra confusion that we cause by using the JSON API at pypi - and therefore introducing code paths for pypi that are quite different from the non-pypi case.

The failure to parse metadata 2.3 with older pkginfo is masked when reading dependencies from the json API:

  • on the plus side, this issue would be more severe if it affected pypi users
  • however it is much more confusing for users who encounter it only on mirrors or other non-pypi repositories. Much harder, for example, to provide a repro when the supposed repro works just fine against the major public repository.
  • there probably ought already to have been a poetry patch release requiring pkginfo 1.10, I am pretty sure that there would have been if pypi users were impacted

The mis-publishing of docutils 0.21 as described in #9293, #9297

  • we get different behaviour depending on whether docutils is found on pypi or, say, a mirror
  • again, causing extra confusion

I expect there are and will be more, perhaps I will update this comment as I find them.

No knockout blow saying that "we must get off the JSON API" here, just adding to the accumulation of reasons that it would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants