Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazy-wheel causes poetry to use all the RAM #8929

Closed
dimbleby opened this issue Jan 31, 2024 · 18 comments · Fixed by #8941
Closed

lazy-wheel causes poetry to use all the RAM #8929

dimbleby opened this issue Jan 31, 2024 · 18 comments · Fixed by #8941
Labels
kind/bug Something isn't working as expected

Comments

@dimbleby
Copy link
Contributor

dimbleby commented Jan 31, 2024

[tool.poetry]
name = "repro"
version = "0.1.0"
description = ""
authors = []
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10"
ruff = "^0.1.15"

[tool.poetry.group.jax.dependencies]
jaxlib = { extras = [
    "cuda12-pip",
], version = "^0.4.21+cuda12.cudnn89", source = "jax-releases" }
jax = { extras = ["cuda12-pip"], version = "^0.4.21" }

[[tool.poetry.source]]
name = "jax-releases"
url = "https://storage.googleapis.com/jax-releases/jax_cuda_releases.html"
priority = "explicit"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

if we do not disable lazy wheel then trying to poetry install this (with an empty cache) causes poetry to use all the RAM until the OS kills it

@dimbleby dimbleby added kind/bug Something isn't working as expected status/triage This issue needs to be triaged labels Jan 31, 2024
@Secrus
Copy link
Member

Secrus commented Jan 31, 2024

@radoering

Maybe there is some kind of size limit we could introduce? This is for sure a release blocker. We need to either fix this or revert that feature.

@Secrus Secrus mentioned this issue Jan 31, 2024
@Secrus Secrus removed the status/triage This issue needs to be triaged label Jan 31, 2024
@radoering
Copy link
Member

then trying to poetry install this

Is it only poetry install or is poetry lock sufficient to trigger the issue?

I just noticed that Poetry has to download all wheels during locking despite of lazy-wheel because the source does not provide hashes but I cannot observe that the memory usage increases during locking so far.

@radoering
Copy link
Member

OK, now it happened. poetry lock seems to be sufficient. It seems like it is not a continuous process, but starts abruptly.

@dimbleby
Copy link
Contributor Author

I think the issue is HTTPRepository._get_info_from_wheel() recursing into itself

  • the try to do it with lazy wheel fails
  • the try to do it without lazy wheel goes: huh, looks like ranges are supported, let's recurse

@radoering
Copy link
Member

We probably just have to make sure that False is passed to raise_accepts_ranges in

with self._cached_or_downloaded_file(
link, raise_accepts_ranges=self._lazy_wheel
) as filepath:

if we hit this branch earlier in the same function call:

except HTTPRangeRequestUnsupported:
# Do not set to False if we already know that the domain supports
# range requests for some URLs!
if netloc not in self._supports_range_requests:
self._supports_range_requests[netloc] = False

to break the recursion.

It seems like we get 304 (not modified) at some point when trying range requests in this example. I'm wondering if we can avoid this with some HTTP header?

@dimbleby

This comment was marked as outdated.

@dimbleby

This comment was marked as outdated.

@dimbleby
Copy link
Contributor Author

dimbleby commented Jan 31, 2024

ok, I think the confusing server behaviour is actually a bug in cachecontrol

so this unconditionally loads the cached response, and sees a Last-Modified header. That causes cachecontrol to add an If-Modified-Since header, and that causes the server to return 304.

Then we end up using the cached response anyway, which was for the whole wheel and not for the range request at all

@dimbleby
Copy link
Contributor Author

dimbleby commented Jan 31, 2024

psf/cachecontrol#327, though I imagine if poetry cares to do anything about this then someone will need to contribute.

alternatively perhaps should find a way of avoiding cachecontrol altogether in the lazywheel parts, not sure if that's more or less painful

(Also I wonder whether this is what was happening all along when you saw

      # However, some servers that support negative range requests also return a
      # 200 OK if the requested range from the end was larger than the file size.

perhaps it was cachecontrol confusing things the whole time?)

@radoering
Copy link
Member

I wonder if we want to use range requests to fetch metadata even if we have to download the whole wheel for calculating the hash or just use the cached response. We trade a huge local copy operation (cf #7916 which was reverted) for some small network requests (at least in this example). I tend to say we can use the cached response.

@dimbleby
Copy link
Contributor Author

dimbleby commented Feb 1, 2024

as #7916 accidentally proved, using the cached response requires either cooperation with cachecontrol - or undesirable knowledge of how cachecontrol works. #7961 (comment)

by the time you have made that work you might almost have done the second ("hard") fix proposed on the cachecontrol issue: in principle at least, cachecontrol could respond to partial content requests if it already has the entire content

the "easy" proposed fix at cachecontrol does look easy. I do not expect to get to it for a day or few, if you - or any other - wants to offer them an MR then please go ahead. Obvs we have no control over how quickly they would then make a release, but I would be reasonably optimistic about that.

even though it turns out this particular server is not behaving with flip-flop "I support range requests, oops no I dont", breaking the recursion per #8929 (comment) is clearly desirable.

@dimbleby
Copy link
Contributor Author

dimbleby commented Feb 1, 2024

actually I again wonder whether any server actually behaves in the complicated ways that the lazy-wheel implementation suggests (eg accepts range requests at some urls but not others), or whether this is all an illusion created by cachecontrol confusion

perhaps there is simplification to be made.

@Secrus
Copy link
Member

Secrus commented Feb 1, 2024

@radoering do you think this should block the release until fixed or should we revert this commit and move that to the next release, fixing the issue in the meantime? I would go with the revert personally.

@dimbleby
Copy link
Contributor Author

dimbleby commented Feb 1, 2024

when it works, and in the right circumstances, the lazy-wheel stuff is a massive win. eg #6409 (comment)

so it would be a shame if it didn't make it into a release soon: though if you're in a hurry to get something out then there can always be two releases with only a modest gap...

@radoering
Copy link
Member

as #7916 accidentally proved, using the cached response requires either cooperation with cachecontrol - or undesirable knowledge of how cachecontrol works.

I probably did not make it clear enough. I did not mean to explicity use the cached response (as #7916 tried and failed) but implicitly: If a link does not has a hash, we know that we will download / have downloaded the wheel anyway so we can just not use range requests but the download function, which will implicitly use the cached response. Of course, it will not use it directly but do a copy operation first so in theory it could still be slower than range requests - but that's exactly what we have done before lazy-wheel. I think I will do some measurements to check if there is a significant difference.

actually I again wonder whether any server actually behaves in the complicated ways that the lazy-wheel implementation suggests (eg accepts range requests at some urls but not others), or whether this is all an illusion created by cachecontrol confusion

That's something I observed while doing the first steps with curl (so no cachecontrol involved) - so yes it's real. I can even give some details: It's a devpi-server with nginx (IIUC devpi-server does not support range requests on its own) and especially a PyPI mirror. The server supports range requests for wheels that have already been downloaded before but it does not support range requests for wheels that have not been requested before.

@Secrus It's not like this is the only open blocker for the release. There is still some work to do after the poetry-core release. Further, IMO this feature will be one of the highlights of the next version. (I think it's currently far more relevant than PEP 658 support for real use cases.) Thus, I don't want to revert it if I can fix it within the next days. I'm quite confident to provide a fix (even without a cachecontrol release) to make it robust against this cachecontrol bug.

@radoering
Copy link
Member

I think I will do some measurements to check if there is a significant difference.

For this example (wheel with size of about 100 MB), the copy operation of the cached wheel seems to be a little bit faster than range requests. Copy is probably better for smaller wheels and range requests might be better for larger wheels, but I don't think it really matters here.

If a link does not has a hash

Actually, we lose that information at the moment because we only pass the URL instead of the link. However, this will change with the refactoring done in #5509 anyway.

@dimbleby
Copy link
Contributor Author

dimbleby commented Feb 2, 2024

cachecontrol 0.14 is released with the fix, so I think the only fixes that really need making before release are

  • take that update
  • the recursion-breaking (anticipating further surprises, rather than because we know of another way to trigger this)

Copy link

github-actions bot commented Mar 5, 2024

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Something isn't working as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants