Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --retry and --retry-timeout #1017

Open
ashwoods opened this issue Oct 5, 2023 · 4 comments
Open

Add --retry and --retry-timeout #1017

ashwoods opened this issue Oct 5, 2023 · 4 comments

Comments

@ashwoods
Copy link

ashwoods commented Oct 5, 2023

The Issue

We are using GitHub actions and uploads have been failing fairly consistently. We have already reported this to pypi, but we would like to ask if you'd consider a PR adding --retry --retry-timeout similar to curl. Thx!

@sigmavirus24
Copy link
Member

Thoughts on --retry

I'm vaguely -0 on --retry. The only reason I'm considering it, however, is that I know that the pypi-publish-action that's maintained is something that's widely used and isn't something people instrument themselves typically.

I would otherwise suggest something like:

for counter in {1..$RETRY_MAX}; do
  twine upload --skip-existing dist/*
  exit_code = $?
  if (( $exit_code == 0 )) ; then
    break
  fi
  echo "Twine upload failed... retrying $counter of $RETRY_MAX"
done

Further, what classes of errors are worthy of a retry? Intermittent network errors can appear to be a number of things (connection reset by peer, EOF occurred in violation of protocol, etc.) and then there are other things that can look like intermittent network errors that aren't. Distinguishing them can be incredibly difficult and you may end up lumping in things we probably shouldn't bother retrying with things we ostensibly should. And that will cause other problems for folks.

Finally, when do retries start counting and do they reset? If we have 5 artifacts we're uploading, and 3 of those need to retry, does each get a max of --retry or do all 5 have a max number of retries set to --retry? We can decide that, but no matter which we pick, someone's going to expect something different and it's going to lead to stubbed toes and invalid bug reports.

The question on --retry is will it benefit something greater than 95% of our current users. So far, it seems you/your colleagues are a very vocal minority. To me, that seems like a "no for now, we can re-evaluate later", but I don't think we have sufficient data to say "no, we'll never do that".

Finally, pip's --retry works per-package it attempts to download/install but I've seen people get endlessly frustrated with that too as it then means that pip can take num_packages**num_retries amount of time to complete which is far longer than they otherwise expected.


Thoughts on --retry-timeout

I'm a strong -1 on --retry-timeout because I believe the option you're thinking of in curl is --retry-max-time. The retry-timeout would require a re-architecture of how Twine works today that would be a significant amount of work for a diminishing amount of returns and likely an increase in the number of bugs and instability. To implement something like curl's --retry-max-time we'll need to spin off the work into a separate thread that we can monitor and kill after the allotted amount of time. This would work fine given that everything in Twine is IO bound so threading wouldn't be affected by the GIL. That said, you now have to consider the semantics of what that option means for someone doing twine upload dist/*.

I say that because, let's say you have 5 artifacts in dist to upload (like the example above). If you upload the first 3, and start retrying on the 4th, when does your timer start on retries? Do you start it when you start retrying? Do you stop the timer once successful? What happens on the 5th if it also needs to start retrying? Do you restart the timer?

In curl, it's an easy semantic because one curl command maps (more or less) to a single HTTP request (ignoring redirects, because by default it will not follow them). Twine has variable behaviour. I'd argue that no matter what semantic we pick, it will not be intuitive for a large percentage of users.

Finally, there are shell ways of doing this as above: http://www.bashcookbook.com/bashinfo/source/bash-4.0/examples/scripts/timeout3


Summary

I'm (more or less) opposed to implementing this strictly in twine (but open to @bhrutledge convincing me otherwise) because:

  • I don't think we have enough data to warrant implementing this
  • I don't believe the benefits will outweigh the risks and added complexity
  • I don't believe there is one clear semantic for these options that will make sense to most people, regardless of how well we document it.
  • Implementing one semantic for either (or both of these) is going to lead to someone wanting a different option for the other semantic and a fair amount of strife
  • There are ways to implement this outside of twine today (see references above) but those don't work for folks not using twine directly, but are instead using a GitHub action from someone else.

@asottile-sentry
Copy link

fwiw there are already retries in upload:

twine/twine/repository.py

Lines 190 to 196 in afc37f8

if 500 <= resp.status_code < 600:
number_of_redirects += 1
logger.warning(
f'Received "{resp.status_code}: {resp.reason}"'
"\nPackage upload appears to have failed."
f" Retry {number_of_redirects} of {max_redirects}."
)

I think the actual ask from @ashwoods would be to also retry on a few other protocol-level errors

@bhrutledge
Copy link
Contributor

If this is a matter of tweaking the existing retry logic as @asottile-sentry suggests, I'm game to review a PR, but otherwise, I agree with @sigmavirus24's summary.

@sigmavirus24
Copy link
Member

If this is a matter of tweaking the existing retry logic as @asottile-sentry suggests, I'm game to review a PR

Hard to say because that's not at all what was asked for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants