Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Automated Recovery #719

Open
navarr opened this issue Nov 4, 2022 · 14 comments
Open

Feature Request: Automated Recovery #719

navarr opened this issue Nov 4, 2022 · 14 comments

Comments

@navarr
Copy link

navarr commented Nov 4, 2022

I have a large Github Action workflow that pushes over 600 images up to the GitHub Container Registry.

This mostly works fine, except that I have to set max-parallels based on how many images I expect to be running at a time, and even then sometimes I'm hitting APIs too fast or getting a rare error.

For example:

buildx failed with: ERROR: failed to solve: failed to compute cache key: failed to copy: httpReadSeeker: failed open: unexpected status code https://ghcr.io/v2/swiftotter/den-php-fpm/blobs/sha256:456f646c7993e2a08fbdcbb09c191d153118d0c8675e1a0b29f83895c425105f: 500 Internal Server Error - Server message: unknown

or

buildx failed with: ERROR: failed to solve: failed to compute cache key: failed to copy: read tcp 172.17.0.2:59588->185.199.111.154:443: read: connection timed out

or

buildx failed with: ERROR: failed to solve: failed to do request: Head "https://ghcr.io/v2/swiftotter/den-php-fpm-debug/blobs/sha256:d6b642fadba654351d3fc430b0b9a51f7044351daaed3d27055b19044d29ec66": dial tcp: lookup ghcr.io on 168.63.129.16:53: read udp 172.17.0.2:40862->168.63.129.16:53: i/o timeout

These are all temporary errors that disappear the moment I re-run the job. Instead, what I wish for is that in such cases - like timeouts, or server errors or too many requests errors, that some sort of automated backoff retry system exists, with configurable limitations.

@Tenzer
Copy link

Tenzer commented Nov 16, 2022

We are occasionally seeing errors like these as well. For the cache-related ones, I guess an option also could be to continue despite the failure, as writing to the cache probably doesn't constitute a critical failure that would require the entire workflow to fail.

@nick-at-work
Copy link

We run into this somewhat often. Is a retry/retries option feasible for pushing images?

@ddelange
Copy link

some additional info: https://github.com/ddelange/pycuda/actions/runs/3972373867/jobs/6830922090

#22 DONE 31.0s

#23 exporting to image
#23 exporting layers
#23 exporting layers 9.9s done
#23 exporting manifest sha256:5cc09704d37dcab52f35d0dc1163acdae52fbb9a265cbb1fe9d55625a61307e9 done
#23 exporting config sha256:50aea776612d2a2916b237afb2e1e59c96b8134105791f65029ac253797dc840 done
#23 exporting attestation manifest sha256:c76e2b4622ef918f2ecc26fd4710f0a68f3f9b57f744ba23ff8130a21b5b3a7d done
#23 exporting manifest sha256:b82c6873647b10e6c1d13f754a225756d879380d6d51983250c0167b1af84874 done
#23 exporting config sha256:af2a1fee215b05d4d8b7893bc20a305a7a257fed7cb59c1ab21715286c89d08f done
#23 exporting attestation manifest sha256:52a9106f51393e6435cbed771d7b4e288c820d84e745c681dbdcc3d3a72bc67d done
#23 ...

#24 [auth] ddelange/pycuda/jupyter:pull,push token for ghcr.io
#24 DONE 0.0s

#23 exporting to image
#23 exporting manifest list sha256:dd095d62b30f27dd9ee27b81a0eabd77ab15387dc44f6833686ff20a005452a2 done
#23 pushing layers
#23 pushing layers 2.0s done
#23 ERROR: failed to push ghcr.io/ddelange/pycuda/jupyter:3.9-master: failed to copy: io: read/write on closed pipe

these images are each around 2GB+, so a retry might again error if there's no 'resume' ie successful layers don't need to be pushed again

@kkom
Copy link

kkom commented Jun 14, 2023

I would like to +1 this! For us, this happens maybe once a day just during continuous deployment (so not counting PRs).

The errors we're seeing look like transient infrastructure errors:

buildx failed with: ERROR: failed to solve: failed to push ghcr.io/<our_org_name>/<our_repo_name>/<our_image_name>:2023.06.14-1605-f0c78f6: failed to copy: failed to do request: Put "https://ghcr.io/v2/<our_org_name>/<our_repo_name>/<our_image_name>/blobs/upload/9508e842-68bb-4779-9f40-6d8cf25357ff?digest=sha256%3A7cb00f153a2766267a4fbe7b14f830de29010a56c96486af21b7b9bf3c8838f0": write tcp 172.17.0.2:35594->140.82.114.33:443: write: broken pipe

Having an internal option to retry the step would be fantastic. We don't want to retry the whole job, as that could mean re-running things that should not be retried, like the test suite.

@patrick-stephens
Copy link

We see this too, both on simple container image promotion for Fluent Bit releases - usually pushing to ghcr.io in parallel the 3 supported architectures at least one of them fails - but also when building the multi-arch images which is a huge time sink as it takes a long time to build with QEMU then just fails to push so we have to restart the whole lot again.

@dinvlad
Copy link

dinvlad commented Jul 25, 2023

I'm seeing this very often now, especially with parallel builds..
This is without any flakiness reported on the GH side: https://www.githubstatus.com

I have tried the pinned versions of Buildkit v0.10.6 and v0.12.0 but that didn't seem to help much:
#761 (comment)

It would be good to have more resilient retries (with the understanding that 100% reliability is obviously not achievable).

@mfridman

This comment was marked as off-topic.

@crazy-max
Copy link
Member

crazy-max commented Oct 18, 2023

But I'm not convinced this would fix the reported (and out) issue.

Does it not work on your side?

Edit: Oh you mean when fetching cache I think right?

@mfridman

This comment was marked as off-topic.

@mfridman
Copy link

I marked my comments above as off-topic since it was an isolated incident related to GCP artifact registry.

However, I'd still +1 this feature request. During that incident, retries would have been extremely useful.

@Santas
Copy link

Santas commented Nov 10, 2023

Seeing this regularly.

#12 [backend 5/7] COPY Website .
#12 ERROR: failed to copy: read tcp 172.17.0.2:59196->20.209.147.161:443: read: connection timed out
------
...
--------------------
ERROR: failed to solve: failed to compute cache key: failed to copy: read tcp 172.17.0.2:59196->20.209.147.161:443: read: connection timed out

@tonynajjar
Copy link

+1 on this feature, I get push-related errors around twice a day and retrying usually fixes it:

buildx failed with: ERROR: failed to solve: failed to push europe-west3-docker.pkg.dev/my-project/my-project/my-project:main: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://europe-west3-docker.pkg.dev/v2/token?scope=repository%3Amy-project%2Fmy-project%my-project%3Apull%2Cpush&service=europe-west3-docker.pkg.dev: 401 Unauthorized

@richaarora01
Copy link

+1 on this feature, I get push-related errors around twice a day and retrying usually fixes it:

buildx failed with: ERROR: failed to solve: failed to push europe-west3-docker.pkg.dev/my-project/my-project/my-project:main: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://europe-west3-docker.pkg.dev/v2/token?scope=repository%3Amy-project%2Fmy-project%my-project%3Apull%2Cpush&service=europe-west3-docker.pkg.dev: 401 Unauthorized

@DhanshreeA
Copy link

Is this feature under development, or is it still being considered? 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests