Timeout and restart #72

MOZGIII · 2022-09-01T19:45:37Z

GitHub cache server sometimes crashes, and sometimes it gets super slow, emitting the cache (which is about 2 GB for us) at the <1mbps.

In both of these behaviors, the step crashes eventually (with an error in the first case, and with an overall job timeout in the second).

It would be nice to extend the logic of this action with:

timeout on cache downloading / upload - throw an error if the process takes more than the configured amount of time
retry on error - an in-action retry on networking errors or timeout from the above, such that the rest of the work in the job is not lost; retry can be implemented with a configured amount of attempts.

Thanks for the great action btw, it does wonders for us already!

Swatinem · 2022-09-04T09:16:46Z

I would say that any kind of logic related to this may need to be developed upstream as part of the github cache action / library.
I’m also unsure if a retry would help in these cases. If something between the runner and the artifacts server is wrong, I doubt a retry would help in that case.

MOZGIII · 2022-09-04T14:07:40Z

From practice, retying the job manually actually works. These issues can be caused by a variety of reasons, and a good portion of them is solvable by retry (since HTTP retry will likely land on a different load-balancer, might connect to a healthier upstream, etc).

I've also considered these features to be implemented at upstream, but, in the end, the implementation on their end will be what, like a loop and a racing timer? And the configuration for both will have to be provided by the user of the code - thus it will have to pass through this action's config. Might make sense to just implement it here, it'll also certainly be quicker.

Ultimately, I figured it'd be easier to address this here first, just in case you'd be ok with the implementation being here, at least at first. It can then be moved out into an external re-usable package, for instance.

My main rationale for going here instead of the upstream is because I figured it would be quicker to land.

MOZGIII · 2022-09-04T15:01:18Z

Figured I'd go ahead and try to do it, ended up with #75.

The downsides of this implementation are that:

We can't cancel an in-flight save/restore operations, which is relevant for the timeouts.
However, there are some timeouts implemented at https://github.com/actions/toolkit/blob/main/packages/cache/src/options.ts now, so we can expose these options.
We may (or may not) keep the overall timeout as an overall failsafe scenario (like when the download is not stuck, but is just very slow) - this will leak the outgoing connection though, but it might not be that bad, given that it's "stuck" anyway.
The logic to tell if the error is retriable or not is left out, as it will likely take some fault injection or some statistics collection to tell if an error is retriable and how to tell. I've set all errors to be retriable for now.

I also noticed my editor reformats the source a bit via prettier (it does so automatically on save), I've omitted these changes from the PR, but I suggest employing prettier in this project for code formatting consistency.

Swatinem · 2022-09-05T08:24:47Z

I also noticed my editor reformats the source a bit via prettier (it does so automatically on save), I've omitted these changes from the PR, but I suggest employing prettier in this project for code formatting consistency.

I have that as well, so I’m surprised it uses different formatting?

thomaseizinger · 2023-01-24T04:38:36Z

I think I am running into the same problem. Here is an example:

Link here: https://github.com/libp2p/rust-libp2p/actions/runs/3992478158/jobs/6848975863

The restore job in the screenshot has been running for ~45minutes.

tisonkun · 2023-11-01T00:22:43Z

Ditto - https://github.com/streamnative/pulsar-rs/actions/runs/6709247198/job/18231865369

In my case the cache should be optional (best effort) so I'd like a timeout to give up restore cache and just start building.

MOZGIII linked a pull request Sep 4, 2022 that will close this issue

Implement timeouts and retries logic #75

Draft

jgallagher mentioned this issue Aug 16, 2023

CI flake: rust-cache timeout oxidecomputer/omicron#3888

Open

tisonkun mentioned this issue Nov 1, 2023

ci: upgrade Swatinem/rust-cache action streamnative/pulsar-rs#301

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout and restart #72

Timeout and restart #72

MOZGIII commented Sep 1, 2022 •

edited

Swatinem commented Sep 4, 2022

MOZGIII commented Sep 4, 2022

MOZGIII commented Sep 4, 2022 •

edited

Swatinem commented Sep 5, 2022

thomaseizinger commented Jan 24, 2023

tisonkun commented Nov 1, 2023

Timeout and restart #72

Timeout and restart #72

Comments

MOZGIII commented Sep 1, 2022 • edited

Swatinem commented Sep 4, 2022

MOZGIII commented Sep 4, 2022

MOZGIII commented Sep 4, 2022 • edited

Swatinem commented Sep 5, 2022

thomaseizinger commented Jan 24, 2023

tisonkun commented Nov 1, 2023

MOZGIII commented Sep 1, 2022 •

edited

MOZGIII commented Sep 4, 2022 •

edited