Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate how to cache packages to reduce flakyness #25

Open
3 tasks
thomas-moulard opened this issue Nov 27, 2019 · 21 comments
Open
3 tasks

Evaluate how to cache packages to reduce flakyness #25

thomas-moulard opened this issue Nov 27, 2019 · 21 comments
Labels
enhancement New feature or request

Comments

@thomas-moulard
Copy link
Member

thomas-moulard commented Nov 27, 2019

Description

As a GH action user, my pipelines infrequently fail due to infrastructure connectivity issues.

We should evaluate whether all parts currently requiring internet connectivity could be cached, including:

  • pip
  • apt
  • choco
  • ccache (?)

Test Plan

  • FIXME

Documentation Plan

  • README.md update

Release Plan

  • Release the impacted packages again, and update the workflows as needed.

Acceptance Criteria

  • Code has been implemented, reviewed and merged.
  • Test plan has been completed
  • Release plan has been completed

Once all above items are checked, this story can be moved to done.

@thomas-moulard thomas-moulard added the enhancement New feature or request label Nov 27, 2019
@thomas-moulard
Copy link
Member Author

@j-rivero
Copy link

During my tests I was hit be several problems related to the stability of this github action: apt mirror syncs, rosdep network problems, etc. I agree with the approach of caching packages as much as we can. Another measure could be to implement re-try mechanism to avoid problems that disappear in seconds.

I think that the use of docker proposed in ros-actions-ci ros-tooling/action-ros-ci#67 could also fall under this same category of reducing flakyness. I'm available to help with implementations. Not sure about which priorities we have for the different solutions.

@piraka9011
Copy link

I think that the use of docker proposed in ros-actions-ci ros-tooling/action-ros-ci#67 could also fall under this same category of reducing flakyness. I'm available to help with implementations.

We have an open issue in our backlog to tackle this ros-security/aws-roadmap#173. We are always open to reviewing contributions!
One way to go about this is to specify a Dockerfile which the action uses (See GH Action Docs).
I could be as simple as including a Dockerfile that is built from the stable ubuntu-18.04 repo and include it in the action.yml file for action-ros-ci.

@Arnatious
Copy link

I think that the use of docker proposed in ros-actions-ci ros-tooling/action-ros-ci#67 could also fall under this same category of reducing flakyness. I'm available to help with implementations.

We have an open issue in our backlog to tackle this ros-security/aws-roadmap#173. We are always open to reviewing contributions!
One way to go about this is to specify a Dockerfile which the action uses (See GH Action Docs).
I could be as simple as including a Dockerfile that is built from the stable ubuntu-18.04 repo and include it in the action.yml file for action-ros-ci.

At a point where its stable, or even now since we're essentially using nightly builds anyway, we should use the official ros docker image https://hub.docker.com/r/osrf/ros2/

@emersonknapp
Copy link
Contributor

we should use the official ros docker image

I'm not sure this is possible today, since we're using this action in practice to target multiple OS (ubuntu, debian) and architecture (x64, arm32, arm64), which those images don't provide (that I know of). Perhaps it could special-case to use those images for Ubuntu though.

On a different note towards the "nightly" part - I wonder whether during heavy-development times the "up-to-23-hours-out-of-date" snapshot you'd be working against would introduce confusing issues to users.

@piraka9011
Copy link

Linking this since it may be relevant @actions/tool-cache.

@rotu
Copy link
Contributor

rotu commented Mar 30, 2020

@rotu
Copy link
Contributor

rotu commented Mar 31, 2020

Here's what I've come up with so far: RoverRobotics-forks/rmw_cyclonedds@b6d7de8

@thomas-moulard
Copy link
Member Author

Unfortunately, this will be very flakey, because the cache will never be invalidated, unless the key changes. I.e. this will keep re-using what's been cached once forever, even if APT/PIP/... changes. You'd need a way to reliably hash APT content which will be... challenging to write.

I've been playing with a simpler alternative, run the CI on a prebuilt Docker image: https://github.com/ros-tooling/setup-ros-docker

This removes the need of APT/PIP/... on Linux, at least. This is not ready for prime time, but the poc works.

@rotu
Copy link
Contributor

rotu commented Mar 31, 2020

Why is that different than the local cache folder? My intent is that it will reload the cache and apt/pip/brew/etc. will do its own cache invalidation and update the cache folder as it normally does. At the end of the build, the modified cache gets saved off again.

It does seem there are some permission bugs to sort out though.

Edit: the bug I’m seeing seems to be this actions/cache#144

@thomas-moulard
Copy link
Member Author

thomas-moulard commented Mar 31, 2020

In this case, you're paying the cost of the cache mechanism + the previously existing mechanism, so it'd be interesting to actually run a benchmark to see if we shave any time doing things like this.

@rotu
Copy link
Contributor

rotu commented Mar 31, 2020

@thomas-moulard From a performance standpoint, saving and loading the entire cache folders did not pay off. Surprisingly to me, the runtime gains on downloading packages was virtually nil. Also, the runtime losses of caching were significant.

cache: https://github.com/RoverRobotics-forks/rmw_cyclonedds/actions/runs/67620206
no cache: https://github.com/RoverRobotics-forks/rmw_cyclonedds/actions/runs/67621120

@thomas-moulard
Copy link
Member Author

Not sure I'm following here, the cache version is taking 1h, and the one w/o cache seems significantly faster ~30 min?

@rotu
Copy link
Contributor

rotu commented Mar 31, 2020

Yes. But that's the "workflow execution time", which is not really a meaningful performance number - it's longer than the longest build but shorter than the total build time. When not caching, the Mac builds alone totaled more than the entire workflow time.

Dig into each job, specifically how long the "Acquire ROS dependencies" (i.e. setup-ros) step and the Cache X / Post Cache X steps take.

@thomas-moulard
Copy link
Member Author

thomas-moulard commented Mar 31, 2020

Sorry, I still don't see it. I took two OS X runs, and it still don't compare favorably:

Seems like we shave out 1 minute by caching, but HomeBrew caching adds 3min14s?

edit: maybe I am not associating the runs correctly?

Let's see the sum of all OS X builds to evacuate this potential issue:
w/ caching: 15min + 11min + 11min = 37min
w/o caching: 8min + 8min + 16min = 32min

Still shorter w/o caching. Maybe the builds are just taking a varying amount of time? Did you maybe had a better chance in a previous run?

@rotu
Copy link
Contributor

rotu commented Mar 31, 2020

My takeaway is that caching in the way I did it isn't worth it. I don't know what you mean by "I still don't see it" but I don't think it's worth looking into much.


In your first link, restoring the cache took 3m34s. If you expand the step Cache Homebrew:

Cache Size: ~3131 MB (3283395838 B)
/usr/bin/tar -xz -f /Users/runner/runners/2.165.2/work/_temp/c3c95ea6-053a-4b6d-b2bc-8222aff56fa1/cache.tgz -C /Users/runner/Library/Caches/Homebrew
Cache restored from key: macOS-latest-homebrew-67613993

And then at the end (Post Cache Homebrew) it didn't save off the cache again, so that took no time.

Runtime variability can be high! For example in these two runs, both without caching, and both starting brew install within 15 seconds of each other.

Here, ros setup took 4m34s = 274s with brew install taking 229s (computed by looking at the timestamps)
https://github.com/RoverRobotics-forks/rmw_cyclonedds/runs/549690337?check_suite_focus=true

Here, ros setup took 11m54s = 714s with brew install taking 633s
https://github.com/RoverRobotics-forks/rmw_cyclonedds/runs/549690345?check_suite_focus=true

@thomas-moulard
Copy link
Member Author

I see, thanks for clarifying!

@rotu
Copy link
Contributor

rotu commented Apr 1, 2020

Performance aside, it seems all the flakiness I’m seeing is from azure.archive.ubuntu.com/ubuntu. That url fails for both apt-get update and upgrade. Adding another mirror may be a good idea. https://github.com/ros2/rmw_cyclonedds/runs/551468958?check_suite_focus=true

@thomas-moulard
Copy link
Member Author

Generally, we're moving away from running those builds on bare metal. Running them in Docker containers on GH action would mitigate those issues as Debian/Canonical infrastructure is not as flaky. It also prevents updates of the host image from breaking the builds.

@rotu
Copy link
Contributor

rotu commented Apr 1, 2020

Running them in Docker containers on GH action would mitigate those issues

Okay. Are these docker containers ready to use today? Or do you mean hypothetically? It would be nice to have it mentioned in the readme

@thomas-moulard
Copy link
Member Author

It's WIP, but I have a poc here, not really ready for prime time yet:
https://github.com/ros-tooling/setup-ros-docker

Also, OSRF has official docker images, we saw other users using them successfully with the action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants