CI: 2023 04 07

Attendees

We've been working on a pie chart to show which packages we spend the most time building.
We've put together a preliminary chart distinguishing the number of PR vs. develop jobs running at any given time.
We're considering migrating some of our underlying metrics data from OpenSearch to a cloned & extended copy of GitLab's postgres database. This would be kept in sync using AWS' Database Migration Service.
OpenSearch exhausted its shard limits for a few days this week. We are working to reingest the data we missed during this time.
We've developed a preliminary proof-of-concept for getting the EC2 instance type for a running builder pod. This is the first step towards a new "cost per job" metric.
We've begun updating spackbot to post data to OpenSearch. This will allow us to track how many jobs are due to "@spackbot run pipeline" or "@spackbot rebuild everything" (etc.)
We verified that our updated job pruning strategy is working as intended.
Luke demonstrated a new dashboard he developed that allows us to see how much time is spent on retried GitLab CI jobs. We will keep an eye on this to get a sense of how much cost savings we can expect to achieve by eliminating unnecessary retries.

We are looking to update this service to make it more useful & less confusing. Specific improvements should include:

It looks like we might have unnecessary cross-AZ traffic that is contributing to our EC2-Other and S3 costs. Mike & Zack to investigate further and attempt a fix.

Our goal is to set up a small pool of runners (and a corresponding stack) using pcluster AMIs
We are waiting for feedback from AWS on what specific AMIs to use
This effort will probably also require us to upgrade gitlab.spack.io
We should also rethink how runner configuration is stored in the spack-infra repo. It currently requires a lot of coping & pasting of YAML to create new types of runners.

Update the GitLab CI Failures by Error Taxonomy dashboard to be a stacked area chart per ref (develop vs. each PR branch). This change will make it easier for us to triage.
Continue updating cache.spack.io as described above.
Keep working on "costs per job" metric.
Keep working to reduce our AWS bill.