Skip to content
Zack Galbreath edited this page Apr 7, 2023 · 2 revisions

Attendees

  • Alec Scott
  • Dan LaManna
  • Jacob Nesbitt
  • John Parent
  • Luke Peyralans
  • Ryan Krattiger
  • Scott Wittenburg
  • Tammy Grimmett
  • Todd Gamblin
  • Zack Galbreath

GitLab CI dashboards and reliability

  • We've been working on a pie chart to show which packages we spend the most time building.
  • We've put together a preliminary chart distinguishing the number of PR vs. develop jobs running at any given time.
  • We're considering migrating some of our underlying metrics data from OpenSearch to a cloned & extended copy of GitLab's postgres database. This would be kept in sync using AWS' Database Migration Service.
  • OpenSearch exhausted its shard limits for a few days this week. We are working to reingest the data we missed during this time.
  • We've developed a preliminary proof-of-concept for getting the EC2 instance type for a running builder pod. This is the first step towards a new "cost per job" metric.
  • We've begun updating spackbot to post data to OpenSearch. This will allow us to track how many jobs are due to "@spackbot run pipeline" or "@spackbot rebuild everything" (etc.)
  • We verified that our updated job pruning strategy is working as intended.
  • Luke demonstrated a new dashboard he developed that allows us to see how much time is spent on retried GitLab CI jobs. We will keep an eye on this to get a sense of how much cost savings we can expect to achieve by eliminating unnecessary retries.

Improving cache.spack.io

We are looking to update this service to make it more useful & less confusing. Specific improvements should include:

  • Usage instructions
  • Show file size & hash for each package
  • Allow users to browse by stack (ie. browse E4S binaries from the 0.19 release)
  • Add links from packages.spack.io to cache.spack.io when applicable
  • Rename the "View Packages" link to something like "View Info" or "View Details"

AWS cost reduction

  • It looks like we might have unnecessary cross-AZ traffic that is contributing to our EC2-Other and S3 costs. Mike & Zack to investigate further and attempt a fix.

ParallelCluster testing

  • Our goal is to set up a small pool of runners (and a corresponding stack) using pcluster AMIs
  • We are waiting for feedback from AWS on what specific AMIs to use
  • This effort will probably also require us to upgrade gitlab.spack.io
  • We should also rethink how runner configuration is stored in the spack-infra repo. It currently requires a lot of coping & pasting of YAML to create new types of runners.

Priorities

  • Update the GitLab CI Failures by Error Taxonomy dashboard to be a stacked area chart per ref (develop vs. each PR branch). This change will make it easier for us to triage.
  • Continue updating cache.spack.io as described above.
  • Keep working on "costs per job" metric.
  • Keep working to reduce our AWS bill.
Clone this wiki locally