Skip to content
Zack Galbreath edited this page Mar 31, 2023 · 1 revision

Attendees

  • Aashish Chaudhary
  • Alec Scott
  • Jacob Nesbitt
  • John Parent
  • Luke Peyralans
  • Massimiliano Culpo
  • Mike VanDenburgh
  • Ryan Krattiger
  • Scott Wittenburg
  • Todd Gamblin
  • Zack Galbreath

GitLab CI dashboards & reliability

  • The general sentiment is that things are getting better. Our error rates are going down!
  • You can now click through from Grafana error taxonomy -> matching opensearch records -> GitLab CI job output
  • The User Impact dashboard vanished. We are working on restoring it & making sure this doesn't happen again by storing our dashboard definitions in RDS.

AWS cost reduction

  • Mike achieved nontrivial savings by finding & terminating some unused EC2 instances
  • Otherwise investigation continues into our NatGateway-Bytes and S3 transfer costs

Node utilization metrics

  • The utilization panel on the Pipeline Overview dashboard is a good start.
  • For the "pod count per CI ref" panel, we should sum up all the PR builds so we can more easily compare PR builds vs. develop builds.

Costs per job

  • Once we have this metric we'll be able to generate useful aggregate metrics like cost per package, cost per stack, cost per PR, etc.
  • We need to know the EC2 instance type for jobs performed by our cloud runners. We should be able to use an approach similar to spack-infra PR #427 to get this info.
  • Once we have the instance type, we can get its spot price using aws ec2 describe-spot-price-history
  • We should also weight the cost of the job what fraction of the system resources it is using (parsed from "CPURequest" and "MemoryRequest" in the job log output).

Top Priorities

  • Restore "User Impact" dashboard and make sure user-defined dashboards persist in RDS.
  • Update the frontpage of grafana.spack.io to list our most important dashboards
  • Deploy a testing pool of runners on AWS ParallelCluster
  • Revamp cache.spack.io
  • Keep working to reduce our AWS costs. Experiment with removing our fallback to on-demand instances for our GitLab CI runenrs.
  • Work on implementing the cost-per-job metric.
  • Continue work on error categorization and reduction. Partition dashboards to distinguish between failures on develop vs. failures in PRs.
Clone this wiki locally