CI: 2023 03 31

Attendees

The general sentiment is that things are getting better. Our error rates are going down!
You can now click through from Grafana error taxonomy -> matching opensearch records -> GitLab CI job output
The User Impact dashboard vanished. We are working on restoring it & making sure this doesn't happen again by storing our dashboard definitions in RDS.

Mike achieved nontrivial savings by finding & terminating some unused EC2 instances
Otherwise investigation continues into our NatGateway-Bytes and S3 transfer costs

The utilization panel on the Pipeline Overview dashboard is a good start.
For the "pod count per CI ref" panel, we should sum up all the PR builds so we can more easily compare PR builds vs. develop builds.

Once we have this metric we'll be able to generate useful aggregate metrics like cost per package, cost per stack, cost per PR, etc.
We need to know the EC2 instance type for jobs performed by our cloud runners. We should be able to use an approach similar to spack-infra PR #427 to get this info.
Once we have the instance type, we can get its spot price using aws ec2 describe-spot-price-history
We should also weight the cost of the job what fraction of the system resources it is using (parsed from "CPURequest" and "MemoryRequest" in the job log output).

Restore "User Impact" dashboard and make sure user-defined dashboards persist in RDS.
Update the frontpage of grafana.spack.io to list our most important dashboards
Deploy a testing pool of runners on AWS ParallelCluster
Revamp cache.spack.io
Keep working to reduce our AWS costs. Experiment with removing our fallback to on-demand instances for our GitLab CI runenrs.
Work on implementing the cost-per-job metric.
Continue work on error categorization and reduction. Partition dashboards to distinguish between failures on develop vs. failures in PRs.