CI: 2023 03 31
Zack Galbreath edited this page Mar 31, 2023
·
1 revision
- Aashish Chaudhary
- Alec Scott
- Jacob Nesbitt
- John Parent
- Luke Peyralans
- Massimiliano Culpo
- Mike VanDenburgh
- Ryan Krattiger
- Scott Wittenburg
- Todd Gamblin
- Zack Galbreath
- The general sentiment is that things are getting better. Our error rates are going down!
- You can now click through from Grafana error taxonomy -> matching opensearch records -> GitLab CI job output
- The User Impact dashboard vanished. We are working on restoring it & making sure this doesn't happen again by storing our dashboard definitions in RDS.
- Mike achieved nontrivial savings by finding & terminating some unused EC2 instances
- Otherwise investigation continues into our NatGateway-Bytes and S3 transfer costs
- The utilization panel on the Pipeline Overview dashboard is a good start.
- For the "pod count per CI ref" panel, we should sum up all the PR builds so we can more easily compare PR builds vs. develop builds.
- Once we have this metric we'll be able to generate useful aggregate metrics like cost per package, cost per stack, cost per PR, etc.
- We need to know the EC2 instance type for jobs performed by our cloud runners. We should be able to use an approach similar to spack-infra PR #427 to get this info.
- Once we have the instance type, we can get its spot price using aws ec2 describe-spot-price-history
- We should also weight the cost of the job what fraction of the system resources it is using (parsed from "CPURequest" and "MemoryRequest" in the job log output).
- Restore "User Impact" dashboard and make sure user-defined dashboards persist in RDS.
- Update the frontpage of grafana.spack.io to list our most important dashboards
- Deploy a testing pool of runners on AWS ParallelCluster
- Revamp cache.spack.io
- Keep working to reduce our AWS costs. Experiment with removing our fallback to on-demand instances for our GitLab CI runenrs.
- Work on implementing the cost-per-job metric.
- Continue work on error categorization and reduction. Partition dashboards to distinguish between failures on develop vs. failures in PRs.