CI: 2023 03 24

Attendees

Suggestions for improvements to the Pipelines Dashboard
- show a "success percentage" time series
- It would be great if we could show the "top error type of the day" in a tooltip
Suggestions for improvements to the Jobs Dashboard
- Change pie chart to show system vs. Spack errors (report success percentage separately elsewhere)
- Try to split up the spack_error category with more regexes
- It would be very useful if we could "drill down" to a table of links to GitLab jobs for each error category

Top suspicious offenders are:
- EC2-other::NatGateway-bytes
- EC2-other::DataTransfer-Regional-Bytes
- S3::DataTransfer-Out-Bytes
Logging is setup to gather more information about these data transfers. We plan to analyze these logs next week and come up with a more concrete plan to reduce data transfer costs.
Ask Evan for help if the causes or solutions to these costs are not immediately obvious.
Look more into the unready node issue and create a mitigation (cron job) if a better solution cannot be found.

Continue to hunt down and eliminate system errors in GitLab CI pipelines
Define more regexes to split up the spack_error category
Look into linking our Grafana dashboards together, especially if spack_error subcategories becomes a separate page
Create another dashboard to track our most expensive packages
Draft a site reliability policy. The goal of this document is to make is easier for more folks to help out.
- What steps to take when we received an alert about an increased rate of errors
- What thresholds should we use to define such error alerting?