CI: 2023 03 24
Zack Galbreath edited this page Mar 24, 2023
·
2 revisions
- Aashish Chaudhary
- Alec Scott
- Bill Hoffman
- Jacob Nesbitt
- John Parent
- Massimiliano Culpo
- Mike VanDenburgh
- Ryan Krattiger
- Tammy Grimmett
- Todd Gamblin
- Zack Galbreath
- Suggestions for improvements to the Pipelines Dashboard
- show a "success percentage" time series
- It would be great if we could show the "top error type of the day" in a tooltip
- Suggestions for improvements to the Jobs Dashboard
- Change pie chart to show system vs. Spack errors (report success percentage separately elsewhere)
- Try to split up the
spack_error
category with more regexes - It would be very useful if we could "drill down" to a table of links to GitLab jobs for each error category
- Top suspicious offenders are:
- EC2-other::NatGateway-bytes
- EC2-other::DataTransfer-Regional-Bytes
- S3::DataTransfer-Out-Bytes
- Logging is setup to gather more information about these data transfers. We plan to analyze these logs next week and come up with a more concrete plan to reduce data transfer costs.
- Ask Evan for help if the causes or solutions to these costs are not immediately obvious.
- Look more into the unready node issue and create a mitigation (cron job) if a better solution cannot be found.
- This week we disabled Karpenter's node consolidation for GitLab CI pods.
- We also found a bad node in the cluster and manually terminated it.
- Together these two changes reduced our total error rate by ~50%!
- Continue to hunt down and eliminate system errors in GitLab CI pipelines
- Define more regexes to split up the
spack_error
category - Look into linking our Grafana dashboards together, especially if
spack_error
subcategories becomes a separate page - Create another dashboard to track our most expensive packages
- Draft a site reliability policy. The goal of this document is to make is easier for more folks to help out.
- What steps to take when we received an alert about an increased rate of errors
- What thresholds should we use to define such error alerting?