Skip to content
Zack Galbreath edited this page Mar 24, 2023 · 2 revisions

Attendees

  • Aashish Chaudhary
  • Alec Scott
  • Bill Hoffman
  • Jacob Nesbitt
  • John Parent
  • Massimiliano Culpo
  • Mike VanDenburgh
  • Ryan Krattiger
  • Tammy Grimmett
  • Todd Gamblin
  • Zack Galbreath

Dashboards for GitLab CI

  • Suggestions for improvements to the Pipelines Dashboard
    • show a "success percentage" time series
    • It would be great if we could show the "top error type of the day" in a tooltip
  • Suggestions for improvements to the Jobs Dashboard
    • Change pie chart to show system vs. Spack errors (report success percentage separately elsewhere)
    • Try to split up the spack_error category with more regexes
    • It would be very useful if we could "drill down" to a table of links to GitLab jobs for each error category

AWS costs

  • Top suspicious offenders are:
    • EC2-other::NatGateway-bytes
    • EC2-other::DataTransfer-Regional-Bytes
    • S3::DataTransfer-Out-Bytes
  • Logging is setup to gather more information about these data transfers. We plan to analyze these logs next week and come up with a more concrete plan to reduce data transfer costs.
  • Ask Evan for help if the causes or solutions to these costs are not immediately obvious.
  • Look more into the unready node issue and create a mitigation (cron job) if a better solution cannot be found.

Fixing high-impact system issues

  • This week we disabled Karpenter's node consolidation for GitLab CI pods.
  • We also found a bad node in the cluster and manually terminated it.
  • Together these two changes reduced our total error rate by ~50%!

Top priorities

  • Continue to hunt down and eliminate system errors in GitLab CI pipelines
  • Define more regexes to split up the spack_error category
  • Look into linking our Grafana dashboards together, especially if spack_error subcategories becomes a separate page
  • Create another dashboard to track our most expensive packages
  • Draft a site reliability policy. The goal of this document is to make is easier for more folks to help out.
    • What steps to take when we received an alert about an increased rate of errors
    • What thresholds should we use to define such error alerting?
Clone this wiki locally