Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus alerts for when etcd-druid's snapshot compaction jobs fail above a certain rate #9739

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

renormalize
Copy link
Member

@renormalize renormalize commented May 13, 2024

How to categorize this PR?

/area control-plane
/area monitoring
/kind enhancement

What this PR does / why we need it:

This PR enables alerts at the seed level when etcd-druid's snapshot compaction jobs fail over a certain rate (10% is the currently agreed upon value by @gardener/etcd-druid-maintainers). The PR is in draft for enabling reviewers to test these changes locally; I will be opening it up for review once the reviewers are satisfied with the testing they perform locally.

These alerts are a health check for the seed cluster in the sense that a large number of snapshot compaction jobs failing simultaneously would suggest:

  • Connectivity issues to the remote object storage.
  • Network issues for the cloud provider, leading to alerts on all seeds on that cloud provider.
  • Early detection of backup corruption.

This PR proposes the following changes:

  • Federate etcd-druid metrics from the Cache Prometheus to the Aggregate Prometheus.
  • Raise alerts based on the etcddruid_compaction_jobs_total metric when more than 10% of the jobs deployed in last 3 hours have failed (succeeded="false" label).

Which issue(s) this PR fixes:
Fixes gardener/etcd-druid#603

Special notes for your reviewer:

The last commit in the draft is changes I've made specifically to be able to test this feature in a local gardener setup. It includes an image for etcd-druid which labels all etcd-druid snapshot compaction jobs with the succeeded="false" label, to simulate failed jobs.

The sources for that can be found on this branch of my fork of etcd-druid which you can use to build the etcd-druid image locally yourself, or directly use the image I've built which is hosted on Docker Hub as can be seen in imagevector/images.yaml in the final commit.

The directory where compacted snapshots would be found:

➜  gardener git:(compaction-alerts) ✗ tree dev/local-backupbuckets
dev/local-backupbuckets
└── XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
    └── shoot--local--local--XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
        └── etcd-main
            └── v2
                ├── Full-00000000-00000001-1715322606.gz
                ├── Full-00000000-00001714-1715322909.gz
                ├── Full-00000000-00002642-1715323208.gz
                ├── Full-00000000-00003505-1715323509.gz
                ├── Full-00000000-00004335-1715323809.gz
                ├── Full-00000000-00005161-1715324109.gz
                ├── Incr-00000002-00001714-1715322906.gz
                ├── Incr-00001715-00002642-1715323207.gz
                ├── Incr-00002643-00003505-1715323507.gz
                ├── Incr-00003506-00004335-1715323807.gz
                └── Incr-00004336-00005161-1715324107.gz

5 directories, 11 files

After the initial review and suggestions, I will remove the final commit in this branch.

Release note:

Failure of snapshot compaction jobs at a rate greater than 10% in a seed will raise alerts now.

…prometheus

* The aggregate prometheus now scrapes metrics about etcd-druid's snapshot
  compaction job metrics which are federated by the cache prometheus.

* Changes are made in `CentralScrapeConfigs()` for the aggregate prometheus.
  Federated metrics are scraped through a job
  `{job="etcd-druid",namespace="garden"}` which scrapes the metrics
  which have the job name as "etcd-druid" in the cache prometheus.

* Adapted unit tests for `CentralScrapeConfigs()`.
…on jobs in a seed crosses a threshold

* Prometheus rules are set to raise alerts if the number of etcd snapshot
  compaction jobs that have failed in the seed during a 3 hour window in
  the immediate past crosses a threshold.

* The alerts are based on etcd-druid metrics that are federated from the
  cache prometheus to the aggregate prometheus.

* Changes are made in `CentralPrometheusRules()` for the aggregate prometheus.
  If the number of etcd-druid compaction jobs which have the `succeeded="false"` label
  that were deployed in the last 3 hours crosses 10%, then alerts are raised.

* Adapted unit tests for `CentralPrometheusRules()`.
* Changed etcd-druid image which causes compaction jobs to always have
  the `succeded="false"` label.

* Changed `etcdConfig` for the local gardener setup.

* Changed `etcdConfig` in gardener charts.
Copy link
Contributor

gardener-prow bot commented May 13, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gardener-prow gardener-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/control-plane Control plane related area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension labels May 13, 2024
Copy link
Contributor

gardener-prow bot commented May 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign timuthy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/enhancement Enhancement, improvement, extension size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Alerts for the compaction job metrics
1 participant