Prometheus alerts for when `etcd-druid`'s snapshot compaction jobs fail above a certain rate #9739

renormalize · 2024-05-13T05:59:00Z

How to categorize this PR?

/area control-plane
/area monitoring
/kind enhancement

What this PR does / why we need it:

This PR enables alerts at the seed level when etcd-druid's snapshot compaction jobs fail over a certain rate (10% is the currently agreed upon value by @gardener/etcd-druid-maintainers). The PR is in draft for enabling reviewers to test these changes locally; I will be opening it up for review once the reviewers are satisfied with the testing they perform locally.

These alerts are a health check for the seed cluster in the sense that a large number of snapshot compaction jobs failing simultaneously would suggest:

Connectivity issues to the remote object storage.
Network issues for the cloud provider, leading to alerts on all seeds on that cloud provider.
Early detection of backup corruption.

This PR proposes the following changes:

Federate etcd-druid metrics from the Cache Prometheus to the Aggregate Prometheus.
Raise alerts based on the etcddruid_compaction_jobs_total metric when more than 10% of the jobs deployed in last 3 hours have failed (succeeded="false" label).

Which issue(s) this PR fixes:
Fixes gardener/etcd-druid#603

Special notes for your reviewer:

The last commit in the draft is changes I've made specifically to be able to test this feature in a local gardener setup. It includes an image for etcd-druid which labels all etcd-druid snapshot compaction jobs with the succeeded="false" label, to simulate failed jobs.

The sources for that can be found on this branch of my fork of etcd-druid which you can use to build the etcd-druid image locally yourself, or directly use the image I've built which is hosted on Docker Hub as can be seen in imagevector/images.yaml in the final commit.

The directory where compacted snapshots would be found:

➜  gardener git:(compaction-alerts) ✗ tree dev/local-backupbuckets
dev/local-backupbuckets
└── XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
    └── shoot--local--local--XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
        └── etcd-main
            └── v2
                ├── Full-00000000-00000001-1715322606.gz
                ├── Full-00000000-00001714-1715322909.gz
                ├── Full-00000000-00002642-1715323208.gz
                ├── Full-00000000-00003505-1715323509.gz
                ├── Full-00000000-00004335-1715323809.gz
                ├── Full-00000000-00005161-1715324109.gz
                ├── Incr-00000002-00001714-1715322906.gz
                ├── Incr-00001715-00002642-1715323207.gz
                ├── Incr-00002643-00003505-1715323507.gz
                ├── Incr-00003506-00004335-1715323807.gz
                └── Incr-00004336-00005161-1715324107.gz

5 directories, 11 files

After the initial review and suggestions, I will remove the final commit in this branch.

Release note:

Failure of snapshot compaction jobs at a rate greater than 10% in a seed will raise alerts now.

…prometheus * The aggregate prometheus now scrapes metrics about etcd-druid's snapshot compaction job metrics which are federated by the cache prometheus. * Changes are made in `CentralScrapeConfigs()` for the aggregate prometheus. Federated metrics are scraped through a job `{job="etcd-druid",namespace="garden"}` which scrapes the metrics which have the job name as "etcd-druid" in the cache prometheus. * Adapted unit tests for `CentralScrapeConfigs()`.

…on jobs in a seed crosses a threshold * Prometheus rules are set to raise alerts if the number of etcd snapshot compaction jobs that have failed in the seed during a 3 hour window in the immediate past crosses a threshold. * The alerts are based on etcd-druid metrics that are federated from the cache prometheus to the aggregate prometheus. * Changes are made in `CentralPrometheusRules()` for the aggregate prometheus. If the number of etcd-druid compaction jobs which have the `succeeded="false"` label that were deployed in the last 3 hours crosses 10%, then alerts are raised. * Adapted unit tests for `CentralPrometheusRules()`.

* Changed etcd-druid image which causes compaction jobs to always have the `succeded="false"` label. * Changed `etcdConfig` for the local gardener setup. * Changed `etcdConfig` in gardener charts.

gardener-prow · 2024-05-13T05:59:03Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

gardener-prow · 2024-05-13T05:59:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign timuthy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

renormalize added 3 commits May 13, 2024 11:06

Changes needed for local testing

e5491e0

* Changed etcd-druid image which causes compaction jobs to always have the `succeded="false"` label. * Changed `etcdConfig` for the local gardener setup. * Changed `etcdConfig` in gardener charts.

gardener-prow bot requested review from ialidzhikov and ScheererJ May 13, 2024 05:59

gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus alerts for when `etcd-druid`'s snapshot compaction jobs fail above a certain rate #9739

Prometheus alerts for when `etcd-druid`'s snapshot compaction jobs fail above a certain rate #9739

renormalize commented May 13, 2024 •

edited

gardener-prow bot commented May 13, 2024

gardener-prow bot commented May 13, 2024

Prometheus alerts for when etcd-druid's snapshot compaction jobs fail above a certain rate #9739

Are you sure you want to change the base?

Prometheus alerts for when etcd-druid's snapshot compaction jobs fail above a certain rate #9739

Conversation

renormalize commented May 13, 2024 • edited

gardener-prow bot commented May 13, 2024

gardener-prow bot commented May 13, 2024

Prometheus alerts for when `etcd-druid`'s snapshot compaction jobs fail above a certain rate #9739

Prometheus alerts for when `etcd-druid`'s snapshot compaction jobs fail above a certain rate #9739

renormalize commented May 13, 2024 •

edited