Prometheus alerts for when etcd-druid
's snapshot compaction jobs fail above a certain rate
#9739
+62
−19
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
How to categorize this PR?
/area control-plane
/area monitoring
/kind enhancement
What this PR does / why we need it:
This PR enables alerts at the seed level when
etcd-druid
's snapshot compaction jobs fail over a certain rate (10% is the currently agreed upon value by @gardener/etcd-druid-maintainers). The PR is in draft for enabling reviewers to test these changes locally; I will be opening it up for review once the reviewers are satisfied with the testing they perform locally.These alerts are a health check for the seed cluster in the sense that a large number of snapshot compaction jobs failing simultaneously would suggest:
This PR proposes the following changes:
etcd-druid
metrics from the Cache Prometheus to the Aggregate Prometheus.etcddruid_compaction_jobs_total
metric when more than 10% of the jobs deployed in last 3 hours have failed (succeeded="false"
label).Which issue(s) this PR fixes:
Fixes gardener/etcd-druid#603
Special notes for your reviewer:
The last commit in the draft is changes I've made specifically to be able to test this feature in a local gardener setup. It includes an image for etcd-druid which labels all etcd-druid snapshot compaction jobs with the
succeeded="false"
label, to simulate failed jobs.The sources for that can be found on this branch of my fork of etcd-druid which you can use to build the etcd-druid image locally yourself, or directly use the image I've built which is hosted on Docker Hub as can be seen in
imagevector/images.yaml
in the final commit.The directory where compacted snapshots would be found:
After the initial review and suggestions, I will remove the final commit in this branch.
Release note: