[GEP-19] Migrate monitoring stack to `prometheus-operator` #9065

rfranzke · 2024-01-23T13:37:19Z

How to categorize this issue?

/area dev-productivity monitoring
/kind enhancement

What would you like to be added:
The monitoring stack should be migrated from the current custom-built Helm charts to the prometheus-operator as proposed in GEP-19.

Why is this needed:
GEP-19 has been accepted and merged a long while ago, hence we should strive for completing its implementation. Also, the garden cluster (managed via gardener-operator, ref #7016) does not have a monitoring stack yet. Also, this increases the development productivity by cleaning up technical debt and improving the code.

Tasks:

Deployment of prometheus-operator
- Golang component package: [GEP-19] Introduce prometheus-operator in garden and seed clusters #9067
- Deployment to garden cluster via gardener-operator: [GEP-19] Introduce prometheus-operator in garden and seed clusters #9067
- Deployment to seed clusters via gardenlet: [GEP-19] Introduce prometheus-operator in garden and seed clusters #9067
Prometheis/Alertmanager responsible for seed cluster
Prometheis/Alertmanager responsible for garden cluster
Prometheus/Alertmanager responsible for shoot clusters
- [GEP-19] Migrate shoot Alertmanager deployment and configuration #9257
- [GEP-19] Migrate shoot Prometheus deployment #9695
- Adapt shoot Prometheus configuration (scrape configs, rules, ...) for components deployed by gardenlet
  - [GEP-19] Adapt monitoring configuration for shoot cluster system components #9737
  - [GEP-19] Adapt monitoring configuration for shoot control plane components #9848
Adapt how extensions provide their observability configuration
Miscellaneous
- [GEP-19] Extend health checks of gardener-resource-manager for new Prometheus and Alertmanager resources #9163
- ~~Consider deployment of admission webhook server~~ (abandoned for now due to other, more important topics)
- [GEP-19] Add sidecar to Plutono for fetching dashboard ConfigMaps dynamically #9624

General notes for the migration (taken from #6319):

Add temporary migration code for the Persistent volume. This ensures that no data is lost.
1. Find the "old" pvc and its pv and set persistentVolumeReclaimPolicy=Retain.
2. Delete the "old" pvc.
3. Create a Prometheus Object with a volumeClaimTemplate that references the pv with volumeName=<existing-pv>
4. Migrate the data using an init container
5. Remove the migration code after 1-2 releases
Add all existing prometheus configuration to an additionalScrapeConfig. This will allow us to switch to the prometheus-operator without creating PodMonitors and ServiceMonitors for each component and instead do that migration step by step.
Add all extension prometheus configuration to the same additionalScrapeConfig. This will allow extensions time to migrate as well.
Existing rules should be replaced with PrometheusRules.
Once all of these steps are completed, most of the configuration in the additionalScrapeConfig can be migrated to PodMonitors and ServiceMonitors.

The text was updated successfully, but these errors were encountered:

rfranzke · 2024-05-29T06:16:32Z

All tasks have been completed.
/close

gardener-prow · 2024-05-29T06:16:36Z

@rfranzke: Closing this issue.

In response to this:

All tasks have been completed.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rfranzke self-assigned this Jan 23, 2024

gardener-prow bot added area/dev-productivity Developer productivity related (how to improve development) area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension labels Jan 23, 2024

rfranzke pinned this issue Jan 23, 2024

rfranzke mentioned this issue Jan 23, 2024

[GEP-19] Introduce prometheus-operator in garden and seed clusters #9067

Merged

This was referenced Mar 14, 2024

[GEP-19] Remove the additional matcher from the Alertmanager config #9384

Merged

[GEP-19] Fix the match expression in the alertmanager configuration #9387

Merged

rfranzke mentioned this issue Apr 5, 2024

[GEP-19] Integrate Prometheus and blackbox-exporter deployments into Garden controller #9543

Merged

This was referenced Apr 15, 2024

[GEP-19] Allow public network access for Garden Prometheus #9587

Merged

[GEP-19] Integrate long-term Prometheus deployment into Garden controller #9606

Merged

rfranzke mentioned this issue Apr 19, 2024

[GEP-19] Add sidecar to Plutono for fetching dashboard ConfigMaps dynamically #9624

Merged

rfranzke mentioned this issue May 11, 2024

[GEP-19] Adapt monitoring configuration for shoot cluster system components #9737

Merged

rfranzke mentioned this issue May 22, 2024

PVC migration: Remove .spec.claimRef only after new PVC got created #9817

Merged

rfranzke added kind/epic Large multi-story topic area/ipcei IPCEI (Important Project of Common European Interest) labels May 23, 2024

rfranzke mentioned this issue May 24, 2024

[GEP-19] Adapt monitoring configuration for shoot control plane components #9848

Merged

gardener-prow bot closed this as completed May 29, 2024

rfranzke unpinned this issue May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GEP-19] Migrate monitoring stack to `prometheus-operator` #9065

[GEP-19] Migrate monitoring stack to `prometheus-operator` #9065

rfranzke commented Jan 23, 2024 •

edited

rfranzke commented May 29, 2024

gardener-prow bot commented May 29, 2024

[GEP-19] Migrate monitoring stack to prometheus-operator #9065

[GEP-19] Migrate monitoring stack to prometheus-operator #9065

Comments

rfranzke commented Jan 23, 2024 • edited

rfranzke commented May 29, 2024

gardener-prow bot commented May 29, 2024

[GEP-19] Migrate monitoring stack to `prometheus-operator` #9065

[GEP-19] Migrate monitoring stack to `prometheus-operator` #9065

rfranzke commented Jan 23, 2024 •

edited