Skip to content

Latest commit

 

History

History
1686 lines (1152 loc) · 51.7 KB

CHANGELOG.md

File metadata and controls

1686 lines (1152 loc) · 51.7 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Added

  • Add service priority as a tag in opsgenie alerts.

Fixed

  • Upgrade go-kit/kit to fix CVE-2022-24450 and CVE-2022-29946.
  • Upgrade getsentry/sentry-go to fix CVE-2021-23772, CVE-2021-42576, CVE-2020-26892, and CVE-2021-3127.

4.3.0 - 2022-08-02

Fixed

  • Fix psp names for prometheus and alertmanager.

4.2.0 - 2022-07-28

Changed

  • Set node-exporter namespace to kube-system for CAPI MCs and all WC, and to monitoring for vintage MCs.
  • Set cert-exporter namespace to kube-system for CAPI MCs and all WC, and to monitoring for vintage MCs.

Fixed

  • Added pod_name as a label to distinguish between multiple etcd pods when running in-cluster (e.g. CAPI).

Added

  • Push to gcp-app-collection.

Changed

  • Bump alpine from 3.16.0 to 3.16.1

4.1.0 - 2022-07-20

Changed

  • Upgrade operatorkit from v7.0.1 to v7.1.0.

Added

  • errors_total metric for each controller (comes with operatorkit upgrade).

Fixed

  • Cleanup of RemoteWrite Status (configuredPrometheuses, syncedSecrets) in case a cluster gets deleted.

4.0.1 - 2022-07-14

Fixed

  • Fix creation of new prometheus instance once a cluster has been created

4.0.0 - 2022-07-13

Added

  • Implement remotewrite CR logic, in order to configure Prometheus remotewrite config.
  • Add HTTP_PROXY in remotewrite config
  • Add unit tests for remotewrite resource
  • Add Secrets field in the RemoteWrite CR
  • Implement sync RemoteWrite Secrets logic
  • Adding RemoteWrite.status field to ensure cleanup
  • Add psp and service account for prometheus and alertmanager

Changed

  • Rename vcd to cloud-director

Fixed

  • Fix API server discovery.

Removed

  • Remove duplicate scrape config targets.

Fixed

  • Fix API server discovery.
  • Add patch verb for remoteWrite resources.

3.8.0 - 2022-06-30

Added

  • Add Secrets field in the RemoteWrite CR

3.7.0 - 2022-06-20

This release was created on release-v3.5.x branch to fix release 3.6.0 see PR#992

Changed

  • Change remote write name to grafana-cloud.

3.6.0 - 2022-06-08

Added

  • Add remotewrite controller.
  • Deployment of remoteWrite CRD in Helm chart
  • Ignore remotewrite field when updating prometheus CR.
  • Add PodMonitor support for workload cluster Prometheus.

Fixed

  • dependencies updates
  • fix build by ignoring CVEs we can't fix for the moment
  • Upgrade docker image from Alpine 3.15.1 to Alpine 3.16.0

Added

  • remoteWrite CustomResourceDefinition

3.5.0 - 2022-05-17

Added

  • Add Cluster Service Priority label.
  • Add customer and organization label to metrics.
  • Add VCD provider.

3.4.3 - 2022-05-10

Fixed

  • Add 5mn initial delay before performing readiness checks.

3.4.2 - 2022-05-09

Fixed

  • Use 'ip' node label as target to scrape etcd on MCs.

3.4.1 - 2022-05-05

Fixed

  • Fix CAPI cluster detection for legacy Management Clusters.

3.4.0 - 2022-05-04

Added

  • Add PodMonitor support on management clusters.

3.3.0 - 2022-05-04

Changed

  • Add nodepool label to kube-state-metrics metrics.
  • Improve CAPI cluster detection.

3.2.0 - 2022-04-13

Changed

  • Change how MC managed with CAPI are reconciled in PMO (using the cluster CR instead of the Kubernetes Service)

Fixed

  • Fix etcd service discovery for CAPI clusters.

Removed

  • Remove skip resource

3.1.0 - 2022-04-08

Added

  • Add support for etcd-certificates on OpenStack.
  • Add context to generic resources.

Fixed

  • Add skip resource, to fix MC duplicated handling.

3.0.0 - 2022-03-28

Added

  • Add alertmanager ingress.
  • Configure alertmanager and wire prometheus to both legacy and new alertmanagers.

Changed

  • Remove deprecated matcher types from alertmanager config.
  • Changed scrape_interval to 180s and scrape_timeout to 60s for azure-collector.

Removed

  • Remove old teams from alertmanager template.
  • Remove code to manage legacy alertmanager.

2.4.0 - 2022-03-16

Changed

  • Migrate to rbac/v1 from rbac/v1beta1.
  • Change additional scraping config to keep cadvisor metrics for kong.* named namespaces

Fixed

  • Do not trail right whitespaces in config.

2.3.0 - 2022-03-04

Changed

  • Support ingress v1 by default.
  • Scrape node-exporter trough apiserver proxy.

Fixed

  • Old references to Firecracker and Celestial replaced with Phoenix

2.2.1 - 2022-02-24

Fixed

  • Fix failing aggregation:prometheus:memory_percentage due to duplicated series from node exporter.

2.2.0 - 2022-01-20

Changed

  • Allow overriding the scraping protocol

Fixed

  • Set ingress class name in ingress spec instead of annotation to prepare supporting ingress v1.

2.1.1 - 2022-01-12

Fixed

  • Prevent panic when encountering a different user in the CAPI kubeconfig.

2.1.0 - 2022-01-10

Added

  • Added support for OpenStack provider

2.0.0 - 2022-01-03

Changed

  • Disable cluster-api controller on KVM installations.
  • Disable legacy controller on AWS and Azure installations.
  • Upgrade to Go 1.17
  • Upgrade github.com/giantswarm/microkit v0.2.2 to v1.0.0
  • Upgrade github.com/giantswarm/versionbundle v0.2.0 to v1.0.0
  • Upgrade github.com/giantswarm/microendpoint v0.2.0 to v1.0.0
  • Upgrade github.com/giantswarm/microerror v0.3.0 to v0.4.0
  • Upgrade github.com/giantswarm/micrologger v0.5.0 to v0.6.0
  • Upgrade github.com/spf13/viper v1.9.0 to v1.10.0
  • Upgrade github.com/giantswarm/k8sclient v5.12.0 to v7.0.1
  • Upgrade k8s.io/api v0.19.4 to v0.21.4
  • Upgrade k8s.io/apiextensions-apiserver v0.19.4 to v0.21.4
  • Upgrade sigs.k8s.io/controller-runtime v0.6.4 to v0.8.3
  • Upgrade k8s.io/client-go v0.19.4 to v0.21.4
  • Upgrade github.com/giantswarm/operatorkit v4.3.1 to v7.0.0
  • Upgrade sigs.k8s.io/cluster-api v0.3.19 to v0.4.5
  • Upgrade sigs.k8s.io/controller-runtime v0.8.3 to v0.9.7
  • Upgrade github.com/prometheus-operator v0.50.0 to v0.52.1

Removed

  • Remove k8sclient.G8sClient
  • Remove versionbundle.Changelog
  • Remove github.com/giantswarm/cluster-api v0.3.13-gs

1.53.0 - 2021-12-17

Changed

  • Renamed cancel_if_has_no_workers inhibition to cancel_if_cluster_has_no_workers to make it explicit it's about clusters and not node pools.

1.52.1 - 2021-12-14

Fixed

  • Fix relabeling for __meta_kubernetes_service_annotation_giantswarm_io_monitoring_app_label

1.52.0 - 2021-12-13

Added

  • Add new inhibition for clusters without workers.
  • Add relabeling for __meta_kubernetes_service_annotation_giantswarm_io_monitoring_app_label

Changed

  • Upgrade alertmanager to v0.23.0
  • Upgrade prometheus-operator v0.49.0 to v0.50.0

Fixed

  • Avoid defaulting of role label (containing the role of the k8s node). If data is missing we can't reliably default it.

1.51.2 - 2021-10-28

Fixed

  • Fix finding certificates in organization namespaces.

Removed

  • Remove cloud limit alerts from customer channel.

1.51.1 - 2021-09-10

Fixed

  • Re-introduce v1alpha2 scheme.

1.51.0 - 2021-09-09

Changed

  • Drop v1alpha2 scheme.
  • Reconcile v1alpha3 cluster.

Fixed

  • Do not create the legacy controller on new installations.

1.50.0 - 2021-08-16

Changed

  • Upgrade prometheus-operator to v0.49.0

Fixed

  • Fix an issue where prometheus config is empty, due to missing serviceMonitorSelector.

1.49.0 - 2021-08-11

Added

  • Add additionalScrapeConfigs flag which accepts a string which will be appended to the management cluster scrape config template for installation specific configuration.

1.48.0 - 2021-08-09

Added

  • Add receiver and route for #noise-falco Slack channel.

1.47.0 - 2021-08-05

Changed

  • Add the service label in the alert templates for the ServiceLevelBurnRateTooHigh alert.
  • Update Prometheus to 2.28.1.
  • Allow the use of Prometheus Operator Service Monitor for management clusters.

1.46.0 - 2021-07-14

Changed

  • Use giantswarm/config to generate managed configuration.

1.45.0 - 2021-06-28

Changed

  • Use Grafana Cloud remote-write URL from config instead of hardcoding it, to allow overriding the URL in installations which can't access Grafana Cloud directly.

1.44.2 - 2021-06-24

1.44.1 - 2021-06-24

1.44.0 - 2021-06-23

Removed

1.43.0 - 2021-06-22

Changed

  • Removed ServiceLevelBurnRateTicket alert.

1.42.0 - 2021-06-22

Changed

  • Removed NodeExporterDown alert and use SLO framework to monitor node-exporters.
  • Change ServiceLevelBurnRateTooHigh and ServiceLevelBurnRateTooHighTicket to opt-out for services.

1.41.2 - 2021-06-22

Fixed

  • Fix typo in AzureClusterCreationFailed and AzureClusterUpgradeFailed

1.41.1 - 2021-06-22

Added

  • Add term to not count api-server errors for clusters in transitioning state.
  • Business-hours alert for azure clusters not updating in time.

Changed

  • Increase ManagementClusterWebhookDurationExceedsTimeout duration from 5m to 15m.

Fixed

  • Fix CoreDNSMaxHPAReplicasReached alert to not fire in case max and min are equal.
  • Business-hours alert for azure clusters not creating in time.

Removed

  • Remove AlertManager ingress to avoid conflicts with the existing one, until the new AlertManager is ready to replace the one from g8s-prometheus

1.41.0 - 2021-06-17

Added

  • Add AppPendingUpdate alert.
  • Add scrapeconfig for falco-exporter on management clusters.
  • Add Alertmanager managed by Prometheus Operator.
  • Add Alertmanager ingress.
  • Add WorkloadClusterDeploymentNotSatisfiedLudacris to monitor metrics-server in workload clusters.
  • Add CoreDNSMaxHPAReplicasReached business hours alert for when CoreDNS has been scaled to its maximum for too long.

Changed

  • Lower Prometheus disk space alert from 10% to 5%.
  • Change severity of ChartOperatorDown alert to notify.
  • Merge all provider certificate.management-cluster.rules into one prometheus rule.

Fixed

  • Fix service name in ingress.

1.40.0 - 2021-06-14

Changed

  • Lower kubelet SLO from 99.9% to 99%.

1.39.0 - 2021-06-11

Added

  • Add ServiceLevelBurnRateTicket alert.
  • Add the prometheus log level option
  • Add high and low burn rates as recording rules.

Changed

  • Move managed apps SLO alerts to the service-level format.
  • Set HighNumberOfAllocatedSockets to notify not page
  • Extract kubelet and api-server SLO targets to their own recording rules.
  • Extract kubelet and api-server alerting thresholds to their own recording rules.
  • Change ServiceLevelBurnRateTooHigh to use new created values.

Fixed

  • Fixed the way VPA maxAllowed parameter for memory is calculated so that we avoid going over node memory capacity with the memory limit (maxAllowed is used for request and limit is that multiplied by 1.2).

1.38.0 - 2021-05-28

Changed

  • Increased alert duration of PrometheusCantCommunicateWithKubernetesAPI.
  • Refactor resources to namespace monitoring and alerting code.
  • Add cluster-autoscaler to WorkloadClusterContainerIsRestartingTooFrequentlyFirecracker

Removed

  • Remove tlscleanup and volumeresizehack resources as they are not needed anymore.

1.37.0 - 2021-05-26

Added

  • Add HTTP proxy support to Prometheus Remote Write.

1.36.0 - 2021-05-25

Added

  • Added alert HighNumberOfAllocatedSockets for High number of allocated sockets
  • Added alert HighNumberOfOrphanedSockets for High number of orphaned sockets
  • Added alert HighNumberOfTimeWaitSockets for High number of time wait sockets
  • Added alert AWSWorkloadClusterNodeTooManyAutoTermination for terminate unhealthy feature.
  • Preserve and merge global HTTP client config when generating heartbeat receivers in AlertManager config; this allows it to be used in environments where internet access is only allowed through a proxy.

Changed

  • Include cluster-api-core-unique-webhook into DeploymentNotSatisfiedFirecracker and DeploymentNotSatisfiedChinaFirecracker.
  • Increased duration for PrometheusPersistentVolumeSpaceTooLow alert
  • Increased duration for WorkloadClusterEtcdDBSizeTooLarge alert.
  • Increased duration for WorkloadClusterEtcdHasNoLeader alert.
  • Silence OperatorkitErrorRateTooHighCelestial and OperatorkitCRNotDeletedCelestial outside working hours.
  • Update Prometheus to 2.27.1
  • Add atlas, and installation tag onto Heartbeats.

Fixed

  • Fix PrometheusFailsToCommunicateWithRemoteStorageAPI alert not firing on china clusters.

1.35.0 - 2021-05-12

Added

  • Add alert alertmanager-dashboard not satisfied.

1.34.1 - 2021-05-10

Fixed

  • inhibit KubeStateMetricsDown and KubeStateMetricsMissing

1.34.0 - 2021-05-06

Changed

  • Lower the severity to notify for managed app's error budget alerts

Fixed

  • Fix ManagedApp alert
  • Fix InhibitionKubeStateMetricsDown not firing long enough

1.33.0 - 2021-04-27

Changed

  • Raise prometheus cpu limit to 150%.

Removed

  • Remove PodLimitAlmostReachedAWS and EBSVolumeMountErrors alerts as they were not used.

1.32.1 - 2021-04-22

Fixed

  • Adjust container restarting too often firecracker.

1.32.0 - 2021-04-19

Added

  • Add alert for kube-state-metrics missing.
  • Tune remote write configuration to avoid loss of data.

Changed

  • Only fire KubeStateMetricsDown if kube-state-metrics is down.

1.31.0 - 2021-04-16

Added

  • Page firecracker for failed cluster transitions.
  • Page Firecracker in working hours for restarting containers.
  • Add recording rules for kube-mixins
  • MatchingNumberOfPrometheusAndCluster now has a runbook, link added to alert.

Changed

1.30.0 - 2021-04-12

Removed

  • Remove Gatekeeper alerts and targets.

1.29.1 - 2021-04-09

Fixed

  • Fix inhibition for MatchingNumberOfPrometheusAndCluster alert by matching it with source from Management Cluster instead of the cluster the alert is firing for.

1.29.0 - 2021-04-09

Added

  • Add PrometheusCantCommunicateWithRemoteStorageAPI to alert when Prometheus fails to send samples to Cortex.
  • Add workload type and name labels for ManagedAppBasicError* alerts
  • Add alert for master node in HA setup down for too long.
  • Add aggregation for docker actions.

Fixed

  • Fix prometheus storage alert

Removed

  • Removed unnecessary whitespace in additional scrape configs.

1.28.0 - 2021-04-01

Added

  • Add support to calculate maximum CPU.
  • Include cadvisor metrics from the pod in draughtsman namespace.
  • Add PrometheusPersistentVolumeSpaceTooLow alert for prometheus storage going over 90 percent.

Changed

  • Split ManagementClusterCertificateWillExpireInLessThanTwoWeeks alert per provider.
  • Increased duration time for flapping WorkloadClusterWebhookDurationExceedsTimeout alert

Fixed

  • Changed prometheus volume space alert ownership to atlas:
    • PersistentVolumeSpaceTooLow -> PrometheusPersistentVolumeSpaceTooLow

Removed

  • Do not monitor docker for CAPI clusters

Removed

  • Remove promxy resource.

1.27.4 - 2021-03-26

  • Add recording rules for dex activity, creating the metrics
    • aggregation:dex_requests_status_ok
    • aggregation:dex_requests_status_4xx
    • aggregation:dex_requests_status_5xx

1.27.3 - 2021-03-25

  • Fix prometheus/common secret token in imported code.

1.27.2 - 2021-03-25

Fixed

  • Fix alertmanager secretToken in imported alertmanager code.

1.27.1 - 2021-03-25

Fixed

  • Remove follow_redirects from alertmanager config
    • Update prometheus/alertmanger@v0.21.0
    • Update prometheus/common@v0.17.0

1.27.0 - 2021-03-24

Changed

  • Update architect to 2.4.2

Removed

  • Removed memory-intensive notify only systemd alerts.

1.26.0 - 2021-03-24

Changed

  • Push to shared-app-collection
  • Rename EtcdWorkloadClusterDown to WorkloadClusterEtcdDown
  • Increased memory limits by 1.2 factor

Fixed

  • Support vmware for WorkloadClusterEtcdDown
  • Add vmware to the list of valid providers

1.25.2 - 2021-03-23

Fixed

  • Disable follow redirect for alertmanager

1.25.1 - 2021-03-22

Fixed

  • Set prometheus minimum CPU to 100m

1.25.0 - 2021-03-22

Added

  • Add support for monitoring vmware clusters
  • Add support to get the API Server URL for both legacy and CAPI clusters

Changed

  • Upgrade ingress version to networking.k8s.io/v1beta1

Fix

  • Fix typo in MatchingNumberOfPrometheusAndCluster alert
  • Fix scrapeconfig to use secured ports for kubernetes control plane components for CAPI clusters
  • Fix scrapeconfig to proxy all calls through the API Server for CAPI clusters

1.24.8 - 2021-03-18

Fix

  • Avoid alerting for MatchingNumberOfPrometheusAndCluster when a cluster is being deleted.

1.24.7 - 2021-03-18

Added

  • Add support to copy CAPI cluster's certificates
  • Add aggregation aggregation:giantswarm:api_auth_giantswarm_successful_attempts_total.

1.24.6 - 2021-03-02

Fixed

  • Fix equality check on the VPA CR to prevent it being overriden and losing it's status information on every prometheus-meta-operator deployment.
  • Inhibit MatchingNumberOfPrometheusAndCluster when kube-state-metrics is down to prevent bogus pages when kube_pod_container_status_running metric isn't available

1.24.5 - 2021-03-02

Added

  • Set the prometheus UI Web page title.
  • Add 'app' label to metrics pushed from app-exporter to cortex

1.24.4 - 2021-02-26

Changed

  • Avoid alerting for ETCD backups outside business hours.

1.24.3 - 2021-02-24

Changed

  • Use resident_memory when calculating docker memory usage.

1.24.2 - 2021-02-24

Added

  • Add 'catalog' label to metrics pushed from app-exporter to cortex

1.24.1 - 2021-02-23

Fixed

  • Fixed syntax error in expressions of ManagementClusterPodPending* alerts

1.24.0 - 2021-02-23

Added

  • Add Alert for missing prometheus for a workload cluster
  • Add ManagementClusterPodStuckFirecracker and WorkloadClusterPodStuckFirecracker alerts for Firecracker.
  • Add ManagementClusterPodStuckCelestial alert for Celestial.
  • Send samples per second to cortex

Changed

  • Move Cluster Autoscaller app installation/upgrade related alerts to team Batman.

1.23.1 - 2021-02-22

Added

  • Add TestClusterTooOld for testing installations
  • Added Mayu as a scrape target as well as puma's pods

Changed

  • Apply prometheus rule group (which includes
  • Discover ETCD targets through the LoadBalancer using the giantswarm.io/etcd-domain annotation

Fixed

  • Remove PersistentVolumeSpaceTooLow from Workload Clusters.

1.23.0 - 2021-02-17

Added

  • Add the sig-customer alerts:
    • WorkloadClusterCertificateWillExpireInLessThanAMonth
    • WorkloadClusterCertificateWillExpireMetricMissing
  • Add the ludacris alerts:
    • CadvisorDown
    • CalicoRestartRateTooHigh
    • CertOperatorVaultTokenAlmostExpiredMissing
    • CertOperatorVaultTokenAlmostExpired
    • ClusterServiceVaultTokenAlmostExpiredMissing
    • ClusterServiceVaultTokenAlmostExpired
    • CollidingOperatorsLudacris
    • CoreDNSCPUUsageTooHigh
    • CoreDNSDeploymentNotSatisfied
    • CoreDNSLatencyTooHigh
    • DeploymentNotSatisfiedLudacris and assign it to rocket DeploymentNotSatisfiedRocket
    • DockerMemoryUsageTooHigh for both Ludacris and Biscuit
    • DockerVolumeSpaceTooLow for both Ludacris and Biscuit
    • EtcdVolumeSpaceTooLow for both Ludacris and Biscuit
    • JobFailed renamed to ManagementClusterJobFailed
    • KubeConfigMapCreatedMetricMissing
    • KubeDaemonSetCreatedMetricMissing
    • KubeDeploymentCreatedMetricMissing
    • KubeEndpointCreatedMetricMissing
    • KubeNamespaceCreatedMetricMissing
    • KubeNodeCreatedMetricMissing
    • KubePodCreatedMetricMissing
    • KubeReplicaSetCreatedMetricMissing
    • KubeSecretCreatedMetricMissing
    • KubeServiceCreatedMetricMissing
    • KubeStateMetricsDown
    • KubeletConditionBad
    • KubeletDockerOperationsErrorsTooHigh
    • KubeletDockerOperationsLatencyTooHigh
    • KubeletPLEGLatencyTooHigh
    • KubeletVolumeSpaceTooLow for both Ludacris and Biscuit
    • LogVolumeSpaceTooLow for both Ludacris and Biscuit
    • MachineAllocatedFileDescriptorsTooHigh
    • MachineEntropyTooLow
    • MachineLoadTooHigh and moved it to biscuit
    • MachineMemoryUsageTooHigh and moved it to biscuit
    • ManagementClusterAPIServerAdmissionWebhookErrors
    • ManagementClusterAPIServerLatencyTooHigh
    • ManagementClusterContainerIsRestartingTooFrequently
    • ManagementClusterCriticalSystemdUnitFailed
    • ManagementClusterDaemonSetNotSatisfiedLudacris
    • ManagementClusterDaemonSetNotSatisfiedLudacris
    • ManagementClusterDisabledSystemdUnitActive
    • ManagementClusterHighNumberSystemdUnits
    • ManagementClusterNetExporterCPUUsageTooHigh
    • ManagementClusterSystemdUnitFailed
    • ManagementClusterWebhookDurationExceedsTimeout
    • Network95thPercentileLatencyTooHigh
    • NetworkCheckErrorRateTooHigh
    • NodeConnTrackAlmostExhausted
    • NodeExporterCollectorFailed
    • NodeExporterDeviceError
    • NodeExporterDown
    • NodeExporterMissing
    • NodeHasConstantOOMKills
    • NodeStateFlappingUnderLoad
    • OperatorNotReconcilingLudacris
    • OperatorkitErrorRateTooHighLudacris
    • PersistentVolumeSpaceTooLow for both Ludacris and Biscuit
    • ReleaseNotReady
    • RootVolumeSpaceTooLow for both Ludacris and Biscuit
    • SYNRetransmissionRateTooHigh
    • ServiceLevelBurnRateTooHigh
    • WorkloadClusterAPIServerAdmissionWebhookErrors
    • WorkloadClusterAPIServerLatencyTooHigh
    • WorkloadClusterCriticalSystemdUnitFailed
    • WorkloadClusterDaemonSetNotSatisfiedLudacris
    • WorkloadClusterDisabledSystemdUnitActive
    • WorkloadClusterHighNumberSystemdUnits
    • WorkloadClusterNetExporterCPUUsageTooHigh
    • WorkloadClusterSystemdUnitFailed
    • WorkloadClusterWebhookDurationExceedsTimeout

Changed

  • Migrate and rename EBSVolumeMountErrors to ManagementClusterEBSVolumeMountErrors and WorkloadClusterEBSVolumeMountErrors

Removed

  • Removing legacy finalizers resource used to remove old custom resource finalizers

1.22.0 - 2021-02-16

Changed

  • Improved inhibition alert InhibitionClusterStatusUpdating to inhibit alerts 10 minutes after the update has finished to avoid unecessery pages.

1.21.0 - 2021-02-16

Changed

  • Split ManagementClusterAppFailed per team

Added

  • Add the solution engineer alerts:
    • AzureQuotaUsageApproachingLimit
    • NATGatewaysPerVPCApproachingLimit
    • ServiceUsageApproachingLimit

1.20.0 - 2021-02-16

Added

  • Add the rocket alerts:
    • BackendServerUP
    • ClockOutOfSyncKVM
    • CollidingOperatorsRocket
    • DNSCheckErrorRateTooHighKVM
    • DNSErrorRateTooHighKVM
    • EtcdWorkloadClusterDownKVM
    • IngressExporterDown
    • KVMManagementClusterDeploymentScaledDownToZero
    • KVMNetworkErrorRateTooHigh
    • ManagementClusterCriticalPodMetricMissingKVM
    • ManagementClusterCriticalPodNotRunningKVM
    • ManagementClusterMasterNodeMissingRocket
    • ManagementClusterPodLimitAlmostReachedKVM
    • ManagementClusterPodPendingFor15Min
    • MayuSystemdUnitIsNotRunning
    • NetworkInterfaceLeftoverWithoutCluster
    • OnpremManagementClusterMissingNodes
    • OperatorNotReconcilingRocket
    • OperatorkitCRNotDeletedRocket
    • OperatorkitErrorRateTooHighRocket
    • WorkloadClusterCriticalPodMetricMissingKVM
    • WorkloadClusterCriticalPodNotRunningKVM
    • WorkloadClusterEndpointIPDown
    • WorkloadClusterEtcdCommitDurationTooHighKVM
    • WorkloadClusterEtcdDBSizeTooLargeKVM
    • WorkloadClusterEtcdHasNoLeaderKVM
    • WorkloadClusterEtcdNumberOfLeaderChangesTooHighKVM
    • WorkloadClusterMasterNodeMissingRocket
    • WorkloadClusterPodLimitAlmostReachedKVM
  • Added the firecracker rules to PMO:
    • AWSClusterCreationFailed
    • AWSClusterUpdateFailed
    • AWSManagementClusterDeploymentScaledDownToZero
    • AWSManagementClusterMissingNodes
    • AWSNetworkErrorRateTooHigh
    • ClockOutOfSyncAWS
    • CloudFormationStackFailed
    • CloudFormationStackRollback
    • ClusterAutoscalerAppFailedAWS
    • ClusterAutoscalerAppNotInstalledAWS
    • ClusterAutoscalerAppPendingInstallAWS
    • ClusterAutoscalerAppPendingUpgradeAWS
    • CollidingOperatorsFirecracker
    • ContainerIsRestartingTooFrequentlyFirecracker
    • CredentialdCantReachKubernetes
    • DNSCheckErrorRateTooHighAWS
    • DNSErrorRateTooHighAWS
    • DefaultCredentialsMissing
    • DeploymentNotSatisfiedChinaFirecracker
    • DeploymentNotSatisfiedFirecracker
    • ELBHostsOutOfService
    • EtcdWorkloadClusterDownAWS
    • FluentdMemoryHighUtilization
    • JobHasNotBeenScheduledForTooLong
    • KiamMetadataFindRoleErrors
    • ManagementClusterDaemonSetNotSatisfiedChinaFirecracker
    • ManagementClusterDaemonSetNotSatisfiedFirecracker
    • OperatorNotReconcilingFirecracker
    • OperatorkitCRNotDeletedFirecracker
    • OperatorkitErrorRateTooHighFirecracker
    • TooManyCredentialsForOrganization
    • TrustedAdvisorErroring
    • WorkloadClusterCriticalPodNotRunningAWS
    • WorkloadClusterCriticalPodMetricMissingAWS
    • WorkloadClusterDaemonSetNotSatisfiedFirecracker
    • WorkloadClusterEtcdCommitDurationTooHighAWS
    • WorkloadClusterEtcdDBSizeTooLargeAWS
    • WorkloadClusterEtcdHasNoLeaderAWS
    • WorkloadClusterEtcdNumberOfLeaderChangesTooHighAWS
    • WorkloadClusterMasterNodeMissingFirecracker
    • WorkloadClusterPodLimitAlmostReachedAWS
  • Splitting NodeIsUnschedulable per team
  • Split ContainerIsRestartingTooFrequentlyFirecracker into WorkloadClusterContainerIsRestartingTooFrequentlyFirecracker and ManagementClusterContainerIsRestartingTooFrequentlyFirecracker
  • Add the following biscuit alerts to split alerts between workload and management cluster:
    • ManagementClusterCriticalPodNotRunning
    • ManagementClusterCriticalPodMetricMissing
    • ManagementClusterPodLimitAlmostReached

Changed

  • Move AzureManagementClusterMissingNodes and AWSManagementClusterMissingNodes to team biscuit ManagementClusterMissingNodes
  • Move ManagementClusterPodStuckAzure and ManagementClusterPodStuckAWS to team biscuit ManagementClusterPodPendingFor15Min
  • Renamed the following alerts:
    • AzureClusterAutoscalerIsRestartingFrequently -> WorkloadClusterAutoscalerIsRestartingFrequentlyAzure
    • CriticalPodNotRunningAzure -> WorkloadClusterCriticalPodNotRunningAzure
    • CriticalPodMetricMissingAzure -> WorkloadClusterCriticalPodMetricMissingAzure
    • MasterNodeMissingCelestial -> WorkloadClusterMasterNodeMissingCelestial
    • NodeUnexpectedTaintNodeWithImpairedVolumes -> WorkloadClusterNodeUnexpectedTaintNodeWithImpairedVolumes
    • PodLimitAlmostReachedAzure -> WorkloadClusterPodLimitAlmostReachedAzure

Fixed

  • Do not page biscuit for a failing prometheus

1.19.2 - 2021-02-12

Fixed

  • Fix incorrect prometheus memory usage recording rules after we migrated to the new monitoring setup

Changed

  • Use azure-collector instead of azure-operator in AzureClusterCreationFailed alert

Removed

  • Removing service monitor resource used to clean up unused service monitor CR

1.19.1 - 2021-02-10

Fixed

  • Fix empty prometheus rules in helm template issues for aws and kvm installations

1.19.0 - 2021-02-10

Added

  • Added the celestial rules to PMO:
    • AzureClusterAutoscalerIsRestartingFrequently
    • AzureClusterCreationFailed
    • AzureDeploymentIsRunningForTooLong
    • AzureDeploymentStatusFailed
    • AzureManagementClusterDeploymentScaledDownToZero
    • AzureManagementClusterMissingNodes
    • AzureNetworkErrorRateTooHigh
    • AzureServicePrincipalExpirationDateUnknown
    • AzureServicePrincipalExpiresInOneMonth
    • AzureServicePrincipalExpiresInOneWeek
    • AzureVMSSRateLimit30MinutesAlmostReached
    • AzureVMSSRateLimit30MinutesReached
    • AzureVMSSRateLimit3MinutesAlmostReached
    • AzureVMSSRateLimit3MinutesReached
    • ClockOutOfSyncAzure
    • ClusterAutoscalerAppFailedAzure
    • ClusterAutoscalerAppNotInstalledAzure
    • ClusterAutoscalerAppPendingInstallAzure
    • ClusterAutoscalerAppPendingUpgradeAzure
    • ClusterWithNoResourceGroup
    • CollidingOperatorsCelestial
    • CriticalPodMetricMissingAzure
    • CriticalPodNotRunningAzure
    • DNSCheckErrorRateTooHighAzure
    • DNSErrorRateTooHighAzure
    • DeploymentNotSatisfiedCelestial
    • EtcdWorkloadClusterDownAzure
    • LatestETCDBackup1DayOld
    • LatestETCDBackup2DaysOld
    • ManagementClusterNotBackedUp24h
    • MasterNodeMissingCelestial
    • OperatorNotReconcilingCelestial
    • OperatorkitCRNotDeletedCelestial
    • OperatorkitErrorRateTooHighCelestial
    • PodLimitAlmostReachedAzure
    • ManagementClusterPodStuckAzure (renamed from PodStuckAzure)
    • ReadsRateLimitAlmostReached
    • VPNConnectionProvisioningStateBad
    • VPNConnectionStatusBad
    • WorkloadClusterEtcdCommitDurationTooHighAzure
    • WorkloadClusterEtcdDBSizeTooLargeAzure
    • WorkloadClusterEtcdHasNoLeaderAzure
    • WorkloadClusterEtcdNumberOfLeaderChangesTooHighAzure
    • WritesRateLimitAlmostReached
    • ETCDBackupJobFailedOrStuck (renamed from BackupJobFailedOrStuck)
  • Added node role label to kubelet metrics as it's needed by MasterNodeMissingCelestial alert

Removed

  • Removed axolotl from Chinese rules as the installation has been decommissioned

1.18.0 - 2021-02-08

Removed

  • Added the batman alerts to PMO:
    • AppExporterDown
    • AppOperatorNotReady
    • AppWithoutTeamLabel
    • CertManagerPodHighMemoryUsage
    • CertificateSecretWillExpireInLessThanTwoWeeks
    • ChartOperatorDown
    • ChartOrphanConfigMap
    • ChartOrphanSecret
    • CollidingOperatorsBatman
    • CordonedAppExpired
    • DeploymentNotSatisfiedBatman
    • DeploymentNotSatisfiedChinaBatman
    • ElasticsearchClusterHealthStatusRed
    • ElasticsearchClusterHealthStatusYellow
    • ElasticsearchDataVolumeSpaceTooLow
    • ElasticsearchHeapUsageWarning
    • ElasticsearchPendingTasksTooHigh
    • ExternalDNSCantAccessRegistry
    • ExternalDNSCantAccessSource
    • HelmHistorySecretCountTooHigh
    • IngressControllerDeploymentNotSatisfied
    • IngressControllerMemoryUsageTooHigh
    • IngressControllerReplicaSetNumberTooHigh
    • IngressControllerSSLCertificateWillExpireSoon
    • IngressControllerServiceHasNoEndpoints
    • ManagedAppBasicErrorBudgetBurnRateAboveSafeLevel
    • ManagedAppBasicErrorBudgetBurnRateInLast10mTooHigh
    • ManagedAppBasicErrorBudgetEstimationWarning
    • ManagedLoggingElasticsearchClusterDown
    • ManagedLoggingElasticsearchDataNodesNotSatisfied
    • ManagementClusterAppFailed
    • OperatorNotReconcilingBatman
    • OperatorkitErrorRateTooHighBatman
    • RepeatedHelmOperation
    • TillerHistoryConfigMapCountTooHigh
    • TillerRunningPods
    • TillerUnreachable
    • WorkloadClusterAppFailed
    • WorkloadClusterDeploymentNotSatisfied
    • WorkloadClusterDeploymentScaledDownToZero
    • WorkloadClusterManagedDeploymentNotSatisfied

1.17.2 - 2021-02-04

Changed

  • (internal) Rely on Ingress for OAuth2 proxy to configure TLS for Prometheus domain, as it also configures management of the certificates, instead of creating copies which could break access in case they became out of date.

Fixed

  • Fix incorrect prometheus memory usage recording rule

1.17.1 - 2021-02-02

Fixed

  • Fixed incorrect label in GatekeeperDown alert.

1.17.0 - 2021-02-02

Added

  • Added the NoHealthyJumphost alert
  • Added the biscuit alerts to PMO:
    • AppCollectionDeploymentFailed
    • CalicoNodeMemoryHighUtilization
    • CrsyncDeploymentNotSatisfied
    • CrsyncTooManyTagsMissing
    • DeploymentNotSatisfiedBiscuit
    • DeploymentNotSatisfiedChinaBiscuit
    • DraughtsmanRateLimitAlmostReached
    • EtcdDown
    • GatekeeperDown
    • GatekeeperWebhookMissing
    • KeyPairStorageAlmostFull
    • ManagementClusterHasLessThanThreeNodes
    • ManagementClusterCriticalSystemdUnitFailed
    • ManagementClusterDisabledSystemdUnitActive
    • ManagementClusterEtcdCommitDurationTooHigh
    • ManagementClusterEtcdDBSizeTooLarge
    • ManagementClusterEtcdHasNoLeader
    • ManagementClusterEtcdNumberOfLeaderChangesTooHigh
    • ManagementClusterHighNumberSystemdUnits
    • ManagementClusterPodPending
    • ManagementClusterSystemdUnitFailed
    • VaultIsDown
    • VaultIsSealed

Changed

  • Renamed control plane and tenant cluster respectively to management cluster and workload cluster. Renamed some alerts:
    • ControlPlaneCertificateWillExpireInLessThanTwoWeeks > ManagementClusterCertificateWillExpireInLessThanTwoWeeks
    • ControlPlaneDaemonSetNotSatisfiedAtlas > ManagementClusterDaemonSetNotSatisfiedAtlas
    • ControlPlaneDaemonSetNotSatisfiedChinaAtlas > ManagementClusterDaemonSetNotSatisfiedChinaAtlas
    • PrometheusCantCommunicateWithTenantAPI > PrometheusCantCommunicateWithKubernetesAPI
  • Rename ETCDDown alert to ManagementClusterEtcdDown
  • Enable alerts only on the corresponding providers

Fixed

  • Fix missing app label on kube-apiserver target
  • Fix missing app label on nginx-ingress-controller target

1.16.1 - 2021-01-28

Fixed

  • Fix recording rules to apply them to all prometheuses

1.16.0 - 2021-01-28

Changed

  • Reenable Remote Write to Cortex

Added

  • Trigger final heartbeat before deleting the cluster to clean up opened heartbeat alerts

Removed

  • Remove webhook from AlertManagerNotificationsFailing alert.

1.15.0 - 2021-01-22

Changed

  • Use giantswarm/prometheus image

Fixed

  • Fix recording rules creation
  • Fix prometheus container image tag to not use latest
  • Fix prometheus minimal memory in VPA

1.14.0 - 2021-01-13

Added

  • Add inhibition rules.
  • Set Prometheus pod max memory usage (via vpa) to 90% of lowest node allocatable memory
  • Prometheus monitors itself

Changed

  • Ignore missing unhealthy prometheus instances in promxy to avoid it from crash looping
  • Added the biscuit alerts to PMO:
    • ControlPlaneCertificateWillExpireInLessThanTwoWeeks
  • Add topologySpreadConstraint to evenly spread prometheus pods
  • Ignore Slack in AlertManagerNotificationsFailing alert.
  • Set heartbeat alert to up for 10mn

Removed

  • Removed g8s-prometheus target
  • Removed alert resource

1.13.0 - 2021-01-05

Added

  • Add priority class prometheus and use it for all managed Prometheus pods in order to allow scheduler to evict other pods with lower priority to make space for Prometheus

1.12.0 - 2020-12-02

Changed

  • Change PrometheusCantCommunicateWithTenantAPI to ignore promxy
  • Set prometheus default resources to 100m of CPU and 1Gi of memory
  • Reduced number of metrics ingested from nginx-ingress-controller in order to reduce memory requirements of Prometheus.

1.11.0 - 2020-12-01

Added

  • Create VerticalPodAutoscaler resource for each Prometheus configuring the VPA to manage Prometheus pod requests and limits to allow dynamic scaling but prevent scheduling and OOM issues.

Changed

  • Change prometheus affinity from "Prefer" to "Required".

1.10.3 - 2020-11-25

Fixed

  • Fix initial heartbeat ping so that it only triggers on creation.

1.10.2 - 2020-11-25

Fixed

  • Set prometheus cpu requests and limits to 0.25 CPU.

1.10.1 - 2020-11-24

Fixed

  • Set prometheus cpu requests and limits to 1 CPU.
  • Set prometheus memory requests and limits to 5Gi.

1.10.0 - 2020-11-20

Added

  • Add team atlas alerts (in helm chart).

Changed

  • Set heartbeat client log level to fatal to avoid polluting our logs.
  • Set prometheus to select rules from monitoring namespace.

Removed

  • Set alert resource to delete PrometheusRules in cluster namespace.

Fixed

  • Fix prometheus targets.
  • Fix duplicated scrapping of nginx-ingress-controller.

1.9.0 - 2020-11-11

Added

  • Add support for Remote Write to Cortex
  • Added recording rules
  • Add node affinity to prefer not scheduling on master nodes
  • Added pipeline tag to Hearbeat alert to be able to see if it affects a stable or testing installation at first glance

Changed

  • Increase memory request from 100Mi to 5Gi

Fixed

  • Fix kube-state-metrics scraping port on Control Planes.
  • Fixed creating of alerts, it was failing due to a typo in template path

1.8.0 - 2020-10-21

Added

  • Add pod, container, node and node role labels
  • Allow ignoring clusters using the giantswarm.io/monitoring: false label on cluster CRs
  • Add monitoring of control plane bastions
  • Add heartbeat alert to prometheus
  • Create heartbeat in opsgenie
  • Route heartbeat alerts to corresponding opsgenie heartbeat

1.7.0 - 2020-10-14

Added

  • Add alertmanager config

Fixed

  • Fix a bug where promxy configmap keep growing and lead to OutOfMemory issues.
  • Fix an issue where prometheus fails to be created due to resource order.

1.6.0 - 2020-10-12

Changed

  • Set retention size to 90Gi and duration to 2w
  • Increased storage to 100Gi

1.5.1 - 2020-10-07

Fixed

  • Fix promxy config marshaling
  • Fix promxy config not being updated

1.5.0 - 2020-10-07

Added

  • Support for managing Promxy configuration

Removed

  • Old namespace deleter resource

1.4.0 - 2020-09-25

Added

  • Add oauth ingress
  • Add tls certificate for ingress
  • Add ingress for individual prometheuses

1.3.0 - 2020-09-24

Added

  • Scraping of tenant cluster prometheus
  • Scraping of control plane prometheus
  • Add installation label
  • Add labelling schema alert

Changed

  • Set honor labels to true
  • Change control plane namespace to reflect the installation name instead of 'kubernetes'

1.2.0 - 2020-09-03

Added

  • Add monitoring label
  • Add etcd target for control planes
  • Add vault target
  • Add gatekeeper target
  • Add managed-app target
  • Add cert-operator target
  • Add bridge-operator target
  • Add flannel-operator target
  • Add ingress-exporter target
  • Add coreDNS target
  • Add azure-collector target

Removed

  • frontend, ingress, and service resources.

Fixed

  • prevented data loss in Cluster resources by always using the correct version of the type as configured in CRDs storage version (#101)
  • avoids trying to read dependant objects from the cluster when processing deletion, as they may be gone already and errors here were disrupting cleanup and preventing the finalizer from being removed (#115)

1.1.0 - 2020-08-27

Added

  • Scraping of the control plane operators
    • aws-operator
    • azure-operator
    • kvm-operator
    • app-operator
    • chart-operator
    • cluster-operator
    • etcd-backup-operator
    • node-operator
    • release-operator
    • organization-operator
    • prometheus-meta-operator
    • rbac-operator
    • draughtsman
  • Scraping of the monitoring targets
    • app-exporter
    • cert-exporter
    • vault-exporter
    • node-exporter
    • net-exporter
    • kube-state-metrics
    • alertmanager
    • grafana
    • prometheus
    • prometheus-config-controller
    • fluentbit
  • Scraping of the control plane apis
    • tokend
    • companyd
    • userd
    • api
    • kubernetesd
    • credentiald
    • cluster-service
  • New control-plane controller, reconciling kubernetes api service (#92)

1.0.1 - 2020-08-25

Changed

  • Rename controller name and finalizers

1.0.0 - 2020-08-20

Added

  • Scraping of kube-proxy (#88)
  • Scraping of kube-scheduler (#87)
  • Scraping of kube-controller-manager (#85)
  • Scraping of etcd (#81)
  • Scraping of kubelet (#82)
  • Scraping of legacy docker, calico-node, cluster-autoscaler, aws-node and cadvisor (#78)

Changed

  • Moved prometheus storage from emptyDir to a persistentVolumeClaim
  • Remove tenant cluster prometheus limits
  • Updated backward incompatible Kubernetes dependencies to v1.18.5.

0.3.2 - 2020-07-24

Changed

  • Set TC prometheus memory limit to 1Gi (#73)

0.3.1 - 2020-07-17

Changed

  • Set TC prometheus memory limit to 200Mi

0.3.0 - 2020-07-15

Changed

  • Scale prometheus-meta-operator replicas back to one.

Added

  • Set prometheus request/limits (cpu: 100m, memory: 100Mi)

0.2.1 - 2020-07-01

Fixed

  • Fixed release process

0.2.0 - 2020-06-29

Added

  • Add service monitor for nginx-ingress-controller
  • Reconcile CAPI (Cluster) and legacy cluster CRs (AWSConfig, AzureConfig, KVMConfig)

Changed

  • Reduced prometheus server replicas to one (#45)
  • Reduced default prometheus-meta-operator replicas to zero as having both this and previous (g8s-prometheus) solutions on at the same time is overloading some control planes

Removed

  • Removed cortex frontend as it's an optimisation that's not currently needed
  • Removed service and ingress resources as they are no longer needed (they were used for the cortex frontend)

Fixed

  • Fix an error during alert update: metadata.resourceVersion: Invalid value

0.1.1 - 2020-05-27

Added

  • Change chart namespace from giantswarm to monitoring

0.1.0 - 2020-05-27

Added

  • First release.