[2.4.2] [Bug] Monitoring is not deployed correctly on new cluster creation. #26440

mitchellmaler · 2020-04-02T04:45:10Z

What kind of request is this (question/bug/enhancement/feature request):
Bug

Steps to reproduce (least amount of steps as possible):
Create a new cluster using the terraform provider (might not be relevant) with monitoring enabled.
Result:
The monitoring apps do not deploy correctly. Looking in the clusters System Apps both show this error:

Failed to install app cluster-monitoring. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Other details that may be helpful:
Here are the rancher logs. It might be they are created before the CRDs are installed by the agent? If I kick them off again by forcing an upgrade they seem to install fine so seems like a timing issue. Seems like the monitoring-operator does automatically get kicked off again and deploys correctly, not sure if cluster-monitoring will.

rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] Create app /
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] Create app /
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] clusterHandler: calling sync to create network policies for cluster c-4b2p4
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [ERROR] ClusterController c-4b2p4 [cluster-monitoring-handler] failed with : failed to get cattle-prometheus/prometheus-operated endpoints: endpoints "cattle-prometheus/prometheus-operated" not found
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] cluster [c-4b2p4] worker-upgrade: updating node [m-qj99n] with node-version 1
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] cluster [c-4b2p4] worker-upgrade: sending node-version for node [] version 1
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] clusterHandler: calling sync to create network policies for cluster c-4b2p4
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [ERROR] ClusterController c-4b2p4 [cluster-monitoring-handler] failed with : failed to get cattle-prometheus/prometheus-operated endpoints: endpoints "cattle-prometheus/prometheus-operated" not found
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:55 [ERROR] ClusterController c-4b2p4 [cluster-monitoring-handler] failed with : failed to get cattle-prometheus/prometheus-operated endpoints: endpoints "cattle-prometheus/prometheus-operated" not found
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:55 [ERROR] ClusterController c-4b2p4 [cluster-monitoring-handler] failed with : failed to get cattle-prometheus/prometheus-operated endpoints: endpoints "cattle-prometheus/prometheus-operated" not found
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:55 [ERROR] ClusterController c-4b2p4 [cluster-monitoring-handler] failed with : failed to get cattle-prometheus/prometheus-operated endpoints: endpoints "cattle-prometheus/prometheus-operated" not found
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:05 [INFO] Creating token for user u-4k3od3dcdk
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:05 [INFO] Installing chart using helm version: rancher-helm
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:06 Starting Tiller v2.16.3-rancher1 (tls=false)
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:06 GRPC listening on :49793
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:06 Probes listening on :45634
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:06 Storage driver is ConfigMap
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:06 Max history per release is 10
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:06 getting history for release cluster-monitoring
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:06 getting release history for "cluster-monitoring"
rancher-687cf6dc6b-2m2pq rancher Release "cluster-monitoring" does not exist. Installing it now.
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:07 preparing install for cluster-monitoring
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:07 getting release history for "cluster-monitoring"
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:07 failed install prepare step: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:07 [ERROR] AppController p-bz9st/cluster-monitoring [helm-controller] failed with : failed to install app cluster-monitoring. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:13 [INFO] Installing chart using helm version: rancher-helm
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:13 Starting Tiller v2.16.3-rancher1 (tls=false)
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:13 GRPC listening on :56380
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:13 Probes listening on :52310
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:13 Storage driver is ConfigMap
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:13 Max history per release is 10
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:14 getting history for release monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:14 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher Release "monitoring-operator" does not exist. Installing it now.
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:14 preparing install for monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:14 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:15 failed install prepare step: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:15 [ERROR] AppController p-bz9st/monitoring-operator [helm-controller] failed with : failed to install app monitoring-operator. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:25 [INFO] [etcd-backup] Cluster [c-4b2p4] has no backups, creating first backup
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:25 [INFO] [etcd-backup] Cluster [c-4b2p4] new backup is created: c-4b2p4-rl-4zvns
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:25 [INFO] Installing chart using helm version: rancher-helm
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:26 Starting Tiller v2.16.3-rancher1 (tls=false)
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:26 GRPC listening on :34531
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:26 Probes listening on :54346
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:26 Storage driver is ConfigMap
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:26 Max history per release is 10
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:27 getting history for release monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:27 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher Release "monitoring-operator" does not exist. Installing it now.
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:27 preparing install for monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:27 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:28 failed install prepare step: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:28 [ERROR] AppController p-bz9st/monitoring-operator [helm-controller] failed with : failed to install app monitoring-operator. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] kontainerdriver rancherkubernetesengine listening on address 127.0.0.1:33813
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] Starting saving snapshot on etcd hosts
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] [dialer] Setup tunnel for host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] [dialer] Setup tunnel for host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] [dialer] Setup tunnel for host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] [etcd] Running snapshot save once on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] Image [rancher/rke-tools:v0.1.56] exists on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] Starting container [etcd-snapshot-once] on host [10.183.44.10], try #1






rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:31 [INFO] [etcd] Successfully started [etcd-snapshot-once] container on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:31 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:31 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:32 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:32 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:33 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:33 [INFO] Removing container [etcd-snapshot-once] on host [10.183.44.10], try #1






rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:33 [INFO] [etcd] Running snapshot save once on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:33 [INFO] Image [rancher/rke-tools:v0.1.56] exists on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:33 [INFO] Starting container [etcd-snapshot-once] on host [10.183.44.6], try #1






rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:34 [INFO] [etcd] Successfully started [etcd-snapshot-once] container on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:34 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:34 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:35 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:35 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:36 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:36 [INFO] Removing container [etcd-snapshot-once] on host [10.183.44.6], try #1






rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:36 [INFO] [etcd] Running snapshot save once on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:36 [INFO] Image [rancher/rke-tools:v0.1.56] exists on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:36 [INFO] Starting container [etcd-snapshot-once] on host [10.183.44.9], try #1






rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:37 [INFO] [etcd] Successfully started [etcd-snapshot-once] container on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:37 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:37 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:38 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:38 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:38 [INFO] Installing chart using helm version: rancher-helm
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:39 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:39 [INFO] Removing container [etcd-snapshot-once] on host [10.183.44.9], try rancher/rancher#1
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:39 [INFO] Finished saving/uploading snapshot [c-4b2p4-rl-4zvns_2020-04-02T04:34:25Z] on all etcd hosts
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:39 [INFO] kontainerdriver rancherkubernetesengine stopped
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:39 Starting Tiller v2.16.3-rancher1 (tls=false)
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:39 GRPC listening on :51699
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:39 Probes listening on :39375
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:39 Storage driver is ConfigMap
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:39 Max history per release is 10
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:40 getting history for release monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:40 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher Release "monitoring-operator" does not exist. Installing it now.
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:40 preparing install for monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:40 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:40 failed install prepare step: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:40 [ERROR] AppController p-bz9st/monitoring-operator [helm-controller] failed with : failed to install app monitoring-operator. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:00 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:13 [ERROR] failed on subscribe prometheus: NotFound 404: the server could not find the requested resource (get prometheuses.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:13 [ERROR] failed on subscribe prometheusRule: NotFound 404: the server could not find the requested resource (get prometheusrules.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:13 [ERROR] failed on subscribe alertmanager: NotFound 404: the server could not find the requested resource (get alertmanagers.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:13 [ERROR] failed on subscribe serviceMonitor: NotFound 404: the server could not find the requested resource (get servicemonitors.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:25 [ERROR] failed on subscribe prometheus: NotFound 404: the server could not find the requested resource (get prometheuses.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:25 [ERROR] failed on subscribe serviceMonitor: NotFound 404: the server could not find the requested resource (get servicemonitors.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:25 [ERROR] failed on subscribe prometheusRule: NotFound 404: the server could not find the requested resource (get prometheusrules.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:25 [ERROR] failed on subscribe alertmanager: NotFound 404: the server could not find the requested resource (get alertmanagers.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:43 [ERROR] failed on subscribe serviceMonitor: NotFound 404: the server could not find the requested resource (get servicemonitors.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:43 [ERROR] failed on subscribe alertmanager: NotFound 404: the server could not find the requested resource (get alertmanagers.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:43 [ERROR] failed on subscribe prometheusRule: NotFound 404: the server could not find the requested resource (get prometheusrules.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:43 [ERROR] failed on subscribe prometheus: NotFound 404: the server could not find the requested resource (get prometheuses.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher W0402 04:37:40.683003       6 reflector.go:326] github.com/rancher/steve/pkg/clustercache/controller.go:187: watch of *summary.SummarizedObject ended with: unexpected object: &{{{Status v1} {      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []}} }

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI):
Installation option (single install/HA): HA

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): vSphere
Machine type (cloud/VM/metal) and specifications (CPU/memory): VM
Kubernetes version (use kubectl version): 1.17

The text was updated successfully, but these errors were encountered:

nickvth · 2020-04-02T13:50:04Z

same here

stefanvangastel · 2020-04-02T14:55:58Z

Having same issues

jiaqiluo · 2020-04-02T17:09:18Z

The bug is reproduced in v2.4.2 single install when adding a cluster with cluster monitoring enabled by editing the cluster as YAML file

Rancher logs

2020/04/02 16:56:07 [ERROR] AppController p-hf8zj/cluster-monitoring [helm-controller] failed with : failed to install app cluster-monitoring. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

2020/04/02 16:56:07 [ERROR] AppController p-hf8zj/monitoring-operator [helm-controller] failed with : failed to install app monitoring-operator. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Workaround:
The workaround is to either force upgrade the apps or to re-enabled the cluster monitoring

More Info:
The cluster monitoring is deployed successfully if it is enabled after the cluster is active not as part of provisioning the cluster

soumyalj · 2020-04-02T17:38:14Z

The issue is also reproduced when creating a cluster using cluster template with monitoring enabled. [While creating a cluster template, enable monitoring by editing the yaml and setting enable_cluster_monitoring: true]. Monitoring fails to come up with the below error in Apps:

Failed to install app monitoring-operator. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

hugodopradofernandes · 2020-04-03T06:34:22Z

Seems an issue on metricserver that impacts Helm:

helm/helm#6361
The solution to delete the metricserver api solves the Helm issue, but then prometheus won't work as metricserver will be without API (I tried here, don't try there)

Then we need to solve the issue on metricserver api returning false (FailedDiscoveryCheck)

> kubectl get apiservice
NAME                                   SERVICE                      AVAILABLE                      AGE
v1.                                    Local                        True                           29m
v1.admissionregistration.k8s.io        Local                        True                           29m
v1.apiextensions.k8s.io                Local                        True                           29m
v1.apps                                Local                        True                           29m
v1.authentication.k8s.io               Local                        True                           29m
v1.authorization.k8s.io                Local                        True                           29m
v1.autoscaling                         Local                        True                           29m
v1.batch                               Local                        True                           29m
v1.coordination.k8s.io                 Local                        True                           29m
v1.crd.projectcalico.org               Local                        True                           28m
v1.monitoring.coreos.com               Local                        True                           25m
v1.networking.k8s.io                   Local                        True                           29m
v1.rbac.authorization.k8s.io           Local                        True                           29m
v1.scheduling.k8s.io                   Local                        True                           29m
v1.storage.k8s.io                      Local                        True                           29m
v1beta1.admissionregistration.k8s.io   Local                        True                           29m
v1beta1.apiextensions.k8s.io           Local                        True                           29m
v1beta1.authentication.k8s.io          Local                        True                           29m
v1beta1.authorization.k8s.io           Local                        True                           29m
v1beta1.batch                          Local                        True                           29m
v1beta1.certificates.k8s.io            Local                        True                           29m
v1beta1.coordination.k8s.io            Local                        True                           29m
v1beta1.discovery.k8s.io               Local                        True                           29m
v1beta1.events.k8s.io                  Local                        True                           29m
v1beta1.extensions                     Local                        True                           29m
**v1beta1.metrics.k8s.io                 kube-system/metrics-server   False (FailedDiscoveryCheck)   28m**
v1beta1.networking.k8s.io              Local                        True                           29m
v1beta1.node.k8s.io                    Local                        True                           29m
v1beta1.policy                         Local                        True                           29m
v1beta1.rbac.authorization.k8s.io      Local                        True                           29m
v1beta1.scheduling.k8s.io              Local                        True                           29m
v1beta1.storage.k8s.io                 Local                        True                           29m
v2beta1.autoscaling                    Local                        True                           29m
v2beta2.autoscaling                    Local                        True                           29m
v3.cluster.cattle.io                   Local                        True                           27m

mrajashree · 2020-04-08T23:56:19Z

The helm bug is in 2.16 till 2.16.3 and there is a fix for the upstream helm bug in helm 2.16.5
k8s client-go throws an error when an API service is registered but unimplemented, but since the discovery client continues building the API object, it is correctly populated with all valid APIs as per the upstream PR
And monitoring-operator app has logic in rancher to force deploy it if its workloads don't exist that's why monitoring-operator recovered but not cluster-monitoring, the rancher PR will add the same force redeploy logic to cluster-monitoring

jiaqiluo · 2020-04-17T17:52:49Z

The bug fix is validated in v2.4-head 63a490f and master-head 00ff159

Steps:

run Rancher single install
add a cluster with cluster monitoring enabled by editing the cluster as YAML file

Results:

the apps are deployed successfully
In v2.4-head it takes a quite long time (10 - 15 mins) to wait until metics show up in the UI after the apps are active. During the waiting, the cluster page shows Monitoring API is not ready

Update:
it takes about 3 mins 30s for metrics showing up after the apps are active in another attempt on the same v2.4-head setup.

mrajashree · 2020-04-17T18:11:59Z

@jiaqiluo how much time does it take for metrics to show up if monitoring is enabled after the cluster is active? Is the time similar for a 2.3 setup if you enable monitoring during cluster create?

jiaqiluo · 2020-04-17T18:23:04Z

@jiaqiluo how much time does it take for metrics to show up if monitoring is enabled after the cluster is active? Is the time similar for a 2.3 setup if you enable monitoring during cluster create?

@mrajashree it is usually about 3 to 5 mins wait for the apps to be active, then another 3 to 5 mins for the metics showing up.

soumyalj · 2020-04-17T19:07:17Z

Validated the fix on master-head(00ff159) and v2.4-head(cf5ab1d)
Steps:

Create a cluster using cluster template with monitoring enabled.
Monitoring app was deployed successfully and the metrics was also displayed for nodes, workloads.

In both master-head and v2.4-head it took around 4 minutes for the monitoring app to be active and around 5 minutes for the metrics to show up after the app was active.

jiaqiluo · 2020-04-17T21:17:38Z

@mrajashree

Here is the difference I noticed:

When enabling the cluster monitoring in the provisioning configuration, the UI will keep showing the Monitoring API is not ready message after the apps are active, and it takes another 3 to 5 mins to see the message disappear and metrics section shows up.

While if the cluster monitoring is enabled after the cluster is active, the Monitoring API is not ready message will disappear as soon as the apps are active and the metrics section shows up.

mrajashree · 2020-04-17T21:21:00Z

@jiaqiluo If this is the same behavior you see for 2.3 setup then it's fine

mrajashree · 2020-04-17T21:21:25Z

Why is this reopened? monitoring does work. If the delay is not same for a 2.3 setup, we can open a new issue, but now the monitoring app no longer fails to deploy

jiaqiluo · 2020-04-17T21:46:00Z

this bug is confirmed to be fixed.

different behavior is observed in v2.3.6, so a new issue is made to track that: #26692

nickvth mentioned this issue Apr 2, 2020

cluster monitoring broken rancher/terraform-provider-rancher2#300

Closed

sangeethah assigned jiaqiluo Apr 2, 2020

jiaqiluo added this to the v2.4.3 milestone Apr 2, 2020

jiaqiluo added area/monitoring kind/bug Issues that are defects reported by users or that we know have reached a real release labels Apr 2, 2020

jiaqiluo removed this from the v2.4.3 milestone Apr 2, 2020

deniseschannon added this to the v2.4.3 milestone Apr 2, 2020

deniseschannon assigned mrajashree Apr 3, 2020

deniseschannon added [zube]: To Triage and removed [zube]: To Triage labels Apr 3, 2020

mrajashree added [zube]: Working and removed [zube]: Next Up labels Apr 8, 2020

This was referenced Apr 8, 2020

[2.4] Force redeploy cluster-monitoring if workload doesn't exist #26564

Merged

Update to helm version 2.16.5 rancher/helm#28

Merged

mrajashree added the [zube]: Review label Apr 9, 2020

zube bot removed the [zube]: Working label Apr 9, 2020

mrajashree mentioned this issue Apr 9, 2020

[2.4] Update helm version to 2.16.5 #26569

Merged

This was referenced Apr 16, 2020

Force redeploy cluster-monitoring if workload doesn't exist #26656

Merged

Update helm version to 2.16.5 #26571

Merged

mrajashree added the [zube]: To Test label Apr 16, 2020

zube bot removed the [zube]: Review label Apr 16, 2020

sangeethah assigned soumyalj Apr 17, 2020

jiaqiluo added [zube]: Reopened and removed [zube]: To Test labels Apr 17, 2020

jiaqiluo mentioned this issue Apr 17, 2020

Monitoring - Monitoring API is not ready after apps are active #26692

Closed

jiaqiluo removed the [zube]: Reopened label Apr 17, 2020

jiaqiluo closed this as completed Apr 17, 2020

jiaqiluo added the [zube]: Done label Apr 17, 2020

loganhz mentioned this issue May 10, 2020

2.4.3 upgrade monitoring to 0.1.0 fails #26969

Closed

anouarchattouna mentioned this issue May 26, 2020

[2.4.2] [Bug] Rancher-logging is not deployed correctly on new cluster creation #27250

Closed

axeal mentioned this issue Jun 19, 2020

Rancher-logging is not deployed correctly on new cluster creation #27659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.4.2] [Bug] Monitoring is not deployed correctly on new cluster creation. #26440

[2.4.2] [Bug] Monitoring is not deployed correctly on new cluster creation. #26440

mitchellmaler commented Apr 2, 2020 •

edited by zube bot

nickvth commented Apr 2, 2020

stefanvangastel commented Apr 2, 2020

jiaqiluo commented Apr 2, 2020 •

edited

soumyalj commented Apr 2, 2020

hugodopradofernandes commented Apr 3, 2020

mrajashree commented Apr 8, 2020 •

edited

jiaqiluo commented Apr 17, 2020 •

edited

mrajashree commented Apr 17, 2020 •

edited

jiaqiluo commented Apr 17, 2020

soumyalj commented Apr 17, 2020 •

edited

jiaqiluo commented Apr 17, 2020

mrajashree commented Apr 17, 2020

mrajashree commented Apr 17, 2020 •

edited

jiaqiluo commented Apr 17, 2020

[2.4.2] [Bug] Monitoring is not deployed correctly on new cluster creation. #26440

[2.4.2] [Bug] Monitoring is not deployed correctly on new cluster creation. #26440

Comments

mitchellmaler commented Apr 2, 2020 • edited by zube bot

nickvth commented Apr 2, 2020

stefanvangastel commented Apr 2, 2020

jiaqiluo commented Apr 2, 2020 • edited

soumyalj commented Apr 2, 2020

hugodopradofernandes commented Apr 3, 2020

mrajashree commented Apr 8, 2020 • edited

jiaqiluo commented Apr 17, 2020 • edited

mrajashree commented Apr 17, 2020 • edited

jiaqiluo commented Apr 17, 2020

soumyalj commented Apr 17, 2020 • edited

jiaqiluo commented Apr 17, 2020

mrajashree commented Apr 17, 2020

mrajashree commented Apr 17, 2020 • edited

jiaqiluo commented Apr 17, 2020

mitchellmaler commented Apr 2, 2020 •

edited by zube bot

jiaqiluo commented Apr 2, 2020 •

edited

mrajashree commented Apr 8, 2020 •

edited

jiaqiluo commented Apr 17, 2020 •

edited

mrajashree commented Apr 17, 2020 •

edited

soumyalj commented Apr 17, 2020 •

edited

mrajashree commented Apr 17, 2020 •

edited