Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.4.2] [Bug] Monitoring is not deployed correctly on new cluster creation. #26440

Closed
mitchellmaler opened this issue Apr 2, 2020 · 14 comments
Closed
Assignees
Labels
area/monitoring kind/bug Issues that are defects reported by users or that we know have reached a real release
Milestone

Comments

@mitchellmaler
Copy link

mitchellmaler commented Apr 2, 2020

What kind of request is this (question/bug/enhancement/feature request):
Bug

Steps to reproduce (least amount of steps as possible):
Create a new cluster using the terraform provider (might not be relevant) with monitoring enabled.
Result:
The monitoring apps do not deploy correctly. Looking in the clusters System Apps both show this error:

Failed to install app cluster-monitoring. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Other details that may be helpful:
Here are the rancher logs. It might be they are created before the CRDs are installed by the agent? If I kick them off again by forcing an upgrade they seem to install fine so seems like a timing issue. Seems like the monitoring-operator does automatically get kicked off again and deploys correctly, not sure if cluster-monitoring will.

rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] Create app /
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] Create app /
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] clusterHandler: calling sync to create network policies for cluster c-4b2p4
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [ERROR] ClusterController c-4b2p4 [cluster-monitoring-handler] failed with : failed to get cattle-prometheus/prometheus-operated endpoints: endpoints "cattle-prometheus/prometheus-operated" not found
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] cluster [c-4b2p4] worker-upgrade: updating node [m-qj99n] with node-version 1
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] cluster [c-4b2p4] worker-upgrade: sending node-version for node [] version 1
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [INFO] clusterHandler: calling sync to create network policies for cluster c-4b2p4
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:54 [ERROR] ClusterController c-4b2p4 [cluster-monitoring-handler] failed with : failed to get cattle-prometheus/prometheus-operated endpoints: endpoints "cattle-prometheus/prometheus-operated" not found
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:55 [ERROR] ClusterController c-4b2p4 [cluster-monitoring-handler] failed with : failed to get cattle-prometheus/prometheus-operated endpoints: endpoints "cattle-prometheus/prometheus-operated" not found
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:55 [ERROR] ClusterController c-4b2p4 [cluster-monitoring-handler] failed with : failed to get cattle-prometheus/prometheus-operated endpoints: endpoints "cattle-prometheus/prometheus-operated" not found
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:33:55 [ERROR] ClusterController c-4b2p4 [cluster-monitoring-handler] failed with : failed to get cattle-prometheus/prometheus-operated endpoints: endpoints "cattle-prometheus/prometheus-operated" not found
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:05 [INFO] Creating token for user u-4k3od3dcdk
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:05 [INFO] Installing chart using helm version: rancher-helm
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:06 Starting Tiller v2.16.3-rancher1 (tls=false)
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:06 GRPC listening on :49793
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:06 Probes listening on :45634
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:06 Storage driver is ConfigMap
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:06 Max history per release is 10
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:06 getting history for release cluster-monitoring
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:06 getting release history for "cluster-monitoring"
rancher-687cf6dc6b-2m2pq rancher Release "cluster-monitoring" does not exist. Installing it now.
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:07 preparing install for cluster-monitoring
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:07 getting release history for "cluster-monitoring"
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:07 failed install prepare step: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:07 [ERROR] AppController p-bz9st/cluster-monitoring [helm-controller] failed with : failed to install app cluster-monitoring. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:13 [INFO] Installing chart using helm version: rancher-helm
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:13 Starting Tiller v2.16.3-rancher1 (tls=false)
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:13 GRPC listening on :56380
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:13 Probes listening on :52310
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:13 Storage driver is ConfigMap
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:13 Max history per release is 10
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:14 getting history for release monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:14 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher Release "monitoring-operator" does not exist. Installing it now.
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:14 preparing install for monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:14 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:15 failed install prepare step: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:15 [ERROR] AppController p-bz9st/monitoring-operator [helm-controller] failed with : failed to install app monitoring-operator. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:25 [INFO] [etcd-backup] Cluster [c-4b2p4] has no backups, creating first backup
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:25 [INFO] [etcd-backup] Cluster [c-4b2p4] new backup is created: c-4b2p4-rl-4zvns
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:25 [INFO] Installing chart using helm version: rancher-helm
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:26 Starting Tiller v2.16.3-rancher1 (tls=false)
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:26 GRPC listening on :34531
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:26 Probes listening on :54346
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:26 Storage driver is ConfigMap
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:26 Max history per release is 10
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:27 getting history for release monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:27 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher Release "monitoring-operator" does not exist. Installing it now.
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:27 preparing install for monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:27 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:28 failed install prepare step: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:28 [ERROR] AppController p-bz9st/monitoring-operator [helm-controller] failed with : failed to install app monitoring-operator. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] kontainerdriver rancherkubernetesengine listening on address 127.0.0.1:33813
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] Starting saving snapshot on etcd hosts
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] [dialer] Setup tunnel for host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] [dialer] Setup tunnel for host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] [dialer] Setup tunnel for host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] [etcd] Running snapshot save once on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] Image [rancher/rke-tools:v0.1.56] exists on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:30 [INFO] Starting container [etcd-snapshot-once] on host [10.183.44.10], try #1






rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:31 [INFO] [etcd] Successfully started [etcd-snapshot-once] container on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:31 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:31 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:32 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:32 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:33 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.10]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:33 [INFO] Removing container [etcd-snapshot-once] on host [10.183.44.10], try #1






rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:33 [INFO] [etcd] Running snapshot save once on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:33 [INFO] Image [rancher/rke-tools:v0.1.56] exists on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:33 [INFO] Starting container [etcd-snapshot-once] on host [10.183.44.6], try #1






rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:34 [INFO] [etcd] Successfully started [etcd-snapshot-once] container on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:34 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:34 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:35 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:35 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:36 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.6]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:36 [INFO] Removing container [etcd-snapshot-once] on host [10.183.44.6], try #1






rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:36 [INFO] [etcd] Running snapshot save once on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:36 [INFO] Image [rancher/rke-tools:v0.1.56] exists on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:36 [INFO] Starting container [etcd-snapshot-once] on host [10.183.44.9], try #1






rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:37 [INFO] [etcd] Successfully started [etcd-snapshot-once] container on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:37 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:37 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:38 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:38 [INFO] Container [etcd-snapshot-once] is still running on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:38 [INFO] Installing chart using helm version: rancher-helm
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:39 [INFO] Waiting for [etcd-snapshot-once] container to exit on host [10.183.44.9]
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:39 [INFO] Removing container [etcd-snapshot-once] on host [10.183.44.9], try rancher/rancher#1
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:39 [INFO] Finished saving/uploading snapshot [c-4b2p4-rl-4zvns_2020-04-02T04:34:25Z] on all etcd hosts
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:39 [INFO] kontainerdriver rancherkubernetesengine stopped
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:39 Starting Tiller v2.16.3-rancher1 (tls=false)
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:39 GRPC listening on :51699
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:39 Probes listening on :39375
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:39 Storage driver is ConfigMap
rancher-687cf6dc6b-2m2pq rancher [main] 2020/04/02 04:34:39 Max history per release is 10
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:40 getting history for release monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:40 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher Release "monitoring-operator" does not exist. Installing it now.
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:40 preparing install for monitoring-operator
rancher-687cf6dc6b-2m2pq rancher [storage] 2020/04/02 04:34:40 getting release history for "monitoring-operator"
rancher-687cf6dc6b-2m2pq rancher [tiller] 2020/04/02 04:34:40 failed install prepare step: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:34:40 [ERROR] AppController p-bz9st/monitoring-operator [helm-controller] failed with : failed to install app monitoring-operator. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
rancher-687cf6dc6b-2m2pq rancher
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:00 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:13 [ERROR] failed on subscribe prometheus: NotFound 404: the server could not find the requested resource (get prometheuses.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:13 [ERROR] failed on subscribe prometheusRule: NotFound 404: the server could not find the requested resource (get prometheusrules.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:13 [ERROR] failed on subscribe alertmanager: NotFound 404: the server could not find the requested resource (get alertmanagers.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:13 [ERROR] failed on subscribe serviceMonitor: NotFound 404: the server could not find the requested resource (get servicemonitors.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:25 [ERROR] failed on subscribe prometheus: NotFound 404: the server could not find the requested resource (get prometheuses.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:25 [ERROR] failed on subscribe serviceMonitor: NotFound 404: the server could not find the requested resource (get servicemonitors.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:25 [ERROR] failed on subscribe prometheusRule: NotFound 404: the server could not find the requested resource (get prometheusrules.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:25 [ERROR] failed on subscribe alertmanager: NotFound 404: the server could not find the requested resource (get alertmanagers.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:43 [ERROR] failed on subscribe serviceMonitor: NotFound 404: the server could not find the requested resource (get servicemonitors.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:43 [ERROR] failed on subscribe alertmanager: NotFound 404: the server could not find the requested resource (get alertmanagers.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:43 [ERROR] failed on subscribe prometheusRule: NotFound 404: the server could not find the requested resource (get prometheusrules.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher 2020/04/02 04:35:43 [ERROR] failed on subscribe prometheus: NotFound 404: the server could not find the requested resource (get prometheuses.meta.k8s.io)
rancher-687cf6dc6b-2m2pq rancher W0402 04:37:40.683003       6 reflector.go:326] github.com/rancher/steve/pkg/clustercache/controller.go:187: watch of *summary.SummarizedObject ended with: unexpected object: &{{{Status v1} {      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []}} }

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI):
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): vSphere
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): VM
  • Kubernetes version (use kubectl version): 1.17
@nickvth
Copy link

nickvth commented Apr 2, 2020

same here

@stefanvangastel
Copy link

Having same issues

@jiaqiluo
Copy link
Member

jiaqiluo commented Apr 2, 2020

The bug is reproduced in v2.4.2 single install when adding a cluster with cluster monitoring enabled by editing the cluster as YAML file

Rancher logs

2020/04/02 16:56:07 [ERROR] AppController p-hf8zj/cluster-monitoring [helm-controller] failed with : failed to install app cluster-monitoring. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

2020/04/02 16:56:07 [ERROR] AppController p-hf8zj/monitoring-operator [helm-controller] failed with : failed to install app monitoring-operator. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Screen Shot 2020-04-02 at 9 58 46 AM

Screen Shot 2020-04-02 at 9 58 52 AM

Screen Shot 2020-04-02 at 9 58 57 AM

Workaround:
The workaround is to either force upgrade the apps or to re-enabled the cluster monitoring

More Info:
The cluster monitoring is deployed successfully if it is enabled after the cluster is active not as part of provisioning the cluster

@jiaqiluo jiaqiluo added this to the v2.4.3 milestone Apr 2, 2020
@jiaqiluo jiaqiluo added area/monitoring kind/bug Issues that are defects reported by users or that we know have reached a real release labels Apr 2, 2020
@jiaqiluo jiaqiluo removed this from the v2.4.3 milestone Apr 2, 2020
@soumyalj
Copy link

soumyalj commented Apr 2, 2020

The issue is also reproduced when creating a cluster using cluster template with monitoring enabled. [While creating a cluster template, enable monitoring by editing the yaml and setting enable_cluster_monitoring: true]. Monitoring fails to come up with the below error in Apps:

Failed to install app monitoring-operator. Error: Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Screen Shot 2020-04-02 at 10 37 36 AM

@hugodopradofernandes
Copy link

Seems an issue on metricserver that impacts Helm:

helm/helm#6361
The solution to delete the metricserver api solves the Helm issue, but then prometheus won't work as metricserver will be without API (I tried here, don't try there)

Then we need to solve the issue on metricserver api returning false (FailedDiscoveryCheck)

> kubectl get apiservice
NAME                                   SERVICE                      AVAILABLE                      AGE
v1.                                    Local                        True                           29m
v1.admissionregistration.k8s.io        Local                        True                           29m
v1.apiextensions.k8s.io                Local                        True                           29m
v1.apps                                Local                        True                           29m
v1.authentication.k8s.io               Local                        True                           29m
v1.authorization.k8s.io                Local                        True                           29m
v1.autoscaling                         Local                        True                           29m
v1.batch                               Local                        True                           29m
v1.coordination.k8s.io                 Local                        True                           29m
v1.crd.projectcalico.org               Local                        True                           28m
v1.monitoring.coreos.com               Local                        True                           25m
v1.networking.k8s.io                   Local                        True                           29m
v1.rbac.authorization.k8s.io           Local                        True                           29m
v1.scheduling.k8s.io                   Local                        True                           29m
v1.storage.k8s.io                      Local                        True                           29m
v1beta1.admissionregistration.k8s.io   Local                        True                           29m
v1beta1.apiextensions.k8s.io           Local                        True                           29m
v1beta1.authentication.k8s.io          Local                        True                           29m
v1beta1.authorization.k8s.io           Local                        True                           29m
v1beta1.batch                          Local                        True                           29m
v1beta1.certificates.k8s.io            Local                        True                           29m
v1beta1.coordination.k8s.io            Local                        True                           29m
v1beta1.discovery.k8s.io               Local                        True                           29m
v1beta1.events.k8s.io                  Local                        True                           29m
v1beta1.extensions                     Local                        True                           29m
**v1beta1.metrics.k8s.io                 kube-system/metrics-server   False (FailedDiscoveryCheck)   28m**
v1beta1.networking.k8s.io              Local                        True                           29m
v1beta1.node.k8s.io                    Local                        True                           29m
v1beta1.policy                         Local                        True                           29m
v1beta1.rbac.authorization.k8s.io      Local                        True                           29m
v1beta1.scheduling.k8s.io              Local                        True                           29m
v1beta1.storage.k8s.io                 Local                        True                           29m
v2beta1.autoscaling                    Local                        True                           29m
v2beta2.autoscaling                    Local                        True                           29m
v3.cluster.cattle.io                   Local                        True                           27m

@mrajashree
Copy link
Contributor

mrajashree commented Apr 8, 2020

The helm bug is in 2.16 till 2.16.3 and there is a fix for the upstream helm bug in helm 2.16.5
k8s client-go throws an error when an API service is registered but unimplemented, but since the discovery client continues building the API object, it is correctly populated with all valid APIs as per the upstream PR
And monitoring-operator app has logic in rancher to force deploy it if its workloads don't exist that's why monitoring-operator recovered but not cluster-monitoring, the rancher PR will add the same force redeploy logic to cluster-monitoring

@jiaqiluo
Copy link
Member

jiaqiluo commented Apr 17, 2020

The bug fix is validated in v2.4-head 63a490f and master-head 00ff159

Steps:

  • run Rancher single install
  • add a cluster with cluster monitoring enabled by editing the cluster as YAML file

Results:

  • the apps are deployed successfully
  • In v2.4-head it takes a quite long time (10 - 15 mins) to wait until metics show up in the UI after the apps are active. During the waiting, the cluster page shows Monitoring API is not ready

Update:
it takes about 3 mins 30s for metrics showing up after the apps are active in another attempt on the same v2.4-head setup.

@mrajashree
Copy link
Contributor

mrajashree commented Apr 17, 2020

@jiaqiluo how much time does it take for metrics to show up if monitoring is enabled after the cluster is active? Is the time similar for a 2.3 setup if you enable monitoring during cluster create?

@jiaqiluo
Copy link
Member

@jiaqiluo how much time does it take for metrics to show up if monitoring is enabled after the cluster is active? Is the time similar for a 2.3 setup if you enable monitoring during cluster create?

@mrajashree it is usually about 3 to 5 mins wait for the apps to be active, then another 3 to 5 mins for the metics showing up.

@soumyalj
Copy link

soumyalj commented Apr 17, 2020

Validated the fix on master-head(00ff159) and v2.4-head(cf5ab1d)
Steps:

  1. Create a cluster using cluster template with monitoring enabled.
    Monitoring app was deployed successfully and the metrics was also displayed for nodes, workloads.

In both master-head and v2.4-head it took around 4 minutes for the monitoring app to be active and around 5 minutes for the metrics to show up after the app was active.

@jiaqiluo
Copy link
Member

@mrajashree

Here is the difference I noticed:

When enabling the cluster monitoring in the provisioning configuration, the UI will keep showing the Monitoring API is not ready message after the apps are active, and it takes another 3 to 5 mins to see the message disappear and metrics section shows up.

While if the cluster monitoring is enabled after the cluster is active, the Monitoring API is not ready message will disappear as soon as the apps are active and the metrics section shows up.

@mrajashree
Copy link
Contributor

@jiaqiluo If this is the same behavior you see for 2.3 setup then it's fine

@mrajashree
Copy link
Contributor

mrajashree commented Apr 17, 2020

Why is this reopened? monitoring does work. If the delay is not same for a 2.3 setup, we can open a new issue, but now the monitoring app no longer fails to deploy

@jiaqiluo
Copy link
Member

this bug is confirmed to be fixed.

different behavior is observed in v2.3.6, so a new issue is made to track that: #26692

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

8 participants