-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics-server can't scrape nodes after enable_cri_dockerd is set to true #2709
Comments
Well, it's not fully working, though, this is what I get after rebooting all three nodes:
So something is still not fine, although, it's clear that v0.4.1 is submitting the metrics to kubectl as it's log contains errors for node 1 and 3
Maybe related to the dockershim update? It was enabled upon doing the rke up command with the new images. But I still don't get why 2 nodes are not reporting metrics and 1 does. |
Please use If it is related to the dockershim update, you can disable it and see if that solves the issue. |
Yes, it was working fine with v0.4.1 before the upgrade. However, a new v0.5.0 pod was also running besides the old one after the upgrade but just one Deployment with v0.5.0 was present. This time, I made a backup of the running v0.4.1 replicaset as YAML, deleted it, maybe it conflicts with the new version. But the new one didn't succeed nonetheless. Then I put back the v0.4.1 replicaset from the backup YAML file, and so it succeeded scraping the metrics. Then I rebooted all nodes, and then the top command is returning just one node's info, not all. Firewalld is disabled as it should be. Thanks for the idea, now I did these steps:
So I think the issue is around the dockershim option. |
Just for curiosity, I enabled the dockershim option again, rke up, then the previously fine v0.5.0 became unhealthy: "Warning Unhealthy 5s (x3 over 25s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500" and top is returning "Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)" Disabled dockershim again, rke up, then v0.5.0 works fine, top is returning the metrics again. |
@immanuelfodor I ran the same scenario and although I did see the errors in the metrics-server, they went away after ~2 minutes. I also did not encounter the duplicate metrics-server deployment/pods with different images. If there is any more info you can provide on the setup and steps to reproduce, that would help. |
I think the duplicated metrics-server with two different images appeared just because the new metrics-server replicaset/pod didn't become Healthy, so the kubelet didn't remove the previous replicaset/pod until the new one started up fine. The old replicaset/pod only disappeared automatically after I disabled the dockershim and so the new one become Healthy finally. I don't think we should investigate this more. I reenabled dockershim, did an rke up, waited for more than 2 minutes and metrics-server is still unhealthy.
Here is the full cluster YAML, maybe it helps: cluster.ymlnodes:
- address: 192.168.1.19
port: "22"
internal_address: ""
role:
- controlplane
- worker
- etcd
hostname_override: node1
user: centos
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_ed25519
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
- address: 192.168.1.20
port: "22"
internal_address: ""
role:
- controlplane
- worker
- etcd
hostname_override: node2
user: centos
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_ed25519
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
- address: 192.168.1.21
port: "22"
internal_address: ""
role:
- controlplane
- worker
- etcd
hostname_override: node3
user: centos
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_ed25519
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
services:
etcd:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
external_urls: []
ca_cert: ""
cert: ""
key: ""
path: ""
uid: 1000
gid: 1000
snapshot: true
retention: 48h
creation: 6h
backup_config:
interval_hours: 12
retention: 6
kube-api:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
service_cluster_ip_range: 10.43.0.0/16
service_node_port_range: ""
pod_security_policy: false
always_pull_images: false
secrets_encryption_config:
enabled: true
audit_log:
enabled: true
admission_configuration: null
event_rate_limit: null
kube-controller:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
cluster_cidr: 10.42.0.0/16
service_cluster_ip_range: 10.43.0.0/16
scheduler:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
kubelet:
image: ""
extra_args:
max-pods: 150
# @see: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#general-guidelines
enforce-node-allocatable: "pods"
system-reserved: "cpu=300m,memory=5Mi,ephemeral-storage=1Gi"
kube-reserved: "cpu=200m,memory=5Mi,ephemeral-storage=1Gi"
eviction-hard: "memory.available<1Gi,nodefs.available<10%"
extra_binds:
# added ISCSI paths due to OpenEBS cStor requirements
# @see: https://docs.openebs.io/docs/next/prerequisites.html#rancher
- "/etc/iscsi:/etc/iscsi"
- "/sbin/iscsiadm:/sbin/iscsiadm"
- "/var/lib/iscsi:/var/lib/iscsi"
- "/lib/modules"
- "/var/openebs/local:/var/openebs/local"
- "/usr/lib64/libcrypto.so.1.1:/usr/lib/libcrypto.so.1.1"
- "/usr/lib64/libopeniscsiusr.so.0.2.0:/usr/lib/libopeniscsiusr.so.0.2.0"
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
cluster_domain: cluster.local
infra_container_image: ""
cluster_dns_server: 10.43.0.10
fail_swap_on: false
generate_serving_certificate: false
kubeproxy:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
network:
plugin: canal
options:
# workaround to get hostnetworked pods DNS resolution working on nodes that don't have a CoreDNS replica running
# do the rke up then reboot all nodes to apply
# @see: https://github.com/rancher/k3s/issues/1827#issuecomment-636362097
# @see: https://github.com/coreos/flannel/issues/1243#issuecomment-589542796
# @see: https://rancher.com/docs/rke/latest/en/config-options/add-ons/network-plugins/
canal_flannel_backend_type: host-gw
mtu: 0
node_selector: {}
update_strategy: null
tolerations: []
authentication:
strategy: x509
sans:
# floating virtual IP with kube-karp, @see: https://github.com/immanuelfodor/kube-karp
- "192.168.1.10"
webhook: null
addons: ""
addons_include:
- dashboard/recommended.yml
- dashboard/dashboard-adminuser.yml
system_images:
etcd: rancher/mirrored-coreos-etcd:v3.4.16-rancher1
alpine: rancher/rke-tools:v0.1.78
nginx_proxy: rancher/rke-tools:v0.1.78
cert_downloader: rancher/rke-tools:v0.1.78
kubernetes_services_sidecar: rancher/rke-tools:v0.1.78
kubedns: rancher/mirrored-k8s-dns-kube-dns:1.17.4
dnsmasq: rancher/mirrored-k8s-dns-dnsmasq-nanny:1.17.4
kubedns_sidecar: rancher/mirrored-k8s-dns-sidecar:1.17.4
kubedns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.3
coredns: rancher/mirrored-coredns-coredns:1.8.4
coredns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.3
nodelocal: rancher/mirrored-k8s-dns-node-cache:1.18.0
kubernetes: rancher/hyperkube:v1.21.5-rancher1
flannel: rancher/mirrored-coreos-flannel:v0.14.0
flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
calico_node: rancher/mirrored-calico-node:v3.19.2
calico_cni: rancher/mirrored-calico-cni:v3.19.2
calico_controllers: rancher/mirrored-calico-kube-controllers:v3.19.2
calico_ctl: rancher/mirrored-calico-ctl:v3.19.2
calico_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.19.2
canal_node: rancher/mirrored-calico-node:v3.19.2
canal_cni: rancher/mirrored-calico-cni:v3.19.2
canal_controllers: rancher/mirrored-calico-kube-controllers:v3.19.2
canal_flannel: rancher/mirrored-coreos-flannel:v0.14.0
canal_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.19.2
weave_node: weaveworks/weave-kube:2.8.1
weave_cni: weaveworks/weave-npc:2.8.1
pod_infra_container: rancher/mirrored-pause:3.4.1
ingress: rancher/nginx-ingress-controller:nginx-0.48.1-rancher1
ingress_backend: rancher/mirrored-nginx-ingress-controller-defaultbackend:1.5-rancher1
ingress_webhook: rancher/mirrored-jettech-kube-webhook-certgen:v1.5.1
metrics_server: rancher/mirrored-metrics-server:v0.5.0
windows_pod_infra_container: rancher/kubelet-pause:v0.1.6
aci_cni_deploy_container: noiro/cnideploy:5.1.1.0.1ae238a
aci_host_container: noiro/aci-containers-host:5.1.1.0.1ae238a
aci_opflex_container: noiro/opflex:5.1.1.0.1ae238a
aci_mcast_container: noiro/opflex:5.1.1.0.1ae238a
aci_ovs_container: noiro/openvswitch:5.1.1.0.1ae238a
aci_controller_container: noiro/aci-containers-controller:5.1.1.0.1ae238a
aci_gbp_server_container: noiro/gbp-server:5.1.1.0.1ae238a
aci_opflex_server_container: noiro/opflex-server:5.1.1.0.1ae238a
ssh_key_path: ~/.ssh/id_ed25519
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
mode: rbac
options: {}
ignore_docker_version: null
# default null, can be true, @see: https://github.com/rancher/rancher/issues/31943
# my issue, @see: https://github.com/rancher/rke/issues/2709
enable_cri_dockerd: true
kubernetes_version: ""
private_registries: []
ingress:
provider: nginx
options:
use-forwarded-headers: "true"
proxy-body-size: "80M"
use-http2: "true"
node_selector: {}
extra_args: {}
dns_policy: ""
extra_envs: []
extra_volumes: []
extra_volume_mounts: []
update_strategy: null
# @see: https://github.com/rancher/rke/issues/1876
# @see: https://github.com/rancher/rke/commit/5a63de09bc21267955461372aa2969cdff6e5b2c
http_port: 0
https_port: 0
# default "", can be hostNetwork, hostPort, @see: https://rancher.com/docs/rke/latest/en/config-options/add-ons/ingress-controllers/#configuring-network-options
# @see: https://github.com/rancher/rke/issues/2702#issuecomment-928950593
network_mode: ""
tolerations: []
default_backend: null
default_http_backend_priority_class_name: ""
nginx_ingress_controller_priority_class_name: ""
cluster_name: "rke"
cloud_provider:
name: ""
prefix_path: ""
win_prefix_path: ""
addon_job_timeout: 0
bastion_host:
address: ""
port: ""
user: ""
ssh_key: ""
ssh_key_path: ""
ssh_cert: ""
ssh_cert_path: ""
# default false, @see: https://github.com/rancher/rke/issues/2525
ignore_proxy_env_vars: false
monitoring:
provider: ""
options: {}
node_selector: {}
update_strategy: null
replicas: null
tolerations: []
metrics_server_priority_class_name: ""
restore:
restore: false
snapshot_name: ""
# automatic 'rke encrypt rotate-key', @see: https://github.com/rancher/rancher/issues/27735
rotate_encryption_key: false
dns:
provider: coredns
upstreamnameservers:
- 192.168.1.33
- 192.168.1.34 I've also got these errors firing in Rancher over the last half an hour after re-enabling dockershim:
|
Please run metrics-server with highest verbosity log and see what is causing the issue. Is your host firewall enabled or disabled? With or without custom rules? |
Firewalld is disabled on all 3 nodes, and all of them are on the same subnet, so it shouldn't cause any problems, and everything is fine when dockershim is not enabled. Redeployed metrics-server with: monitoring:
options:
# increase metrics server verbosity, @see: https://github.com/rancher/rke/issues/2709#issuecomment-931630835
# for other params, @see: docker run --rm k8s.gcr.io/metrics-server/metrics-server:v0.5.0 --help
v: "10" Since the new one can't become healthy, two pods/replicasets are running (this confirms the previous assumption): Here are the verbose logs from the new one: metrics-server verbose logs
|
I'm affected by this as well when I upgraded RKE from v.1.2.8 to v1.3.2 and enable_cri_dockerd=true.
|
This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
I didn't have time to test it with latest RKE release, please don't close, it's possibly still relevant |
The linked issue I created (that was closed in favor of this) has pretty concise details about how to test and what the issue truly is. |
Tried to enable dockershim with latest rke v1.3.7 that now includes k8s v1.22.6 and metrics server v0.5.1. I get the same errors in metrics server logs as in #2709 (comment) and #2709 (comment) . Besides, kubectl top no returns |
This also impacts rancher-desktop |
This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions. |
Unstale, I need to test this with v1.22.9 |
Upstream just fixed this a couple weeks ago FYI: Mirantis/cri-dockerd#15 |
Got the same issue with v1.22.9.
|
This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions. |
Any updates on this? Just tested with RKE 1.3.14 and Kubernetes v1.23.7-rancher1-1, the issue still persists. |
I can confirm we are still seeing this issue, see rancher/rancher#38816 - we initially thought that was a problem with the OEL kernel, but it's not related to that. |
RKE version: v1.3.1
Docker version: (
docker version
,docker info
preferred)Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred)Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Proxmox
cluster.yml file:
Maybe relevant lines:
Steps to Reproduce:
Update RKE v1.2.11 -> v1.3.1 (k8s v1.20.9 -> v1.21.5, metrics-server v0.4.1->v0.5.0)
Results:
After updating RKE, another metrics-server instance is created which is unable to start, can't scrape the nodes, but the previous one is also present and it works fine with the previous version:
Diffing the old good and the new bad replicasets, the only relevant change seem to be the port change. Maybe a 4 digit is missing?
I also did a Rancher app update from v2.5 latest patch release (can't remember which version it was) to v2.6.0 with monitoring update 14.5.100->100.0.0+up16.6.0 but RKE is not provisioned from Rancher, I only use it to easily deploy the monitoring stack. The Rancher update was before the RKE update and metrics-server was untouched by the Rancher update. Only got it failing over v0.5.0 after the RKE update. v0.4.1 works fine even after manually deleting its replicaset and restoring it from a yaml file backup.
The text was updated successfully, but these errors were encountered: