`enable_cri_dockerd: true` causes dockerd to use all CPU resources available. #38018

Raboo · 2022-06-16T13:01:56Z

Rancher Server Setup

Rancher version: 2.6.5
Installation option (Docker install/Helm Chart): helm, rke1

Information about the Cluster

Kubernetes version: v1.22.9
Cluster Type (Local/Downstream): Custom

Information about underlying OS

OS: Flatcar Linux
Release: Beta Channel / 3227.1.0
Kernel: 5.15.43
Docker engine: 20.10.14
using legacy cgroups (cgroups v1)

Describe the bug
I set cluster configuration rancher_kubernetes_engine_config.enable_cri_dockerd: true and dockerd started using all available CPU and causing high load, eventually causing the entire cluster to create cascading errors.

Seems related to rancher/rke#2938 and rancher/rke#2709.

To Reproduce
Enable rancher_kubernetes_engine_config.enable_cri_dockerd on Rancher v2.6.5 or earlier on k8s v1.21, v1.22 or v1.23.

Result

Expected Result
dockerd should not be using 4800% CPU and causing a system load of over 80 on a 48 core node.

Additional context
Perhaps there should be a warning in the documentation about using enable_cri_dockerd: true since it's been known for a while that the performance is bad. At least until fixing the root problem since this is a cluster breaking bug.

The text was updated successfully, but these errors were encountered:

Raboo · 2022-08-18T09:31:15Z

This will be fixed in Rancher v2.6.7 according to rancher/rke#2938

cite · 2022-09-08T12:24:20Z

@Raboo Are you still seeing these issues with Rancher 2.6.7/2.6.8?

Raboo · 2022-09-08T13:19:37Z

@cite I have not upgraded and not enabled cri_dockerd.
But more users have confirmed that there is indeed still a problem rancher/rke#2709

cite · 2022-09-08T14:43:20Z

I took a look at Mirantis/cri-dockerd#38 - it seems this only fixes the problem of metrics not being returned fast enough, by delegating stats collection to a goroutine. So the metrics aren't collected any faster or in a more clever way, but just in parallel. Then, when metrics are collected, and you run a lot of pods, it makes sense that dockerd would need several 10s of cores.

So instead of kubelet being slow, we consume a lot more CPU at dockerd.

Raboo · 2022-09-08T14:49:58Z

That sucks.
Well my situation was that my entire cluster went haywire. High CPU on dockerd and unresponsive kubelet.
So I reverted everything. Currently there are no battle-tested way of migrating a Rancher provisioned RKE1 cluster to a RKE2 which sucks.

nickvth · 2022-10-17T08:59:42Z

With rancher 2.6.8 dockerd is still consuming more CPU than other nodes where cri_dockerd is disabled.

github-actions · 2022-12-17T01:55:34Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

phoenix-bjoern · 2022-12-20T21:46:15Z

This issue is still unresolved. Maybe continue in #38816, which seems to be a duplicate.

Raboo · 2022-12-21T08:41:45Z

Yes, it seems to be the same issue. Unfortunately this has been a known problem for a very long period without any resolution in sight

github-actions · 2023-02-20T02:03:31Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

Raboo · 2023-02-20T13:43:27Z

Seems that this will be resolved in next release, details: #38816

kinarashah · 2023-03-06T16:20:44Z

Just to clarify, the linked issue is fixing cpu usage for >=1.24 clusters. For clusters <1.24, please disable cri-dockerd and you would be able to see the same improvement. In case you see <1.24 clusters with enable_cri_dockerd: false and dockerd still consuming all cpu, please reopen the issue or open a new issue with all system information.

immanuelfodor · 2023-06-08T18:34:42Z

Can confirm the CPU issue fixed: upgraded v1.23.16-rancher2-1 -> v1.24.13-rancher2-1 with enable_cri_dockerd: true and the CPU usage is normal.

Raboo closed this as completed Jun 20, 2022

Raboo reopened this Jun 20, 2022

Raboo closed this as completed Aug 18, 2022

Raboo reopened this Sep 8, 2022

mgabeler-lee-6rs mentioned this issue Dec 8, 2022

Extreme dockerd cpu & memory usage in some environments k3s-io/k3s#6617

Closed

github-actions bot added the status/stale label Dec 17, 2022

github-actions bot removed the status/stale label Dec 21, 2022

Raboo mentioned this issue Jan 24, 2023

[BUG] Abnormally high CPU usage on Kubernetes 1.24.4 #38816

Closed

github-actions bot added the status/stale label Feb 20, 2023

Raboo closed this as completed Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`enable_cri_dockerd: true` causes dockerd to use all CPU resources available. #38018

`enable_cri_dockerd: true` causes dockerd to use all CPU resources available. #38018

Raboo commented Jun 16, 2022

Raboo commented Aug 18, 2022

cite commented Sep 8, 2022

Raboo commented Sep 8, 2022

cite commented Sep 8, 2022 •

edited

Raboo commented Sep 8, 2022

nickvth commented Oct 17, 2022

github-actions bot commented Dec 17, 2022

phoenix-bjoern commented Dec 20, 2022

Raboo commented Dec 21, 2022

github-actions bot commented Feb 20, 2023

Raboo commented Feb 20, 2023

kinarashah commented Mar 6, 2023

immanuelfodor commented Jun 8, 2023

enable_cri_dockerd: true causes dockerd to use all CPU resources available. #38018

enable_cri_dockerd: true causes dockerd to use all CPU resources available. #38018

Comments

Raboo commented Jun 16, 2022

Raboo commented Aug 18, 2022

cite commented Sep 8, 2022

Raboo commented Sep 8, 2022

cite commented Sep 8, 2022 • edited

Raboo commented Sep 8, 2022

nickvth commented Oct 17, 2022

github-actions bot commented Dec 17, 2022

phoenix-bjoern commented Dec 20, 2022

Raboo commented Dec 21, 2022

github-actions bot commented Feb 20, 2023

Raboo commented Feb 20, 2023

kinarashah commented Mar 6, 2023

immanuelfodor commented Jun 8, 2023

`enable_cri_dockerd: true` causes dockerd to use all CPU resources available. #38018

`enable_cri_dockerd: true` causes dockerd to use all CPU resources available. #38018

cite commented Sep 8, 2022 •

edited