Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable_cri_dockerd: true causes dockerd to use all CPU resources available. #38018

Closed
Raboo opened this issue Jun 16, 2022 · 13 comments
Closed

Comments

@Raboo
Copy link

Raboo commented Jun 16, 2022

Rancher Server Setup

  • Rancher version: 2.6.5
  • Installation option (Docker install/Helm Chart): helm, rke1

Information about the Cluster

  • Kubernetes version: v1.22.9
  • Cluster Type (Local/Downstream): Custom

Information about underlying OS

  • OS: Flatcar Linux
  • Release: Beta Channel / 3227.1.0
  • Kernel: 5.15.43
  • Docker engine: 20.10.14
  • using legacy cgroups (cgroups v1)

Describe the bug
I set cluster configuration rancher_kubernetes_engine_config.enable_cri_dockerd: true and dockerd started using all available CPU and causing high load, eventually causing the entire cluster to create cascading errors.

Seems related to rancher/rke#2938 and rancher/rke#2709.

To Reproduce
Enable rancher_kubernetes_engine_config.enable_cri_dockerd on Rancher v2.6.5 or earlier on k8s v1.21, v1.22 or v1.23.

Result

Expected Result
dockerd should not be using 4800% CPU and causing a system load of over 80 on a 48 core node.

Additional context
Perhaps there should be a warning in the documentation about using enable_cri_dockerd: true since it's been known for a while that the performance is bad. At least until fixing the root problem since this is a cluster breaking bug.

@Raboo Raboo closed this as completed Jun 20, 2022
@Raboo Raboo reopened this Jun 20, 2022
@Raboo
Copy link
Author

Raboo commented Aug 18, 2022

This will be fixed in Rancher v2.6.7 according to rancher/rke#2938

@Raboo Raboo closed this as completed Aug 18, 2022
@cite
Copy link

cite commented Sep 8, 2022

@Raboo Are you still seeing these issues with Rancher 2.6.7/2.6.8?

@Raboo
Copy link
Author

Raboo commented Sep 8, 2022

@cite I have not upgraded and not enabled cri_dockerd.
But more users have confirmed that there is indeed still a problem rancher/rke#2709

@Raboo Raboo reopened this Sep 8, 2022
@cite
Copy link

cite commented Sep 8, 2022

I took a look at Mirantis/cri-dockerd#38 - it seems this only fixes the problem of metrics not being returned fast enough, by delegating stats collection to a goroutine. So the metrics aren't collected any faster or in a more clever way, but just in parallel. Then, when metrics are collected, and you run a lot of pods, it makes sense that dockerd would need several 10s of cores.

So instead of kubelet being slow, we consume a lot more CPU at dockerd.

@Raboo
Copy link
Author

Raboo commented Sep 8, 2022

That sucks.
Well my situation was that my entire cluster went haywire. High CPU on dockerd and unresponsive kubelet.
So I reverted everything. Currently there are no battle-tested way of migrating a Rancher provisioned RKE1 cluster to a RKE2 which sucks.

@nickvth
Copy link

nickvth commented Oct 17, 2022

With rancher 2.6.8 dockerd is still consuming more CPU than other nodes where cri_dockerd is disabled.

@github-actions
Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

@phoenix-bjoern
Copy link

This issue is still unresolved. Maybe continue in #38816, which seems to be a duplicate.

@Raboo
Copy link
Author

Raboo commented Dec 21, 2022

Yes, it seems to be the same issue. Unfortunately this has been a known problem for a very long period without any resolution in sight

@github-actions
Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

@Raboo
Copy link
Author

Raboo commented Feb 20, 2023

Seems that this will be resolved in next release, details: #38816

@Raboo Raboo closed this as completed Feb 20, 2023
@kinarashah
Copy link
Member

Just to clarify, the linked issue is fixing cpu usage for >=1.24 clusters. For clusters <1.24, please disable cri-dockerd and you would be able to see the same improvement. In case you see <1.24 clusters with enable_cri_dockerd: false and dockerd still consuming all cpu, please reopen the issue or open a new issue with all system information.

@immanuelfodor
Copy link

Can confirm the CPU issue fixed: upgraded v1.23.16-rancher2-1 -> v1.24.13-rancher2-1 with enable_cri_dockerd: true and the CPU usage is normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants