New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable_cri_dockerd: true
causes dockerd to use all CPU resources available.
#38018
Comments
This will be fixed in Rancher v2.6.7 according to rancher/rke#2938 |
@Raboo Are you still seeing these issues with Rancher 2.6.7/2.6.8? |
@cite I have not upgraded and not enabled cri_dockerd. |
I took a look at Mirantis/cri-dockerd#38 - it seems this only fixes the problem of metrics not being returned fast enough, by delegating stats collection to a goroutine. So the metrics aren't collected any faster or in a more clever way, but just in parallel. Then, when metrics are collected, and you run a lot of pods, it makes sense that So instead of |
That sucks. |
With rancher 2.6.8 dockerd is still consuming more CPU than other nodes where cri_dockerd is disabled. |
This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions. |
This issue is still unresolved. Maybe continue in #38816, which seems to be a duplicate. |
Yes, it seems to be the same issue. Unfortunately this has been a known problem for a very long period without any resolution in sight |
This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions. |
Seems that this will be resolved in next release, details: #38816 |
Just to clarify, the linked issue is fixing cpu usage for >=1.24 clusters. For clusters <1.24, please disable cri-dockerd and you would be able to see the same improvement. In case you see <1.24 clusters with |
Can confirm the CPU issue fixed: upgraded v1.23.16-rancher2-1 -> v1.24.13-rancher2-1 with |
Rancher Server Setup
Information about the Cluster
Information about underlying OS
Describe the bug
I set cluster configuration
rancher_kubernetes_engine_config.enable_cri_dockerd: true
and dockerd started using all available CPU and causing high load, eventually causing the entire cluster to create cascading errors.Seems related to rancher/rke#2938 and rancher/rke#2709.
To Reproduce
Enable
rancher_kubernetes_engine_config.enable_cri_dockerd
on Rancher v2.6.5 or earlier on k8s v1.21, v1.22 or v1.23.Result
Expected Result
dockerd should not be using 4800% CPU and causing a system load of over 80 on a 48 core node.
Additional context
Perhaps there should be a warning in the documentation about using
enable_cri_dockerd: true
since it's been known for a while that the performance is bad. At least until fixing the root problem since this is a cluster breaking bug.The text was updated successfully, but these errors were encountered: