e2e flake: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write #109182

liggitt · 2022-03-31T05:58:13Z

Looks like we just got a spike of a new run failure message in master: https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=unable%20to%20apply%20cgroup%20configuration&xjob=1-2

Seen in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109178/pull-kubernetes-conformance-kind-ga-only-parallel/1509397620936675328

s: "pod \"oidc-discovery-validator\" failed with status: {Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-31 05:35:20 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-31 05:35:20 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [oidc-discovery-validator]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-31 05:35:20 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [oidc-discovery-validator]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-31 05:35:20 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.0.2 HostIPs:[{IP:172.18.0.2}] PodIP:10.244.1.130 PodIPs:[{IP:10.244.1.130}] StartTime:2022-03-31 05:35:20 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:oidc-discovery-validator State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:128,Signal:0,Reason:StartError,Message:failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write 36721: write /sys/fs/cgroup/rdma/kubelet/kubepods/besteffort/pod4c5127ae-797f-4b89-9aa9-7f66226768cd/61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8/cgroup.procs: no such device: unknown,StartedAt:1970-01-01 00:00:00 +0000 UTC,FinishedAt:2022-03-31 05:35:21 +0000 UTC,ContainerID:containerd://61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:k8s.gcr.io/e2e-test-images/agnhost:2.36 ImageID:k8s.gcr.io/e2e-test-images/agnhost@sha256:f5241226198f5a54d22540acf2b3933ea0f49458f90c51fc75833d0c428687b8 ContainerID:containerd://61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8 Started:0xc000d223ea}] QOSClass:BestEffort EphemeralContainerStatuses:[]}", } pod "oidc-discovery-validator" failed with status: {Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-31 05:35:20 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-31 05:35:20 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [oidc-discovery-validator]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-31 05:35:20 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [oidc-discovery-validator]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-31 05:35:20 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.0.2 HostIPs:[{IP:172.18.0.2}] PodIP:10.244.1.130 PodIPs:[{IP:10.244.1.130}] StartTime:2022-03-31 05:35:20 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:oidc-discovery-validator State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:128,Signal:0,Reason:StartError,Message:failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write 36721: write /sys/fs/cgroup/rdma/kubelet/kubepods/besteffort/pod4c5127ae-797f-4b89-9aa9-7f66226768cd/61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8/cgroup.procs: no such device: unknown,StartedAt:1970-01-01 00:00:00 +0000 UTC,FinishedAt:2022-03-31 05:35:21 +0000 UTC,ContainerID:containerd://61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:k8s.gcr.io/e2e-test-images/agnhost:2.36 ImageID:k8s.gcr.io/e2e-test-images/agnhost@sha256:f5241226198f5a54d22540acf2b3933ea0f49458f90c51fc75833d0c428687b8 ContainerID:containerd://61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8 Started:0xc000d223ea}] QOSClass:BestEffort EphemeralContainerStatuses:[]}

/milestone v1.24
/sig node

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-03-31T05:58:19Z

@liggitt: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pacoxu · 2022-04-01T12:35:46Z

kubelet log:

Mar 31 05:35:21 kind-worker kubelet[260]: E0331 05:35:21.552197 260 remote_runtime.go:453] "StartContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write 36721: write /sys/fs/cgroup/rdma/kubelet/kubepods/besteffort/pod4c5127ae-797f-4b89-9aa9-7f66226768cd/61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8/cgroup.procs: no such device: unknown" containerID="61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8"

containerd log:

Mar 31 05:35:21 kind-worker containerd[178]: time="2022-03-31T05:35:21.447710609Z" level=info msg="CreateContainer within sandbox "951d4117fa12f522b3eac01fdc3575df7a7d4d4cb1d467913ac0c9d5529ce909" for &ContainerMetadata{Name:oidc-discovery-validator,Attempt:0,} returns container id "61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8""
Mar 31 05:35:21 kind-worker containerd[178]: time="2022-03-31T05:35:21.448420497Z" level=info msg="StartContainer for "61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8""
Mar 31 05:35:21 kind-worker containerd[178]: time="2022-03-31T05:35:21.525049577Z" level=info msg="shim disconnected" id=61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8
Mar 31 05:35:21 kind-worker containerd[178]: time="2022-03-31T05:35:21.525138474Z" level=warning msg="cleaning up after shim disconnected" id=61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8 namespace=k8s.io
Mar 31 05:35:21 kind-worker containerd[178]: time="2022-03-31T05:35:21.525149912Z" level=info msg="cleaning up dead shim"
Mar 31 05:35:21 kind-worker containerd[178]: time="2022-03-31T05:35:21.537393536Z" level=warning msg="cleanup warnings time="2022-03-31T05:35:21Z" level=info msg="starting signal loop" namespace=k8s.io pid=36723 runtime=io.containerd.runc.v2\ntime="2022-03-31T05:35:21Z" level=warning msg="failed to read init pid file" error="open /run/containerd/io.containerd.runtime.v2.task/k8s.io/61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8/init.pid: no such file or directory" runtime=io.containerd.runc.v2\n"
Mar 31 05:35:21 kind-worker containerd[178]: time="2022-03-31T05:35:21.537681531Z" level=error msg="copy shim log" error="read /proc/self/fd/268: file already closed"
Mar 31 05:35:21 kind-worker containerd[178]: time="2022-03-31T05:35:21.537997837Z" level=error msg="Failed to pipe stdout of container "61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8"" error="reading from a closed fifo"
Mar 31 05:35:21 kind-worker containerd[178]: time="2022-03-31T05:35:21.538005119Z" level=error msg="Failed to pipe stderr of container "61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8"" error="reading from a closed fifo"
Mar 31 05:35:21 kind-worker containerd[178]: time="2022-03-31T05:35:21.550346256Z" level=error msg="StartContainer for "61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8" failed" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write 36721: write /sys/fs/cgroup/rdma/kubelet/kubepods/besteffort/pod4c5127ae-797f-4b89-9aa9-7f66226768cd/61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8/cgroup.procs: no such device: unknown"

I prefer to think it is a container-related issue: failed to read init pid file.

dims · 2022-04-04T14:50:57Z

cc @SergeyKanzhelev @ehashman @endocrimes

ehashman · 2022-04-04T20:50:52Z

/triage accepted
/priority critical-urgent

/cc @kolyshkin @rphillips

mrunalp · 2022-04-04T21:19:10Z

Was the runc (or containerd) binary updated on these jobs?

kolyshkin · 2022-04-04T21:57:49Z

write /sys/fs/cgroup/rdma/kubelet/kubepods/besteffort/pod4c5127ae-797f-4b89-9aa9-7f66226768cd/61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8/cgroup.procs: no such device: unknown" containerID="61b3e1f7568f23bb1503c2309e9e254c1ac0103d0de059958f9555ff6548b5c8"

Support for RDMA controller was indeed added to runc 1.1 (in opencontainers/runc#2883).

The error message says that write failed. That write happens right after mkdir, which (apparently) succeeded.

kolyshkin · 2022-04-04T22:46:37Z

kubelet log also shows (this is the earliest mention of rdma):

Mar 31 05:30:14 kind-control-plane kubelet[723]: time="2022-03-31T05:30:14Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/rdma/kubelet/kubepods/pod93035424-72ce-42df-be98-d0a9a47ec7c3/7cd75055222b05261b435a4face3561cd52ba91f54fa9ce6b194962767e5cddd: device or resource busy"
Mar 31 05:30:14 kind-control-plane kubelet[723]: time="2022-03-31T05:30:14Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/unified/kubelet/kubepods/besteffort/pod2ca9d41c-9ba9-4151-bac4-ef5d66467871/1aa8f37127b89423f3927beddbde27c0af1a1d45b1c3dc2a80b4079621412c35: device or resource busy"
Mar 31 05:30:14 kind-control-plane kubelet[723]: time="2022-03-31T05:30:14Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/unified/kubelet/kubepods/pod93035424-72ce-42df-be98-d0a9a47ec7c3/7cd75055222b05261b435a4face3561cd52ba91f54fa9ce6b194962767e5cddd: device or resource busy"
Mar 31 05:30:14 kind-control-plane kubelet[723]: time="2022-03-31T05:30:14Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/rdma/kubelet/kubepods/besteffort/pod2ca9d41c-9ba9-4151-bac4-ef5d66467871/1aa8f37127b89423f3927beddbde27c0af1a1d45b1c3dc2a80b4079621412c35: device or resource busy"
Mar 31 05:30:15 kind-control-plane kubelet[723]: time="2022-03-31T05:30:15Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/rdma/kubelet/kubepods/pod93035424-72ce-42df-be98-d0a9a47ec7c3/7cd75055222b05261b435a4face3561cd52ba91f54fa9ce6b194962767e5cddd: device or resource busy"
Mar 31 05:30:15 kind-control-plane kubelet[723]: time="2022-03-31T05:30:15Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/unified/kubelet/kubepods/besteffort/pod2ca9d41c-9ba9-4151-bac4-ef5d66467871/1aa8f37127b89423f3927beddbde27c0af1a1d45b1c3dc2a80b4079621412c35: device or resource busy"
Mar 31 05:30:15 kind-control-plane kubelet[723]: time="2022-03-31T05:30:15Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/unified/kubelet/kubepods/pod93035424-72ce-42df-be98-d0a9a47ec7c3/7cd75055222b05261b435a4face3561cd52ba91f54fa9ce6b194962767e5cddd: device or resource busy"
Mar 31 05:30:15 kind-control-plane kubelet[723]: time="2022-03-31T05:30:15Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/rdma/kubelet/kubepods/besteffort/pod2ca9d41c-9ba9-4151-bac4-ef5d66467871/1aa8f37127b89423f3927beddbde27c0af1a1d45b1c3dc2a80b4079621412c35: device or resource busy"
Mar 31 05:30:15 kind-control-plane kubelet[723]: I0331 05:30:15.180441     723 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubelet kubepods pod93035424-72ce-42df-be98-d0a9a47ec7c3] err="unable to destroy cgroup paths for cgroup [kubelet kubepods pod93035424-72ce-42df-be98-d0a9a47ec7c3] : Failed to remove paths: map[:/sys/fs/cgroup/unified/kubelet/kubepods/pod93035424-72ce-42df-be98-d0a9a47ec7c3 rdma:/sys/fs/cgroup/rdma/kubelet/kubepods/pod93035424-72ce-42df-be98-d0a9a47ec7c3]"
Mar 31 05:30:15 kind-control-plane kubelet[723]: I0331 05:30:15.180471     723 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubelet kubepods besteffort pod2ca9d41c-9ba9-4151-bac4-ef5d66467871] err="unable to destroy cgroup paths for cgroup [kubelet kubepods besteffort pod2ca9d41c-9ba9-4151-bac4-ef5d66467871] : Failed to remove paths: map[:/sys/fs/cgroup/unified/kubelet/kubepods/besteffort/pod2ca9d41c-9ba9-4151-bac4-ef5d66467871 rdma:/sys/fs/cgroup/rdma/kubelet/kubepods/besteffort/pod2ca9d41c-9ba9-4151-bac4-ef5d66467871]"

Note that support for hybrid unified hierarchy was also appeared in runc 1.1.

kolyshkin · 2022-04-05T00:00:54Z

The removal failure happens because allegedly there is a process(es) left in rdma and unified cgroups which prevent the cgroup removal. I can't figure out why this can ever happen (kubelet does not know anything about rdma or unified, but it should not break things).

My preliminary theory is, inability to write a pid to rdma is caused by too many rdma cgroups.

In any case, we should figure out why rdma and unified are not empty upon removal. Following the source code, kubelet kills all the processes in these cgroups before trying to remove them, so I am puzzled.

mrunalp · 2022-04-05T01:08:31Z

In any case, we should figure out why rdma and unified are not empty upon removal. Following the source code, kubelet kills all the processes in these cgroups before trying to remove them, so I am puzzled.

Any possibility of processes stuck in 'D' state?

kolyshkin · 2022-04-05T02:37:54Z

So, I added some debug in #109298 to see what is going on.

Here is an excerpt from the kubelet log:

Apr 05 01:54:38 kind-worker kubelet[261]: time="2022-04-05T01:54:38Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/unified/kubelet/kubepods/burstable/pod1883213d8fec799ee2b7bf9f2185a5c7/5b078c521eefb476090c430cac51c128fabcd9094ebcaf0fa225d2b366c13c39: device or resource busy"
Apr 05 01:54:38 kind-worker kubelet[261]: time="2022-04-05T01:54:38Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/rdma/kubelet/kubepods/burstable/pod1883213d8fec799ee2b7bf9f2185a5c7/5b078c521eefb476090c430cac51c128fabcd9094ebcaf0fa225d2b366c13c39: device or resource busy"
Apr 05 01:54:38 kind-worker kubelet[261]: I0405 01:54:38.438546 261 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubelet kubepods burstable poda67d4606-bf19-4290-b354-6c8f9f522e74] err="unable to destroy cgroup paths for cgroup [kubelet kubepods burstable poda67d4606-bf19-4290-b354-6c8f9f522e74] : Failed to remove paths: map[:/sys/fs/cgroup/unified/kubelet/kubepods/burstable/poda67d4606-bf19-4290-b354-6c8f9f522e74 rdma:/sys/fs/cgroup/rdma/kubelet/kubepods/burstable/poda67d4606-bf19-4290-b354-6c8f9f522e74]" pids=[]
Apr 05 01:54:38 kind-worker kubelet[261]: I0405 01:54:38.436682 261 cgroup_manager_linux.go:307] "KKK subsystem info" name="" path="/sys/fs/cgroup/unified/kubelet/kubepods/besteffort/pod15fc6eea-8cb0-4d22-8f38-79d81043176f" pids=[0 0] pErr=
Apr 05 01:54:38 kind-worker kubelet[261]: I0405 01:54:38.438656 261 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubelet kubepods burstable pod1e2d8aa5-e95f-4ccd-8a58-70c4d5559d54] err="unable to destroy cgroup paths for cgroup [kubelet kubepods burstable pod1e2d8aa5-e95f-4ccd-8a58-70c4d5559d54] : Failed to remove paths: map[:/sys/fs/cgroup/unified/kubelet/kubepods/burstable/pod1e2d8aa5-e95f-4ccd-8a58-70c4d5559d54 rdma:/sys/fs/cgroup/rdma/kubelet/kubepods/burstable/pod1e2d8aa5-e95f-4ccd-8a58-70c4d5559d54]" pids=[]
Apr 05 01:54:38 kind-worker kubelet[261]: I0405 01:54:38.438668 261 cgroup_manager_linux.go:307] "KKK subsystem info" name="rdma" path="/sys/fs/cgroup/rdma/kubelet/kubepods/burstable/pod1883213d8fec799ee2b7bf9f2185a5c7" pids=[] pErr=
Apr 05 01:54:38 kind-worker kubelet[261]: I0405 01:54:38.436848 261 cgroup_manager_linux.go:307] "KKK subsystem info" name="rdma" path="/sys/fs/cgroup/rdma/kubelet/kubepods/pod9daba603-dc32-4c18-bb17-7110922acc05" pids=[] pErr=
Apr 05 01:54:38 kind-worker kubelet[261]: I0405 01:54:38.438981 261 cgroup_manager_linux.go:307] "KKK subsystem info" name="" path="/sys/fs/cgroup/unified/kubelet/kubepods/burstable/pod1883213d8fec799ee2b7bf9f2185a5c7" pids=[0 0] pErr=
Apr 05 01:54:38 kind-worker kubelet[261]: I0405 01:54:38.439027 261 cgroup_manager_linux.go:307] "KKK subsystem info" name="" path="/sys/fs/cgroup/unified/kubelet/kubepods/pod9daba603-dc32-4c18-bb17-7110922acc05" pids=[0 0] pErr=
Apr 05 01:54:38 kind-worker kubelet[261]: I0405 01:54:38.439225 261 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubelet kubepods besteffort pod15fc6eea-8cb0-4d22-8f38-79d81043176f] err="unable to destroy cgroup paths for cgroup [kubelet kubepods besteffort pod15fc6eea-8cb0-4d22-8f38-79d81043176f] : Failed to remove paths: map[:/sys/fs/cgroup/unified/kubelet/kubepods/besteffort/pod15fc6eea-8cb0-4d22-8f38-79d81043176f rdma:/sys/fs/cgroup/rdma/kubelet/kubepods/besteffort/pod15fc6eea-8cb0-4d22-8f38-79d81043176f]" pids=[]

All this means that

there are no processes in RDMA cgroup, yet it can not be removed;
for some reason runc/libcontainer/cgroups.GetAllPids() returns the list with two 0 in it for unified controller. This is probably not related to this issue, but I am looking into it;
despite no processes in cgroups, they can not be removed.

Adding more debug to #109298...

My next two suspects are KIND and the kernel. As for KIND, I looked at the sources of the script that prepares cgroups, and found nothing bad.

kolyshkin · 2022-04-05T02:38:44Z

Any possibility of processes stuck in 'D' state?

@mrunalp Looks like it's not it, cgroup.procs show no entries (nor any in the subdirectories) -- see the previous comment.

helayoty · 2022-04-05T13:26:10Z

@liggitt 👋 the release 1.24 bug triage shadow. While the test freeze phase is cut off tomorrow, do you think this issue will still be included in the current release?

liggitt · 2022-04-05T14:01:52Z

Until the issue is understood, it should remain in the milestone

ehashman · 2022-04-05T17:08:03Z

/assign @mrunalp

derekwaynecarr · 2022-04-05T17:13:22Z

@mrunalp lets catch up on why rdma is even an available controller on this host.

rdma isnt in this allowed list:

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cgroup_manager_linux.go#L260

kolyshkin · 2022-04-05T17:25:45Z

@mrunalp lets catch up on why rdma is even an available controller on this host.

rdma isnt in this allowed list:

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cgroup_manager_linux.go#L260

@derekwaynecarr This "allowed list" is merely a way to specify controllers that must be present (I guess its naming is slightly misleading). IOW, the code you refer to ensures that memory, cpu etc paths do present.

It has nothing to do with rdma or unified. runc creates cgroups for all supported controllers/subsystems (and add containers to all of them).

What's unclear is why these cgroups can't be removed during destroy. I am still looking at it in #109298 (feeling under the weather today so it takes longer).

mrunalp · 2022-04-05T17:31:07Z

It has nothing to do with rdma or unified. runc creates cgroups for all supported controllers/subsystems (and add containers to all of them).

One thing worth trying may be to see if we don't join the rdma controller, do we still hit the issue.
Maybe runc can skip joining it unless rdma settings are specified? (It is a relatively newer controller which hasn't been tested as much within containers so we could avoid potential bugs there.)

derekwaynecarr · 2022-04-05T17:31:38Z

@kolyshkin understood. rdma as an enabled cgroup controller on a target host for kubelet execution is what was new to me so I was wondering if there was a change to the test operating system configuration beyond just runc adding awareness.

kolyshkin · 2022-04-05T18:27:57Z

RDMA cgroup requires a kernel config parameter to be set. It is obviously set in Ubuntu kernels.

In Fedora 35 kernels, CONFIG_CGROUP_RDMA is not set. Here's from my machine:

[kir@kir-rhat run]$ grep RDMA /boot/config-5.16.18*
# CONFIG_CGROUP_RDMA is not set

On CentOS Stream 9 kernel, it is set:

[root@cirrus-task-6704502681108480 runc]# grep RDMA /boot/config-5.14.0-71.el9.x86_64 
CONFIG_CGROUP_RDMA=y

kolyshkin · 2022-04-05T19:27:15Z

Not joining RDMA in vendored libcontainer did not help. It might help in runc binary (which I haven't done).

Looking into the underlying cause.

BenTheElder · 2022-04-15T20:24:33Z

I'm not sure this is eliminated:

https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=unable%20to%20apply%20cgroup%20configuration&xjob=1-2

=>

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kind-conformance-parallel-ipv6/1515013844353683456

Test started today at 10:08 AM failed after 31m15s

Message:failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write 50586: write /sys/fs/cgroup/rdma/kubelet/kubepods/besteffort/poddb92c0c2-8c75-48f5-bb58-14f859e7797b/b67bb9a6e0d9b89e17cd8d123532268d31329876d4f0d7463196afa2a435faa1/cgroup.procs: no such device: unknown

but CI should be using kind @ HEAD and kubernetes-sigs/kind#2709 merged two days ago

liggitt · 2022-04-18T13:58:39Z

reopening per #109182 (comment) to make sure this is resolved

Looking at some of the issues around k8s/runc, I came across this issue where runc 1.1.0 didn't properly scope some cgroup objects. kubernetes/kubernetes#109182 Signed-off-by: Shane Jarych <sjarych@mirantis.com>

liggitt · 2022-05-03T13:14:08Z

BenTheElder closed this in [BenTheElder/kind@db40a9b](/BenTheElder/kind/commit/db40a9b58aefcb44abffcab638acb5e44e05f31d) 15 hours ago

should that have actually closed this issue? not seeing how the linked commit in Ben's fork modified kind bringup

BenTheElder · 2022-05-03T15:26:17Z

Oh no, that's that GitHub "feature", I merely synced my fork to upstream but the commit contains "fixes". Not sure why CI didn't block this with the invalid-commit label.

thaJeztah · 2022-05-03T17:42:20Z

Ah, yes, those are always a pain.

BenTheElder · 2022-05-06T03:44:37Z

Potentially related: ... We should probably update the docker-in-docker in Kubernetes CI, it's going to have an outdated docker install and it's generally not well done, I've been meaning to clean that up ...

https://github.com/kubernetes/test-infra/blob/master/images/krte/Dockerfile is still based on Debian Buster ...

aojea · 2022-05-10T09:56:42Z

the tests that execute command on pods seems to be affected by this

#109928 (comment)

BenTheElder · 2022-05-12T03:14:56Z

KIND's CI image is now on docker 20.10.15 / runc v1.1.1-0-g52de29d.

Tentatively after this change we don't see any more loges about rdma cgroups, I've spot checked with a few curl $kubelet_log | grep rdma for CI logs from before and after the change. Which makes sense since3 we should be on runc v1.1.1 on both dind and kind within that.

(the CI dind is still naively done and I'm not sure what the underlying hosts are running currently, need to get back to that ...)

BenTheElder · 2022-05-27T17:36:35Z

https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=unable%20to%20apply%20cgroup%20configuration&xjob=1-2 is indeed empty.

https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=unable%20to%20apply%20cgroup%20configuration on all jobs has a few, but those are:

ci-kubernetes-csi-1-22-test-on-kubernetes-master
ci-kubernetes-csi-1-23-test-on-kubernetes-master

periodic-cluster-api-e2e-workload-upgrade-1-21-1-22-release-1-1

BenTheElder · 2022-05-27T17:40:29Z

The csi-driver-hostpath jobs are due to kubekins-e2e image not having the updated docker (and possibly not updated kind).

CAPI is probably the same thing.

These remaining flakes are rare and not affecting CI for this repo.
/close

k8s-ci-robot · 2022-05-27T17:40:44Z

@BenTheElder: Closing this issue.

In response to this:

The csi-driver-hostpath jobs are due to kubekins-e2e image not having the updated docker (and possibly not update kind).

CAPI is probably the same thing.

These remaining flakes are rare and not affecting CI for this repo.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 31, 2022

k8s-ci-robot added this to the v1.24 milestone Mar 31, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 31, 2022

liggitt added kind/flake Categorizes issue or PR as related to a flaky test. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 31, 2022

liggitt mentioned this issue Mar 31, 2022

Fix non-enum CSR condition field, omit enums from static openapi snapshot #109178

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Apr 4, 2022

ehashman added this to Triage in SIG Node CI/Test Board Apr 4, 2022

k8s-ci-robot assigned mrunalp Apr 5, 2022

SIG Node CI/Test Board automation moved this from Issues - In progress to Done Apr 13, 2022

liggitt reopened this Apr 18, 2022

SIG Node CI/Test Board automation moved this from Done to Issues - In progress Apr 18, 2022

BenTheElder closed this as completed in BenTheElder/kind@db40a9b May 2, 2022

SIG Node CI/Test Board automation moved this from Issues - In progress to Done May 2, 2022

liggitt reopened this May 3, 2022

SIG Node CI/Test Board automation moved this from Done to Issues - In progress May 3, 2022

BenTheElder mentioned this issue May 6, 2022

Use systemd cgroup driver for v1.24.0+ kubernetes-sigs/kind#2737

Merged

This was referenced May 8, 2022

[Flaky] Kubernetes e2e suite: [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails #109890

Closed

Retry when it fails to update pods status on scheduling loop #109832

Merged

aojea mentioned this issue May 9, 2022

[Flake-test] Kubernetes e2e suite: [sig-node] Downward API should provide pod UID as env vars [NodeConformance] [Conformance] #109903

Closed

BenTheElder mentioned this issue May 11, 2022

bump krte to bullseye kubernetes/test-infra#26288

Merged

BenTheElder assigned BenTheElder and kolyshkin May 12, 2022

k8s-ci-robot closed this as completed May 27, 2022

SIG Node CI/Test Board automation moved this from Issues - In progress to Done May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write #109182

e2e flake: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write #109182

liggitt commented Mar 31, 2022 •

edited

k8s-ci-robot commented Mar 31, 2022

pacoxu commented Apr 1, 2022

dims commented Apr 4, 2022

ehashman commented Apr 4, 2022

mrunalp commented Apr 4, 2022 •

edited

kolyshkin commented Apr 4, 2022

kolyshkin commented Apr 4, 2022

kolyshkin commented Apr 5, 2022

mrunalp commented Apr 5, 2022

kolyshkin commented Apr 5, 2022

kolyshkin commented Apr 5, 2022

helayoty commented Apr 5, 2022

liggitt commented Apr 5, 2022

ehashman commented Apr 5, 2022

derekwaynecarr commented Apr 5, 2022

kolyshkin commented Apr 5, 2022

mrunalp commented Apr 5, 2022

derekwaynecarr commented Apr 5, 2022

kolyshkin commented Apr 5, 2022

kolyshkin commented Apr 5, 2022

BenTheElder commented Apr 15, 2022 •

edited

liggitt commented Apr 18, 2022

liggitt commented May 3, 2022

BenTheElder commented May 3, 2022

thaJeztah commented May 3, 2022

BenTheElder commented May 6, 2022

aojea commented May 10, 2022

BenTheElder commented May 12, 2022 •

edited

BenTheElder commented May 27, 2022

BenTheElder commented May 27, 2022 •

edited

k8s-ci-robot commented May 27, 2022

e2e flake: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write #109182

e2e flake: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: failed to write #109182

Comments

liggitt commented Mar 31, 2022 • edited

k8s-ci-robot commented Mar 31, 2022

pacoxu commented Apr 1, 2022

dims commented Apr 4, 2022

ehashman commented Apr 4, 2022

mrunalp commented Apr 4, 2022 • edited

kolyshkin commented Apr 4, 2022

kolyshkin commented Apr 4, 2022

kolyshkin commented Apr 5, 2022

mrunalp commented Apr 5, 2022

kolyshkin commented Apr 5, 2022

kolyshkin commented Apr 5, 2022

helayoty commented Apr 5, 2022

liggitt commented Apr 5, 2022

ehashman commented Apr 5, 2022

derekwaynecarr commented Apr 5, 2022

kolyshkin commented Apr 5, 2022

mrunalp commented Apr 5, 2022

derekwaynecarr commented Apr 5, 2022

kolyshkin commented Apr 5, 2022

kolyshkin commented Apr 5, 2022

BenTheElder commented Apr 15, 2022 • edited

liggitt commented Apr 18, 2022

liggitt commented May 3, 2022

BenTheElder commented May 3, 2022

thaJeztah commented May 3, 2022

BenTheElder commented May 6, 2022

aojea commented May 10, 2022

BenTheElder commented May 12, 2022 • edited

BenTheElder commented May 27, 2022

BenTheElder commented May 27, 2022 • edited

k8s-ci-robot commented May 27, 2022

liggitt commented Mar 31, 2022 •

edited

mrunalp commented Apr 4, 2022 •

edited

BenTheElder commented Apr 15, 2022 •

edited

BenTheElder commented May 12, 2022 •

edited

BenTheElder commented May 27, 2022 •

edited