Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerd IP leakage #5768

Closed
Random-Liu opened this issue Jul 20, 2021 · 28 comments
Closed

Containerd IP leakage #5768

Random-Liu opened this issue Jul 20, 2021 · 28 comments
Assignees
Labels
area/cri Container Runtime Interface (CRI) kind/bug
Milestone

Comments

@Random-Liu
Copy link
Member

Description

We see a problem in production that containerd may leak IP on the node.

Steps to reproduce the issue:

  • When pod network setup is quite slow, RunPodSandbox may timeout or fail;
  • Once RunPodSandbox fails, it tries to teardown the pod network in defer;
  • However, because CNI is slow, the teardown also failed;
  • At this point, the pod sandbox is gone, but the network is not properly tore down.

Proposed solution
We should probably change how RunPodSandbox works.

It should:

  1. Create the sandbox container first;
  2. Setup network for the sandbox container;
  3. Create the sandbox container task.

In this way, when there is any issue in RunPodSandbox, we can still try to cleanup in defer. However, if any cleanup step failed, the sandbox container on disk can still represent the sandbox, and kubelet will try to guarantee it is properly cleaned up eventually.

@Random-Liu
Copy link
Member Author

Random-Liu commented Jul 20, 2021

@qiutongs volunteered to help fix this issue. :)

Thanks!

@AkihiroSuda AkihiroSuda added the area/cri Container Runtime Interface (CRI) label Jul 20, 2021
@mikebrow
Copy link
Member

related: was going to need to refactor this a bit anyway to do some sort of pinning model for pid reuse #5630

@SergeyKanzhelev
Copy link
Contributor

is this one related? #5438

@Random-Liu
Copy link
Member Author

Random-Liu commented Jul 22, 2021

is this one related? #5438

Yeah, that can fix the context timeout issue, but can't fix the issue that network teardown just returned an error.

To make containerd work reliably in error cases, we should keep the sandbox around until it is properly cleaned up.

However, if the context timeout thing can solve most known problems, this becomes relatively lower priority. :)

@yylt
Copy link
Contributor

yylt commented Jul 27, 2021

There are other situations, like containerd/go-cni#60

Mostly use cni plugins more than one, such as

{
  "name":"cni0",
  "cniVersion":"0.3.1",
  "plugins":[
    {
      "type":"flannel",
      "delegate":{
        "forceAddress":true,
        "hairpinMode": true,
        "isDefaultGateway":true
      }
    },
    {
      "type":"portmap",
      "capabilities":{
        "portMappings":true
      }
    }
  ]
}

The order about cni remove is from last plugin to first plugin, and if cni plugin` failed, the plugins before that will not be executed, and mostly the first cni plugin is IPAM(IP address allocation) type

Y: Executed
X: Not Executed

CNI CREATE:
  +---------+     +--------+     +----------+
  |  plugin +--->|plugin |+->|  plugin  |
  +---------+     +--------+     +----------+
      Y              Y                Y

CNI REMOVE:
  +----------+  +----------+   +----------+
  |  plugin  |<-+  plugin  |<--+  plugin  |
  +----------+  +----------+   +----------+
       X             X                  Y

@qiutongs
Copy link
Contributor

qiutongs commented Aug 4, 2021

/assign

@skmatti
Copy link

skmatti commented Aug 11, 2021

We are also seeing network teardown errors with IP leaks:

Failed to destroy network for sandbox \"92bac2b1b1e49c0f9b2884ae51f855ea1cc4ae598e252c8e41655fe6ec1c695c\"" error="netplugin failed with no error message: signal: killed"

There a lot more of these errors than the number of leaked IPs. I was not sure if this error message leads to IP leaks. Is there way to know these errors are related to IP leaks?.

@sharkymcdongles
Copy link

sharkymcdongles commented Oct 9, 2021

I created a oneliner to solve this on GKE for anyone interested. It takes about 3 min or so to fully correct after running this depending on the number of pods. I set it to only check one namespace, but you could probably make it check more.

for node in `kubectl get pods -o wide --namespace NAMESPACE | grep Creating | awk '{print $7}' | sort -u`; do gcloud beta compute ssh --zone "ZONE" --project "PROJECT" $node --command "$(cat ~/script.sh)"; done

script.sh contents for cilium:

for hash in $(sudo find /var/lib/cni/networks/gke-pod-network -iregex '/var/lib/cni/networks/gke-pod-network/[0-9].*' -exec head -n1 {} \;); do if [ -z $(sudo ctr -n k8s.io c ls | grep $hash | awk '{print $1}') ]; then sudo grep -ilr $hash /var/lib/cni/networks/gke-pod-network; fi; done | sudo xargs rm

sudo systemctl restart kubelet containerd;

On calico the script changes slightly:

for hash in $(sudo find /var/lib/cni/networks/k8s-pod-network -iregex '/var/lib/cni/networks/k8s-pod-network/[0-9].*' -exec head -n1 {} \;); do if [ -z $(sudo ctr -n k8s.io c ls | grep $hash | awk '{print $1}') ]; then sudo grep -ilr $hash /var/lib/cni/networks/k8s-pod-network; fi; done | sudo xargs rm

sudo systemctl restart kubelet containerd;

The fix is supposedly in 1.4.7 containerd, but COS with that version won't debut for another month or so. This should help until then.

Explanation: Basically the script reads all the container hashes for deployed containers then if it finds a running container matching it doesn't touch that file. It then removes all hashes not mapped to running containers and restarts kubelet and containerd. You can delete the files for running hashes as they will be recreated, but it's much better to do it this way since the IPs won't need to be reassigned and the running pods always work as expected.

@rueian
Copy link

rueian commented Nov 30, 2021

for hash in $(sudo find /var/lib/cni/networks/gke-pod-network -iregex '/var/lib/cni/networks/gke-pod-network/[0-9].*' -exec head -n1 {} ;); do if [ -z $(sudo ctr -n k8s.io c ls | grep $hash | awk '{print $1}') ]; then sudo grep -ilr $hash /var/lib/cni/networks/gke-pod-network; fi; done | sudo xargs rm

sudo systemctl restart kubelet containerd;

We use 1.4.8 containerd on GKE with anetd(cilium), but the problem is not fixed yet.

@dmcgowan dmcgowan modified the milestones: 1.6, 1.7 Dec 9, 2021
@anfernee
Copy link

Any plan to fix this issue? We still see this problem in GKE.

@qiutongs
Copy link
Contributor

qiutongs commented Apr 4, 2022

We are also seeing network teardown errors with IP leaks:

Failed to destroy network for sandbox \"92bac2b1b1e49c0f9b2884ae51f855ea1cc4ae598e252c8e41655fe6ec1c695c\"" error="netplugin failed with no error message: signal: killed"

There a lot more of these errors than the number of leaked IPs. I was not sure if this error message leads to IP leaks. Is there way to know these errors are related to IP leaks?.

Yeah, this likely leads to IP leaks. A previous issue about the timeout case of destroying network was fixed #5438. But other errors cases were not covered.

@qiutongs
Copy link
Contributor

qiutongs commented Apr 4, 2022

Any plan to fix this issue? We still see this problem in GKE.

I am prioritizing it now. Will update my PR asap.

@qiutongs
Copy link
Contributor

qiutongs commented Apr 5, 2022

The idea of fixing this bug is to have Kubelet see the failed-to-destroy-network sandbox and then Kubelet will call StopSandbox which should cleanup the IP leakage.

Here is how things work on Kubelet side.

PLET put pod status in cache

Pod worker reads pod status from cache

Kubelet syncPod which kill changed sandbox

@qiutongs
Copy link
Contributor

qiutongs commented Apr 5, 2022

Therefore, we want to two things here.

  1. CRI PodSandboxStatus returned the failed-to-destroy-network sandbox
  2. CRI StopPodSandbox can cleanup things properly

@xvzf
Copy link

xvzf commented Apr 12, 2022

Any ETA for this?

@qiutongs
Copy link
Contributor

qiutongs commented May 3, 2022

Still working on it #5904. Hope I can merge it in next 2 weeks.

@aojea
Copy link
Contributor

aojea commented May 3, 2022

The order about cni remove is from last plugin to first plugin, and if cni plugin` failed, the plugins before that will not be executed, and mostly the first cni plugin is IPAM(IP address allocation) type

are you saying the cni plugin fails on a DEL command?
that doesn't seem to match the spec https://github.com/containernetworking/cni/blob/main/SPEC.md#del-remove-container-from-network-or-un-apply-modifications

Plugins should generally complete a DEL action without error even if some resources are missing
Plugins MUST accept multiple DEL calls for the same (CNI_CONTAINERID, CNI_IFNAME) pair, and return success if the interface in question, or any modifications added, are missing.

@aojea
Copy link
Contributor

aojea commented May 3, 2022

the delete should not return an error, @mikebrow may this be related to the CNI bug we fixed recently?
containerd/go-cni#98

@qiutongs
Copy link
Contributor

qiutongs commented May 7, 2022

@aojea An example to illustrate the problem here is the binary is missing for last plugin. There will be the "binary not found" error in both network setup and network teardown.

@aojea
Copy link
Contributor

aojea commented May 9, 2022

@aojea An example to illustrate the problem here is the binary is missing for last plugin. There will be the "binary not found" error in both network setup and network teardown.

@squeed is this something the CNI spec leaves to the implementations or is there an scenarios that the spec considers?

@squeed
Copy link
Contributor

squeed commented May 10, 2022

@squeed is this something the CNI spec leaves to the implementations or is there an scenarios that the spec considers?

@aojea Good question. This is not something the CNI spec covers: what to do with a configuration where ADD succeeded but now DEL will fail?

As a general rule, the preference is for plugins to effect a delete if at all possible. But this presents us with an interesting quandary: what do we do if a particular plugin does not execute? It seems like there are three cases we need to consider:

  1. The configuration changes between ADD and DEL such that something is now broken. Not sure how to handle this case.
  2. A plugin binary goes away
  3. A plugin is successfully called, but deletion times out.

The reality is, I'm not sure we can write any CNI spec language that is safe in all cases here. There are two basic approaches:

  • Ignore all failures, proceeding at all costs
  • Require plugins to succeed

Unfortunately, if we choose to ignore failures, we're just as likely to suffer resource leaks. Thus, I'm not sure if there is a one-size-fits-all option here.

A few choices, which I can bring up to the CNI maintainers:

  1. Some sort of way for plugins to be skipped on delete, or skipped if it is known the container is going away. Thus, only plugins that need to clean up external resources will be executed
  2. A GC method (already proposed: Enhanced plugin / network lifecycle ( INIT / DEINIT / GC) containernetworking/cni#822) that would allow the runtime to say "these are all the containers, please purge all others"

@qiutongs
Copy link
Contributor

Waiting for #7069 to be merged so that we can have more E2E tests.

@qiutongs
Copy link
Contributor

#5904 is merged. I and @samuelkarp will work on backporting.

@sparr
Copy link

sparr commented Oct 20, 2022

5904 merged should fix this for 1.7
Can we remove that milestone with just backport remaining?

@yumingqiao
Copy link

We meet this issue in our production for v1.5.5, thanks for fixing it @qiutongs. When would the fix being backport to v1.5?

@estesp
Copy link
Member

estesp commented Oct 28, 2022

#7464 is the backport for release/1.5 and will be merged and available in a future 1.5.x release

@awx-fuyuanchu
Copy link

awx-fuyuanchu commented Jan 4, 2023

Same issue here.
We use the GKE v1.22 and encountered IP leakage on one node.
the containerd version is 1.5.13

The pod creation was stuck and kubelet reported Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6edb3f3677c4df9a07e47cb205d3cab5fe6a810d80d38b8c50d8a6eff1bed9c8": failed to allocate for range 0: no IP addresses available in range set: 10.69.32.1-10.69.32.126

I checked the node, and only 45 running pods on it. However, I got 135 pods with ready state when I running crictl pods | grep -v NotReady | wc -l on the node. So most pods were not clean up by kubelet.

Take the pod id 90487414c798e as an example:
Log of kubelet

Dec 31 13:03:56 node-pool-757cbbe8-wpl7 kubelet[2345]: I1231 13:03:56.675987    2345 kubelet.go:2142] "SyncLoop (PLEG): event for pod" pod="infra/runner-spmuwnyf-project-2128-concurrent-5tff9n" event=&{ID:bc86dd91-e83a-48c1-87b9-e74109988187 Type:ContainerStarted Data:90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618}
Dec 31 13:10:51 node-pool-757cbbe8-wpl7 kubelet[2345]: E1231 13:10:51.254019    2345 remote_runtime.go:144] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618"
Dec 31 13:10:51 node-pool-757cbbe8-wpl7 kubelet[2345]: E1231 13:10:51.254099    2345 kuberuntime_manager.go:993] "Failed to stop sandbox" podSandboxID={Type:containerd ID:90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618}
Dec 31 13:14:53 node-pool-757cbbe8-wpl7 kubelet[2345]: E1231 13:14:53.058122    2345 remote_runtime.go:144] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to stop container \"0a4b225c30d3a478ec66af76aabd4c9a76dba9b00c875ab4b93b7cf6c0a5b8a9\": failed to kill container \"0a4b225c30d3a478ec66af76aabd4c9a76dba9b00c875ab4b93b7cf6c0a5b8a9\": context deadline exceeded: unknown" podSandboxID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618"
Dec 31 13:14:53 node-pool-757cbbe8-wpl7 kubelet[2345]: E1231 13:14:53.058182    2345 kuberuntime_manager.go:993] "Failed to stop sandbox" podSandboxID={Type:containerd ID:90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618}

Log of containerd

Dec 31 13:03:33 node-pool-757cbbe8-wpl7 containerd[2223]: 2022-12-31 13:03:32.813 [INFO][188155] plugin.go 324: Calico CNI found existing endpoint: &{{WorkloadEndpoint projectcalico.org/v3} {node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0 runner-spmuwnyf-project-2128-concurrent-5 infra  bc86dd91-e83a-48c1-87b9-e74109988187 1691742427 0 2022-12-31 13:03:32 +0000 UTC <nil> <nil> map[pod:runner-spmuwnyf-project-2128-concurrent-5 projectcalico.org/namespace:infra projectcalico.org/orchestrator:k8s projectcalico.org/serviceaccount:gitlab-runner] map[] [] []  []} {k8s  node-pool-757cbbe8-wpl7  runner-spmuwnyf-project-2128-concurrent-5tff9n eth0 gitlab-runner [] []   [kns.infra ksa.infra.gitlab-runner] calic63534fe44b  [] []}} ContainerID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618" Namespace="infra" Pod="runner-spmuwnyf-project-2128-concurrent-5tff9n" WorkloadEndpoint="node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-"
Dec 31 13:03:33 node-pool-757cbbe8-wpl7 containerd[2223]: 2022-12-31 13:03:32.813 [INFO][188155] k8s.go 74: Extracted identifiers for CmdAddK8s ContainerID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618" Namespace="infra" Pod="runner-spmuwnyf-project-2128-concurrent-5tff9n" WorkloadEndpoint="node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0"
Dec 31 13:03:33 node-pool-757cbbe8-wpl7 containerd[2223]: 2022-12-31 13:03:32.819 [INFO][188155] utils.go 345: Calico CNI passing podCidr to host-local IPAM: 10.69.32.0/25 ContainerID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618" Namespace="infra" Pod="runner-spmuwnyf-project-2128-concurrent-5tff9n" WorkloadEndpoint="node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0"
Dec 31 13:03:33node-pool-757cbbe8-wpl7 containerd[2223]: 2022-12-31 13:03:32.968 [INFO][188155] k8s.go 383: Populated endpoint ContainerID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618" Namespace="infra" Pod="runner-spmuwnyf-project-2128-concurrent-5tff9n" WorkloadEndpoint="node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0", GenerateName:"runner-spmuwnyf-project-2128-concurrent-5", Namespace:"infra", SelfLink:"", UID:"bc86dd91-e83a-48c1-87b9-e74109988187", ResourceVersion:"1691742427", Generation:0, CreationTimestamp:time.Date(2022, time.December, 31, 13, 3, 32, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"pod":"runner-spmuwnyf-project-2128-concurrent-5", "projectcalico.org/namespace":"infra", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"gitlab-runner"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"node-pool-757cbbe8-wpl7", ContainerID:"", Pod:"runner-spmuwnyf-project-2128-concurrent-5tff9n", Endpoint:"eth0", ServiceAccountName:"gitlab-runner", IPNetworks:[]string{"10.69.32.93/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.infra", "ksa.infra.gitlab-runner"}, InterfaceName:"calic63534fe44b", MAC:"", Ports:[]v3.WorkloadEndpointPort(nil), AllowSpoofedSourcePrefixes:[]string(nil)}}
Dec 31 13:03:33 node-pool-757cbbe8-wpl7 containerd[2223]: 2022-12-31 13:03:32.969 [INFO][188155] k8s.go 384: Calico CNI using IPs: [10.69.32.93/32] ContainerID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618" Namespace="infra" Pod="runner-spmuwnyf-project-2128-concurrent-5tff9n" WorkloadEndpoint="node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0"
Dec 31 13:03:33 node-pool-757cbbe8-wpl7 containerd[2223]: 2022-12-31 13:03:32.969 [INFO][188155] dataplane_linux.go 68: Setting the host side veth name to calic63534fe44b ContainerID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618" Namespace="infra" Pod="runner-spmuwnyf-project-2128-concurrent-5tff9n" WorkloadEndpoint="gke--platform--preprod--demo--2--node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0"
Dec 31 13:03:33 node-pool-757cbbe8-wpl7 containerd[2223]: 2022-12-31 13:03:32.984 [INFO][188155] dataplane_linux.go 453: Disabling IPv4 forwarding ContainerID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618" Namespace="infra" Pod="runner-spmuwnyf-project-2128-concurrent-5tff9n" WorkloadEndpoint="gke--platform--preprod--demo--2--node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0"
Dec 31 13:03:33 node-pool-757cbbe8-wpl7 containerd[2223]: 2022-12-31 13:03:32.991 [INFO][188155] k8s.go 411: Added Mac, interface name, and active container ID to endpoint ContainerID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618" Namespace="infra" Pod="runner-spmuwnyf-project-2128-concurrent-5tff9n" WorkloadEndpoint="gke--platform--preprod--demo--2--node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"gke--platform--preprod--demo--2--node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0", GenerateName:"runner-spmuwnyf-project-2128-concurrent-5", Namespace:"infra", SelfLink:"", UID:"bc86dd91-e83a-48c1-87b9-e74109988187", ResourceVersion:"1691742427", Generation:0, CreationTimestamp:time.Date(2022, time.December, 31, 13, 3, 32, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"pod":"runner-spmuwnyf-project-2128-concurrent-5", "projectcalico.org/namespace":"infra", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"gitlab-runner"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"node-pool-757cbbe8-wpl7", ContainerID:"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618", Pod:"runner-spmuwnyf-project-2128-concurrent-5tff9n", Endpoint:"eth0", ServiceAccountName:"gitlab-runner", IPNetworks:[]string{"10.69.32.93/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.infra", "ksa.infra.gitlab-runner"}, InterfaceName:"calic63534fe44b", MAC:"a6:77:2a:27:ca:cf", Ports:[]v3.WorkloadEndpointPort(nil), AllowSpoofedSourcePrefixes:[]string(nil)}}
Dec 31 13:03:33 node-pool-757cbbe8-wpl7 containerd[2223]: 2022-12-31 13:03:33.005 [INFO][188155] k8s.go 489: Wrote updated endpoint to datastore ContainerID="90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618" Namespace="infra" Pod="runner-spmuwnyf-project-2128-concurrent-5tff9n" WorkloadEndpoint="gke--platform--preprod--demo--2--node--pool--757cbbe8--wpl7-k8s-runner--spmuwnyf--project--2128--concurrent--5tff9n-eth0"
Dec 31 13:03:56 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:03:56.128084912Z" level=info msg="starting signal loop" namespace=k8s.io path=/run/containerd/io.containerd.runtime.v2.task/k8s.io/90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618 pid=188837
Dec 31 13:03:56 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:03:56.428131703Z" level=error msg="ContainerStatus for \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\": not found"
Dec 31 13:03:56 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:03:56.428572797Z" level=error msg="PodSandboxStatus for \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find sandbox: not found"
Dec 31 13:03:56 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:03:56.448918396Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:runner-spmuwnyf-project-2128-concurrent-5tff9n,Uid:bc86dd91-e83a-48c1-87b9-e74109988187,Namespace:infra,Attempt:0,} returns sandbox id \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\""
Dec 31 13:03:56 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:03:56.456030582Z" level=info msg="CreateContainer within sandbox \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\" for container &ContainerMetadata{Name:build,Attempt:0,}"
Dec 31 13:04:15 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:04:15.161913567Z" level=info msg="CreateContainer within sandbox \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\" for &ContainerMetadata{Name:build,Attempt:0,} returns container id \"0a4b225c30d3a478ec66af76aabd4c9a76dba9b00c875ab4b93b7cf6c0a5b8a9\""
Dec 31 13:04:17 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:04:17.027796597Z" level=info msg="CreateContainer within sandbox \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\" for container &ContainerMetadata{Name:helper,Attempt:0,}"
Dec 31 13:04:49 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:04:49.001426550Z" level=info msg="CreateContainer within sandbox \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\" for &ContainerMetadata{Name:helper,Attempt:0,} returns container id \"237a34714da1a06482d80dc729710dd083942b404ee906c444ff0f5412651c66\""
Dec 31 13:08:51 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:08:51.253575903Z" level=info msg="StopPodSandbox for \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\""
Dec 31 13:10:51 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:10:51.253747407Z" level=error msg="StopPodSandbox for \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\" failed" error="failed to stop container \"237a34714da1a06482d80dc729710dd083942b404ee906c444ff0f5412651c66\": failed to kill container \"237a34714da1a06482d80dc729710dd083942b404ee906c444ff0f5412651c66\": context deadline exceeded: unknown"
Dec 31 13:12:53 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:12:53.057644692Z" level=info msg="StopPodSandbox for \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\""
Dec 31 13:14:53 node-pool-757cbbe8-wpl7 containerd[2223]: time="2022-12-31T13:14:53.057747235Z" level=error msg="StopPodSandbox for \"90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618\" failed" error="failed to stop container \"0a4b225c30d3a478ec66af76aabd4c9a76dba9b00c875ab4b93b7cf6c0a5b8a9\": failed to kill container \"0a4b225c30d3a478ec66af76aabd4c9a76dba9b00c875ab4b93b7cf6c0a5b8a9\": context deadline exceeded: unknown"

process of containerd-shim-runc-v still running

ps -eaf | grep 90487414c798e
root      188837       1  0  2022 ?        00:01:02 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618 -address /run/containerd/containerd.sock

 ps -eaf | grep 188837
root      188837       1  0  2022 ?        00:01:02 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 90487414c798e765f396edaf1d7dd2845d3a735a62ea424bfac5155eb9868618 -address /run/containerd/containerd.sock
65535     188858  188837  0  2022 ?        00:00:00 /pause
root     3930576 3818155  0 08:20 pts/0    00:00:00 grep --colour=auto 188837

@samuelkarp
Copy link
Member

Closing since this has been fixed in main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) kind/bug
Projects
None yet
Development

No branches or pull requests