readiness prober timeout do not run as expected #123931

Kyle596 · 2024-03-14T12:49:55Z

What happened?

We meet the issue that the readiness probe timeout do not run as expected. It run as 2mins + timeout setting in actual
According to the below official kubernets doc, it define that the timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1. After I read this infromation from the doc , i think it means if the prober command running time exceeded the timeout period, the kubelet will stop this prober action and start the next prober, butl it not in actual. The next prober action will run after the last prober action completed or last prober action lasted (2mins + timeout setting).

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes

I also check the code, and find that it deifine timeous =timeout setting + default timeout (2 minutes) as timeout to leave some time for the runtime to do cleanup uless timeout setting is 0s. I think 2 minutes is used for the containerd runtime to prepare necessary action. But the doc description make the customer confusion easily if not add some details

kubernetes/pkg/kubelet/cri/remote/remote_runtime.go

Line 469 in d1a2a13

// ExecSync executes a command in the container, and returns the stdout output.

What did you expect to happen?

Could you plz modify the deafult timeout in code or add description in the officail doc ?

How can we reproduce it (as minimally and precisely as possible)?

create a configmap like below. it show the two commd: 1. create a file named with the current time; 2. sleep 300s

apiVersion: v1
data:
  container-readiness-probe-handler.sh: |-
    #!/bin/sh
    touch /tmp/$(date +'%Y-%m-%d-%H-%M-%S')
    sleep 300
kind: ConfigMap
metadata:
  name: tesh
  namespace: default

create a pod with the nginx image and mount the above configmap as the volume to the /script path. Setting the timeoutSeconds as 8s

 containers:
  - image: nginx:latest
    imagePullPolicy: Always
    name: nginx-test
    readinessProbe:
      exec:
        command:
        - /bin/sh
        - /script/container-readiness-probe-handler.sh
      failureThreshold: 7
      initialDelaySeconds: 6
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 8
    resources:
      requests:
        cpu: 250m
        memory: 512Mi
    - mountPath: /script
      name: volume-1710223576413
  volumes:
  - configMap:
      defaultMode: 493
      name: tesh
    name: volume-1710223576413

After the pod running , will find that the new file will be created every (2mins+8s）=128s in /tmp path
strace the cri which pull the containerd and find that it will trigger the prober action every (2mins+8s）
if the timeout period setting < prober command runiig time< 2 mins. The new readiness prober will run after last prober command completed, not timeout period

Anything else we need to know?

No response

Kubernetes version

Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.3-aliyun.1

Cloud provider

Alibaba Cloud

OS version

# On Linux:
$ cat /etc/os-release
NAME="Alibaba Cloud Linux (Aliyun Linux)"
VERSION="2.1903 LTS (Hunting Beagle)"
ID="alinux"
ID_LIKE="rhel fedora centos anolis"
VERSION_ID="2.1903"
PRETTY_NAME="Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle)"
ANSI_COLOR="0;31"
HOME_URL="https://www.aliyun.com/"

$ uname -a
Linux iZj6c9edi9gpcd5qesfwxvZ 4.19.91-25.6.al7.x86_64 #1 SMP Thu Feb 10 19:15:17 CST 2022 x86_64 x86_64 x86_64 GNU/Linux


### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>

</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-03-14T12:50:05Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 · 2024-03-14T14:23:01Z

/sig node

tallclair · 2024-03-14T21:06:24Z

The timeout code you linked to is the timeout on the CRI call the to the runtime, not the actual timeout on the probe exec call. That should be the timeout specified in the ExecSync request here:

kubernetes/pkg/kubelet/cri/remote/remote_runtime.go

Lines 487 to 493 in d1a2a13

    
           func (r *remoteRuntimeService) execSyncV1(ctx context.Context, containerID string, cmd []string, timeout time.Duration) (stdout []byte, stderr []byte, err error) { 
        
           	timeoutSeconds := int64(timeout.Seconds()) 
        
           	req := &runtimeapi.ExecSyncRequest{ 
        
           		ContainerId: containerID, 
        
           		Cmd:         cmd, 
        
           		Timeout:     timeoutSeconds, 
        
           	}

If the longer timeout is being hit, that likely indicates a problem with the runtime, e.g. it timed out creating the exec process, or making the request. What version of containerd are you running?

I wasn't able to reproduce this with a quick check:

apiVersion: v1
kind: Pod
metadata:
  name:   probe-timeout
spec:
  volumes:
    - name: probe-log
      emptyDir: {}
  containers:
    - name:  alpine
      image: alpine:latest
      args: [sh, -c, 'while true; do sleep 100000; done']
      readinessProbe:      
        failureThreshold: 3
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
        exec:
          command: ['sh', '-c', 'date >> /probe/readiness-log.txt && sleep 300']
      volumeMounts:
        - mountPath: /probe
          name: probe-log

Output

$ kubectl exec -t probe-timeout -- tail -f /probe/readiness-log.txt
Thu Mar 14 21:00:19 UTC 2024
Thu Mar 14 21:00:29 UTC 2024
Thu Mar 14 21:00:34 UTC 2024
Thu Mar 14 21:00:39 UTC 2024
Thu Mar 14 21:00:49 UTC 2024
Thu Mar 14 21:00:59 UTC 2024
Thu Mar 14 21:01:09 UTC 2024
Thu Mar 14 21:01:19 UTC 2024
Thu Mar 14 21:01:29 UTC 2024
Thu Mar 14 21:01:39 UTC 2024
Thu Mar 14 21:01:49 UTC 2024
Thu Mar 14 21:01:54 UTC 2024
Thu Mar 14 21:01:59 UTC 2024
Thu Mar 14 21:02:09 UTC 2024

Note that the interval is sometimes 5s, and sometimes 10s. The 10s interval comes from the probe period, but the 5s is probably from retries on probe error. I'm not sure why it's inconsistent.

In either case, this sounds like it's probably a containerd issue, not a Kubernetes issue.

Kyle596 · 2024-03-15T02:38:28Z

@tallclair
containerd containerd.io 1.6.20 405c12edca9fce99095c9827fca59ce0d6da128e

I think this is due to the image . If you use the nginx instead of the alpine, you could reproduce the issue . I also make a strace for the busybox and nginx image . And found that the the sleep 300 is not triggered in busybox or alpine image , so it will trigger the readiness prober every timeout period.

from the strace, you coud find that when cri exec the command ,it only run the date command without following sleep command. So the prober oocur every timeout period. only touch command process without the prober process
But when use the nginx image, the sleep 300 is run after the date command . from the pstree , can check the pid is triggered

tallclair · 2024-03-19T23:51:47Z

Ah, interesting. I swapped the alpine image for nginx, and sure enough it replicated:

apiVersion: v1
kind: Pod
metadata:
  name:   probe-timeout
spec:
  volumes:
    - name: probe-log
      emptyDir: {}
  containers:
    - name:  test
      image: nginx
      args: [sh, -c, 'while true; do sleep 100000; done']
      readinessProbe:
        timeoutSeconds: 5
        exec:
          command: ['sh', '-c', 'date >> /probe/readiness-log.txt && sleep 45']
      volumeMounts:
        - mountPath: /probe
          name: probe-log

Output:

$ kubectl exec -t probe-timeout -- tail -f /probe/readiness-log.txt
Tue Mar 19 23:35:13 UTC 2024
Tue Mar 19 23:35:58 UTC 2024
Tue Mar 19 23:36:43 UTC 2024
Tue Mar 19 23:37:28 UTC 2024
Tue Mar 19 23:38:13 UTC 2024
Tue Mar 19 23:38:58 UTC 2024
Tue Mar 19 23:39:43 UTC 2024
Tue Mar 19 23:40:28 UTC 2024

Probe taking the full 45 seconds, rather than the desired 5.

This looks like a containerd issue, not a Kubelet issue. I found containerd/containerd#9568 reporting the same issue.

rtheis · 2024-04-03T15:07:12Z

We are seeing the same problem and have been able to recreate #123931 (comment) on the following version combinations:

Kubernetes version 1.25 with containerd version 1.6.27
Kubernetes version 1.29 with containerd version 1.7.13
Kubernetes version 1.29 with containerd version 1.7.14

rtheis · 2024-04-03T18:41:44Z

We also tested Kubernetes with crio and that combination works as expected.

haircommander · 2024-04-10T17:42:58Z

Seems like a containerd issue, so I think we can close this and track it over in containerd/containerd#9568

/close

k8s-ci-robot · 2024-04-10T17:43:03Z

@haircommander: Closing this issue.

In response to this:

Seems like a containerd issue, so I think we can close this and track it over in containerd/containerd#9568

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rtheis · 2024-04-19T13:37:07Z

containerd/containerd#10094

Kyle596 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 14, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 14, 2024

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 14, 2024

tallclair mentioned this issue Mar 19, 2024

ExecSync did not return according to the timeout set in the request containerd/containerd#9568

Closed

SergeyKanzhelev added this to Triage in SIG Node Bugs Mar 27, 2024

k8s-ci-robot closed this as completed Apr 10, 2024

SIG Node Bugs automation moved this from Triage to Done Apr 10, 2024

rtheis mentioned this issue Apr 19, 2024

ExecSync did not return according to the timeout set in the request containerd/containerd#10094

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readiness prober timeout do not run as expected #123931

readiness prober timeout do not run as expected #123931

Kyle596 commented Mar 14, 2024 •

edited

k8s-ci-robot commented Mar 14, 2024

neolit123 commented Mar 14, 2024

tallclair commented Mar 14, 2024

Kyle596 commented Mar 15, 2024 •

edited

tallclair commented Mar 19, 2024

rtheis commented Apr 3, 2024 •

edited

rtheis commented Apr 3, 2024

haircommander commented Apr 10, 2024

k8s-ci-robot commented Apr 10, 2024

rtheis commented Apr 19, 2024

readiness prober timeout do not run as expected #123931

readiness prober timeout do not run as expected #123931

Comments

Kyle596 commented Mar 14, 2024 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

k8s-ci-robot commented Mar 14, 2024

neolit123 commented Mar 14, 2024

tallclair commented Mar 14, 2024

Kyle596 commented Mar 15, 2024 • edited

tallclair commented Mar 19, 2024

rtheis commented Apr 3, 2024 • edited

rtheis commented Apr 3, 2024

haircommander commented Apr 10, 2024

k8s-ci-robot commented Apr 10, 2024

rtheis commented Apr 19, 2024

Kyle596 commented Mar 14, 2024 •

edited

Kyle596 commented Mar 15, 2024 •

edited

rtheis commented Apr 3, 2024 •

edited