Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readiness prober timeout do not run as expected #123931

Closed
Kyle596 opened this issue Mar 14, 2024 · 10 comments
Closed

readiness prober timeout do not run as expected #123931

Kyle596 opened this issue Mar 14, 2024 · 10 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@Kyle596
Copy link

Kyle596 commented Mar 14, 2024

What happened?

  1. We meet the issue that the readiness probe timeout do not run as expected. It run as 2mins + timeout setting in actual

  2. According to the below official kubernets doc, it define that the timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1. After I read this infromation from the doc , i think it means if the prober command running time exceeded the timeout period, the kubelet will stop this prober action and start the next prober, butl it not in actual. The next prober action will run after the last prober action completed or last prober action lasted (2mins + timeout setting).

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes

  1. I also check the code, and find that it deifine timeous =timeout setting + default timeout (2 minutes) as timeout to leave some time for the runtime to do cleanup uless timeout setting is 0s. I think 2 minutes is used for the containerd runtime to prepare necessary action. But the doc description make the customer confusion easily if not add some details
    // ExecSync executes a command in the container, and returns the stdout output.

    image

What did you expect to happen?

Could you plz modify the deafult timeout in code or add description in the officail doc ?

How can we reproduce it (as minimally and precisely as possible)?

  1. create a configmap like below. it show the two commd: 1. create a file named with the current time; 2. sleep 300s
apiVersion: v1
data:
  container-readiness-probe-handler.sh: |-
    #!/bin/sh
    touch /tmp/$(date +'%Y-%m-%d-%H-%M-%S')
    sleep 300
kind: ConfigMap
metadata:
  name: tesh
  namespace: default
  1. create a pod with the nginx image and mount the above configmap as the volume to the /script path. Setting the timeoutSeconds as 8s
 containers:
  - image: nginx:latest
    imagePullPolicy: Always
    name: nginx-test
    readinessProbe:
      exec:
        command:
        - /bin/sh
        - /script/container-readiness-probe-handler.sh
      failureThreshold: 7
      initialDelaySeconds: 6
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 8
    resources:
      requests:
        cpu: 250m
        memory: 512Mi
    - mountPath: /script
      name: volume-1710223576413
  volumes:
  - configMap:
      defaultMode: 493
      name: tesh
    name: volume-1710223576413
  1. After the pod running , will find that the new file will be created every (2mins+8s)=128s in /tmp path
    image

  2. strace the cri which pull the containerd and find that it will trigger the prober action every (2mins+8s)
    image

  3. if the timeout period setting < prober command runiig time< 2 mins. The new readiness prober will run after last prober command completed, not timeout period

Anything else we need to know?

No response

Kubernetes version

Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.3-aliyun.1

Cloud provider

Alibaba Cloud

OS version

# On Linux:
$ cat /etc/os-release
NAME="Alibaba Cloud Linux (Aliyun Linux)"
VERSION="2.1903 LTS (Hunting Beagle)"
ID="alinux"
ID_LIKE="rhel fedora centos anolis"
VERSION_ID="2.1903"
PRETTY_NAME="Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle)"
ANSI_COLOR="0;31"
HOME_URL="https://www.aliyun.com/"

$ uname -a
Linux iZj6c9edi9gpcd5qesfwxvZ 4.19.91-25.6.al7.x86_64 #1 SMP Thu Feb 10 19:15:17 CST 2022 x86_64 x86_64 x86_64 GNU/Linux


### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>

</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>
@Kyle596 Kyle596 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 14, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 14, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neolit123
Copy link
Member

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 14, 2024
@tallclair
Copy link
Member

The timeout code you linked to is the timeout on the CRI call the to the runtime, not the actual timeout on the probe exec call. That should be the timeout specified in the ExecSync request here:

func (r *remoteRuntimeService) execSyncV1(ctx context.Context, containerID string, cmd []string, timeout time.Duration) (stdout []byte, stderr []byte, err error) {
timeoutSeconds := int64(timeout.Seconds())
req := &runtimeapi.ExecSyncRequest{
ContainerId: containerID,
Cmd: cmd,
Timeout: timeoutSeconds,
}

If the longer timeout is being hit, that likely indicates a problem with the runtime, e.g. it timed out creating the exec process, or making the request. What version of containerd are you running?

I wasn't able to reproduce this with a quick check:

apiVersion: v1
kind: Pod
metadata:
  name:   probe-timeout
spec:
  volumes:
    - name: probe-log
      emptyDir: {}
  containers:
    - name:  alpine
      image: alpine:latest
      args: [sh, -c, 'while true; do sleep 100000; done']
      readinessProbe:      
        failureThreshold: 3
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
        exec:
          command: ['sh', '-c', 'date >> /probe/readiness-log.txt && sleep 300']
      volumeMounts:
        - mountPath: /probe
          name: probe-log

Output

$ kubectl exec -t probe-timeout -- tail -f /probe/readiness-log.txt
Thu Mar 14 21:00:19 UTC 2024
Thu Mar 14 21:00:29 UTC 2024
Thu Mar 14 21:00:34 UTC 2024
Thu Mar 14 21:00:39 UTC 2024
Thu Mar 14 21:00:49 UTC 2024
Thu Mar 14 21:00:59 UTC 2024
Thu Mar 14 21:01:09 UTC 2024
Thu Mar 14 21:01:19 UTC 2024
Thu Mar 14 21:01:29 UTC 2024
Thu Mar 14 21:01:39 UTC 2024
Thu Mar 14 21:01:49 UTC 2024
Thu Mar 14 21:01:54 UTC 2024
Thu Mar 14 21:01:59 UTC 2024
Thu Mar 14 21:02:09 UTC 2024

Note that the interval is sometimes 5s, and sometimes 10s. The 10s interval comes from the probe period, but the 5s is probably from retries on probe error. I'm not sure why it's inconsistent.

In either case, this sounds like it's probably a containerd issue, not a Kubernetes issue.

@Kyle596
Copy link
Author

Kyle596 commented Mar 15, 2024

@tallclair
containerd containerd.io 1.6.20 405c12edca9fce99095c9827fca59ce0d6da128e

I think this is due to the image . If you use the nginx instead of the alpine, you could reproduce the issue . I also make a strace for the busybox and nginx image . And found that the the sleep 300 is not triggered in busybox or alpine image , so it will trigger the readiness prober every timeout period.

  1. from the strace, you coud find that when cri exec the command ,it only run the date command without following sleep command. So the prober oocur every timeout period. only touch command process without the prober process
    image
    image

  2. But when use the nginx image, the sleep 300 is run after the date command . from the pstree , can check the pid is triggered
    image
    image

@tallclair
Copy link
Member

Ah, interesting. I swapped the alpine image for nginx, and sure enough it replicated:

apiVersion: v1
kind: Pod
metadata:
  name:   probe-timeout
spec:
  volumes:
    - name: probe-log
      emptyDir: {}
  containers:
    - name:  test
      image: nginx
      args: [sh, -c, 'while true; do sleep 100000; done']
      readinessProbe:
        timeoutSeconds: 5
        exec:
          command: ['sh', '-c', 'date >> /probe/readiness-log.txt && sleep 45']
      volumeMounts:
        - mountPath: /probe
          name: probe-log

Output:

$ kubectl exec -t probe-timeout -- tail -f /probe/readiness-log.txt
Tue Mar 19 23:35:13 UTC 2024
Tue Mar 19 23:35:58 UTC 2024
Tue Mar 19 23:36:43 UTC 2024
Tue Mar 19 23:37:28 UTC 2024
Tue Mar 19 23:38:13 UTC 2024
Tue Mar 19 23:38:58 UTC 2024
Tue Mar 19 23:39:43 UTC 2024
Tue Mar 19 23:40:28 UTC 2024

Probe taking the full 45 seconds, rather than the desired 5.

This looks like a containerd issue, not a Kubelet issue. I found containerd/containerd#9568 reporting the same issue.

@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Mar 27, 2024
@rtheis
Copy link

rtheis commented Apr 3, 2024

We are seeing the same problem and have been able to recreate #123931 (comment) on the following version combinations:

  • Kubernetes version 1.25 with containerd version 1.6.27
  • Kubernetes version 1.29 with containerd version 1.7.13
  • Kubernetes version 1.29 with containerd version 1.7.14

@rtheis
Copy link

rtheis commented Apr 3, 2024

We also tested Kubernetes with crio and that combination works as expected.

@haircommander
Copy link
Contributor

Seems like a containerd issue, so I think we can close this and track it over in containerd/containerd#9568

/close

@k8s-ci-robot
Copy link
Contributor

@haircommander: Closing this issue.

In response to this:

Seems like a containerd issue, so I think we can close this and track it over in containerd/containerd#9568

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rtheis
Copy link

rtheis commented Apr 19, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Development

No branches or pull requests

6 participants