ExecSync did not return according to the timeout set in the request #10094

rtheis · 2024-04-19T13:36:40Z

Description

While both kubernetes/kubernetes#123931 and #9568 have been closed, the problem still exists.

Steps to reproduce the issue

See kubernetes/kubernetes#123931 (comment) for the recreate steps.

Describe the results you received and expected

Exec timeout is honored.

What version of containerd are you using?

1.7.15

Any other relevant information

No response

Show configuration if it is related to CRI plugin.

No response

fuweid · 2024-04-19T15:34:25Z

Sorry for missing the comment in closed #9568. @tallclair @rtheis

I would like to explain more detail about shim IO handler first.

The following chart is about setns process's stdout dataflow.
When setns-process exits, we don't want to have any log loss issue so we will wait for IO closed.
The pipe+w writer should be closed after setns-process exited.
After pipe writer is closed, since there is only one writer, the pipe reader will return EOF in shim side.
And then shim closes the fifo+w and CRI plugin will get the EOF from fifo+r.

+--------------------------------+      +-----------shim----------+   +--------CRI--------+
|                                |      |                         |   |                   |
| setns-process/stdout (pipe+w)----------->(pipe+r)---->(fifo+w)-------->(fifo+r)-->FILE  |
|                                |      |                         |   |                   |
+--------------------------------+      +-------------------------+   +-------------------+

However, setns-process can have child processes or grandchild processes.
When ExecSync timeouts, CRI only kills setns-process.
Since we don't have sub-cgroup to track all the processes created by the setns-process, we can't force-stop these processes. So, these processes will hold pipe+w file descriptor. Based on current design in shim, shim will wait for EOF from pipe+r.

In 'sh', '-c', 'date >> /probe/readiness-log.txt && sleep 45' case, we kill the sh but sleep 45 is still holding the writer's file descriptor. That's why it takes so long.

Currently, I don't have better solution to cleanup all the processes created by setns-process without introducing sub-cgroup for each exec call. So, I introduces a drain_exec_sync_io_timeout in #7832.

By default, there is no timeout to drain IO. If user wants to fastfail after timeout, we can set it to 10ms.

Side note: The setns-process and init container should take responsibility to cleanup the orphaned processes #10002 (comment)

rtheis · 2024-04-25T14:32:51Z

@fuweid thank you. That works as you've noted. I'm concerned with changing the value since you all left the default value to effectively disable the timeout. I'd like a reasonable timeout but don't want data loss either. What is your recommendation?

fuweid · 2024-04-25T16:06:31Z

ping @dmcgowan @mikebrow

I think that the default value could be 3-5 seconds.

In #3361, we use 2 seconds to drain IO. 3-5 seconds looks safe

rtheis added the kind/bug label Apr 19, 2024

This was referenced Apr 19, 2024

ExecSync did not return according to the timeout set in the request #9568

Closed

readiness prober timeout do not run as expected kubernetes/kubernetes#123931

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExecSync did not return according to the timeout set in the request #10094

ExecSync did not return according to the timeout set in the request #10094

rtheis commented Apr 19, 2024

fuweid commented Apr 19, 2024

rtheis commented Apr 25, 2024 •

edited

fuweid commented Apr 25, 2024

ExecSync did not return according to the timeout set in the request #10094

ExecSync did not return according to the timeout set in the request #10094

Comments

rtheis commented Apr 19, 2024

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of containerd are you using?

Any other relevant information

Show configuration if it is related to CRI plugin.

fuweid commented Apr 19, 2024

rtheis commented Apr 25, 2024 • edited

fuweid commented Apr 25, 2024

rtheis commented Apr 25, 2024 •

edited