New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExecSync did not return according to the timeout set in the request #9568
Comments
@fuweid PTAL |
Hi @Wabct , did you backport that patch? and did you check that exec process is zombie? Or you can try v1.7 release to verify it again? currently, that patch hasn't been backported into v1.6 yet. |
@fuweid, I backport that patch into v1.6.14 and find it did not work. Then i tried containerd v1.7.11-8-g6d0a3b66c, and got the same result. The child procress of ExecSync didn't zombie. After the execsync process exits, it's child was re-parented to main process of the container, which is pid 1 in the pid namespace. It will continue executing until exit. Can we return the timeout error when |
@Wabct hmm, I already added that test case to cover that. Before drain IO, that containerd should kill that exec process first. Even if there are children re-parented to pid 1, containerd-shim-runC-v2 will close that IO pipe after 2 second. And containerd will wait for drain IO until timeout ( CRI has setting for this). So which shim you are using? Exec process doesn't have sub cgroup for so it's unlikely to clean up all processes created by exec process. I was thinking that we can introduce sub cgroup for exec. But it might have performance issue if we create sub cgroup for exec process. You know, that exec probe goes crazy |
@fuweid I tried the following case: I check the pipe and found that the shim still hold the pipe of the child procress |
@Wabct Would you please use that following command to run that test with your image? I can't reproduce with your step. $ # apply that following diff to in your local container repo.
$ ENABLE_CRI_SANDBOXES=1 FOCUS="TestContainerDrainExecIOAfterExit" CONTAINERD_RUNTIME="runc.v2" BUILDTAGS="no_aufs no_btrfs" make cri-integration
=== RUN TestContainerDrainExecIOAfterExit
container_exec_test.go:35: Create a sandbox
main_test.go:707: Image "docker.io/library/ubuntu:22.04" already exists, not pulling.
container_exec_test.go:45: Create a container
container_exec_test.go:59: Start the container
container_exec_test.go:65: Exec in container
E1226 03:25:54.014965 93153 remote_runtime.go:412] ExecSync e672dae15c720c51dd3d5fc5efbe3db251b8934094cd4ecb71d188dfbf36cc28 '/bin/bash -c sleep 20&' from runtime service failed: rpc error: code = Unknown desc = failed to exec in container: failed to drain exec process "d5b66ebad3e5863d3c43943d9f6af7cb1bf67c23c9491f1f466385acbb5db561" io: failed to drain exec process "d5b66ebad3e5863d3c43943d9f6af7cb1bf67c23c9491f1f466385acbb5db561" io in 1ns because io is still held by other processes
--- PASS: TestContainerDrainExecIOAfterExit (3.51s)
PASS
diff --git a/integration/container_exec_test.go b/integration/container_exec_test.go
index be2635f68..8b052aa83 100644
--- a/integration/container_exec_test.go
+++ b/integration/container_exec_test.go
@@ -21,10 +21,9 @@ import (
"testing"
"time"
- "github.com/containerd/containerd/integration/images"
-
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
+ criruntime "k8s.io/cri-api/pkg/apis/runtime/v1"
)
func TestContainerDrainExecIOAfterExit(t *testing.T) {
@@ -37,7 +36,7 @@ func TestContainerDrainExecIOAfterExit(t *testing.T) {
sb, sbConfig := PodSandboxConfigWithCleanup(t, "sandbox", "container-exec-drain-io-after-exit")
var (
- testImage = images.Get(images.BusyBox)
+ testImage = "docker.io/library/ubuntu:22.04"
containerName = "test-container-exec"
)
@@ -47,7 +46,8 @@ func TestContainerDrainExecIOAfterExit(t *testing.T) {
cnConfig := ContainerConfig(
containerName,
testImage,
- WithCommand("sh", "-c", "sleep 365d"),
+ WithCommand("bash", "-c", "sleep 365d"),
+ WithPidNamespace(criruntime.NamespaceMode_POD),
)
cn, err := runtimeService.CreateContainer(sb, cnConfig, sbConfig)
@@ -63,10 +63,6 @@ func TestContainerDrainExecIOAfterExit(t *testing.T) {
}()
t.Log("Exec in container")
- _, _, err = runtimeService.ExecSync(cn, []string{"sh", "-c", "sleep 365d &"}, 5*time.Second)
+ _, _, err = runtimeService.ExecSync(cn, []string{"/bin/bash", "-c", "sleep 20&"}, 5*time.Second)
require.ErrorContains(t, err, "failed to drain exec process")
-
- t.Log("Exec in container")
- _, _, err = runtimeService.ExecSync(cn, []string{"sh", "-c", "sleep 2s &"}, 10*time.Second)
- require.NoError(t, err, "should drain IO in time")
}
diff --git a/script/test/utils.sh b/script/test/utils.sh
index 8eedc040e..ce3b1681b 100755
--- a/script/test/utils.sh
+++ b/script/test/utils.sh
@@ -51,7 +51,7 @@ if [ -z "${CONTAINERD_CONFIG_FILE}" ]; then
version=2
[plugins."io.containerd.grpc.v1.cri"]
- drain_exec_sync_io_timeout = "10s"
+ drain_exec_sync_io_timeout = "1ns"
EOF
if command -v sestatus >/dev/null 2>&1; then |
I found the problem. The community shim links child procress through copyIO. The shim and containerd have separate pipe links, so the shim can close the pipe when the exec process exit. Thanks for your reply. @fuweid |
I don't quite understand the resolution in #9568 (comment), but we just got a duplicate report of this in Kubernetes: kubernetes/kubernetes#123931 Can you clarify what the solution is? Has this been fixed, and there's just a patch we need to pick up? |
Description
Create a Pod and configure a blocking ExecProbe. After setting the timeout, I find that the health check does not return as expected timeout.
Steps to reproduce the issue
The
runtimeRequestTimeout
of kubelet uses the default value of2m
Test with blocking exec command
Pod.yaml
Describe the results you received and expected
I found that the requested time interval is
2m8s
instead of8s
。The time interval of2m8s
is caused by the following code in kuebelt.The
ExecSync
block until cmd execution is completed and exits. In this use case,ExecSync
will block for 300s.I did some debugging and found that
ExecSync
blocks in<-attachDone
of theexecInternal
function, waiting for IO to exit.I found a related issue #7832 and tried,but found that it didn't solve the problem.
I set
drain_exec_sync_io_timeout
to 1ns, and it successfully entereddrainExecSyncIO
as expected. When executing toexecProcess.Delete
, the function will still block inp.io.Wait()
, which is still waiting for io release.What version of containerd are you using?
containerd github.com/containerd/containerd v1.6.14
Any other relevant information
Kubernetes v1.25.6
Show configuration if it is related to CRI plugin.
No response
The text was updated successfully, but these errors were encountered: