Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockerd doesn't kill healthcheck processes after timeout #43737

Closed
seleznev opened this issue Jun 22, 2022 · 4 comments · Fixed by #43739
Closed

dockerd doesn't kill healthcheck processes after timeout #43737

seleznev opened this issue Jun 22, 2022 · 4 comments · Fixed by #43739
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/claimed version/20.10

Comments

@seleznev
Copy link

Description

dockerd already has logic to gracefully stop healthcheck processes after timeout (/daemon/exec.go#L277-L291).

But it seems completely broken because of using canceled context in daemon.containerd.SignalProcess() call (/daemon/exec.go#L279). SignalProcess() just returns context canceled error and does nothing.

Steps to reproduce the issue:

  1. Create Dockerfile:
    FROM ubuntu:22.04
    
    HEALTHCHECK --interval=5s --timeout=5s \
       CMD ["sleep", "3600"]
    
    CMD ["sleep", "infinity"]
    
  2. Build image:
    docker build --tag=healthcheck-test .
    
  3. Start container:
    docker run -d --rm --name=healthcheck-test healthcheck-test
    
  4. Wait some health intervals:
    sleep 30
    
  5. Check processes in the container:
    docker exec healthcheck-test ps axuf
    
  • Cleanup:
    docker rm --force healthcheck-test # remove container
    docker rmi healthcheck-test # remove image
    

Describe the results you received:

More then one sleep 3600 processes:

$ docker build --tag=healthcheck-test .
Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM ubuntu:22.04
 ---> 27941809078c
Step 2/3 : HEALTHCHECK --interval=5s --timeout=5s    CMD ["sleep", "3600"]
 ---> Running in 248a9dcfaa6f
Removing intermediate container 248a9dcfaa6f
 ---> 16d09d0a1b09
Step 3/3 : CMD ["sleep", "infinity"]
 ---> Running in 55ef832b3170
Removing intermediate container 55ef832b3170
 ---> 7e8b71425a0a
Successfully built 7e8b71425a0a
Successfully tagged healthcheck-test:latest
$ docker run -d --rm --name=healthcheck-test healthcheck-test
41e8e2eb21d0bdd485e647c6ec1273474b19ba616d284d48d53ea607edd96841
$ sleep 30
$ docker exec healthcheck-test ps axuf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          25  0.0  0.0   7060  1664 ?        Rs   12:01   0:00 ps axuf
root          19  0.2  0.0   2788  1036 ?        Ss   12:01   0:00 sleep 3600
root          13  0.0  0.0   2788  1036 ?        Ss   12:00   0:00 sleep 3600
root           7  0.0  0.0   2788  1056 ?        Ss   12:00   0:00 sleep 3600
root           1  0.0  0.0   2788  1108 ?        Ss   12:00   0:00 sleep infinity

Describe the results you expected:

Zero or one sleep 3600 process:

$ docker exec healthcheck-test ps axuf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          25  0.0  0.0   7060  1664 ?        Rs   12:01   0:00 ps axuf
root          19  0.2  0.0   2788  1036 ?        Ss   12:01   0:00 sleep 3600
root           1  0.0  0.0   2788  1108 ?        Ss   12:00   0:00 sleep infinity

Additional information you deem important (e.g. issue happens only occasionally):

N/A

Output of docker version:

Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:02:57 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:01:03 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.6
  GitCommit:        10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc:
  Version:          1.1.2
  GitCommit:        v1.1.2-0-ga916309
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.2-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 10
  Running: 1
  Paused: 0
  Stopped: 9
 Images: 119
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc version: v1.1.2-0-ga916309
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.13.0-51-generic
 Operating System: Ubuntu 20.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.34GiB
 Name: uk-ubnt-61
 ID: BXVR:65LG:FDYH:IX7Q:U2LT:LQI6:P5B5:ZFEG:EIMS:WPWM:D3ND:RN4H
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 90
  Goroutines: 74
  System Time: 2022-06-22T15:03:06.064800737+03:00
  EventsListeners: 0
 Username: 2gis
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  docker-hub.2gis.ru:5444
  127.0.0.0/8
 Registry Mirrors:
  https://docker-registry-proxy.2gis.io/
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

N/A

@ndeloof
Copy link
Contributor

ndeloof commented Jun 23, 2022

Thanks for reporting this issue with detailed diagnostic, I'll investigate today

@seleznev
Copy link
Author

Hello, @ndeloof! Many thanks for your help! I tried the patch (but I cherry-picked it on v20.10.17 actually) and it works well for TERM. But it looks like KILL part still broken (sorry I didn't test earlier 😞).

I think // TERM signal worked part called immediately because attachErr is not empty (because of "context canceled" 🥲). But I'm not sure.


Steps to reproduce:

I didn't find simple test app that ignores TERM, so:

  1. Create no-sigterm.c with content:
    #include <signal.h>
    #include <unistd.h>
    
    int main(void) {
        signal(SIGTERM, SIG_IGN);
    
        sleep(3600); // 1 hour
    
        return 0;
    }
    
  2. Build it:
    gcc -o no-sigterm no-sigterm.c
    
  3. Create Dockerfile (I increased interval to cover all timeouts, so only one healtcheck process should be alive at the same time):
    FROM ubuntu:22.04
    
    COPY /no-sigterm /usr/local/bin/
    
    HEALTHCHECK --interval=30s --timeout=5s \
       CMD ["no-sigterm"]
    
    CMD ["sleep", "infinity"]
    
  4. Build container and run it:
    docker build --tag=healthcheck-test .
    docker run -d --rm --name=healthcheck-test healthcheck-test
    
  5. Wait some health intervals:
    sleep 120
    
  6. Check processes in the container:
    docker exec healthcheck-test ps axuf
    

Describe the results you received:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          31  0.0  0.0   7060  1440 ?        Rs   14:22   0:00 ps axuf
root          25  0.0  0.0   2640   852 ?        Ss   14:21   0:00 no-sigterm
root          13  0.0  0.0   2640   844 ?        Ss   14:21   0:00 no-sigterm
root           7  0.0  0.0   2640   856 ?        Ss   14:20   0:00 no-sigterm
root           1  0.0  0.0   2788   972 ?        Ss   14:20   0:00 sleep infinity

Describe the results you expected:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          31  0.0  0.0   7060  1440 ?        Rs   14:22   0:00 ps axuf
root          25  0.0  0.0   2640   852 ?        Ss   14:21   0:00 no-sigterm
root           1  0.0  0.0   2788   972 ?        Ss   14:20   0:00 sleep infinity

@rrauenza
Copy link

@thaJeztah What versions can we assume this has been fixed on ? 20.10 and 22.06? And 23 and 24? I want to blacklist versions in our environment where the commands are not timed out (or require people to use /usr/bin/timeout to wrap the commands)

@thaJeztah
Copy link
Member

thaJeztah commented Jul 11, 2023

#44018 is on the 20.10.18 milestone, and #43994 is part of 23.0.0 and up (22.06 was not released, but the version scheme changed to 23.0.0)

Older versions (19.03) are EOL, so won't have that fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/claimed version/20.10
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants