Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[add version/20.10] A container has been force exited but docker ps reports the container Up #43482

Closed
pekuz opened this issue Apr 13, 2022 · 9 comments

Comments

@pekuz
Copy link

pekuz commented Apr 13, 2022

Description

Randomly, we are observing that docker cannot inspect some containers or it cannot restart them, or docker stats hungs. It is a long lasting issue happening with low frequency per docker instance but now I have identified one specific situation when it can happen.

If docker stop CONTAINER struggles to stop container and it uses the force then docker can end in an inconsistent state (measured by subsequent docker ps).

Typical workaround to resolve it is to restart the affected docker instance.

Steps to reproduce the issue:

  1. docker stop 86f164e7c63eed894e226b39b1e55e2715429a04049f0df4724c45d1189fb87f
  2. journal reported "Container 86f164e7c63eed894e226b39b1e55e2715429a04049f0df4724c45d1189fb87f failed to exit within 61 seconds of signal 15 - using the force"
  3. docker ps

Describe the results you received:

$ docker ps
CONTAINER ID   IMAGE                                                                                                  COMMAND                  CREATED        STATUS       PORTS                                                                      NAMES
b5ed684f16a5   docker-eurofins-shared.packages.eurofins.local/com.eurofins.observability.metricmon:1.2                "/metricbeat/metricb…"   4 days ago     Up 4 days                                                                               observability-metricsmon-eesb
0fa892288812   docker-eurofins-shared.packages.eurofins.local/com.eurofins.observability.logmon:1.2                   "/usr/share/filebeat…"   4 weeks ago    Up 4 weeks                                                                              observability-logmon-eesb
283ae2cadd99   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-consul-synchronizer:3.1.6          "sh -c 'java ${JAVA_…"   7 weeks ago    Up 7 weeks   0.0.0.0:30216->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-consul-synchronizer-01
7a2baed6269f   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-message-transfer-solution:3.0.14   "java -Xss512k -XX:N…"   2 months ago   Up 7 weeks   0.0.0.0:30016->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-mts-01
2c1071b643e2   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-monitoring-smoke-tests:3.0.4       "java -jar eesb-moni…"   2 months ago   Up 7 weeks   0.0.0.0:32216->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-monitoring-smoke-tests-01
86f164e7c63e   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-monitoring-agent-keep:3.0.2        "sh -c 'java ${JAVA_…"   2 months ago   Up 6 weeks   0.0.0.0:32016->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-monitoring-agent-keep-01
ea7161d545c2   docker.packages.eurofins.local/consul:1.9.10  

it's an unedited output, the important line is:

86f164e7c63e   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-monitoring-agent-keep:3.0.2        "sh -c 'java ${JAVA_…"   2 months ago   Up 6 weeks   0.0.0.0:32016->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-monitoring-agent-keep-01

Describe the results you expected:

$ docker ps
CONTAINER ID   IMAGE                                                                                                  COMMAND                  CREATED        STATUS       PORTS                                                                      NAMES
b5ed684f16a5   docker-eurofins-shared.packages.eurofins.local/com.eurofins.observability.metricmon:1.2                "/metricbeat/metricb…"   4 days ago     Up 4 days                                                                               observability-metricsmon-eesb
0fa892288812   docker-eurofins-shared.packages.eurofins.local/com.eurofins.observability.logmon:1.2                   "/usr/share/filebeat…"   4 weeks ago    Up 4 weeks                                                                              observability-logmon-eesb
283ae2cadd99   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-consul-synchronizer:3.1.6          "sh -c 'java ${JAVA_…"   7 weeks ago    Up 7 weeks   0.0.0.0:30216->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-consul-synchronizer-01
7a2baed6269f   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-message-transfer-solution:3.0.14   "java -Xss512k -XX:N…"   2 months ago   Up 7 weeks   0.0.0.0:30016->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-mts-01
2c1071b643e2   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-monitoring-smoke-tests:3.0.4       "java -jar eesb-moni…"   2 months ago   Up 7 weeks   0.0.0.0:32216->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-monitoring-smoke-tests-01
ea7161d545c2   docker.packages.eurofins.local/consul:1.9.10  

i.e. 86f164e7c63e is not listed because it is not Up. The container is exited, despite forcefully.

Additional information you deem important (e.g. issue happens only occasionally):

It sounds similar to #33369, here however using a stock docker and I might be closer why it is perceived random:

Container 86f164e7c63eed894e226b39b1e55e2715429a04049f0df4724c45d1189fb87f failed to exit within 61 seconds of signal 15 - using the force

i.e. because the need for using force is random. And when force is used the issue can occur.

The docker/for-linux#1354 shares the symptoms, there it's believed that Docker fluentd log driver can be configured so that container stopping must be forced.

Output of docker version:

Client: Docker Engine - Community
 Version:           20.10.5
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        55c4c88
 Built:             Tue Mar  2 20:33:55 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          19.03.9
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       9d988398e7
  Built:            Fri May 15 00:24:05 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 7
  Running: 7
  Paused: 0
  Stopped: 0
 Images: 61
 Server Version: 19.03.9
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1160.21.1.el7.x86_64
 Operating System: RHEL
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.51GiB
 Name: EU50MQVQ020.area1.eurofins.local
 ID: DZ3K:UWMJ:ZKPY:LFDX:AQQN:O2US:3O4Y:5FDV:J7AO:DXWP:ZTAI:PBN2
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

VM

@pekuz
Copy link
Author

pekuz commented Apr 13, 2022

Again a docker journal reads:

"Container 7a2baed6269f66fcc1315614120e8a8ca07fd9f15165688d6b608fed16eabc47 failed to exit within 61 seconds of signal 15 - using the force"

so let's testify this bug report, if it survives time test of 5 hours:

$ docker ps
CONTAINER ID   IMAGE                                                                                                  COMMAND                  CREATED        STATUS       PORTS                                                                      NAMES
b5ed684f16a5   docker-eurofins-shared.packages.eurofins.local/com.eurofins.observability.metricmon:1.2                "/metricbeat/metricb…"   6 days ago     Up 6 days                                                                               observability-metricsmon-eesb
0fa892288812   docker-eurofins-shared.packages.eurofins.local/com.eurofins.observability.logmon:1.2                   "/usr/share/filebeat…"   5 weeks ago    Up 5 weeks                                                                              observability-logmon-eesb
283ae2cadd99   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-consul-synchronizer:3.1.6          "sh -c 'java ${JAVA_…"   7 weeks ago    Up 7 weeks   0.0.0.0:30216->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-consul-synchronizer-01
7a2baed6269f   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-message-transfer-solution:3.0.14   "java -Xss512k -XX:N…"   2 months ago   Up 7 weeks   0.0.0.0:30016->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-mts-01
2c1071b643e2   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-monitoring-smoke-tests:3.0.4       "java -jar eesb-moni…"   2 months ago   Up 7 weeks   0.0.0.0:32216->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-monitoring-smoke-tests-01
86f164e7c63e   docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-monitoring-agent-keep:3.0.2        "sh -c 'java ${JAVA_…"   2 months ago   Up 6 weeks   0.0.0.0:32016->8080/tcp                                                    uat-itaag108-corporatehmb06-eesb-monitoring-agent-keep-01
ea7161d545c2   docker.packages.eurofins.local/consul:1.9.10                                                           "docker-entrypoint.s…"   4 months ago   Up 7 weeks   8300-8302/tcp, 8301-8302/udp, 8600/tcp, 8600/udp, 0.0.0.0:8646->8500/tcp   uat-itaag108-corporatehmb06-eesb-consul-01
$ docker inspect 7a2baed6269f
^C
$

The new observations match the bug report.

@thaJeztah
Copy link
Member

Not sure if this specific issue has been fixed since, but I see you're running a version of docker that is no longer maintained (and EOL versions of containerd and runc).

If you have a test environment to try on, are you still able to reproduce this issue on a current version of docker (20.10.x) and containerd (1.5.x)?

@pekuz
Copy link
Author

pekuz commented Apr 14, 2022

Hi Sebastiaan,

it another test environment we have a docker at version:

Client: Docker Engine - Community
 Version:           20.10.6
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        370c289
 Built:             Fri Apr  9 22:45:33 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.6
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8728dd2
  Built:            Fri Apr  9 22:43:57 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

It comes with containerd 1.4.4. If reproduced on that docker version would it be enough, please?

@pekuz
Copy link
Author

pekuz commented Apr 14, 2022

The docker was restarted and then docker behaves as expected even for the force-exited containers. So this issue is reproducible only if docker is already in a "tired" state. What transitions docker to the "tired" state is so far unknown.

@pekuz
Copy link
Author

pekuz commented Apr 28, 2022

On a host with:

$ docker version
Client: Docker Engine - Community
 Version:           20.10.6
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        370c289
 Built:             Fri Apr  9 22:45:33 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.6
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8728dd2
  Built:            Fri Apr  9 22:43:57 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

SETUP NOTE: the fluentd port (defaults to --log-opt fluentd-address=localhost:24224) is not intentionally accepting connections.

the repro steps were:

$ docker create --name zombie --log-driver=fluentd --log-opt fluentd-async=true alpine sh -c 'for i in $(seq 1000000); do echo $i; sleep 0.001; done'

$ docker start zombie
$ sleep 5
$ docker stop zombie &

$ sleep 20 # 10s is default container shutdown grace period

$ docker ps

CONTAINER ID   STATUS           NAMES
891bd7a1a4e2   Up 28 seconds    zombie

The zombie container was Up, it survived the force stop after the graceful shutdown period.

The docker journal had expected info message:

time="2022-04-28T09:34:21.713671493+02:00" level=info msg="Container 891bd7a1a4e2a4c0691466d27785f162edd958600d35136f63e798f665bb88cb failed to exit within 10 seconds of signal 15 - using the force"

My expectation is that the force stop covers the docker log driver too.

@pekuz pekuz changed the title A container has been force exited but docker ps reports the container Up [add version/20.10] A container has been force exited but docker ps reports the container Up Apr 28, 2022
pekuz referenced this issue Apr 28, 2022
The flag ForceStopAsyncSend was added to fluent logger lib in v1.5.0 (at
this time named AsyncStop) to tell fluentd to abort sending logs
asynchronously as soon as possible, when its Close() method is called.
However this flag was broken because of the way the lib was handling it
(basically, the lib could be stucked in retry-connect loop without
checking this flag).

Since fluent logger lib v1.7.0, calling Close() (when ForceStopAsyncSend
is true) will really stop all ongoing send/connect procedure,
wherever it's stucked.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
(cherry picked from commit bd61629)
Signed-off-by: Wesley <wppttt@amazon.com>
@pekuz
Copy link
Author

pekuz commented Apr 28, 2022

f9df098 is an interesting commit.

Following it I eventually ended at https://github.com/moby/moby/releases/tag/v20.10.13 which reads:

Updates the fluentd log driver to prevent a potential daemon crash, and prevent
containers from hanging when using the fluentd-async-connect=true and the
remote server is unreachable #43147.

@thaJeztah
Copy link
Member

@pekuz are you still seeing this issue after upgrading?

@pekuz
Copy link
Author

pekuz commented Nov 17, 2022

At the time, the upgrade was prevented by unavailability of newer Docker versions in RHEL repos. Meanwhile our R&D focus shifted elsewhere. I cannot verify in reasonable time frame.

@thaJeztah
Copy link
Member

Thanks @pekuz

I'll go ahead and close this ticket for now, but we can reopen if the issue still persists and if you have more details 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants