Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker stats hangs #39523

Open
invokablegmbh opened this issue Jul 15, 2019 · 16 comments · May be fixed by #47748
Open

docker stats hangs #39523

invokablegmbh opened this issue Jul 15, 2019 · 16 comments · May be fixed by #47748

Comments

@invokablegmbh
Copy link

invokablegmbh commented Jul 15, 2019

Description

Docker is working for a few weeks, then it is somehow not responding anymore. Some commands work, others not.
e.g. "docker stats" is hanging. "docker pa -a" works.

I searched through the various issues here that describe different hangs, but found nothing that solves my issue.

Restarting the docker daemon fixes the issue for a few days...

Steps to reproduce the issue:

  1. We just use docker and are starting / stopping containers.

Describe the results you received:
Docker hangs for whatever reason.

Describe the results you expected:
Docker should work as expected.

Additional information you deem important (e.g. issue happens only occasionally):
In other threads there was the hint of doing a SIGUSR1-Signal to dockerd and posting the result. Also, I have done an strace on the stuck "docker stats" command.
It can be found here: https://git.jar.media/snippets/8

Output of docker version:

Docker version 18.09.7, build 2d0083d

Output of docker info:

Containers: 58
 Running: 58
 Paused: 0
 Stopped: 0
Images: 776
Server Version: 18.09.7
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.0-9-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 48
Total Memory: 125.8GiB
Name: h19
ID: OJOK:YAOX:OTVB:ALUM:SZUE:ZKJD:T26K:OPEF:XP6O:33GQ:2GRT:KC36
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):
No additional info.

@thaJeztah
Copy link
Member

The strace likely won't have much details (as it would only strike the CLI, which is likely not where the issue is)

I attached the files here for easier finding;
goroutine-stacks-2019-07-15T073325_0200.log.zip
strace_of_docker_stats.txt

@thaJeztah
Copy link
Member

@cpuguy83 @kolyshkin ptal

@thaJeztah
Copy link
Member

@jarmediagmbh is there anything useful in the daemon logs? Could you describe a bit what kind of load / containers you're running? (long-lived / short-lived containers? do you use health-checks? do container produce a lot of logs? do you use bind-mounts?)

@invokablegmbh
Copy link
Author

is there anything useful in the daemon logs?

  • No. Before opening the issue here we tried so many different things. Of course we also analyzed several log files (syslog, daemonlog, dmesg etc.). We did not find suspicious things, just the normal "a new network interface appeared/disappeared etc.".

Could you describe a bit what kind of load / containers you're running? (long-lived / short-lived containers?
We have about 60 Containers running. Each container will be deleted and recreated each morning (we have a very special usecase.. ;) ). The containers offer development environments for web development, the load should not be sooo huge.

do you use health-checks?
no.

do container produce a lot of logs?
no.

do you use bind-mounts?
Yes. There are several mounts between a container and the host system.

@thaJeztah
Copy link
Member

The containers offer development environments for web development, the load should not be sooo huge.

All containers are running "interactively" (so a shell session attached to it?)

@invokablegmbh
Copy link
Author

Wow. I have found a suspicous thing: Besides our dev-containers we are using zalenium (https://github.com/zalando/zalenium) on the same docker host. It automatically generated containers with google chrome running in it, so you can remotely and programmatically control the chrome-instances e.g. for test automation for websites.

After I did a "docker-compose down" on the zalenium directories (zalenium consists of a control-container and several chrome-instance-containers), docker stats did work again. However, I am sure, that the last time we had this issue, we removed all containers from the host and docker stats did not work aswell.

@invokablegmbh
Copy link
Author

The containers offer development environments for web development, the load should not be sooo huge.

All containers are running "interactively" (so a shell session attached to it?)

No, we ssh into the conatiners to work on them. There is no "docker exec -it [CONTAINER] /bin/bash", if you ment that.

@thaJeztah
Copy link
Member

No, we ssh into the conatiners to work on them. There is no "docker exec -it [CONTAINER] /bin/bash", if you ment that.

Thanks; yes that's what I meant (docker exec or docker run -it) 👍

@invokablegmbh
Copy link
Author

In the past we had a lot of problems using docker exec / docker run, because the commands sometimes did not terminate. Thus, even our control-scripts use ssh into the containers and not "docker exec".

@achimnol
Copy link

achimnol commented May 6, 2021

Starting and stopping a few tens (~30) containers simultaneously and continuously with periodic docker stats API invocation, the stats API gets hang at a random time and often makes deadlocks for my container termination routines which calls the stats API to sync the final stats to an internal database.
For testing, each container run just a simple "sleep" command for a few seconds, so there is no computational overheads of containers, though we are controlling the containers via our own "job scheduler" implementation.
Maybe there is a race condition issue inside the stats API implementation when the container terminates.

FYI: I have an alternative stats implementation which reads and parses the sysfs directly, and using it instead of the docker stats API resolves the deadlock issue completely, under heavy-load tests. So I'm pretty sure that the problem resides inside the docker daemon somewhere....

@zatlodan
Copy link

zatlodan commented Apr 23, 2024

We have the same issue, where docker stats suddenly stops responding.

Issue description

  • 3 virtual instances all have the same issue (occurred at different times).
  • Running ~30 containers.
  • Starting and stopping a few times a day.
  • The containers might start on different instance each time.
  • Running without issues for at least a week.

Docker stats suddenly stops responding on all containers without an apparent cause.

This never resolves (has to be terminated):

docker stats --no-stream dd7e99646619

This will error with "timeout waiting for stats":

docker stats dd7e99646619

Docker info & version

All instances are the same:

docker-info.txt
docker-version.txt

Logs

These are the docker daemon logs from around the time the stats endpoint stopped responding.
These are actually the only logs that are there +- a few hours

Instance A

Issue occurred around 10:00 and 13:00.

journalctl-logs.txt

Instance B

Issue occurred around 22:00.

journalctl-logs.txt

Instance C

Issue occurred around 10:00.

journalctl-logs.txt

Metrics

I have all the Prometheus "node" metrics from the instances.
There is no blatantly apparent cause or correlation with the metric data.

I have observed:

  • Drop in Disk IOps when the issue arises on two of the 3 instances.
  • Some network interfaces start receiving traffic and some stop receiving traffic.
  • Drop in memory usage.

What have you tried

I have tried sending the stats request directly to the docker socket in various combinations like so:

curl --unix-socket /var/run/docker.sock http:/v1.24/containers/452bb2e0564d/stats?stream=0
curl --unix-socket /var/run/docker.sock http:/v1.44/containers/452bb2e0564d/stats?stream=0
curl --unix-socket /var/run/docker.sock http:/v1.44/containers/452bb2e0564d/stats?stream=1

This request never resolves.

What fixes the problem

Restarting the docker daemon fixes the issue.

Important

I currently have three distinct instances with the problematic docker at hand, have not yet restarted the daemon.
And I can keep it that way for a few days.
So if we have a quick response, I can debug this further directly on the problematic instances
.

@cpuguy83
Copy link
Member

On the stuck instances, please grab a goroutine stack dump.

You can get this from the endpoint t "/debug/pprof/goroutine?debug=2"

Also please provide the output of "docker info" and "docker version", assuming those work (they shouldn't generally get wedged).

@zatlodan
Copy link

@cpuguy83
Copy link
Member

@zatlodan Thanks, I can see there are a couple of places this is getting wedged trying to close the (cloudwatch) log driver because the log buffer is full. Additionally the same thing is happening in the local log cache (used to allow docker logs to work with cloudwatch). Its trying to send the log to cloudwatch but is blocked (because of the above problem) and it is preventing the log driver from being closed.

This is causing the container state to remain locked indefinitely which is causing other issues, namely preventing an image removal from completing because some container is using the image and its trying to take the container lock to see if the container is running... I think this is also another bug, we shouldn't care if the container is running or not here... but there are likely historical reasons for why this is the way it is.
I'll try and take a look at this as well.

@cpuguy83 cpuguy83 linked a pull request Apr 23, 2024 that will close this issue
@zatlodan
Copy link

zatlodan commented Apr 24, 2024

Thanks for the quick investigation 👍

I can see that some of the containers indeed can't send logs due to non existing cloud watch log groups.
This is of course caused by a negligence on our part 😄

But on the impacted instances there are only two containers from the total ~24 that have this issue currently,
yet the stats endpoint is unresponsive on all containers on that instance.
That is a really weird behavior that one or two containers with full log buffers can cause such an issue.

Although I can see that the problematic image (with a non-existing log group) has been deployed to each affected instance at the time when the issue arose, indicating a potential correlation

I will try to fix the logging issues and keep you updated.

@cpuguy83
Copy link
Member

cpuguy83 commented May 2, 2024

@zatlodan In all your stack dumps I can see lots of stats requests not blocked except on just new stats data (normal) and then some that are blocked waiting for the container lock to be freed (due to log writing being blocked).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants