New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker Daemon Hangs under load #13885
Comments
would you happen to have a reduced testcase to show this? Perhaps a small Dockerfile for what's in the container and a bash script that does the work of starting/stopping/... the containers? |
Container is in dockerhub kinvey/blrunner#v0.3.8 Using the remote API with the following options: CREATE START container.start REMOVE |
Hmm, are you seeing excessive resource usage? |
Not particularly in regards to excessive resource usage... But the early symptom is docker completely hanging while other processes hum along happily... Important to note, we only have 8 containers running at any one time on any one instance. |
Captured some stats where docker is no longer resposive: lsof | wc -l shows 1025. However, an error appears several times: Sample output of top: top - 00:16:53 up 12:22, 2 users, load average: 2.01, 2.05, 2.05 24971 kinvey 20 0 992008 71892 10796 S 1.3 1.9 9:11.93 node |
@mjsalinger The setup you're using is unsupported. The reason why it's unsupported is that you're using Ubuntu 14.04 with a custom kernel. Where does that 3.18.0-031800 kernel come from? Did you notice that this kernel build is outdated? The kernel you're using has been built in December last year. I'm sorry, but there's nothing to be debugged here. This issue might actually be a kernel bug related to overlay or to some other already fixed kernel bug which is no longer an issue in the latest version of kernel 3.18. I'm going to close this issue. Please try again with an up to date 3.18 or newer kernel and check if you're running into the problem. Please keep in mind that there are multiple issues opened against overlay and that you'll probably experience problems with overlay even after updating to the latest kernel version and to the latest Docker version. |
@unclejack @cpuguy83 @LK4D4 Please reopen this issue. This was advice for the configuration that we are using was specifically given by the docker team and experimentation. We've tried newer kernels (3.19 +) and they have a kernel panic bug of some kind that we were running into - so the advice was to go with 3.18 pre-December because there was a known kernel bug introduced after that which caused a Kernel panic that we were running into, which to my knowledge, has not yet been fixed. As far as OverlayFS, that was also presented to me as the ideal FS for Docker after experiencing numerous performance problems with AUFS. If this isn't a supported configuration, can someone help me to find a performant, stable configuration that will work for this use case? We've been pushing to get this stable for several months. |
@mjsalinger Can you provide the inode usage for the volume that overlay is running on? ( |
Thanks for reopening. If the answer is a different kernel, that's fine, I just want to get to a stable scenario. df -i /ver/lib/docker Filesystem Inodes IUsed IFree IUse% Mounted on |
Overlay still has a bunch of problems: https://github.com/docker/docker/issues?q=is%3Aopen+is%3Aissue+label%3A%2Fsystem%2Foverlay I wouldn't use overlay in production. Others have commented on this issue tracker that they're using AUFS in production and that it's been stable for them. Kernel 3.18 is unsupported on Ubuntu 14.04. Canonical doesn't provide support for that. |
AUFS in production is not performant at all and has been anything but stable for us - I would routinely run into I/O Bottlenecks, freezes, etc. Switching to Overlay resolved all of the above issues - we only have this one issue remaining. also: http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/ It seems that Overlay is being presented as the driver of choice by the community in general. Fine if it's not stable yet, but neither is AUFS and there has to be some way to get the performance and stability I need with Docker. I'm all for trying new things, but in our previous configuration (AUFS on Ubuntu 12.04 and AUFS on Ubuntu 14.04) we could not get either stability or performance. At least with Overlay we get good performance and better stability - but we just need to resolve this one problem. |
@mjsalinger I'd recommend using Ubuntu 14.04 with the latest kernel 3.13 packages. I'm using that myself and I haven't run into any of those problems at all. |
@unclejack Tried that and ran into the issues specified above with heavy, high volume usage (creating/destroying lots of containers) and AUFS was incredibly non-performant. So that's not an option. |
@mjsalinger Are you using upstart to start docker? What does /etc/init/docker.conf look like? |
Yes, using upstart. /etc/init/docker.conf
|
We are also now seeing the below when running any Docker command on some instances, for example...
FATA[0000] Get http:///var/run/docker.sock/v1.18/containers/json?all=1: dial unix /var/run/docker.sock: resource temporarily unavailable. Are you trying to connect to a TLS-enabled |
@mjsalinger It's just crappy error message. In most cases it means that daemon crashed. |
@mjsalinger What do the docker daemon logs say during this? |
Frozen, no new entries coming in. Here are the last entries in the log:
|
@cpuguy83 Was the log helpful at all? |
@mjsalinger Makes me think there's a deadlock somewhere since there's nothing else indicating an issue. |
@cpuguy83 That would make sense given the symptoms. Is there anything I can do to help further trace this issue and where it comes from? |
Maybe we can get an strace to see that it's actually hanging on a lock. |
Ok. Will work to see if we can get that. We wanted to try 1.7 first - did that but did not notice any improvement. |
@cpuguy83 From one of the affected hosts:
|
@cpuguy83 Any ideas? |
Seeing the following in 1.7 with containers not being killed/started. This seems to be a precurser to the problem (note didn't see these errors in 1.6 but did see a volume of dead containers start to build up, even though a command was issued to kill/remove)
|
@mblaschke I looked over your traces(https://gist.github.com/tonistiigi/0fb0abfb068a89975c072b68e6ed07ce for better view). I can't find anything suspicious in there though. All long running goroutines are from open io copy that is normal if there are running containers or execs as these goroutines don't hold any locks. From the traces I would expect that other commands like The warnings you have in the logs should be fixed with https://github.com/docker/containerd/pull/351 . They should only cause spam and not be anything serious. Because the debug log is not enabled in the logs I can't see if there are any suspicious commands sent to the daemon. There don't seem to be any meaningful logs for minutes before you took the stacktrace. |
The same code works with Docker 1.10.3, but not after Docker 1.11.x. The serverspec tests are failing randomly with timeouts. |
@mblaschke I had a look at the trace too, it really looks like the exec is just not finishing with its IO. What exec program are you executing? Does it fork new processes withing its own session id? |
We are executing (docker exec) We are seeing the hanging problem on a regular basis within our production cluster. |
@mlaventure Tests are here (but they need some environment settings for execution): you could try it with our code base:
|
@GameScripting I'm getting confused now 😅 . Are you referring to the same context that @mblaschke is running from? If not, which docker version are you using? And to answer your question, no, it's unlikely that ip addr would do anything like this. Which image are you using? What is the exact docker exec command being used? |
Sorry, it was not my intention to make this more confusion. Main issue on resolving this bug is that no one was yet able to come up with stable, reproducable steps to trigger the hanging. Seems like @mblaschke found something so heis able to reliably trigger the bug. |
I use the CoreOS stable (1185.3.0) with Docker 1.11.2. But I've found a workaround until you find a solution. I use the The next docker exec example:
can be translated to ctr:
So you can translate scripts temporarily to ctr as a workaround. |
@mblaschke The commands you posted did fail for me but they don't look like docker failures. https://gist.github.com/tonistiigi/86badf5a41dff3fe53bd68d8e83e4ec4 Could you enable debug logs. The master build also stores more debug data about daemon internals and allows to trace containerd as well with sigusr1 signals. Because we are tracking a stuck process even |
@tonistiigi Sorry :( |
@mblaschke Still doesn't look like docker issue. It is hanging on
And this HipHopVM doesn't support |
@tonistiigi I've searched the build logs and found one test failure (random issue, not in the hhvm tests) when using 1.12.3. We will continue to stress Docker and try to find the issue. |
@coolljt0725 I just came back to work to find Docker hung again.
As soon as that completed, the other two sessions unblocked. I was checking disk space because that had been a problem once before so was the first place I looked. The 2 semaphores may have been the two calls I had, but there were many hung An example of hung output after I did the
|
Docker 1.11.2 hanging, stack trace: https://gist.github.com/Calpicow/871621ba807d6eb9b18b91e8c2eb4eef |
trace from @Calpicow seems to be stuck on devicemapper. But doesn't look to udev_wait case. @coolljt0725 @rhvgoyal
|
@Calpicow Do you have some |
My I have a huge SIGUSR1 dump below as well (didn't fit here). https://gist.github.com/ahmetalpbalkan/34bf40c02a78e319eaf5710acb15cf9a
It looks like I have a ton (like 700) of these goroutines:
|
@ahmetalpbalkan You look blocked waiting on a netlink socket to return. |
@cpuguy83 yeah I saw the coreos/bugs#254 which looked similar to my case, however I don't see those "waiting" messages in the kernel logs the person and you mentioned. it looks like 1.12.5 did not hit even the coreos alpha stream yet. is there a kernel/docker version I can downgrade and have it working? |
@ahmetalpbalkan Yay, for another kernel bug. |
Is know what exactly IS the bug? Was the kernel-bug reported upstream? Or is there even a kernel version where this bug has been fixed? |
@GameScripting Issue will have to be reported to whatever distro this was produced in, and as you can see we have more than 1 issue causing the same effect here as well. |
Here's another one with Docker v1.12.3 Relevant syslog:
|
@Calpicow Thanks, yours looks like devicemapper has stalled.
Can you open a separate issue with all the details? |
Let me close this ticket for now, as it looks like it went stale. |
Docker Daemon hangs under heavy load. Scenario is starting/stopping/killing/removing many containers/second - high utilization. Containers contain one port, and are run without logging and in detached mode. The container receives an incoming TCP connection, does some work, sends a response, and then exits. An outside process cleans up by killing/removing and starting a new container.
I cannot get docker info from an actual hung instance, as once it is hung I can't get docker to run without a reboot. The below info is from one of the instances that has had the problem after a reboot.
We also have instances where something completely locks up and the instance cannot even be ssh'd into. This usually happens sometime after the docker lockup occurs.
Docker Info
Containers: 8
Images: 65
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Kernel Version: 3.18.0-031800-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 2
Total Memory: 3.675 GiB
Name:
ID: FAEG:2BHA:XBTO:CNKH:3RCA:GV3Z:UWIB:76QS:6JAG:SVCE:67LH:KZBP
WARNING: No swap limit support
Docker Version
Client version: 1.6.0
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 4749651
OS/Arch (client): linux/amd64
Server version: 1.6.0
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 4749651
OS/Arch (server): linux/amd64
uname -a
Linux 3.18.0-031800-generic #201412071935 SMP Mon Dec 8 00:36:34 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 14972
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 14972
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
The text was updated successfully, but these errors were encountered: