Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete system lock-up Ubuntu 14.04 #21179

Closed
rbjorklin opened this issue Mar 14, 2016 · 13 comments
Closed

Complete system lock-up Ubuntu 14.04 #21179

rbjorklin opened this issue Mar 14, 2016 · 13 comments

Comments

@rbjorklin
Copy link

This might be a duplicate of #10355 but I was asked by @thaJeztah to open a separate issue in this comment. My original report can be seen here.

Running Ubuntu 14.04.4 all patched up with docker 1.10.2 we had 6 out of 7 virtual machines (VmWare) completely freeze at pretty much the same time in our dev environment sometime between 15.00-16.00CET 2014-03-11. This happened again for 2 machines ~3 hours later. The console provided by VmWare was completely unresponsive. We have hundreds of VMs running and I've never observed this behavior before. I'd like to blame docker but currently I have no hard proof.

Output of uname -a:

Linux slave6-mesos 3.13.0-79-generic #123-Ubuntu SMP Fri Feb 19 14:27:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Output of docker version:

Client:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:37:01 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:37:01 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 58
 Running: 1
 Paused: 0
 Stopped: 57
Images: 29
Server Version: 1.10.2
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 401
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: weave null host bridge weavemesh
Kernel Version: 3.13.0-79-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.67 GiB
Name: slave6-mesos
ID: 7OZ4:ZTNC:WVEM:DCLQ:TTSL:ZHCI:QMKB:7GUX:HNEN:R3NW:KUPS:3IWF
WARNING: No swap limit support
Cluster store: zk://master1-mesos.dte.loc:2181,master2-mesos.dte.loc:2181,master3-mesos.dte.loc:2181/dockernet
Cluster advertise: 10.34.20.158:2375

Additional environment details:
We are running Marathon on top of Mesos so containers are started by the Mesos slave. All containers are running the official tomcat image with a bash script as ENTRYPOINT that traps sigterm to handle signals nicely. Inside the container we are also running the zabbix-agent to poll JMX values and report back. Pretty much all logging is sent out of the container to logstash with gelf. Tomcat is using this to get it's logs out.
The Mesos slaves are virtual machines in VmWare. We were using Marathon version 0.13.0, Mesos 0.27.1 and Docker 1.10.2 when this issue occurred but have since upgraded to Mesos 0.27.2 and Docker 1.10.3.

Additional information you deem important (e.g. issue happens only occasionally):
Have seen this message logged a few times:

aufs au_opts_parse:1155:docker[1094]: unknown option dirperm1
@rbjorklin
Copy link
Author

The last few lines in syslog before system lock-up:

Mar 11 18:57:00 slave6-mesos mesos-slave[1427]: I0311 18:57:00.349362  1523 slave.cpp:4304] Current disk usage 50.37%. Max allowed age: 2.774337338481030days
Mar 11 18:57:12 slave6-mesos kernel: [10436.826488] docker0: port 4(vethf3178d2) entered forwarding state
Mar 11 18:57:23 slave6-mesos mesos-slave[1427]: I0311 18:57:23.620837  1520 slave.cpp:1890] Asked to kill task featuredockerifyproject_mock.c89713c9-e7b1-11e5-932b-0050569710e1 of framework e5f6bf78-ff84-4d02-9c85-3f27e4c9f0c0-0
000
Mar 11 18:57:24 slave6-mesos kernel: [10448.655194] docker0: port 6(vethceeddf5) entered disabled state
Mar 11 18:57:24 slave6-mesos kernel: [10448.888775] docker0: port 6(vethceeddf5) entered disabled state
Mar 11 18:57:24 slave6-mesos kernel: [10448.892371] device vethceeddf5 left promiscuous mode
Mar 11 18:57:24 slave6-mesos kernel: [10448.892402] docker0: port 6(vethceeddf5) entered disabled state
Mar 11 18:57:24 slave6-mesos mesos-slave[1427]: I0311 18:57:24.429033  1524 slave.cpp:3001] Handling status update TASK_KILLED (UUID: 52259042-997b-4768-8927-6e68a2df17cd) for task featuredockerifyproject_mock.c89713c9-e7b1-11e5
-932b-0050569710e1 of framewMar 14 07:34:29 slave6-mesos rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1156" x-info="http://www.rsyslog.com"] start

Unfortunately the log messages in /var/log/upstart/docker.log doesn't contain timestamps so I can't provide any accurate output from the docker daemon.

@HackToday
Copy link
Contributor

From https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes#Kernel
seems ubuntu 14.04.4 would ship a new kernel 4.2, Not sure if your kernel 3.13 is right here ?

Is that really ubuntu 14.04.4 ? Thanks

@rbjorklin
Copy link
Author

@HackToday Ubuntu 14.04.4 ships with 3.13.0 but it is indeed possible to use one of the backported kernels from other NON LTS releases. Is this what the docker team recommends?

I think this is what you intended to link: https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes#Updated_Packages

Reading here: "Those running virtual or cloud images should not need this newer hardware enablement stack and thus it is recommended they remain on the original Trusty stack."

@HackToday
Copy link
Contributor

no @rbjorklin I just not have ubuntu 14.04.4 on hand, Maybe could try later to find if kernel matched what you given. Thanks

@sfussenegger
Copy link

My Ubuntu 15.10 always came to a complete halt when starting 12 containers using docker-compose, most of which where java processes. It turned out that my machine simply ran out of memory which I learned when Eclipse crashed (not running in a container obviously):

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007a2f00000, 326107136, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 326107136 bytes for committing reserved memory.

My machine has 16G and another 8G swap so this came as a bit of a surprise. The containers had no restrictions set on CPU or memory and neither did the java processes. I've now added a generous limit of 512m (docker) and 256m (java/Xmx) for all java containers to solve the issue.

My gut feeling tells me that java is using more memory inside an unbounded container than it is supposed to.

Some more output:

$ uname -a
Linux koothrappali 4.2.0-34-generic #39-Ubuntu SMP Thu Mar 10 22:13:01 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ docker info 
Containers: 16
 Running: 15
 Paused: 0
 Stopped: 1
Images: 13
Server Version: 1.10.3
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 163
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 4.2.0-34-generic
Operating System: Ubuntu 15.10
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.61 GiB
Name: koothrappali
ID: F7CA:3W4R:6NCW:N4LI:HQSM:TEKV:J4ZI:FO5M:2LDC:6SXT:WE4W:RZKV
WARNING: No swap limit support

$ docker run --entrypoint java --rm my-container:latest -version
openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)

@rbjorklin
Copy link
Author

The problem hasn't reoccurred since we upgraded to docker 1.10.3. Will wait a few more days but this could possibly have been solved.

@thaJeztah
Copy link
Member

Thanks @rbjorklin, keep us posted

@thaJeztah
Copy link
Member

@rbjorklin wondering if it's resolved after upgrading to docker 1.10.3; are you still seeing this issue, or can we mark this resolved?

@rbjorklin
Copy link
Author

@thaJeztah sorry completely forgot about this!

@thaJeztah
Copy link
Member

No problem, glad it's resolved!

@wangyumi
Copy link

wangyumi commented Jun 24, 2016

unfortunatly, I met the same issue on ubuntu 14.04.4 with docker 1.10.3:

#uname -a
Linux k8s-014 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

#docker info
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 104
Server Version: 1.10.3
Storage Driver: aufs
Root Dir: /data/docker/aufs
Backing Filesystem: extfs
Dirs: 609
Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 3.13.0-24-generic
Operating System: Ubuntu 14.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.777 GiB
Name: k8s-014
ID: 2YKG:UJAY:DCRJ:ZX52:W4LT:X2LB:ULEA:QISH:4WOS:CE65:YKVC:TXYN
WARNING: No swap limit support

kernel log:
Jun 24 15:10:32 k8s-014 kernel: [4944764.996117] net eth0: Too many slots
Jun 24 15:10:42 k8s-014 kernel: [4944775.115344] net eth0: Too many slots
Jun 24 16:22:43 k8s-014 kernel: [4949096.370037] device veth997ac00 entered promiscuous mode
Jun 24 16:22:43 k8s-014 kernel: [4949096.370616] IPv6: ADDRCONF(NETDEV_UP): veth997ac00: link is not ready
Jun 24 16:22:44 k8s-014 kernel: [4949096.584893] IPv6: ADDRCONF(NETDEV_CHANGE): veth997ac00: link becomes ready
Jun 24 16:22:44 k8s-014 kernel: [4949096.585005] docker0: port 1(veth997ac00) entered forwarding state
Jun 24 16:22:44 k8s-014 kernel: [4949096.585010] docker0: port 1(veth997ac00) entered forwarding state
Jun 24 16:22:44 k8s-014 kernel: [4949096.834365] net eth0: Too many slots
Jun 24 16:22:44 k8s-014 kernel: [4949096.935488] net eth0: Too many slots
Jun 24 16:22:44 k8s-014 kernel: [4949096.989313] net eth0: Too many slots
Jun 24 16:22:44 k8s-014 kernel: [4949097.461336] net eth0: Too many slots
Jun 24 16:22:44 k8s-014 kernel: [4949097.488808] net eth0: Too many slots
Jun 24 16:22:45 k8s-014 kernel: [4949097.960160] net eth0: Too many slots
Jun 24 16:22:45 k8s-014 kernel: [4949097.998322] net eth0: Too many slots
Jun 24 16:22:45 k8s-014 kernel: [4949098.253653] net eth0: Too many slots
Jun 24 16:22:45 k8s-014 kernel: [4949098.315925] net eth0: Too many slots
Jun 24 16:22:59 k8s-014 kernel: [4949111.612093] docker0: port 1(veth997ac00) entered forwarding state
Jun 24 16:40:34 k8s-014 kernel: [4950166.820769] docker0: port 1(veth997ac00) entered disabled state
Jun 24 16:40:34 k8s-014 kernel: [4950166.896411] docker0: port 1(veth997ac00) entered disabled state
Jun 24 16:40:34 k8s-014 kernel: [4950166.896799] device veth997ac00 left promiscuous mode
Jun 24 16:40:34 k8s-014 kernel: [4950166.896810] docker0: port 1(veth997ac00) entered disabled state
Jun 24 16:55:02 k8s-014 kernel: [4951034.780565] net_ratelimit: 1 callbacks suppressed
Jun 24 16:55:02 k8s-014 kernel: [4951034.780571] net eth0: Too many slots
Jun 24 17:56:27 k8s-014 kernel: [4954719.586787] net eth0: Too many slots
Jun 24 17:56:27 k8s-014 kernel: [4954720.321527] net eth0: Too many slots
Jun 24 17:56:28 k8s-014 kernel: [4954720.587148] net eth0: Too many slots
Jun 24 18:09:55 k8s-014 kernel: [4955528.129943] device veth37a1c0c entered promiscuous mode
Jun 24 18:09:55 k8s-014 kernel: [4955528.130724] IPv6: ADDRCONF(NETDEV_UP): veth37a1c0c: link is not ready
Jun 24 18:09:55 k8s-014 kernel: [4955528.332977] IPv6: ADDRCONF(NETDEV_CHANGE): veth37a1c0c: link becomes ready
Jun 24 18:09:55 k8s-014 kernel: [4955528.333147] docker0: port 1(veth37a1c0c) entered forwarding state
Jun 24 18:09:55 k8s-014 kernel: [4955528.333154] docker0: port 1(veth37a1c0c) entered forwarding state
Jun 24 18:09:56 k8s-014 kernel: [4955528.808605] net eth0: Too many slots
Jun 24 18:09:56 k8s-014 kernel: [4955528.824141] IPv6: eth0: IPv6 duplicate address fe80::42:acff:fe11:3002 detected!
Jun 24 18:10:10 k8s-014 kernel: [4955543.388082] docker0: port 1(veth37a1c0c) entered forwarding state
Jun 24 18:11:29 k8s-014 kernel: [4955622.006515] net eth0: Too many slots
Jun 24 18:11:30 k8s-014 kernel: [4955622.604626] net eth0: Too many slots
Jun 24 19:11:29 k8s-014 kernel: [4959222.015271] device veth63cde08 entered promiscuous mode
Jun 24 19:11:29 k8s-014 kernel: [4959222.015697] IPv6: ADDRCONF(NETDEV_UP): veth63cde08: link is not ready
Jun 24 19:11:29 k8s-014 kernel: [4959222.212706] IPv6: ADDRCONF(NETDEV_CHANGE): veth63cde08: link becomes ready
Jun 24 19:11:29 k8s-014 kernel: [4959222.212809] docker0: port 2(veth63cde08) entered forwarding state
Jun 24 19:11:29 k8s-014 kernel: [4959222.212813] docker0: port 2(veth63cde08) entered forwarding state
Jun 24 19:11:29 k8s-014 kernel: [4959222.553715] net eth0: Too many slots
Jun 24 19:11:29 k8s-014 kernel: [4959222.558058] net eth0: Too many slots
Jun 24 19:11:30 k8s-014 kernel: [4959222.676158] IPv6: eth0: IPv6 duplicate address fe80::42:acff:fe11:3003 detected!
Jun 24 19:11:30 k8s-014 kernel: [4959222.981529] net eth0: Too many slots
Jun 24 19:11:44 k8s-014 kernel: [4959237.244079] docker0: port 2(veth63cde08) entered forwarding state

@rbjorklin
Copy link
Author

@wangyumi I don't recall all the details but I think you need update your kernel. There were some fairly important AUFS fix in 3.13.0-79 and it looks like you're running an old 3.13.0-24 kernel.

@wangyumi
Copy link

okay, I will try. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants