Complete system lock-up Ubuntu 14.04 #21179

rbjorklin · 2016-03-14T08:07:30Z

This might be a duplicate of #10355 but I was asked by @thaJeztah to open a separate issue in this comment. My original report can be seen here.

Running Ubuntu 14.04.4 all patched up with docker 1.10.2 we had 6 out of 7 virtual machines (VmWare) completely freeze at pretty much the same time in our dev environment sometime between 15.00-16.00CET 2014-03-11. This happened again for 2 machines ~3 hours later. The console provided by VmWare was completely unresponsive. We have hundreds of VMs running and I've never observed this behavior before. I'd like to blame docker but currently I have no hard proof.

Output of uname -a:

Linux slave6-mesos 3.13.0-79-generic #123-Ubuntu SMP Fri Feb 19 14:27:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Output of docker version:

Client:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:37:01 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:37:01 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 58
 Running: 1
 Paused: 0
 Stopped: 57
Images: 29
Server Version: 1.10.2
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 401
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: weave null host bridge weavemesh
Kernel Version: 3.13.0-79-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.67 GiB
Name: slave6-mesos
ID: 7OZ4:ZTNC:WVEM:DCLQ:TTSL:ZHCI:QMKB:7GUX:HNEN:R3NW:KUPS:3IWF
WARNING: No swap limit support
Cluster store: zk://master1-mesos.dte.loc:2181,master2-mesos.dte.loc:2181,master3-mesos.dte.loc:2181/dockernet
Cluster advertise: 10.34.20.158:2375

Additional environment details:
We are running Marathon on top of Mesos so containers are started by the Mesos slave. All containers are running the official tomcat image with a bash script as ENTRYPOINT that traps sigterm to handle signals nicely. Inside the container we are also running the zabbix-agent to poll JMX values and report back. Pretty much all logging is sent out of the container to logstash with gelf. Tomcat is using this to get it's logs out.
The Mesos slaves are virtual machines in VmWare. We were using Marathon version 0.13.0, Mesos 0.27.1 and Docker 1.10.2 when this issue occurred but have since upgraded to Mesos 0.27.2 and Docker 1.10.3.

Additional information you deem important (e.g. issue happens only occasionally):
Have seen this message logged a few times:

aufs au_opts_parse:1155:docker[1094]: unknown option dirperm1

The text was updated successfully, but these errors were encountered:

rbjorklin · 2016-03-14T08:20:36Z

The last few lines in syslog before system lock-up:

Mar 11 18:57:00 slave6-mesos mesos-slave[1427]: I0311 18:57:00.349362  1523 slave.cpp:4304] Current disk usage 50.37%. Max allowed age: 2.774337338481030days
Mar 11 18:57:12 slave6-mesos kernel: [10436.826488] docker0: port 4(vethf3178d2) entered forwarding state
Mar 11 18:57:23 slave6-mesos mesos-slave[1427]: I0311 18:57:23.620837  1520 slave.cpp:1890] Asked to kill task featuredockerifyproject_mock.c89713c9-e7b1-11e5-932b-0050569710e1 of framework e5f6bf78-ff84-4d02-9c85-3f27e4c9f0c0-0
000
Mar 11 18:57:24 slave6-mesos kernel: [10448.655194] docker0: port 6(vethceeddf5) entered disabled state
Mar 11 18:57:24 slave6-mesos kernel: [10448.888775] docker0: port 6(vethceeddf5) entered disabled state
Mar 11 18:57:24 slave6-mesos kernel: [10448.892371] device vethceeddf5 left promiscuous mode
Mar 11 18:57:24 slave6-mesos kernel: [10448.892402] docker0: port 6(vethceeddf5) entered disabled state
Mar 11 18:57:24 slave6-mesos mesos-slave[1427]: I0311 18:57:24.429033  1524 slave.cpp:3001] Handling status update TASK_KILLED (UUID: 52259042-997b-4768-8927-6e68a2df17cd) for task featuredockerifyproject_mock.c89713c9-e7b1-11e5
-932b-0050569710e1 of framewMar 14 07:34:29 slave6-mesos rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1156" x-info="http://www.rsyslog.com"] start

Unfortunately the log messages in /var/log/upstart/docker.log doesn't contain timestamps so I can't provide any accurate output from the docker daemon.

HackToday · 2016-03-15T00:34:31Z

From https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes#Kernel
seems ubuntu 14.04.4 would ship a new kernel 4.2, Not sure if your kernel 3.13 is right here ?

Is that really ubuntu 14.04.4 ? Thanks

rbjorklin · 2016-03-15T08:41:24Z

@HackToday Ubuntu 14.04.4 ships with 3.13.0 but it is indeed possible to use one of the backported kernels from other NON LTS releases. Is this what the docker team recommends?

I think this is what you intended to link: https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes#Updated_Packages

Reading here: "Those running virtual or cloud images should not need this newer hardware enablement stack and thus it is recommended they remain on the original Trusty stack."

HackToday · 2016-03-15T08:53:35Z

no @rbjorklin I just not have ubuntu 14.04.4 on hand, Maybe could try later to find if kernel matched what you given. Thanks

sfussenegger · 2016-04-01T13:00:18Z

My Ubuntu 15.10 always came to a complete halt when starting 12 containers using docker-compose, most of which where java processes. It turned out that my machine simply ran out of memory which I learned when Eclipse crashed (not running in a container obviously):

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007a2f00000, 326107136, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 326107136 bytes for committing reserved memory.

My machine has 16G and another 8G swap so this came as a bit of a surprise. The containers had no restrictions set on CPU or memory and neither did the java processes. I've now added a generous limit of 512m (docker) and 256m (java/Xmx) for all java containers to solve the issue.

My gut feeling tells me that java is using more memory inside an unbounded container than it is supposed to.

Some more output:

$ uname -a
Linux koothrappali 4.2.0-34-generic #39-Ubuntu SMP Thu Mar 10 22:13:01 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ docker info 
Containers: 16
 Running: 15
 Paused: 0
 Stopped: 1
Images: 13
Server Version: 1.10.3
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 163
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 4.2.0-34-generic
Operating System: Ubuntu 15.10
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.61 GiB
Name: koothrappali
ID: F7CA:3W4R:6NCW:N4LI:HQSM:TEKV:J4ZI:FO5M:2LDC:6SXT:WE4W:RZKV
WARNING: No swap limit support

$ docker run --entrypoint java --rm my-container:latest -version
openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)

rbjorklin · 2016-04-11T08:23:51Z

The problem hasn't reoccurred since we upgraded to docker 1.10.3. Will wait a few more days but this could possibly have been solved.

thaJeztah · 2016-04-11T08:31:07Z

Thanks @rbjorklin, keep us posted

thaJeztah · 2016-05-11T21:55:15Z

@rbjorklin wondering if it's resolved after upgrading to docker 1.10.3; are you still seeing this issue, or can we mark this resolved?

rbjorklin · 2016-05-12T05:30:04Z

@thaJeztah sorry completely forgot about this!

thaJeztah · 2016-05-12T08:45:09Z

No problem, glad it's resolved!

wangyumi · 2016-06-24T11:43:52Z

unfortunatly, I met the same issue on ubuntu 14.04.4 with docker 1.10.3:

#uname -a
Linux k8s-014 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

#docker info
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 104
Server Version: 1.10.3
Storage Driver: aufs
Root Dir: /data/docker/aufs
Backing Filesystem: extfs
Dirs: 609
Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 3.13.0-24-generic
Operating System: Ubuntu 14.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.777 GiB
Name: k8s-014
ID: 2YKG:UJAY:DCRJ:ZX52:W4LT:X2LB:ULEA:QISH:4WOS:CE65:YKVC:TXYN
WARNING: No swap limit support

kernel log:
Jun 24 15:10:32 k8s-014 kernel: [4944764.996117] net eth0: Too many slots
Jun 24 15:10:42 k8s-014 kernel: [4944775.115344] net eth0: Too many slots
Jun 24 16:22:43 k8s-014 kernel: [4949096.370037] device veth997ac00 entered promiscuous mode
Jun 24 16:22:43 k8s-014 kernel: [4949096.370616] IPv6: ADDRCONF(NETDEV_UP): veth997ac00: link is not ready
Jun 24 16:22:44 k8s-014 kernel: [4949096.584893] IPv6: ADDRCONF(NETDEV_CHANGE): veth997ac00: link becomes ready
Jun 24 16:22:44 k8s-014 kernel: [4949096.585005] docker0: port 1(veth997ac00) entered forwarding state
Jun 24 16:22:44 k8s-014 kernel: [4949096.585010] docker0: port 1(veth997ac00) entered forwarding state
Jun 24 16:22:44 k8s-014 kernel: [4949096.834365] net eth0: Too many slots
Jun 24 16:22:44 k8s-014 kernel: [4949096.935488] net eth0: Too many slots
Jun 24 16:22:44 k8s-014 kernel: [4949096.989313] net eth0: Too many slots
Jun 24 16:22:44 k8s-014 kernel: [4949097.461336] net eth0: Too many slots
Jun 24 16:22:44 k8s-014 kernel: [4949097.488808] net eth0: Too many slots
Jun 24 16:22:45 k8s-014 kernel: [4949097.960160] net eth0: Too many slots
Jun 24 16:22:45 k8s-014 kernel: [4949097.998322] net eth0: Too many slots
Jun 24 16:22:45 k8s-014 kernel: [4949098.253653] net eth0: Too many slots
Jun 24 16:22:45 k8s-014 kernel: [4949098.315925] net eth0: Too many slots
Jun 24 16:22:59 k8s-014 kernel: [4949111.612093] docker0: port 1(veth997ac00) entered forwarding state
Jun 24 16:40:34 k8s-014 kernel: [4950166.820769] docker0: port 1(veth997ac00) entered disabled state
Jun 24 16:40:34 k8s-014 kernel: [4950166.896411] docker0: port 1(veth997ac00) entered disabled state
Jun 24 16:40:34 k8s-014 kernel: [4950166.896799] device veth997ac00 left promiscuous mode
Jun 24 16:40:34 k8s-014 kernel: [4950166.896810] docker0: port 1(veth997ac00) entered disabled state
Jun 24 16:55:02 k8s-014 kernel: [4951034.780565] net_ratelimit: 1 callbacks suppressed
Jun 24 16:55:02 k8s-014 kernel: [4951034.780571] net eth0: Too many slots
Jun 24 17:56:27 k8s-014 kernel: [4954719.586787] net eth0: Too many slots
Jun 24 17:56:27 k8s-014 kernel: [4954720.321527] net eth0: Too many slots
Jun 24 17:56:28 k8s-014 kernel: [4954720.587148] net eth0: Too many slots
Jun 24 18:09:55 k8s-014 kernel: [4955528.129943] device veth37a1c0c entered promiscuous mode
Jun 24 18:09:55 k8s-014 kernel: [4955528.130724] IPv6: ADDRCONF(NETDEV_UP): veth37a1c0c: link is not ready
Jun 24 18:09:55 k8s-014 kernel: [4955528.332977] IPv6: ADDRCONF(NETDEV_CHANGE): veth37a1c0c: link becomes ready
Jun 24 18:09:55 k8s-014 kernel: [4955528.333147] docker0: port 1(veth37a1c0c) entered forwarding state
Jun 24 18:09:55 k8s-014 kernel: [4955528.333154] docker0: port 1(veth37a1c0c) entered forwarding state
Jun 24 18:09:56 k8s-014 kernel: [4955528.808605] net eth0: Too many slots
Jun 24 18:09:56 k8s-014 kernel: [4955528.824141] IPv6: eth0: IPv6 duplicate address fe80::42:acff:fe11:3002 detected!
Jun 24 18:10:10 k8s-014 kernel: [4955543.388082] docker0: port 1(veth37a1c0c) entered forwarding state
Jun 24 18:11:29 k8s-014 kernel: [4955622.006515] net eth0: Too many slots
Jun 24 18:11:30 k8s-014 kernel: [4955622.604626] net eth0: Too many slots
Jun 24 19:11:29 k8s-014 kernel: [4959222.015271] device veth63cde08 entered promiscuous mode
Jun 24 19:11:29 k8s-014 kernel: [4959222.015697] IPv6: ADDRCONF(NETDEV_UP): veth63cde08: link is not ready
Jun 24 19:11:29 k8s-014 kernel: [4959222.212706] IPv6: ADDRCONF(NETDEV_CHANGE): veth63cde08: link becomes ready
Jun 24 19:11:29 k8s-014 kernel: [4959222.212809] docker0: port 2(veth63cde08) entered forwarding state
Jun 24 19:11:29 k8s-014 kernel: [4959222.212813] docker0: port 2(veth63cde08) entered forwarding state
Jun 24 19:11:29 k8s-014 kernel: [4959222.553715] net eth0: Too many slots
Jun 24 19:11:29 k8s-014 kernel: [4959222.558058] net eth0: Too many slots
Jun 24 19:11:30 k8s-014 kernel: [4959222.676158] IPv6: eth0: IPv6 duplicate address fe80::42:acff:fe11:3003 detected!
Jun 24 19:11:30 k8s-014 kernel: [4959222.981529] net eth0: Too many slots
Jun 24 19:11:44 k8s-014 kernel: [4959237.244079] docker0: port 2(veth63cde08) entered forwarding state

rbjorklin · 2016-06-24T12:22:39Z

@wangyumi I don't recall all the details but I think you need update your kernel. There were some fairly important AUFS fix in 3.13.0-79 and it looks like you're running an old 3.13.0-24 kernel.

wangyumi · 2016-06-25T02:02:13Z

okay, I will try. Thanks!

rbjorklin closed this as completed May 12, 2016

hauihau mentioned this issue Dec 12, 2016

Docker 1.12.3 freezes when VMware VDP does a backup of the Docker Host VM #29324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete system lock-up Ubuntu 14.04 #21179

Complete system lock-up Ubuntu 14.04 #21179

rbjorklin commented Mar 14, 2016

rbjorklin commented Mar 14, 2016

HackToday commented Mar 15, 2016

rbjorklin commented Mar 15, 2016

HackToday commented Mar 15, 2016

sfussenegger commented Apr 1, 2016

rbjorklin commented Apr 11, 2016

thaJeztah commented Apr 11, 2016

thaJeztah commented May 11, 2016

rbjorklin commented May 12, 2016

thaJeztah commented May 12, 2016

wangyumi commented Jun 24, 2016 •

edited

rbjorklin commented Jun 24, 2016

wangyumi commented Jun 25, 2016

Complete system lock-up Ubuntu 14.04 #21179

Complete system lock-up Ubuntu 14.04 #21179

Comments

rbjorklin commented Mar 14, 2016

rbjorklin commented Mar 14, 2016

HackToday commented Mar 15, 2016

rbjorklin commented Mar 15, 2016

HackToday commented Mar 15, 2016

sfussenegger commented Apr 1, 2016

rbjorklin commented Apr 11, 2016

thaJeztah commented Apr 11, 2016

thaJeztah commented May 11, 2016

rbjorklin commented May 12, 2016

thaJeztah commented May 12, 2016

wangyumi commented Jun 24, 2016 • edited

rbjorklin commented Jun 24, 2016

wangyumi commented Jun 25, 2016

wangyumi commented Jun 24, 2016 •

edited