Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Panic when container restart (docker-compose restart) on centos6.6 #14181

Closed
lostsnow opened this issue Jun 25, 2015 · 25 comments
Closed

Comments

@lostsnow
Copy link

docker info

# docker info
Containers: 10
Images: 183
Storage Driver: devicemapper
 Pool Name: docker-253:2-12321001-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: extfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 4.128 GB
 Data Space Total: 107.4 GB
 Data Space Available: 103.2 GB
 Metadata Space Used: 8.471 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.139 GB
 Udev Sync Supported: true
 Data loop file: /opt/docker/docker/devicemapper/devicemapper/data
 Metadata loop file: /opt/docker/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.89-RHEL6 (2014-09-01)
Execution Driver: native-0.2
Kernel Version: 2.6.32-504.23.4.el6.x86_64
Operating System: <unknown>
CPUs: 16
Total Memory: 47.12 GiB
Name: serv-11-1-171
ID: LBVP:5RFY:ZHGG:HUDB:Y2TZ:TXSF:WCTO:UFLV:R2GX:IB4X:UB33:UD7Q

docker version

# docker version
Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 7c8fca2/1.6.2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 7c8fca2/1.6.2
OS/Arch (server): linux/amd64

vmcore-dmesg.txt: http://pastebin.com/Zs6QWLEB

<4>general protection fault: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/virtual/dmi/id/sys_vendor
<4>CPU 5 
<4>Modules linked in: tun veth ipt_addrtype dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc ipv6 ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables microcode ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support bnx2 cdc_ether usbnet mii serio_raw i2c_i801 i2c_core lpc_ich mfd_core sg ioatdma dca i7core_edac edac_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_piix megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 3485, comm: docker Not tainted 2.6.32-504.23.4.el6.x86_64 #1 IBM System x3650 M3 -[7945Q4I]-/69Y5698     
<4>RIP: 0010:[<ffffffff8129ecb0>]  [<ffffffff8129ecb0>] list_del+0x10/0xa0
<4>RSP: 0018:ffff880c5b723dc8  EFLAGS: 00010092
<4>RAX: dead000000200200 RBX: ffff880c74de7558 RCX: 0000000000000010
<4>RDX: 0000000000000002 RSI: 0000000000000003 RDI: ffff880c74de7558
<4>RBP: ffff880c5b723dd8 R08: 0000000000000010 R09: 0000000000000000
<4>R10: 0000000000000000 R11: 0000000000000246 R12: ffff880c74de7540
<4>R13: ffff880c5b6cc918 R14: 0000000000000010 R15: 0000000000000000
<4>FS:  00007fce817fb700(0000) GS:ffff880695420000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>CR2: 00007fce817fad98 CR3: 0000000675116000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process docker (pid: 3485, threadinfo ffff880c5b722000, task ffff880c73868ab0)
<4>Stack:
<4> 00000067df538aad ffff880c74de7580 ffff880c5b723e08 ffffffff810cdc32
<4><d> 0000000100000000 ffff880c6f299738 0000000000000000 ffff880c6f299750
<4><d> ffff880c5b723e58 ffffffff81057839 ffff880c5b723f58 0000000300000001
<4>Call Trace:
<4> [<ffffffff810cdc32>] cgroup_event_wake+0x42/0x70
<4> [<ffffffff81057839>] __wake_up_common+0x59/0x90
<4> [<ffffffff8105bd68>] __wake_up+0x48/0x70
<4> [<ffffffff811daf8d>] eventfd_release+0x2d/0x40
<4> [<ffffffff8118fa45>] __fput+0xf5/0x210
<4> [<ffffffff8118fb85>] fput+0x25/0x30
<4> [<ffffffff8118addd>] filp_close+0x5d/0x90
<4> [<ffffffff8118aeb5>] sys_close+0xa5/0x100
<4> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
<4>Code: 01 01 01 01 01 48 0f af c2 48 c1 e8 38 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 <4c> 8b 00 4c 39 c7 75 39 48 8b 03 4c 8b 40 08 4c 39 c3 75 4c 48 
<1>RIP  [<ffffffff8129ecb0>] list_del+0x10/0xa0
<4> RSP <ffff880c5b723dc8>
@GordonTheTurtle
Copy link

Hi!

Please read this important information about creating issues.

If you are reporting a new issue, make sure that we do not have any duplicates already open. You can ensure this by searching the issue list for this repository. If there is a duplicate, please close your issue and add a comment to the existing issue instead.

If you suspect your issue is a bug, please edit your issue description to include the BUG REPORT INFORMATION shown below. If you fail to provide this information within 7 days, we cannot debug your issue and will close it. We will, however, reopen it if you later provide the information.

This is an automated, informational response.

Thank you.

For more information about reporting issues, see https://github.com/docker/docker/blob/master/CONTRIBUTING.md#reporting-other-issues


BUG REPORT INFORMATION

Use the commands below to provide key information from your environment:

docker version:
docker info:
uname -a:

Provide additional environment details (AWS, VirtualBox, physical, etc.):

List the steps to reproduce the issue:
1.
2.
3.

Describe the results you received:

Describe the results you expected:

Provide additional info you think is important:

----------END REPORT ---------

#ENEEDMOREINFO

@cyphar
Copy link
Contributor

cyphar commented Jun 25, 2015

I don't think we support kernels as old as 2.6.32-504.23.4.el6.x86_64. IIRC there were a bunch of issues with cgroups and namespaces that caused kernel panics in the old days. If you want, you can try to compile a more modern Linux kernel and see if the problem persists (and if you can give us a reproducible test case) [don't actually switch kernels on production, that's a bad idea].

In either case, this is a Linux kernel bug which is either caused by some CentOS patch or is an upstream bug. Pop an email to the CentOS guys first to see if the problem is on their end, if it isn't send an email to the stable kernel maintainers to see if someone can take a look at it.

They'll probably want a decompilation of the relevant kernel code (a general protection fault is caused by invalid memory accesses that violate protection policies of the CPU), as well as some more hardware-specific information.

@unclejack
Copy link
Contributor

@cyphar Please don't recommend custom kernels on CentOS 6 and RHEL 6. They're not in any way supported.

This might indeed be a kernel issue and needs to be investigated. Marking as a kernel and CentOS issue.

@cyphar
Copy link
Contributor

cyphar commented Jun 25, 2015

@unclejack The reason I was asking him to try with a more modern kernel is to see if this bug is present in modern kernels (or if it was fixed and has yet to be backported, or if it is a bug in a CentOS patch). I wouldn't dream of running custom kernels in production (especially on CentOS).

@visualphoenix
Copy link

Agreed with @unclejack - I'm sure you were trying to be helpful, but the docker team has only ever attempted to support official RHEL/CentOS kernels.

2.6.32-504.23.4.el6.x86_64 is an official security update from RH: https://rhn.redhat.com/errata/RHSA-2015-1081.html

See: https://github.com/docker/docker/blob/release/docs/installation/centos.md and https://github.com/docker/docker/blob/release/docs/installation/rhel.md

Thanks @unclejack for marking this for investigation.

@lostsnow
Copy link
Author

I downgrade kernel to 2.6.32-504.16.2.el6.x86_64, it seems works fine.

@cyphar
Copy link
Contributor

cyphar commented Jun 26, 2015

@lostsnow Okay, so I've taken a look at the changelog, here is the list of patches applied during that period:

* Tue Jun 09 2015 Johnny Hughes <johnny@centos.org> [2.6.32-504.23.4.el6]
  - Roll in CentOS Branding
* Fri May 29 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.23.4.el6]
  - [crypto] drbg: fix maximum value checks on 32 bit systems (Herbert Xu) [1225950 1219907]
  - [crypto] drbg: remove configuration of fixed values (Herbert Xu) [1225950 1219907]
* Tue May 19 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.23.3.el6]
  - [netdrv] bonding: fix locking in enslave failure path (Nikolay Aleksandrov) [1222483 1221856]
  - [netdrv] bonding: primary_slave & curr_active_slave are not cleaned on enslave failure (Nikolay Aleksandrov) [1222483 1221856]
  - [netdrv] bonding: vlans don't get deleted on enslave failure (Nikolay Aleksandrov) [1222483 1221856]
  - [netdrv] bonding: mc addresses don't get deleted on enslave failure (Nikolay Aleksandrov) [1222483 1221856]
  - [netdrv] bonding: IFF_BONDING is not stripped on enslave failure (Nikolay Aleksandrov) [1222483 1221856]
  - [netdrv] bonding: fix error handling if slave is busy v2 (Nikolay Aleksandrov) [1222483 1221856]
* Thu May 07 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.23.2.el6]
  - [fs] pipe: fix pipe corruption and iovec overrun on partial copy (Seth Jennings) [1202860 1185166] {CVE-2015-1805}
* Thu May 07 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.23.1.el6]
  - [x86] crypto: sha256_ssse3 - fix stack corruption with SSSE3 and AVX implementations (Herbert Xu) [1218681 1201490]
  - [scsi] storvsc: ring buffer failures may result in I/O freeze (Vitaly Kuznetsov) [1215754 1171676]
  - [scsi] storvsc: get rid of overly verbose warning messages (Vitaly Kuznetsov) [1215753 1167967]
  - [scsi] storvsc: NULL pointer dereference fix (Vitaly Kuznetsov) [1215753 1167967]
  - [netdrv] ixgbe: fix detection of SFP+ capable interfaces (John Greene) [1213664 1150343]
  - [x86] crypto: aesni - fix memory usage in GCM decryption (Kurt Stutsman) [1213329 1213330] {CVE-2015-3331}
* Mon Apr 20 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.22.1.el6]
  - [kernel] hrtimer: Prevent hrtimer_enqueue_reprogram race (Prarit Bhargava) [1211940 1136958]
  - [kernel] hrtimer: Preserve timer state in remove_hrtimer() (Prarit Bhargava) [1211940 1136958]
  - [crypto] testmgr: fix RNG return code enforcement (Herbert Xu) [1212695 1208804]
  - [net] netfilter: xtables: make use of caller family rather than target family (Florian Westphal) [1212057 1210697]
  - [net] dynticks: avoid flow_cache_flush() interrupting every core (Marcelo Leitner) [1210595 1191559]
  - [tools] perf: Fix race in build_id_cache__add_s() (Milos Vyletel) [1210593 1204102]
  - [infiniband] ipath+qib: fix dma settings (Doug Ledford) [1208621 1171803]
  - [fs] dcache: return -ESTALE not -EBUSY on distributed fs race (J. Bruce Fields) [1207815 1061994]
  - [net] neigh: Keep neighbour cache entries if number of them is small enough (Jiri Pirko) [1207352 1199856]
  - [x86] crypto: sha256_ssse3 - also test for BMI2 (Herbert Xu) [1204736 1201560]
  - [scsi] qla2xxx: fix race in handling rport deletion during recovery causes panic (Chad Dupuis) [1203544 1102902]
  - [redhat] configs: Enable SSSE3 acceleration by default (Herbert Xu) [1201668 1036216]
  - [crypto] sha512: Create module providing optimized SHA512 routines using SSSE3, AVX or AVX2 instructions (Herbert Xu) [1201668 1036216]
  - [crypto] sha512: Optimized SHA512 x86_64 assembly routine using AVX2 RORX instruction (Herbert Xu) [1201668 1036216]
  - [crypto] sha512: Optimized SHA512 x86_64 assembly routine using AVX instructions (Herbert Xu) [1201668 1036216]
  - [crypto] sha512: Optimized SHA512 x86_64 assembly routine using Supplemental SSE3 instructions (Herbert Xu) [1201668 1036216]
  - [crypto] sha512: Expose generic sha512 routine to be callable from other modules (Herbert Xu) [1201668 1036216]
  - [crypto] sha256: Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions (Herbert Xu) [1201668 1036216]
  - [crypto] sha256: Optimized sha256 x86_64 routine using AVX2's RORX instructions (Herbert Xu) [1201668 1036216]
  - [crypto] sha256: Optimized sha256 x86_64 assembly routine with AVX instructions (Herbert Xu) [1201668 1036216]
  - [crypto] sha256: Optimized sha256 x86_64 assembly routine using Supplemental SSE3 instructions (Herbert Xu) [1201668 1036216]
  - [crypto] sha256: Expose SHA256 generic routine to be callable externally (Herbert Xu) [1201668 1036216]
  - [crypto] rng: RNGs must return 0 in success case (Herbert Xu) [1201669 1199230]
  - [fs] isofs: infinite loop in CE record entries (Jacob Tanenbaum) [1175243 1175245] {CVE-2014-9420}
  - [x86] vdso: ASLR bruteforce possible for vdso library (Jacob Tanenbaum) [1184896 1184897] {CVE-2014-9585}
  - [kernel] time: ntp: Correct TAI offset during leap second (Prarit Bhargava) [1201674 1199134]
  - [scsi] lpfc: correct device removal deadlock after link bounce (Rob Evers) [1211910 1194793]
  - [scsi] lpfc: Linux lpfc driver doesn't re-establish the link after a cable pull on LPe12002 (Rob Evers) [1211910 1194793]
  - [x86] switch_to(): Load TLS descriptors before switching DS and ES (Denys Vlasenko) [1177353 1177354] {CVE-2014-9419}
  - [net] vlan: Don't propagate flag changes on down interfaces (Jiri Pirko) [1173501 1135347]
  - [net] bridge: register vlan group for br ports (Jiri Pirko) [1173501 1135347]
  - [netdrv] tg3: Use new VLAN code (Jiri Pirko) [1173501 1135347]
  - [netdrv] be2net: move to new vlan model (Jiri Pirko) [1173501 1135347]
  - [net] vlan: mask vlan prio bits (Jiri Pirko) [1173501 1135347]
  - [net] vlan: don't deliver frames for unknown vlans to protocols (Jiri Pirko) [1173501 1135347]
  - [net] vlan: allow nested vlan_do_receive() (Jiri Pirko) [1173501 1135347]
  - [net] allow vlan traffic to be received under bond (Jiri Pirko) [1173501 1135347]
  - [net] vlan: goto another_round instead of calling __netif_receive_skb (Jiri Pirko) [1173501 1135347]
  - [net] bonding: fix bond_arp_rcv setting and arp validate desync state (Jiri Pirko) [1173501 1135347]
  - [net] bonding: remove packet cloning in recv_probe() (Jiri Pirko) [1173501 1135347]
  - [net] bonding: Fix LACPDU rx_dropped commit (Jiri Pirko) [1173501 1135347]
  - [net] bonding: don't increase rx_dropped after processing LACPDUs (Jiri Pirko) [1173501 1135347]
  - [net] bonding: use local function pointer of bond->recv_probe in bond_handle_frame (Jiri Pirko) [1173501 1135347]
  - [net] bonding: move processing of recv handlers into handle_frame() (Jiri Pirko) [1173501 1135347]
  - [netdrv] revert "bonding: fix bond_arp_rcv setting and arp validate desync state" (Jiri Pirko) [1173501 1135347]
  - [netdrv] revert "bonding: check for vlan device in bond_3ad_lacpdu_recv()" (Jiri Pirko) [1173501 1135347]
  - [net] vlan: Always untag vlan-tagged traffic on input (Jiri Pirko) [1173501 1135347]
  - [net] Make skb->skb_iif always track skb->dev (Jiri Pirko) [1173501 1135347]
  - [net] vlan: fix a potential memory leak (Jiri Pirko) [1173501 1135347]
  - [net] vlan: fix mac_len recomputation in vlan_untag() (Jiri Pirko) [1173501 1135347]
  - [net] vlan: reset headers on accel emulation path (Jiri Pirko) [1173501 1135347]
  - [net] vlan: Fix the ingress VLAN_FLAG_REORDER_HDR check (Jiri Pirko) [1173501 1135347]
  - [net] vlan: make non-hw-accel rx path similar to hw-accel (Jiri Pirko) [1173501 1135347]
  - [net] allow handlers to be processed for orig_dev (Jiri Pirko) [1173501 1135347]
  - [net] bonding: get netdev_rx_handler_unregister out of locks (Jiri Pirko) [1173501 1135347]
  - [net] bonding: fix rx_handler locking (Jiri Pirko) [1173501 1135347]
  - [net] introduce rx_handler results and logic around that (Jiri Pirko) [1173501 1135347]
  - [net] bonding: register slave pointer for rx_handler (Jiri Pirko) [1173501 1135347]
  - [net] bonding: COW before overwriting the destination MAC address (Jiri Pirko) [1173501 1135347]
  - [net] bonding: convert bonding to use rx_handler (Jiri Pirko) [1173501 1135347]
  - [net] openvswitch: use rx_handler_data pointer to store vport pointer (Jiri Pirko) [1173501 1135347]
  - [net] add a synchronize_net() in netdev_rx_handler_unregister() (Jiri Pirko) [1173501 1135347]
  - [net] add rx_handler data pointer (Jiri Pirko) [1173501 1135347]
  - [net] replace hooks in __netif_receive_skb (Jiri Pirko) [1173501 1135347]
  - [net] fix conflict between null_or_orig and null_or_bond (Jiri Pirko) [1173501 1135347]
  - [net] remove the unnecessary dance around skb_bond_should_drop (Jiri Pirko) [1173501 1135347]
  - [net] revert "bonding: fix receiving of dups due vlan hwaccel" (Jiri Pirko) [1173501 1135347]
  - [net] uninline skb_bond_should_drop() (Jiri Pirko) [1173501 1135347]
  - [net] bridge: Set vlan_features to allow offloads on vlans (Jiri Pirko) [1173501 1135347]
  - [net] bridge: convert br_features_recompute() to ndo_fix_features (Jiri Pirko) [1173501 1135347]
  - [net] revert "bridge: explictly tag vlan-accelerated frames destined to the host" (Jiri Pirko) [1173501 1135347]
  - [net] revert "fix vlan gro path" (Jiri Pirko) [1173501 1135347]
  - [net] revert "bridge: do not learn from exact matches" (Jiri Pirko) [1173501 1135347]
  - [net] revert "bridge gets duplicate packets when using vlan over bonding" (Jiri Pirko) [1173501 1135347]
  - [net] llc: remove noisy WARN from llc_mac_hdr_init (Jiri Pirko) [1173501 1135347]
  - [net] bridge: stp: ensure mac header is set (Jiri Pirko) [1173501 1135347]
  - [net] vlan: remove reduntant check in ndo_fix_features callback (Jiri Pirko) [1173501 1135347]
  - [net] vlan: enable soft features regardless of underlying device (Jiri Pirko) [1173501 1135347]
  - [net] vlan: don't call ndo_vlan_rx_register on hardware that doesn't have vlan support (Jiri Pirko) [1173501 1135347]
  - [net] vlan: Fix vlan_features propagation (Jiri Pirko) [1173501 1135347]
  - [net] vlan: convert VLAN devices to use ndo_fix_features() (Jiri Pirko) [1173501 1135347]
  - [net] revert "vlan: Avoid broken offload configuration when reorder_hdr is disabled" (Jiri Pirko) [1173501 1135347]
  - [net] vlan: vlan device is lockless do not transfer real_num_<tx|rx>_queues (Jiri Pirko) [1173501 1135347]
  - [net] vlan: consolidate 8021q tagging (Jiri Pirko) [1173501 1135347]
  - [net] propagate NETIF_F_HIGHDMA to vlans (Jiri Pirko) [1173501 1135347]
  - [net] Fix a memmove bug in dev_gro_receive() (Jiri Pirko) [1173501 1135347]
  - [net] vlan: remove check for headroom in vlan_dev_create (Jiri Pirko) [1173501 1135347]
  - [net] vlan: set hard_header_len when VLAN offload features are toggled (Jiri Pirko) [1173501 1135347]
  - [net] vlan: Calling vlan_hwaccel_do_receive() is always valid (Jiri Pirko) [1173501 1135347]
  - [net] vlan: Centralize handling of hardware acceleration (Jiri Pirko) [1173501 1135347]
  - [net] vlan: finish removing vlan_find_dev from public header (Jiri Pirko) [1173501 1135347]
  - [net] vlan: make vlan_find_dev private (Jiri Pirko) [1173501 1135347]
  - [net] vlan: Avoid hash table lookup to find group (Jiri Pirko) [1173501 1135347]
  - [net] revert "vlan: Add helper functions to manage vlans on bonds and slaves" (Jiri Pirko) [1173501 1135347]
  - [net] revert "bonding: assign slaves their own vlan_groups" (Jiri Pirko) [1173501 1135347]
  - [net] revert "bonding: fix regression on vlan module removal" (Jiri Pirko) [1173501 1135347]
  - [net] revert "bonding: Always add vid to new slave group" (Jiri Pirko) [1173501 1135347]
  - [net] revert "bonding: Fix up refcounting issues with bond/vlan config" (Jiri Pirko) [1173501 1135347]
  - [net] revert "8021q/vlan: filter device events on bonds" (Jiri Pirko) [1173501 1135347]
  - [net] vlan: Use vlan_dev_real_dev in vlan_hwaccel_do_receive (Jiri Pirko) [1173501 1135347]
  - [net] gro: __napi_gro_receive() optimizations (Jiri Pirko) [1173501 1135347]
  - [net] vlan: Rename VLAN_GROUP_ARRAY_LEN to VLAN_N_VID (Jiri Pirko) [1173501 1135347]
  - [net] vlan: make vlan_hwaccel_do_receive() return void (Jiri Pirko) [1173501 1135347]
  - [net] vlan: init_vlan should not copy slave or master flags (Jiri Pirko) [1173501 1135347]
  - [net] vlan: updates vlan real_num_tx_queues (Jiri Pirko) [1173501 1135347]
  - [net] vlan: adds vlan_dev_select_queue (Jiri Pirko) [1173501 1135347]
  - [net] llc: use dev_hard_header (Jiri Pirko) [1173501 1135347]
  - [net] vlan: support "loose binding" to the underlying network device (Jiri Pirko) [1173501 1135347]
  - [net] revert "net: don't set VLAN_TAG_PRESENT for VLAN 0 frames" (Jiri Pirko) [1173501 1135347]
  - [net] bridge: Add support for TX vlan offload (Jiri Pirko) [1173562 1146391]
  - [net] revert "bridge: Set vlan_features to allow offloads on vlans" (Vlad Yasevich) [1144442 1121991]
* Tue Apr 14 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.21.1.el6]
  - [netdrv] ixgbe: Fix memory leak in ixgbe_free_q_vector, missing rcu (John Greene) [1210901 1150343]
  - [netdrv] ixgbe: Fix tx_packets and tx_bytes stats not updating (John Greene) [1210901 1150343]
  - [netdrv] qlcnic: Fix update of ethtool stats (Chad Dupuis) [1210902 1148019]
* Fri Apr 10 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.20.1.el6]
  - [fs] exec: do not abuse ->cred_guard_mutex in threadgroup_lock() (Petr Oros) [1208620 1169225]
  - [kernel] cgroup: always lock threadgroup during migration (Petr Oros) [1208620 1169225]
  - [kernel] threadgroup: extend threadgroup_lock() to cover exit and exec (Petr Oros) [1208620 1169225]
  - [kernel] threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem (Petr Oros) [1208620 1169225]
* Thu Mar 26 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.19.1.el6]
  - [mm] memcg: fix crash in re-entrant cgroup_clear_css_refs() (Johannes Weiner) [1204626 1168185]
* Thu Mar 19 2015 Frantisek Hrbata <fhrbata@redhat.com> [2.6.32-504.18.1.el6]
  - [fs] cifs: Use key_invalidate instead of the rh_key_invalidate() (Sachin Prabhu) [1203366 885899]
  - [fs] KEYS: Add invalidation support (Sachin Prabhu) [1203366 885899]
  - [infiniband] core: Prevent integer overflow in ib_umem_get address arithmetic (Doug Ledford) [1181173 1179327] {CVE-2014-8159}
* Wed Mar 11 2015 Frantisek Hrbata <fhrbata@redhat.com> [2.6.32-504.17.1.el6]
  - [x86] fpu: shift clear_used_math() from save_i387_xstate() to handle_signal() (Oleg Nesterov) [1199900 1196262]
  - [x86] fpu: change save_i387_xstate() to rely on unlazy_fpu() (Oleg Nesterov) [1199900 1196262]
* Mon Mar 09 2015 Frantisek Hrbata <fhrbata@redhat.com> [2.6.32-504.16.1.el6]

Is it possible for you to see if you can reproduce on 2.6.32-504.22.*.el6.x86_64? That might help narrow down which kernel patch caused this bug.

@pmyjavec
Copy link

Hey all,

Also can confirm running the latest kernel (kernel.x86_64 0:2.6.32-504.23.4.el6) on CentOS 6 introduces this behaviour. Booting older Kernels resolves the problem instantly.

@cyphar
Copy link
Contributor

cyphar commented Jun 26, 2015

@pmyjavec What is the version of the "older kernel" you booted from?

@pmyjavec
Copy link

Hello @cyphar,

The working version is 2.6.32-504.16.2.el6.x86_64, sorry I had it wrong the first time I posted this.

@lostsnow
Copy link
Author

@cyphar I can not find kernel rpm 2.6.32-504.22.* (http://mirror.centos.org/centos/6/centosplus/x86_64/Packages/), Can someone provide it?

@cyphar
Copy link
Contributor

cyphar commented Jun 26, 2015

Sorry, that means that they didn't actually release it (weird). Anyway, so that means we'll have to git bisect with the entire changeset. I'll take a look at this. Can this be reproduced on a stock CentOS 6.6 install?

@jophofste
Copy link

My install was:


kernel: 
2.6.32-504.23.4.el6.x86_64
docker version:
Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 7c8fca2/1.6.2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 7c8fca2/1.6.2
OS/Arch (server): linux/amd64

It generates panics.

First I removed:
rpm -e kernel-2.6.32-504.23.4.el6.x86_64
rpm -e kernel-firmware-2.6.32-504.23.4.el6.noarch

And I installed:
rpm -ivh kernel-2.6.32-504.8.1.el6.centos.plus.x86_64.rpm
rpm -ivh kernel-firmware-2.6.32-504.8.1.el6.centos.plus.noarch.rpm

The 2.6.32-504.16.2 version did not fix it, but the 2.6.32-504.8.1 will do.

@cyphar
Copy link
Contributor

cyphar commented Jun 26, 2015

@jophofste Can you check if kernel-2.6.32-504.12.2.el6.centos.plus.x86_64.rpm fixes it? I'd prefer to do as much pre-compiled bisecting before we have to manually bisect individual patches (given that we don't have the source tree as a git repo).

@hrunting
Copy link

@cyphar I've been running the EPEL provided docker-io-1.5.0-1 RPMs with kernel-2.6.32-504.16.2 for weeks without any issues. Upon upgrading to kernel-2.6.32-504.23.4, I immediately started seeing panics on container shutdown. I think the problem is between 16.2 and 23.4.

I'd install the latest version of docker from the docker-provided RPMs, but they have been built poorly and expect 32-bit packages to be installed that conflict with 64-bit packages. There's another issue opened for this.

@cyphar
Copy link
Contributor

cyphar commented Jun 30, 2015

@hrunting Since we're seeing a kernel panic, I don't expect an updated Docker version to fix this problem (even if it does, this looks like a kernel bug proper to me). I'm still trying to figure out a nice way of bisecting the CentOS kernel tree. I'll get back to you on this.

@smerrill
Copy link
Contributor

smerrill commented Jul 1, 2015

I’d like to start this thread with a heartfelt thanks to everyone in the Docker and Red Hat communities who have worked to bring this awesome project to EL6 and maintain it there.

Here’s the result of my research on this issue.

The most recent RHEL 6.6 kernel version (kernel-2.6.32-504.23.4.el6) has a regression in its handling of cgroups which will often cause kernel panics when used with applications like Apache Mesos, cgroup_monitor, and Docker. In my experience, starting the Docker daemon will nearly immediately panic the system. See https://bugs.centos.org/view.php?id=7538 for an example bug about a different application that uses the cgroups subsystem. The fix was committed to the mainline kernel in 2013 and Red Hat has also reportedly confirmed in that the fix for this will be in the RHEL 6.7 mainline kernel version kernel-2.6.32-564.el6 when that comes out.

At this point if you would like to use Docker on EL6, you’ve got 2 major options:

  • Hold your machines back to kernel-2.6.32-504.16.2.el6 and wait for RHEL 6.7 to come out. You will miss out on fixes to 5 CVEs by doing so, which is not ideal.
  • If you are on CentOS and willing to run a patched CentOS kernel, CentOS Plus kernel-2.6.32-504.23.4.el6 will likely fix your problem. It has included the LKML patch to fix this issue for several versions now and it is based on the most recent RHEL kernel release, so those CVE fixes are included. You will hopefully be able to get back on the mainline kernel once kernel-2.6.32-564.el6 is released.

My results are not guaranteed to be conclusive for your workload, but I’ve had very good results running Docker 1.6.2 using the CentOS Plus kernel in a test environment.

As part of my research on the CentOS Plus kernel I compared its spec file to the mainline CentOS kernel, and here’s the list of patches applied. They appear to primarily be backported race fixes.

+# centos addition
+Source100: kernel-2.6.32-i686.config
+Source101: kernel-2.6.32-x86_64.config
+
+# without the following, direct rpmbuild works but mock build does not. -ay
+Patch30002: centos-linux-2.6-bonding-fix-802.3ad.patch
+Patch30005: centos-linux-2.6-jfs-bug5453.patch
+Patch30006: centos-linux-2.6-tomoyo-fix-race-on-updating-profile-comment-bug5378.patch
+Patch30007: centos-linux-2.6-tomoyo-use-UMH_WAIT_PROC-constant-bug5588.patch
+Patch30010: centos-linux-2.6-hid-non-LogiTech-remote-bug5780.patch
+Patch30018: centos-linux-2.6-sysfs-fix-printk-warnings-bug6157.patch
+Patch30026: centos-linux-2.6-fs-tmpfs-add-xattrs-support-bug4586.18700.patch
+Patch30027: centos-linux-2.6-fix-fadvise-for-tmpfs-bug6938.patch
+Patch30031: centos-linux-2.6-dm9601-bug7270.patch
+#Patch30033: centos-linux-2.6-hrtimer-fix-race-bug7051.patch
+Patch30034: centos-linux-2.6-fix-cgroup-close-race-bug7538.patch
+Patch30035: centos-linux-2.6-perf-bench-numa-warnings-bug7882.patch
+# end of centos addition

@cyphar
Copy link
Contributor

cyphar commented Jul 1, 2015

@smerrill Thanks for doing the bisect, you're a much stronger man than I.

@smerrill
Copy link
Contributor

smerrill commented Jul 1, 2015

Thankfully, no bisect was needed - when I found that other CentOS bug about a cgroups panic, it pointed me to the patch to fix it (and the reported inclusion of the fix in RHEL 6.7.)

Also interestingly, it looks like this bug has actually been around for a few RHEL kernel releases, but Docker seems to trigger it instantly with the newest kernel release.

@lostsnow
Copy link
Author

lostsnow commented Jul 2, 2015

Good! 2.6.32-504.23.4.el6.centos.plus.x86_64 fix this problem.

# uname -a
Linux serv-11-1-171 2.6.32-504.23.4.el6.centos.plus.x86_64 #1 SMP Wed Jun 10 13:09:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

thanks @smerrill

@tuesdaythefifth
Copy link

Says here 6.7 has been released. Can anyone confirm the problem is fixed?

https://access.redhat.com/articles/3078#RHEL6
RHEL 6 Update 7 2015-07-22 2015-07-22 RHEA-2015:1423 2.6.32-573

@jophofste
Copy link

@cyphar The version you proposed will fix it. It is a fix till we upgrade to CentOS 6.7.

@AllYourBase
Copy link

Today we upgraded two lower environment VMs to CentOS 6.7 kernel 2.6.32-573.3.1.el6 and that seems to have solved the kernel panics we were getting with docker restart on CentOS 6.6 kernel 2.6.32-504.30.3.el6.

I have been running a loop to stress the system:

while true; do date; docker restart container; sleep 10; done

It's been looping for over 20 minutes without panics. On CentOS 6.6 this loop would cause kernel panics after a few cycles. Fingers crossed that this may actually be solved.


UPDATE: I stopped that loop test after running it continuously for almost 4 hrs. No kernel panics.

@thaJeztah
Copy link
Member

Got confirmation in another issue, that 2.6.32-573.1.1.el6 resolved the panics; #14033 (comment)

@unclejack
Copy link
Contributor

This issue is the same as #15057. Please keep your systems fully updated to get fixes such as this one.

Please keep in mind that CentOS 6 and RHEL 6 are unsupported with Docker. An upgrade to CentOS 7 is recommended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests