file system corruption with devicemapper/aufs #7229

stevenschlansker · 2014-07-25T00:02:18Z

I am filing this issue to try to get a single place to document what I see as an extremely severe issue: when using Docker, sometimes your container filesystems and/or files become seriously corrupted.

There are at least two mailing list threads pointing to the problem:
https://groups.google.com/forum/#!topic/docker-dev/Xzwm5GRYCLo
https://groups.google.com/forum/#!topic/docker-user/35Z0-g8sObY

And one previously filed issue (now closed) #6368

The problem manifests itself in a few different ways. It is unclear to me if these are all related.

Known symptoms:

Corrupt files (docker run myimage md5sum /some/file returns different results on different hosts)
Corrupt filesystems, example messages:
- EXT4-fs error (device dm-2): ext4_lookup:1437: inode #132310: comm docker: deleted inode referenced: 134547
- [271394.160211] EXT4-fs warning (device dm-10): ext4_end_bio:317: I/O error writing to inode 402137 (offset 0 size 0 starting block 169271) [271394.160214] Buffer I/O error on device dm-10, logical block 169271
- htree_dirblock_to_tree: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0

So far this has only been reported against the devicemapper and aufs backends, not btrfs.
It is possible that this is a kernel bug but since it is so disastrous to Docker workloads I believe it deserves a bug here regardless of the root cause.

I have verified that this happens against Ubuntu 14.04 3.13.0-24-generic with Docker 1.0.0.

The text was updated successfully, but these errors were encountered:

tve · 2014-07-25T00:24:38Z

Thanks, creating a ticket was on my to-do list but I haven't had the time to finesse a reproducible non-private set-up. BTW, switching to btrfs was the ticket for me in terms of resolution. Have you tried that or have you seen issues with btrfs too?

stevenschlansker · 2014-07-25T00:29:17Z

I haven't tried btrfs. We are already drowning in experimental code and trying to convince everyone to now add a beta-quality filesystem may be a stretch. We also run in AWS, which means that provisioning btrfs also adds some operational concerns (right now filesystems are automagically created as ext4, and there is no natural time to change the filesystem type of ephemeral storage without blocking the boot)

unclejack · 2014-07-29T15:50:30Z

The problems you're running into aren't representative for Docker, otherwise there would be tens of issues like this one and this issue would have hundreds of comments.

Please make sure you're running the latest version of the kernel packages provided by your Linux distribution repositories. They usually provide fixes to severe bugs and keeping your kernel packages up to date is a good idea.

File system corruption and problems with devicemapper are usually a kernel problem.

Can you provide some steps to reproduce this if you can reproduce this on a newly installed instance, please?

stevenschlansker · 2014-07-29T17:03:21Z

@unclejack: I agree it is possible that this is a kernel bug, however given that Docker relies heavily on these storage backends in a way that few (if any?) other applications do, it is representative for Docker. Doubly so because these are the default backends, and switching away (to btrfs) is not something most users are expected to do.

Unfortunately, I have no deterministic way to reproduce the issue. The problem only happens sometimes after hundreds of container start/stop cycles, and our specific containers have proprietary code / data that I cannot share.

@tve could you share more about the setup that you ran into this issue on?

stevenschlansker · 2014-08-05T20:20:11Z

I have reproduced this problem again on 3.13.0-32-generic on docker 1.1.2 with the devicemapper backend:

[468349.814410] EXT4-fs error (device dm-3): ext4_lookup:1437: inode #531686: comm docker: deleted inode referenced: 532078
[505178.902469] EXT4-fs error (device dm-3): ext4_find_dest_de:1648: inode #525570: block 2105545: comm docker: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0
[507776.149988] EXT4-fs (dm-7): ext4_check_descriptors: Checksum for group 0 failed (8702!=0)
[507776.150000] EXT4-fs (dm-7): group descriptors corrupted!

vbatts · 2014-08-15T20:09:10Z

@stevenschlansker and this is all on AWS ? and you said you see this with AUFS too?

stevenschlansker · 2014-08-15T20:11:43Z

@vbatts AWS, yes. After seeing how easy it is for Docker to revert to devicemapper instead of AUFS (any reason the kernel module fails to load and it will silently fallback) I'm not as convinced that it affects AUFS as well, but it definitely affects devicemapper.

vbatts · 2014-08-15T20:21:29Z

i'm seeing a conversation around a commit for linux 3.9 that addresses a similar issue. https://groups.google.com/forum/#!msg/linux.kernel/MCs6mF51om8/9KiGcQTEjWYJ

I am inclined to think this is an ext4 related issue, which could affect aufs too, and why btrfs seems exempt from this issue.

But why AWS seems to express the issue for you, is why I am puzzled.

stevenschlansker · 2014-08-15T22:47:15Z

I can't promise that it's AWS related. That is the only environment I am running Docker in heavy use under. Maybe @tve, who also reported this issue, could describe his environment more?

tve · 2014-08-16T01:19:52Z

Also AWS. Please refer to https://groups.google.com/forum/#!topic/docker-dev/Xzwm5GRYCLo for additional details. I'm happy to provide more. I will try to repro next week (at least it's on my to-do list) without proprietary software.

unclejack · 2014-08-18T12:26:03Z

@vbatts I've managed to reproduce this with kernel 3.15 outside of AWS. I need to try to reproduce this on bare metal.

judemight · 2014-08-23T16:52:09Z

I am sure if this will help are not, but I am having similar issue posted at with dmesg https://groups.google.com/forum/#!topic/docker-user/mcGLOta9ric

I had similar problem before that time I removed all my containers, uninstalled docker and reinstalled.. Any possible solution to recover from this issue without losing my containers ...

phildougherty · 2014-09-04T22:09:38Z

We are having similar issues on AWS, ubuntu 14.04 x64, with kernel 3.13.0-24-generic. We're using ext4 fs. This only seems to happen after we launch 40-50 containers on an instance.

kingyueyang · 2014-09-11T04:02:22Z

I also having this issus on CentOS 6.5 2.6.32-431.el6.x86_64 Docker 1.1.2, using Ext4 too.
when running docker pull registry timeout, I try to rmi all the images and retry pull it:
2014/09/11 11:08:30 Error pulling image (latest) from registry, Driver devicemapper failed to create image rootfs 8c3e8f86ecaf21f9eec0aaf505472ac9f4588f0ed96779e981c80e003d47a6b0: device 8c3e8f86ecaf21f9eec0aaf505472ac9f4588f0ed96779e981c80e003d47a6b0 already exists

I will rm -rf /var/lib/docker/devicemapper/
rm: cannot remove devicemapper/mnt/163338ae91b205330bc8930d94ca5c8e134effbaeb004b10eec864db4af61066/rootfs/usr/share/man/man3/CPU_ALLOC_SIZE.3.gz': Input/output error rm: cannot removedevicemapper/mnt/163338ae91b205330bc8930d94ca5c8e134effbaeb004b10eec864db4af61066/rootfs/usr/share/man/man3/DH_new.3ssl.gz': Input/output error
...

#ll devicemapper/mnt/163338ae91b205330bc8930d94ca5c8e134effbaeb004b10eec864db4af61066/rootfs/usr/share/man/man3
l????????? ? ? ? ? ? rawmemchr.3.gz
-????????? ? ? ? ? ? rc4.3ssl.gz
-????????? ? ? ? ? ? rcmd.3.gz
-????????? ? ? ? ? ? re_comp.3.gz
l????????? ? ? ? ? ? re_exec.3.gz
-????????? ? ? ? ? ? readdir.3.gz
l????????? ? ? ? ? ? readdir_r.3.gz

vbatts · 2014-09-23T15:08:46Z

@kingyueyang you're not on AWS are you? because an input/output error like that usually indicates hardware failure of the disk. Not that the journal of the filesystem is corrupted.

stevenschlansker · 2014-09-23T17:04:19Z

@vbatts while in general I agree with you, the sheer number of people reporting specifically I/O errors with very similar symptoms with docker and devicemapper makes me believe that this is not a (virtual or otherwise) hardware problem. The problems only ever seem to affect devicemapper volumes and never the host volume -- you'd expect to run into I/O errors affecting a 'real' filesystem before noticing it in a devicemapper volume.

vbatts · 2014-09-23T18:57:00Z

@stevenschlansker i'm not disagreeing that there are i/o errors being reported, but we're going to have to draw some distinctions, otherwise two issues may get grouped together and make things more noisey to try and resolve.
The last person that chimed in had a different looking input/output error. That is what I'm saying.

kingyueyang · 2014-09-24T01:32:18Z

@vbatts You are right. My docker in physics server.

flyaruu · 2014-09-25T10:17:35Z

I might be seeing this problem too, on Fedora 20 on Rackspace with a ext4 filesytem. I've tried it on several different (Rackspace) servers, but they seem to show the same behavior, so I don't think it is hardware related.

My servers tend to survive for a few hours at most, then quite suddenly crash, sometimes with file corruption. The logs don't show anything wrong, the node seems to die quite suddenly (state is 'shutoff' in the rackspace console)

It runs 4 reasonably heavy java based docker containers, and a lot of very short lived ephemeral containers that do health checks (using docker run -rm).

I've tried upgrading Fedora (3.15.3-200.fc20 to 3.16.2-201.fc20) but that didn't seem to make any difference.

Now I've recreated this setup on CoreOS Beta (with btrfs), and it seems stable.

Hount · 2014-10-15T13:53:15Z

We're having exactly same issue. Docker can't remove old unused containers and images due to errors in underlying filesystem. And I can't remove /var/lib/docker/ -folder due to EXT4-FS errors. Host OS /var partition get's filled with 100% usage and due to the fact that docker is being run by root it gets 100% full (no use for root reserved blocks).

Only solution I've found is:

/etc/init.d/docker stop && yum -y erase docker
reboot
rm -rf /var/lib/docker #will work after reboot
touch /forcefsck && reboot

Without forcefsck Host OS does not recognize the recently freed space.

On a fully functional system
AWS Paravirtualized
CentOS release 6.5 (Final)
2.6.32-431.29.2.el6.x86_64
Docker version 1.1.2, build d84a070/1.1.2
Xen version: 3.4.3.amazon (preserve-AD)

[root@hostname ~]# du -sh /var/lib/docker/
du: cannot access `/var/lib/docker/devicemapper/mnt/[long hash]/rootfs/[folders]': Input/output error

Last line repeats n times.

unclejack · 2014-10-29T15:06:07Z

Some old kernels have various bugs and problems which might cause such corruption.

The following kernels aren't supported in any way and they should be avoided:
3.8 or older, 3.9, 3.11, 3.13 on non-Ubuntu distributions, 3.15
Custom built kernels which match the versions above aren't supported either.

The latest version of the kernel provided by the distribution should be used and the system should be kept up to date.

If you're on Ubuntu, please make sure that you're running kernel 3.13.0-37 or newer.
If you're on RHEL 6.x/CentOS 6/another RHEL6 derivative, make sure you're running kernel 2.6.32-504 or newer.

When reporting problems, please make sure to provide full output of the following commands: docker info, docker version and uname -a. Please make sure to specify where you've run into this problem (AWS EC2, bare metal, virtual machine on VirtualBox, etc).

gdm85 · 2014-11-06T13:43:13Z

If it can be of any help, I have been using devicemapper (thin provisioning, XFS) on a Xen HVM, with bare metal (RAID5) for about 2 weeks and can see zero I/O errors in logs.

I am using kernel 3.13.0-39-generic (current Ubuntu 14.04); I would definitively be interested and concerned if devmapper had some sort of instability, but generally for these issues I would look at hardware and then as @unclejack said at kernel.

vbatts · 2014-11-06T13:53:11Z

There was a recent fix that has gone into upstream kernel, and most
distributions have backported at this point that could cause this kind of
corruption. The minimum versions @unclejack mention include the fix
On Nov 6, 2014 8:43 AM, "G. D. M." notifications@github.com wrote:

If it can be of any help, I have been using devicemapper (thin
provisioning, XFS) on a Xen HVM, with bare metal (RAID5) for about 2 weeks
and can see zero I/O errors in logs.

I am using kernel 3.13.0-39-generic (current Ubuntu 14.04); I would
definitively be interested and concerned if devmapper had some sort of
instability, but generally for these issues I would look at hardware and
then as @unclejack https://github.com/unclejack said at kernel.

—
Reply to this email directly or view it on GitHub
#7229 (comment).

stevenschlansker · 2014-11-06T17:42:17Z

Hi @vbatts can you link the kernel issue here, so that users can verify if their kernel includes the fix?

vbatts · 2014-11-06T18:14:06Z

thread: http://thread.gmane.org/gmane.linux.kernel/1720729
upstream linux: torvalds/linux@a5049a8

rthomas · 2014-12-04T08:47:57Z

I believe this is the same issue as #4036

unclejack · 2014-12-04T13:59:50Z

@rthomas It's not the same issue. That issue wasn't caused by this problem and it might still be an issue with devicemapper.

unclejack · 2015-01-07T17:17:19Z

devicemapper issues review session with @vbatts

This issue was caused by a kernel bug which is now fixed. Currently supported kernels on CentOS, Debian, RHEL and Ubuntu shouldn't exhibit this problem any more.

We'll make sure include a note about this particular commit in the documentation around kernel requirements so that everyone is aware of this particular problem.

I'll close this now. Please feel free to comment if you run into something similar and you can already confirm that the fix is included in the kernel you're running.

aurorafox1 · 2015-06-15T11:26:02Z

我也出现类似的问题，日志报类似错误，然后发现是可用空间没有了
Data Space Used: 100.52 GB
Data Space Total: 100.52 GB
Data Space Available: 0 M

下面这个是我的解决方法，希望能帮助到你:

备份docker运行目录.
在/etc/sysconfig/docker配置文的OPTIONS参数中添加下面这一个选项:--storage-opt=dm.loopdatasize=500G(根据实际磁盘大小来分配)

ghost · 2015-07-02T12:22:56Z

@unclejack, do you know in which version of the kernel this was fixed?

root@daztladm01:~# uname -a | awk '!($2="")'
Linux  3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
root@daztladm01:~# cat /etc/lsb-release | grep DESC
DISTRIB_DESCRIPTION="Ubuntu 14.04 LTS"

unclejack changed the title ~~Docker has severe issues with corrupt filesystems and files~~ file system corruption with devicemapper/aufs Jul 29, 2014

stevenschlansker mentioned this issue Aug 18, 2014

exec format error when building or restarting container after docker server restart #7527

Closed

stevenschlansker changed the title ~~file system corruption with devicemapper/aufs~~ file system corruption with devicemapper Sep 23, 2014

thaJeztah mentioned this issue Oct 29, 2014

Unable to perform docker exec docker-library/mysql#20

Closed

SvenDowideit added the area/storage/devicemapper label Dec 5, 2014

vbatts changed the title ~~file system corruption with devicemapper~~ file system corruption with devicemapper/aufs Dec 6, 2014

vbatts added the area/storage/aufs label Dec 6, 2014

unclejack closed this as completed Jan 7, 2015

unclejack mentioned this issue Jan 26, 2015

Docker causes system freeze on Ubuntu 14.10 #10355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

file system corruption with devicemapper/aufs #7229

file system corruption with devicemapper/aufs #7229

stevenschlansker commented Jul 25, 2014

tve commented Jul 25, 2014

stevenschlansker commented Jul 25, 2014

unclejack commented Jul 29, 2014

stevenschlansker commented Jul 29, 2014

stevenschlansker commented Aug 5, 2014

vbatts commented Aug 15, 2014

stevenschlansker commented Aug 15, 2014

vbatts commented Aug 15, 2014

stevenschlansker commented Aug 15, 2014

tve commented Aug 16, 2014

unclejack commented Aug 18, 2014

judemight commented Aug 23, 2014

phildougherty commented Sep 4, 2014

kingyueyang commented Sep 11, 2014

vbatts commented Sep 23, 2014

stevenschlansker commented Sep 23, 2014

vbatts commented Sep 23, 2014

kingyueyang commented Sep 24, 2014

flyaruu commented Sep 25, 2014

Hount commented Oct 15, 2014

unclejack commented Oct 29, 2014

gdm85 commented Nov 6, 2014

vbatts commented Nov 6, 2014

stevenschlansker commented Nov 6, 2014

vbatts commented Nov 6, 2014

rthomas commented Dec 4, 2014

unclejack commented Dec 4, 2014

unclejack commented Jan 7, 2015

aurorafox1 commented Jun 15, 2015

ghost commented Jul 2, 2015

file system corruption with devicemapper/aufs #7229

file system corruption with devicemapper/aufs #7229

Comments

stevenschlansker commented Jul 25, 2014

tve commented Jul 25, 2014

stevenschlansker commented Jul 25, 2014

unclejack commented Jul 29, 2014

stevenschlansker commented Jul 29, 2014

stevenschlansker commented Aug 5, 2014

vbatts commented Aug 15, 2014

stevenschlansker commented Aug 15, 2014

vbatts commented Aug 15, 2014

stevenschlansker commented Aug 15, 2014

tve commented Aug 16, 2014

unclejack commented Aug 18, 2014

judemight commented Aug 23, 2014

phildougherty commented Sep 4, 2014

kingyueyang commented Sep 11, 2014

vbatts commented Sep 23, 2014

stevenschlansker commented Sep 23, 2014

vbatts commented Sep 23, 2014

kingyueyang commented Sep 24, 2014

flyaruu commented Sep 25, 2014

Hount commented Oct 15, 2014

unclejack commented Oct 29, 2014

gdm85 commented Nov 6, 2014

vbatts commented Nov 6, 2014

stevenschlansker commented Nov 6, 2014

vbatts commented Nov 6, 2014

rthomas commented Dec 4, 2014

unclejack commented Dec 4, 2014

unclejack commented Jan 7, 2015

aurorafox1 commented Jun 15, 2015

ghost commented Jul 2, 2015