New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file system corruption with devicemapper/aufs #7229
Comments
Thanks, creating a ticket was on my to-do list but I haven't had the time to finesse a reproducible non-private set-up. BTW, switching to btrfs was the ticket for me in terms of resolution. Have you tried that or have you seen issues with btrfs too? |
I haven't tried btrfs. We are already drowning in experimental code and trying to convince everyone to now add a beta-quality filesystem may be a stretch. We also run in AWS, which means that provisioning btrfs also adds some operational concerns (right now filesystems are automagically created as ext4, and there is no natural time to change the filesystem type of ephemeral storage without blocking the boot) |
The problems you're running into aren't representative for Docker, otherwise there would be tens of issues like this one and this issue would have hundreds of comments. Please make sure you're running the latest version of the kernel packages provided by your Linux distribution repositories. They usually provide fixes to severe bugs and keeping your kernel packages up to date is a good idea. File system corruption and problems with devicemapper are usually a kernel problem. Can you provide some steps to reproduce this if you can reproduce this on a newly installed instance, please? |
@unclejack: I agree it is possible that this is a kernel bug, however given that Docker relies heavily on these storage backends in a way that few (if any?) other applications do, it is representative for Docker. Doubly so because these are the default backends, and switching away (to btrfs) is not something most users are expected to do. Unfortunately, I have no deterministic way to reproduce the issue. The problem only happens sometimes after hundreds of container start/stop cycles, and our specific containers have proprietary code / data that I cannot share. @tve could you share more about the setup that you ran into this issue on? |
I have reproduced this problem again on
|
@stevenschlansker and this is all on AWS ? and you said you see this with AUFS too? |
@vbatts AWS, yes. After seeing how easy it is for Docker to revert to devicemapper instead of AUFS (any reason the kernel module fails to load and it will silently fallback) I'm not as convinced that it affects AUFS as well, but it definitely affects devicemapper. |
i'm seeing a conversation around a commit for linux 3.9 that addresses a similar issue. https://groups.google.com/forum/#!msg/linux.kernel/MCs6mF51om8/9KiGcQTEjWYJ I am inclined to think this is an ext4 related issue, which could affect aufs too, and why btrfs seems exempt from this issue. But why AWS seems to express the issue for you, is why I am puzzled. |
I can't promise that it's AWS related. That is the only environment I am running Docker in heavy use under. Maybe @tve, who also reported this issue, could describe his environment more? |
Also AWS. Please refer to https://groups.google.com/forum/#!topic/docker-dev/Xzwm5GRYCLo for additional details. I'm happy to provide more. I will try to repro next week (at least it's on my to-do list) without proprietary software. |
@vbatts I've managed to reproduce this with kernel 3.15 outside of AWS. I need to try to reproduce this on bare metal. |
I am sure if this will help are not, but I am having similar issue posted at with dmesg https://groups.google.com/forum/#!topic/docker-user/mcGLOta9ric I had similar problem before that time I removed all my containers, uninstalled docker and reinstalled.. Any possible solution to recover from this issue without losing my containers ... |
We are having similar issues on AWS, ubuntu 14.04 x64, with kernel 3.13.0-24-generic. We're using ext4 fs. This only seems to happen after we launch 40-50 containers on an instance. |
I also having this issus on CentOS 6.5 2.6.32-431.el6.x86_64 Docker 1.1.2, using Ext4 too. I will rm -rf /var/lib/docker/devicemapper/ #ll devicemapper/mnt/163338ae91b205330bc8930d94ca5c8e134effbaeb004b10eec864db4af61066/rootfs/usr/share/man/man3 |
@kingyueyang you're not on AWS are you? because an input/output error like that usually indicates hardware failure of the disk. Not that the journal of the filesystem is corrupted. |
@vbatts while in general I agree with you, the sheer number of people reporting specifically I/O errors with very similar symptoms with docker and devicemapper makes me believe that this is not a (virtual or otherwise) hardware problem. The problems only ever seem to affect devicemapper volumes and never the host volume -- you'd expect to run into I/O errors affecting a 'real' filesystem before noticing it in a devicemapper volume. |
@stevenschlansker i'm not disagreeing that there are i/o errors being reported, but we're going to have to draw some distinctions, otherwise two issues may get grouped together and make things more noisey to try and resolve. |
@vbatts You are right. My docker in physics server. |
I might be seeing this problem too, on Fedora 20 on Rackspace with a ext4 filesytem. I've tried it on several different (Rackspace) servers, but they seem to show the same behavior, so I don't think it is hardware related. My servers tend to survive for a few hours at most, then quite suddenly crash, sometimes with file corruption. The logs don't show anything wrong, the node seems to die quite suddenly (state is 'shutoff' in the rackspace console) It runs 4 reasonably heavy java based docker containers, and a lot of very short lived ephemeral containers that do health checks (using docker run -rm). I've tried upgrading Fedora (3.15.3-200.fc20 to 3.16.2-201.fc20) but that didn't seem to make any difference. Now I've recreated this setup on CoreOS Beta (with btrfs), and it seems stable. |
We're having exactly same issue. Docker can't remove old unused containers and images due to errors in underlying filesystem. And I can't remove /var/lib/docker/ -folder due to EXT4-FS errors. Host OS /var partition get's filled with 100% usage and due to the fact that docker is being run by root it gets 100% full (no use for root reserved blocks). Only solution I've found is:
Without forcefsck Host OS does not recognize the recently freed space. On a fully functional system [root@hostname ~]# du -sh /var/lib/docker/ Last line repeats n times. |
Some old kernels have various bugs and problems which might cause such corruption. The following kernels aren't supported in any way and they should be avoided: The latest version of the kernel provided by the distribution should be used and the system should be kept up to date. If you're on Ubuntu, please make sure that you're running kernel 3.13.0-37 or newer. When reporting problems, please make sure to provide full output of the following commands: |
If it can be of any help, I have been using devicemapper (thin provisioning, XFS) on a Xen HVM, with bare metal (RAID5) for about 2 weeks and can see zero I/O errors in logs. I am using kernel 3.13.0-39-generic (current Ubuntu 14.04); I would definitively be interested and concerned if devmapper had some sort of instability, but generally for these issues I would look at hardware and then as @unclejack said at kernel. |
There was a recent fix that has gone into upstream kernel, and most
|
Hi @vbatts can you link the kernel issue here, so that users can verify if their kernel includes the fix? |
thread: http://thread.gmane.org/gmane.linux.kernel/1720729 |
I believe this is the same issue as #4036 |
@rthomas It's not the same issue. That issue wasn't caused by this problem and it might still be an issue with devicemapper. |
devicemapper issues review session with @vbatts This issue was caused by a kernel bug which is now fixed. Currently supported kernels on CentOS, Debian, RHEL and Ubuntu shouldn't exhibit this problem any more. We'll make sure include a note about this particular commit in the documentation around kernel requirements so that everyone is aware of this particular problem. I'll close this now. Please feel free to comment if you run into something similar and you can already confirm that the fix is included in the kernel you're running. |
我也出现类似的问题,日志报类似错误,然后发现是可用空间没有了 下面这个是我的解决方法,希望能帮助到你:
|
@unclejack, do you know in which version of the kernel this was fixed?
|
I am filing this issue to try to get a single place to document what I see as an extremely severe issue: when using Docker, sometimes your container filesystems and/or files become seriously corrupted.
There are at least two mailing list threads pointing to the problem:
https://groups.google.com/forum/#!topic/docker-dev/Xzwm5GRYCLo
https://groups.google.com/forum/#!topic/docker-user/35Z0-g8sObY
And one previously filed issue (now closed) #6368
The problem manifests itself in a few different ways. It is unclear to me if these are all related.
Known symptoms:
docker run myimage md5sum /some/file
returns different results on different hosts)EXT4-fs error (device dm-2): ext4_lookup:1437: inode #132310: comm docker: deleted inode referenced: 134547
[271394.160211] EXT4-fs warning (device dm-10): ext4_end_bio:317: I/O error writing to inode 402137 (offset 0 size 0 starting block 169271) [271394.160214] Buffer I/O error on device dm-10, logical block 169271
htree_dirblock_to_tree: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
So far this has only been reported against the devicemapper and aufs backends, not btrfs.
It is possible that this is a kernel bug but since it is so disastrous to Docker workloads I believe it deserves a bug here regardless of the root cause.
I have verified that this happens against Ubuntu 14.04
3.13.0-24-generic
with Docker 1.0.0.The text was updated successfully, but these errors were encountered: