Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove container fails on 1.11 due to root filesystem busy when any container mounts host /var/run - regression #21969

Closed
dhiltgen opened this issue Apr 12, 2016 · 24 comments
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P2 Normal priority: default priority applied.
Milestone

Comments

@dhiltgen
Copy link
Contributor

Something changed between commits c48439a...dd51e85 in 1.11 development where the daemon now fails removal of containers in some circumstances. I haven't managed to figure out exactly what is unique about our use-case that triggers the failure yet. Here's what I do know:

  • We've seen it fail primarily on AWS.
  • It fails with aufs and devicemapper (tested with debian AMIs and centos AMIs)
  • I have repro'd on debian running under KVM (but not with boot2docker+machine)
  • Our use-case is a bit complicated and involves a container mounting the docker.sock and spawning a second container with various additional volume/host mounts, which in turn attempts to stop/remove containers that share some of the same volume mounts. It seems some portion of this is required to trigger this failure mode, as running docker stop and docker rm by hand works without failure.

I've been attempting a git bisect on the docker/docker tree to find the exact commit that broke it but I'm having some challenges as the containerd integration was going through churn during this timeframe so many commits aren't yielding a testable setup for me.

Examples from the client's perspective:

DEBU[0000] daemon reported: Error response from daemon: Driver aufs failed to remove root filesystem 25693d520e87d334fedbfd8f1bc31748be35cae690ab3e1f8fad0c79a5ca3946: rename /var/lib/docker/aufs/diff/7aa440b05939346200bec909079a4e280303dfebd6898c665ed119b537f7c3ed /var/lib/docker/aufs/diff/7aa440b05939346200bec909079a4e280303dfebd6898c665ed119b537f7c3ed-removing: device or resource busy 

What you see on the damon log:

Apr 12 23:28:13 dh-manual-test1 docker[29379]: time="2016-04-12T23:28:13.135946618Z" level=error msg="Error removing mounted layer 25693d520e87d334fedbfd8f1bc31748be35cae690ab3e1f8fad0c79a5ca3946: rename /var/lib/docker/aufs/diff/7aa440b05939346200bec909079a4e280303dfebd6898c665ed119b537f7c3ed /var/lib/docker/aufs/diff/7aa440b05939346200bec909079a4e280303dfebd6898c665ed119b537f7c3ed-removing: device or resource busy"
Apr 12 23:28:13 dh-manual-test1 docker[29379]: time="2016-04-12T23:28:13.136021604Z" level=error msg="Handler for DELETE /containers/25693d520e87d334fedbfd8f1bc31748be35cae690ab3e1f8fad0c79a5ca3946 returned error: Driver aufs failed to remove root filesystem 25693d520e87d334fedbfd8f1bc31748be35cae690ab3e1f8fad0c79a5ca3946: rename /var/lib/docker/aufs/diff/7aa440b05939346200bec909079a4e280303dfebd6898c665ed119b537f7c3ed /var/lib/docker/aufs/diff/7aa440b05939346200bec909079a4e280303dfebd6898c665ed119b537f7c3ed-removing: device or resource busy"

I'll continue my investigation and update this issue as I uncover more details.

@tiborvass tiborvass added this to the 1.11.0 milestone Apr 13, 2016
@vikstrous
Copy link
Contributor

I think your link is backwards: c48439a...dd51e85

@dhiltgen
Copy link
Contributor Author

Oops, git CLI is fine with it reversed but not github - fixed.

@thaJeztah
Copy link
Member

@vikstrous is this the same as #21704?

@vikstrous
Copy link
Contributor

@thaJeztah I can't repro #21704 any more, so that makes me suspect that this is not the same. If you guys manage to track this one down, it might shed some light on whether or not they are the same.

@thaJeztah thaJeztah modified the milestones: 1.11.1, 1.11.0 Apr 14, 2016
@thaJeztah thaJeztah added the kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. label Apr 14, 2016
@dhiltgen
Copy link
Contributor Author

It appears this issue correlates to older kernels. The common theme on all the systems I've reproduced it on are older kernels, and the systems where it doesn't happen are much newer.

Failures seen on:

Linux dockerexp 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19) x86_64 GNU/Linux
Linux dh-manual-test2 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1 (2015-05-24) x86_64 GNU/Linux
Linux dh-manual-test7 3.10.0-229.14.1.el7.x86_64 #1 SMP Tue Sep 15 15:05:51 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

@alexmavr
Copy link

I'm not sure if this is related, but based on our observations this error appears in the same set of kernels that exhibit this behavior:

docker inspect $(docker run -d -m 15MB --privileged busybox tail -f) | grep \"Memory\" 
WARNING: Your kernel does not support memory limit capabilities. Limitation discarded.
           "Memory": 0,

@pdevine
Copy link

pdevine commented Apr 15, 2016

I just upgraded my kernel and am seeing this bug constantly.

Linux clone3 3.16.0-70-generic #90~14.04.1-Ubuntu SMP Wed Apr 6 22:56:34 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

@ugurarpaci
Copy link

Same here

Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux

@justincormack
Copy link
Contributor

@pdevine @ugurarpaci do you have any way to reproduce this? Is your use case similar to @dhiltgen with additional volume mounts? Are you also using aufs (or dm?).

@thaJeztah thaJeztah added the priority/P2 Normal priority: default priority applied. label Apr 18, 2016
@pdevine
Copy link

pdevine commented Apr 18, 2016

I think @dhiltgen may have the most reproducible work around. It's very intermittent for me.

@dhiltgen dhiltgen changed the title Remove container fails on 1.11 due to root filesystem busy - regression Remove container fails on 1.11 due to root filesystem busy when any container mounts host /var/run - regression Apr 19, 2016
@dhiltgen
Copy link
Contributor Author

It appears that the scenario I'm hitting is related to /var/run being mounted in any container on these systems. On these older kernels, if any container mounts /var/run from the host, then other containers can't be removed.

@ugurarpaci
Copy link

ugurarpaci commented Apr 19, 2016

@justincormack the first problem occurrence could be about /var/run mounting, but that is the convention we have been using for months so I have ignored It. Therefore I thought that could be related to kernel version and I have migrated to the containers to a new VM with updated kernel (Debian 3.16.7).

I tried to reproduce the problem again I could make It. Here is the case :
When I map the /var/run/docker.sock as :ro (read-only) to a container and after I rm this container, there has seems to be inconsistencies about the instances running. I had to docker stop and docker rm all the images and restart the daemon running on the VM. After that state has been reset and everything works.

@anusha-ragunathan
Copy link
Contributor

Found something related to this issue that got fixed in 3.19 kernels, which could explain why the issue doesnt happen on newer kernels.
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.19.y&id=8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
@ugurarpaci , would you be able to test with a 3.19 kernel?

@ugurarpaci
Copy link

@anusha-ragunathan For sure, I will try that. My experience about the problem is that this is more like a meta data problem somehow. I have tried the 1.11.0 version on different kernel versions like 3.2 and newer (3.16 for the latest scenario). The problem occurs randomly; the daemon complains about the aufs layers (which does not exist on the filesystem -interestingly-) which blocks the rm therefore rmi operations. After the docker daemon restart, everything becomes shiny. I try to collect more data about my scenario anyway =)

@thaJeztah
Copy link
Member

@ugurarpaci kernel 3.2 is expected; docker does not run on kernels older than 3.10

@anusha-ragunathan
Copy link
Contributor

A simple experiment confirms that 3.19 kernel has robust handling of file removal, which is what fixes the reported issue.

On Debian 8 (which ships with 3.16.0-4-amd64 by default)
$ docker run -d --name test busybox top
$ docker run -d --name test2 -v /var/run:/var/run busybox top
$ docker kill test && docker rm test # rm FAILS with EBUSY

On Ubuntu 15.04 (which ships with 3.19.0-58-generic by default)
$ docker run -d --name test busybox top
$ docker run -d --name test2 -v /var/run:/var/run busybox top
$ docker kill test && docker rm test # rm SUCCEEDS

mlaventure added a commit to mlaventure/docker that referenced this issue Apr 22, 2016
This avoid an extra bind mount within /var/run/docker/libcontainerd

This should resolve situations where a container having the host
/var/run bound prevents other containers from being cleanly removed
(e.g. moby#21969).

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
@icecrime icecrime added this to the 1.11.2 milestone Apr 22, 2016
@icecrime icecrime removed this from the 1.11.1 milestone Apr 22, 2016
@pheuter
Copy link

pheuter commented Apr 22, 2016

Seeing similar issue, but with a container that did not have any mounts. It had a restart policy of on-failure and a syslog driver. Problem occurred when stopping the container using docker stop and failing to rm it. Restarting docker engine fixed the problem and the container was removed.

@cpuguy83
Copy link
Member

@pheuter That sounds like a different issue.
Can you open a new issue with all the details?

Thanks!

@pheuter
Copy link

pheuter commented Apr 22, 2016

@cpuguy83 gotcha, will do!

@thaJeztah
Copy link
Member

@mlaventure should this be resolved by #22256?

@cpuguy83
Copy link
Member

Yep. Closing since this is resolved now by not mounting the container's rootfs into /var/run

mlaventure added a commit to mlaventure/docker that referenced this issue Apr 25, 2016
This avoid an extra bind mount within /var/run/docker/libcontainerd

This should resolve situations where a container having the host
/var/run bound prevents other containers from being cleanly removed
(e.g. moby#21969).

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
(cherry picked from commit 3135874)
@mikesimons
Copy link

@thaJeztah @cpuguy83 Will nested bind mounts still be an issue for paths other than /var/run on kernels < 3.19? I anticipate it will but just want to be clear.

@anusha-ragunathan
Copy link
Contributor

/var/lib will also be an issue.

@mikesimons
Copy link

@anusha-ragunathan Thanks for the clarification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P2 Normal priority: default priority applied.
Projects
None yet
Development

No branches or pull requests