live-restore in combination with docker compose is broken with docker-ce version 20.10.19 and newer #44422

fichte · 2022-11-07T10:55:13Z

Description

Hello,

we are heavily using the live-restore feature in our production environments and facing automatic deployment issues through our ci. In our environment all Operating System Packages are updated automatically. When the docker engine gets updated and someone restarts a container in a running compose stack a docker compose down -v --remove-orphans does not work.

Update:
This happens particularly to named volumes with local driver and binded directory.

docker-compose.yml example

version: '2.4'
services:
  app:
    image: appimage
    networks:
      app-network:
        aliases:
          - app
    restart: "unless-stopped"
    volumes:
      - log-app:/log
     
volumes:
  log-app:
    driver: local
    driver_opts:
      type: none
      device: /srv/docker/stack/prod/log/app
      o: bind

Operating System: Debian GNU/Linux 11.5 (Bullseye)

Reproduce

root@host:/srv/docker/stack/prod# su - dockerdeploy
dockerdeploy@host:~# cd /srv/docker/stack/prod
dockerdeploy@host:/srv/docker/stack/prod# docker compose up -d
dockerdeploy@host:/srv/docker/stack/prod# exit
root@host:/srv/docker/stack/prod# systemctl restart docker.service
root@host:/srv/docker/stack/prod# su - dockerdeploy
dockerdeploy@host:~# cd /srv/docker/stack/prod
dockerdeploy@host:/srv/docker/stack/prod$ docker compose restart app
[+] Running 1/1
 ⠿ Container app  Started
dockerdeploy@host:/srv/docker/stack/prod$ docker compose down -v --remove-orphans
[+] Running 21/37
.....
Error response from daemon: remove prod_log-app: volume has active mounts

Expected behavior

docker compose down -v --remove-orphans should reliably tear down the stack

docker version

Client: Docker Engine - Community
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        baeda1f
 Built:             Tue Oct 25 18:02:28 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       3056208
  Built:            Tue Oct 25 18:00:19 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.9
  GitCommit:        1c90a442489720eec95342e1789ee8a5e1b9536f
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)

Server:
 Containers: 34
  Running: 34
  Paused: 0
  Stopped: 0
 Images: 28
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.0-19-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 60
 Total Memory: 157.2GiB
 Name: prod
 ID: XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true

Additional Info

Docker Compose version v2.12.2

daemon.json

{
  "experimental": true,
  "fixed-cidr-v6": "fd00:dead:beef:c0::/80",
  "ipv6": true,
  "ip6tables": true,
  "live-restore": true,
  "mtu": 1400
}

The text was updated successfully, but these errors were encountered:

cpuguy83 · 2022-11-09T20:35:21Z

This is most likely related to this change 66ddb7f (#44238)

Poking around things I don't really see anything wrong in the code.
Are you sure something else doesn't have the volume mounted?

It seems like you were exploiting a bug in live-restore before that has now been fixed that allowed volumes to be removed even though they were being referenced.

fichte · 2022-11-10T09:49:21Z

as i mentioned in my detailed report version 20.10.18 was fine. i updated my report with a sample compose file, with this you can easily reproduce this behaviour. as a workaround for us we do not use named volumes with bind mounted directories anymore.

bsousaa · 2023-02-23T10:51:10Z

Can you still reproduce on 23.0?

solidgoldbomb · 2023-06-13T22:57:50Z

I can confirm that this is still an issue on 24.0.2. Reproduced with this setup.

Versions

$ docker version
Client: Docker Engine - Community
 Version:           24.0.2
 API version:       1.43
 Go version:        go1.20.4
 Git commit:        cb74dfc
 Built:             Thu May 25 21:52:22 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.2
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.4
  Git commit:       659604f
  Built:            Thu May 25 21:52:22 2023
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.6.20
  GitCommit:        2806fc1057397dbaeefbea0e4e17bddfbd388f38
 runc:
  Version:          1.1.5
  GitCommit:        v1.1.5-0-gf19387a
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Docker compose version

$ docker compose version
Docker Compose version v2.18.1

OS

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal

Compose file

$ cat docker-compose.yaml 
services:
  app:
    image: ubuntu:focal
    network_mode: none
    command: "sleep infinity"
    volumes:
      - trouble:/trouble

volumes:
  trouble:
    driver_opts:
      type: none
      device: ${PWD}
      o: bind

Start up the stack, auto-creating the volume

$ docker compose up -d 
[+] Building 0.0s (0/0)                                                         
[+] Running 2/2
 ✔ Volume "moby-44422_trouble"  Created                                    0.0s 
 ✔ Container moby-44422-app-1   Started                                    0.4s 

$ docker compose ps
NAME                IMAGE               COMMAND             SERVICE             CREATED             STATUS              PORTS
moby-44422-app-1    ubuntu:focal        "sleep infinity"    app                 12 seconds ago      Up 11 seconds

Restart docker daemon -- note that this bug exists across any restart, updating the docker daemon is just one possible trigger of a restart.

$ sudo systemctl restart docker

Bring down the container stack leaving volumes so we can see their state before trying to delete them. note that running docker compose down -v here also triggers the same error as described in the original bug report.

$ docker compose down
[+] Running 1/1
 ✔ Container moby-44422-app-1  Removed                                    10.2s

Current status of the volume

$ docker volume inspect moby-44422_trouble 
[
    {
        "CreatedAt": "2023-06-13T22:38:59Z",
        "Driver": "local",
        "Labels": {
            "com.docker.compose.project": "moby-44422",
            "com.docker.compose.version": "2.18.1",
            "com.docker.compose.volume": "trouble"
        },
        "Mountpoint": "/var/lib/docker/volumes/moby-44422_trouble/_data",
        "Name": "moby-44422_trouble",
        "Options": {
            "device": "/home/XXXXXXXXX/moby-44422",
            "o": "bind",
            "type": "none"
        },
        "Scope": "local"
    }
]

Now try to delete the volume

$ docker volume rm moby-44422_trouble 
Error response from daemon: remove moby-44422_trouble: volume has active mounts

This failure should not happen. This appears to be a bug in the internal reference counting in the docker daemon.

Restart the docker daemon again to try to clear up the error in reference counting.

$ sudo systemctl restart docker

Now notice that we can delete the volume with no error.

$ docker volume rm moby-44422_trouble 
moby-44422_trouble

cpuguy83 · 2023-06-14T00:00:28Z

I am unable to reproduce the issue, either with live-restore or not.

solidgoldbomb · 2023-06-14T00:16:25Z

This is 100% reproducible for me. I've hit this issue over many versions of docker and on many similarly configured ubuntu systems. This burns me every time the docker daemon is restarted under my running stack and then I try to bring down the stack with docker compose down -v.

I also notice that I have similar settings to @fichte in my /etc/docker/daemon.json. Here's mine:

$ cat /etc/docker/daemon.json
{
		"experimental": true,
		"ip6tables": true,
		"live-restore": true,
		"ipv6": true,
		"fixed-cidr-v6": "fd00::/80"
}

@cpuguy83 did you run your test with the same config? Maybe experimental or the various ipv6 config options somehow make a difference to the outcome for volumes.

solidgoldbomb · 2023-06-14T00:26:09Z

I just ran a few tests while reducing my docker daemon config file. I can 100% reproduce this with only this in my config file

$ cat /etc/docker/daemon.json
{
		"live-restore": true
}

cpuguy83 · 2023-06-14T00:43:42Z

I am running a dockerd from the dev branch.
I'll try this tomorrow on a released version.

cpuguy83 · 2023-06-14T16:13:52Z

Also it is hitting me now that this error is coming from the (local) volume driver which is refcounting mounts and not from the volume service which handles refcounting objects referencing a volume.
Originally I was thinking the error was from the latter but that is incorrect.

cpuguy83 · 2023-06-14T16:21:43Z

Looking through the code a bit, if I had to guess this is the cause:

moby/daemon/start.go

Lines 253 to 257 in a978888

    
           if container.BaseFS != "" { 
        
           	if err := container.UnmountVolumes(daemon.LogVolumeEvent); err != nil { 
        
           		logrus.Warnf("%s cleanup: Failed to umount volumes: %v", container.ID, err) 
        
           	} 
        
           }

Or more specifically that in live-restore BaseFS is not set.

cpuguy83 · 2023-06-14T16:29:35Z

And sure enough I can reproduce this with the graphdriver backend, does not reproduce with containerd snapshotters.

cpuguy83 · 2023-06-14T17:10:33Z

Even worse, it is calling container.UnmountVolumes so somewhere the count is just getting off.

cpuguy83 · 2023-06-14T17:35:25Z

So this is happening because the active mount count is not initialized for live-restored containers.
When unmount is called its decrementing the count but the count is a uint64 so it overflows.

cpuguy83 · 2023-06-14T18:02:29Z

It also seems something on master has changed since I'd built it before and it is now broken with containerd snapshotters as well.
I'm not sure what that would be exactly.

solidgoldbomb · 2023-06-14T18:42:55Z

Thank you very much for digging into this issue.

cpuguy83 · 2023-06-14T19:19:53Z

I have a working patch for this, it needs tests and I'd like to do a bisect to see exactly the commit that caused the issue so it's going to take a bit of time to get this in, but I at least see what's happening.

cpuguy83 · 2023-06-14T19:22:52Z

Draft PR is here: #45754

fichte added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Nov 7, 2022

thaJeztah added area/volumes version/20.10 labels Jan 17, 2023

cpuguy83 mentioned this issue Jun 14, 2023

Restore active mount counts on live-restore #45754

Merged

This was referenced Jun 27, 2023

[20.10 backport] Restore active mount counts on live-restore #45826

Closed

[24.0 backport] Restore active mount counts on live-restore #45824

Merged

[23.0 backport] Restore active mount counts on live-restore #45825

Merged

thaJeztah closed this as completed in #45754 Jun 27, 2023

thaJeztah mentioned this issue Jun 28, 2023

[20.10] Backport Restore active mount counts on live-restore #45840

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

live-restore in combination with docker compose is broken with docker-ce version 20.10.19 and newer #44422

live-restore in combination with docker compose is broken with docker-ce version 20.10.19 and newer #44422

fichte commented Nov 7, 2022 •

edited

cpuguy83 commented Nov 9, 2022 •

edited by thaJeztah

fichte commented Nov 10, 2022

bsousaa commented Feb 23, 2023

solidgoldbomb commented Jun 13, 2023

cpuguy83 commented Jun 14, 2023

solidgoldbomb commented Jun 14, 2023

solidgoldbomb commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

solidgoldbomb commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

live-restore in combination with docker compose is broken with docker-ce version 20.10.19 and newer #44422

live-restore in combination with docker compose is broken with docker-ce version 20.10.19 and newer #44422

Comments

fichte commented Nov 7, 2022 • edited

Description

Reproduce

Expected behavior

docker version

docker info

Additional Info

daemon.json

cpuguy83 commented Nov 9, 2022 • edited by thaJeztah

fichte commented Nov 10, 2022

bsousaa commented Feb 23, 2023

solidgoldbomb commented Jun 13, 2023

cpuguy83 commented Jun 14, 2023

solidgoldbomb commented Jun 14, 2023

solidgoldbomb commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

solidgoldbomb commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

cpuguy83 commented Jun 14, 2023

fichte commented Nov 7, 2022 •

edited

cpuguy83 commented Nov 9, 2022 •

edited by thaJeztah