Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

Docker API freezing, journal errors: "failed to create endpoint $container on network bridge: failed to find host side interface veth$id: Link not found" #1731

Closed
ahmetb opened this issue Dec 27, 2016 · 2 comments

Comments

@ahmetb
Copy link

ahmetb commented Dec 27, 2016

Issue Report

Bug

CoreOS Version

NAME=CoreOS
ID=coreos
VERSION=1185.5.0
VERSION_ID=1185.5.0
BUILD_ID=2016-12-07-0937
PRETTY_NAME="CoreOS 1185.5.0 (MoreOS)"

Environment

DigitalOcean VM.

Expected Behavior

I have about 15 systemd timers (and their service units) that all all start the same container with different parameters. It runs a Go program, which would run for a few seconds and exit. I start the containers with docker run --rm in the .service unit file.

I'm expecting to use systemd timers to be able to run cron-like jobs on the docker engine. Actual behavior below.

Reproduction Steps (i.e. my setup)

It's a bit hard for me to give a working repro but I'll go over the use case. (I can provide access to a live repro machine if needed). I have:

  • a .service unit
  • a .timer unit calling the .service every N minutes
  • the service starts a docker container with --rm

Example .service:

$ cat pd-lastfm.service
[Unit]
Description=Scheduled personal dashboard task for 'lastfm'

[Service]
Type=oneshot
ExecStart=/usr/bin/docker run --rm -v /home/core/personal-dashboard/config.toml:/etc/personal-dashboard/config.toml -v /home/core/personal-dashboard/google-credentials.json:/etc/personal-dashboard/google-credentials.json  -e GOOGLE_APPLICATION_CREDENTIALS=/etc/personal-dashboard/google-credentials.json ahmet/personal-dashboard lastfm

Example .timer:

$ cat pd-lastfm.timer
[Unit]
Description=Timer for pd-lastfm.service

[Timer]
OnCalendar=*:0/20
AccuracySec=5s
Persistent=true

[Install]
WantedBy=timers.target

Both the timer and the service are registered/enabled in systemd. It works fine, until it stops working fine after a few runs and freezes the docker API.

Actual Behavior

  1. docker ps (and other commands calling the API) stops responding. It just freezes. When I prepend strace it gets frozen on the docker socket indefinitely:
...
epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4270968064, u64=140569960631552}}) = 0
getsockname(5, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(5, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc820028908, FUTEX_WAKE, 1)      = 1
futex(0x2272d00, FUTEX_WAIT, 0, NULL
  1. docker.service has logs like this in the journal:
Dec 25 05:39:36 core dockerd[14688]: time="2016-12-25T05:39:36.835572926Z" level=info msg="New containerd process, pid: 14711\n"
Dec 25 05:39:38 core dockerd[14688]: time="2016-12-25T05:39:38.052757245Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Dec 25 05:39:38 core dockerd[14688]: time="2016-12-25T05:39:38.069980146Z" level=info msg="Firewalld running: false"
Dec 25 05:39:38 core dockerd[14688]: time="2016-12-25T05:39:38.204778397Z" level=info msg="Removing stale sandbox 86293020c451cbbd4e8e71058e12b0e279ac887b685b1b3ace1284a4b38eedb2 (2a680ff1c4d28d411f0ac2a26868bb2c87b65bf896ec8c48253e680df684c764)"
Dec 25 05:39:38 core dockerd[14688]: time="2016-12-25T05:39:38.241944720Z" level=info msg="Removing stale sandbox b10b0413485a2482788f0177e0590ed6301458c4e35bc5b19dc39047bd7e92ee (5331e2c7f17aab38462602fcdae45c65efc7898856388d62387dded40d13e552)"
Dec 25 05:39:38 core dockerd[14688]: time="2016-12-25T05:39:38.278733320Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Dec 25 05:39:38 core dockerd[14688]: time="2016-12-25T05:39:38.338560342Z" level=info msg="Loading containers: start."
Dec 25 05:39:50 core dockerd[14688]: .....................time="2016-12-25T05:39:50.431136966Z" level=warning msg="error locating sandbox id 86293020c451cbbd4e8e71058e12b0e279ac887b685b1b3ace1284a4b38eedb2: sandbox 86293020c451cbbd4e8e71058e12b0e279ac887b685b1b3ace1284a4b38eedb2 not found"
Dec 25 05:39:50 core dockerd[14688]: time="2016-12-25T05:39:50.433652624Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/2a680ff1c4d28d411f0ac2a26868bb2c87b65bf896ec8c48253e680df684c764/shm: invalid argument"
Dec 25 05:39:50 core dockerd[14688]: time="2016-12-25T05:39:50.435367769Z" level=warning msg="error locating sandbox id b10b0413485a2482788f0177e0590ed6301458c4e35bc5b19dc39047bd7e92ee: sandbox b10b0413485a2482788f0177e0590ed6301458c4e35bc5b19dc39047bd7e92ee not found"
Dec 25 05:39:50 core dockerd[14688]: time="2016-12-25T05:39:50.436346915Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/5331e2c7f17aab38462602fcdae45c65efc7898856388d62387dded40d13e552/shm: invalid argument"
Dec 25 05:39:50 core dockerd[14688]: time="2016-12-25T05:39:50.972693231Z" level=error msg="Failed to start container 5331e2c7f17aab38462602fcdae45c65efc7898856388d62387dded40d13e552: rpc error: code = 6 desc = \"mkdir /run/containerd/5331e2c7f17aab38462602fcdae45c65efc7898856388d62387dded40d13e552: file exists\""
Dec 25 05:39:51 core dockerd[14688]: time="2016-12-25T05:39:51.078254376Z" level=error msg="Failed to start container 2a680ff1c4d28d411f0ac2a26868bb2c87b65bf896ec8c48253e680df684c764: rpc error: code = 6 desc = \"mkdir /run/containerd/2a680ff1c4d28d411f0ac2a26868bb2c87b65bf896ec8c48253e680df684c764: file exists\""
Dec 25 05:39:51 core dockerd[14688]: time="2016-12-25T05:39:51.926941377Z" level=info msg="Loading containers: done."
Dec 25 05:39:51 core dockerd[14688]: time="2016-12-25T05:39:51.928236521Z" level=info msg="Daemon has completed initialization"
Dec 25 05:39:51 core dockerd[14688]: time="2016-12-25T05:39:51.929131351Z" level=info msg="Docker daemon" commit=bac3bae graphdriver=overlay version=1.11.2
Dec 25 05:39:52 core dockerd[14688]: time="2016-12-25T05:39:52.097980019Z" level=info msg="API listen on /var/run/docker.sock"
Dec 25 06:00:04 core dockerd[14688]: time="2016-12-25T06:00:04.304826683Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/be1652bc1c7e77137db6b1ad2b85174d2fb72c1d0b4363778860b82984dbee10/shm: no such file or directory"
Dec 25 06:00:04 core dockerd[14688]: time="2016-12-25T06:00:04.306358422Z" level=error msg="Handler for POST /v1.23/containers/be1652bc1c7e77137db6b1ad2b85174d2fb72c1d0b4363778860b82984dbee10/start returned error: failed to create endpoint mad_turing on network bridge: failed to find host side interface veth6b16d1b: Link not found"
Dec 25 14:30:00 core dockerd[14688]: time="2016-12-25T14:30:00.929993644Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/a44cd5ec2a6f766e508431392b9d4b0709a9eae3a71a44720192e330687a3e74/shm: no such file or directory"
Dec 25 14:30:00 core dockerd[14688]: time="2016-12-25T14:30:00.931014252Z" level=error msg="Handler for POST /v1.23/containers/a44cd5ec2a6f766e508431392b9d4b0709a9eae3a71a44720192e330687a3e74/start returned error: failed to create endpoint tiny_blackwell on network bridge: failed to find host side interface vethad8591e: Link not found"
Dec 26 03:00:01 core dockerd[14688]: time="2016-12-26T03:00:01.190455722Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/726ab059bd227a63d06cb7e2c64b1e50f9999a6db0f5d61dbb29295426ebea72/shm: no such file or directory"
Dec 26 03:00:01 core dockerd[14688]: time="2016-12-26T03:00:01.193196721Z" level=error msg="Handler for POST /v1.23/containers/726ab059bd227a63d06cb7e2c64b1e50f9999a6db0f5d61dbb29295426ebea72/start returned error: failed to create endpoint dreamy_kalam on network bridge: failed to find host side interface vethf862218: Link not found"
Dec 26 06:00:00 core dockerd[14688]: time="2016-12-26T06:00:00.851085977Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/8dd7878a15cbcd6df2911fd540feae6be45de224585d7c7ddf1c1992317b0b75/shm: no such file or directory"
Dec 26 06:00:00 core dockerd[14688]: time="2016-12-26T06:00:00.852241613Z" level=error msg="Handler for POST /v1.23/containers/8dd7878a15cbcd6df2911fd540feae6be45de224585d7c7ddf1c1992317b0b75/start returned error: failed to create endpoint hopeful_mccarthy on network bridge: failed to find host side interface vetha9988bb: Link not found"
Dec 26 18:20:01 core dockerd[14688]: time="2016-12-26T18:20:01Z" level=error msg="containerd: deleting container" error="wait: no child processes"
Dec 27 02:00:02 core dockerd[14688]: time="2016-12-27T02:00:02Z" level=error msg="containerd: deleting container" error="wait: no child processes"
  1. I have about 13 of these timers running docker run --rm ..., some of them are working fine, however, some of them are just getting stuck past their due (possibly correlated with docker API freezing, when one syptom happens, the other one is often there, too). And this happens very often, like multiple times a day. See the "ago" values in "LEFT" column below:
$ systemctl list-timers --all
NEXT                         LEFT          LAST                         PASSED        UNIT                         ACTIVATES
Sun 2016-12-25 14:00:00 UTC  1 day 14h ago Sun 2016-12-25 14:00:04 UTC  1 day 14h ago pd-reporting.timer           pd-reporting.service
Sun 2016-12-25 15:15:00 UTC  1 day 13h ago Sun 2016-12-25 15:15:00 UTC  1 day 13h ago pd-jawbone-caffeine.timer    pd-jawbone-caffeine.service
Mon 2016-12-26 07:59:00 UTC  20h ago       Mon 2016-12-26 07:59:00 UTC  20h ago       pd-twitter-tweets.timer      pd-twitter-tweets.service
Mon 2016-12-26 08:30:00 UTC  19h ago       Mon 2016-12-26 08:30:00 UTC  19h ago       pd-jawbone-heartrate.timer   pd-jawbone-heartrate.service
Mon 2016-12-26 13:45:00 UTC  14h ago       Mon 2016-12-26 13:45:00 UTC  14h ago       pd-jawbone-steps.timer       pd-jawbone-steps.service
Mon 2016-12-26 14:59:00 UTC  13h ago       Mon 2016-12-26 14:59:00 UTC  13h ago       pd-strava.timer              pd-strava.service
Mon 2016-12-26 19:40:00 UTC  8h ago        Mon 2016-12-26 19:40:00 UTC  8h ago        pd-lastfm.timer              pd-lastfm.service
Tue 2016-12-27 04:30:00 UTC  7min left     Tue 2016-12-27 04:15:01 UTC  7min ago      pd-jawbone-sleeps.timer      pd-jawbone-sleeps.service
Tue 2016-12-27 04:40:00 UTC  17min left    Tue 2016-12-27 04:20:00 UTC  2min 11s ago  pd-wakatime.timer            pd-wakatime.service
Tue 2016-12-27 04:59:00 UTC  36min left    Tue 2016-12-27 03:59:00 UTC  23min ago     pd-foursquare.timer          pd-foursquare.service
Tue 2016-12-27 04:59:00 UTC  36min left    Tue 2016-12-27 03:59:00 UTC  23min ago     pd-github.timer              pd-github.service
Tue 2016-12-27 04:59:00 UTC  36min left    Tue 2016-12-27 03:59:00 UTC  23min ago     pd-twitter-followers.timer   pd-twitter-followers.service
Tue 2016-12-27 12:52:33 UTC  8h left       Tue 2016-12-27 00:52:33 UTC  3h 29min ago  rkt-gc.timer                 rkt-gc.service
Wed 2016-12-28 00:00:00 UTC  19h left      Tue 2016-12-27 00:00:00 UTC  4h 22min ago  logrotate.timer              logrotate.service
Wed 2016-12-28 01:07:23 UTC  20h left      Tue 2016-12-27 01:07:23 UTC  3h 14min ago  systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
n/a                          n/a           n/a                          n/a           update-engine-stub.timer     update-engine-stub.service
  1. I do ps aux | grep docker, I see the list of commands run by my frozen timers:
core@core ~ $ ps aux | grep docker
root      3193  0.0  4.8 157868 24724 ?        Ssl  Dec26   0:00 /usr/bin/docker run --rm -v /home/core/personal-dashboard/config.toml:/etc/personal-dashboard/config.toml -v /home/core/personal-dashboard/google-credentials.json:/etc/personal-dashboard/google-credentials.json -e GOOGLE_APPLICATION_CREDENTIALS=/etc/personal-dashboard/google-credentials.json ahmet/personal-dashboard twitter tweets
root      3951  0.0  4.9 157868 24864 ?        Ssl  Dec25   0:00 /usr/bin/docker run --rm -v /home/core/personal-dashboard/config.toml:/etc/personal-dashboard/config.toml -v /home/core/personal-dashboard/google-credentials.json:/etc/personal-dashboard/google-credentials.json -e GOOGLE_APPLICATION_CREDENTIALS=/etc/personal-dashboard/google-credentials.json ahmet/personal-dashboard reporting
root      4230  0.0  4.8 190652 24464 ?        Ssl  Dec26   0:00 /usr/bin/docker run --rm -v /home/core/personal-dashboard/config.toml:/etc/personal-dashboard/config.toml -v /home/core/personal-dashboard/google-credentials.json:/etc/personal-dashboard/google-credentials.json -e GOOGLE_APPLICATION_CREDENTIALS=/etc/personal-dashboard/google-credentials.json ahmet/personal-dashboard jawbone heartrate
core      4355  0.0  0.2   6756  1020 pts/0    S+   04:24   0:00 grep --colour=auto docker
root      6368  0.0  4.9 157868 24844 ?        Ssl  Dec25   0:00 /usr/bin/docker run --rm -v /home/core/personal-dashboard/config.toml:/etc/personal-dashboard/config.toml -v /home/core/personal-dashboard/google-credentials.json:/etc/personal-dashboard/google-credentials.json -e GOOGLE_APPLICATION_CREDENTIALS=/etc/personal-dashboard/google-credentials.json ahmet/personal-dashboard jawbone caffeine
root     14688  0.0  9.7 788284 48992 ?        Ssl  Dec25   2:35 docker daemon --host=fd:// --selinux-enabled
root     14711  0.0  1.4 427724  7360 ?        Ssl  Dec25   0:29 containerd -l /var/run/docker/libcontainerd/docker-containerd.sock --runtime runc --start-timeout 2m
root     14813  0.0  0.7 198952  3544 ?        Sl   Dec25   0:01 containerd-shim eb52ce5c4c3cb62530f7086df9f06278da75ed2f2d37c15dfaefd0803caad556 /var/run/docker/libcontainerd/eb52ce5c4c3cb62530f7086df9f06278da75ed2f2d37c15dfaefd0803caad556 runc
root     14819  0.0  4.5 157868 23128 ?        Sl   Dec25   0:00 docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.17.0.3 -container-port 80
root     14844  0.0  0.6 198952  3532 ?        Sl   Dec25   0:01 containerd-shim 5efa19013c30136a1d2f3a71b1e31907e7f166af2f1cdf09978db60d33220b5e /var/run/docker/libcontainerd/5efa19013c30136a1d2f3a71b1e31907e7f166af2f1cdf09978db60d33220b5e runc
root     17366  0.0  4.9 190652 24748 ?        Ssl  Dec26   0:00 /usr/bin/docker run --rm -v /home/core/personal-dashboard/config.toml:/etc/personal-dashboard/config.toml -v /home/core/personal-dashboard/google-credentials.json:/etc/personal-dashboard/google-credentials.json -e GOOGLE_APPLICATION_CREDENTIALS=/etc/personal-dashboard/google-credentials.json ahmet/personal-dashboard jawbone steps
root     19853  0.0  4.8 157868 24716 ?        Ssl  Dec26   0:00 /usr/bin/docker run --rm -v /home/core/personal-dashboard/config.toml:/etc/personal-dashboard/config.toml -v /home/core/personal-dashboard/google-credentials.json:/etc/personal-dashboard/google-credentials.json -e GOOGLE_APPLICATION_CREDENTIALS=/etc/personal-dashboard/google-credentials.json ahmet/personal-dashboard strava
root     26166  0.0  4.9 157868 24804 ?        Ssl  Dec26   0:00 /usr/bin/docker run --rm -v /home/core/personal-dashboard/config.toml:/etc/personal-dashboard/config.toml -v /home/core/personal-dashboard/google-credentials.json:/etc/personal-dashboard/google-credentials.json -e GOOGLE_APPLICATION_CREDENTIALS=/etc/personal-dashboard/google-credentials.json ahmet/personal-dashboard lastfm
  1. htop also shows the frozen /usr/bin/docker (client) processes. Also, my docker program's entrypoint does not appear in the top output, so I'm guessing my app is not responsible for the issue.

image

  1. journal logs of a .service whose /usr/bin/docker entrypoint has the last log line "Starting" from systemd (see when the timer ran it at 19:40 below):
$ journalctl -fu pd-lastfm.service
-- Logs begin at Thu 2016-11-03 16:24:07 UTC. --
Dec 26 19:00:01 core docker[25558]: task=lastfm git=v1.0-33-g08bf33d time=2016-12-26T19:00:01Z msg="parsed response" songs=0
Dec 26 19:00:01 core docker[25558]: task=lastfm git=v1.0-33-g08bf33d time=2016-12-26T19:00:01Z msg="saved measurement"
Dec 26 19:00:02 core systemd[1]: Started Scheduled personal dashboard task for 'lastfm'.
Dec 26 19:20:00 core systemd[1]: Starting Scheduled personal dashboard task for 'lastfm'...
Dec 26 19:20:01 core docker[25907]: task=lastfm git=v1.0-33-g08bf33d time=2016-12-26T19:20:01Z msg=starting
Dec 26 19:20:01 core docker[25907]: task=lastfm git=v1.0-33-g08bf33d time=2016-12-26T19:20:01Z msg="querying last.fm API"
Dec 26 19:20:01 core docker[25907]: task=lastfm git=v1.0-33-g08bf33d time=2016-12-26T19:20:01Z msg="parsed response" songs=2
Dec 26 19:20:01 core docker[25907]: task=lastfm git=v1.0-33-g08bf33d time=2016-12-26T19:20:01Z msg="saved measurement"
Dec 26 19:20:01 core systemd[1]: Started Scheduled personal dashboard task for 'lastfm'.
Dec 26 19:40:00 core systemd[1]: Starting Scheduled personal dashboard task for 'lastfm'...
  1. machine resource stats seem normal, no indication of high memory/load, plus other timers keep working, however certain ones are stuck. so the systemd timer can't kill/reschedule the job.
@ahmetb
Copy link
Author

ahmetb commented Jan 2, 2017

I think it comes down to a docker issue, I reported it here: moby/moby#13885 (comment)

I see a ton of netlink-related goroutines in docker daemon dump:

goroutine 1149 [chan send, 1070 minutes]:
github.com/vishvananda/netlink.LinkSubscribe.func2(0xc821159d40, 0xc820fa4c60)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:898 +0x2de
created by github.com/vishvananda/netlink.LinkSubscribe
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:901 +0x107

goroutine 442 [chan send, 1129 minutes]:
github.com/vishvananda/netlink.LinkSubscribe.func2(0xc8211eb380, 0xc821095920)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:898 +0x2de
created by github.com/vishvananda/netlink.LinkSubscribe
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:901 +0x107

I think that's the issue.

@ahmetb
Copy link
Author

ahmetb commented Jan 9, 2017

It looks like 1235.4.0 stable fixed this.

@ahmetb ahmetb closed this as completed Jan 9, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant