Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm is having occasional network connection problems between nodes. #32195

Closed
darklow opened this issue Mar 29, 2017 · 250 comments
Closed

Swarm is having occasional network connection problems between nodes. #32195

darklow opened this issue Mar 29, 2017 · 250 comments

Comments

@darklow
Copy link

darklow commented Mar 29, 2017

Few times a day I am having connection issues between nodes and clients are seeing occasional "Bad request" error. My swarm setup (aws) has following services: nginx (global) and web (replicated=2) and separate overlay network. In nginx.conf I am using proxy_pass http://web:5000 to route requests to web service. Both services are running and marked as healthy and haven't been restarted while having these errors. Manager is separate node (30sec-manager1).

Few times a day for few requests I am receiving an errors that nginx couldn't connect upstream and I always see 10.0.0.6 IP address mentioned:

Here are related nginx and docker logs. Both web services are replicated on 30sec-worker3 and 30sec-worker4 nodes.

Nginx log:
----------
2017/03/29 07:13:18 [error] 7#7: *44944 connect() failed (113: Host is unreachable) while connecting to upstream, client: 104.154.58.95, server: 30seconds.com, request: "GET / HTTP/1.1", upstream: "http://10.0.0.6:5000/", host: "30seconds.com"

Around same time from docker logs (journalctl -u docker.service)

on node 30sec-manager1:
---------------------------
Mar 29 07:12:50 30sec-manager1 docker[30365]: time="2017-03-29T07:12:50.736935344Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54.659229055Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:13:10 30sec-manager1 docker[30365]: time="2017-03-29T07:13:10.302960985Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:11 30sec-manager1 docker[30365]: time="2017-03-29T07:13:11.055187819Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:17 30sec-manager1 docker[30365]: time="2017-03-29T07:13:17Z" level=info msg="Firewalld running: false"

on node 30sec-worker3:
-------------------------
Mar 29 07:12:50 30sec-worker3 docker[30362]: time="2017-03-29T07:12:50.613402284Z" level=info msg="memberlist: Suspect 30sec-manager1-b1cbc10665cc has failed, no acks received"
Mar 29 07:12:55 30sec-worker3 docker[30362]: time="2017-03-29T07:12:55.614174704Z" level=warning msg="memberlist: Refuting a dead message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:09 30sec-worker3 docker[30362]: time="2017-03-29T07:13:09.613368306Z" level=info msg="memberlist: Suspect 30sec-worker4-4ca6b1dcaa42 has failed, no acks received"
Mar 29 07:13:10 30sec-worker3 docker[30362]: time="2017-03-29T07:13:10.613972658Z" level=info msg="memberlist: Suspect 30sec-manager1-b1cbc10665cc has failed, no acks received"
Mar 29 07:13:11 30sec-worker3 docker[30362]: time="2017-03-29T07:13:11.042788976Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:14 30sec-worker3 docker[30362]: time="2017-03-29T07:13:14.613951134Z" level=info msg="memberlist: Marking 30sec-worker4-4ca6b1dcaa42 as failed, suspect timeout reached"
Mar 29 07:13:25 30sec-worker3 docker[30362]: time="2017-03-29T07:13:25.615128313Z" level=error msg="Bulk sync to node 30sec-manager1-b1cbc10665cc timed out"

on node 30sec-worker4:
-------------------------
Mar 29 07:12:49 30sec-worker4 docker[30376]: time="2017-03-29T07:12:49.658082975Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54.658737367Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:13:09 30sec-worker4 docker[30376]: time="2017-03-29T07:13:09.658056735Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16.303689665Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"

syslog on 30sec-worker4:
--------------------------
Mar 29 07:12:49 30sec-worker4 docker[30376]: time="2017-03-29T07:12:49.658082975Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54.658737367Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-worker4 kernel: [645679.048975] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 kernel: [645679.100691] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.130069] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.155859] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.180461] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.205707] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.230326] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.255597] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 docker[30376]: message repeated 7 times: [ time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"]
Mar 29 07:13:09 30sec-worker4 docker[30376]: time="2017-03-29T07:13:09.658056735Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16.303689665Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: message repeated 7 times: [ time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"]

I checked other cases when nginx can't find find upstream and always I find these 3x lines appear most at these times in docker logs in:

level=info msg="memberlist:Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
level=warning msg="memberlist: Refuting a dead message (from: 30sec-worker3-054c94d39b58)

By searching other issues, found that these have similar errors, so it may be related:
#28843
#25325

Anything I should check or debug more to spot the problem or is it a bug?
Thank you.

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 18
 Running: 3
 Paused: 0
 Stopped: 15
Images: 16
Server Version: 17.03.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 83
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: ck99cyhgydt8y1zn8ik2xmcdv
 Is Manager: true
 ClusterID: in0q54eh74ljazrprt0vza3wj
 Managers: 1
 Nodes: 5
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 172.31.31.146
 Manager Addresses:
  172.31.31.146:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-57-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 990.6 MiB
Name: 30sec-manager1
ID: 5IIF:RONB:Y27Q:5MKX:ENEE:HZWM:XYBV:O6KN:BKL6:AEUK:2VKB:MO5P
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
 provider=amazonec2
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
Amazon AWS (Manager - t2.micro, rest of nodes - t2.small)

docker-compose.yml (There are more services and nodes in setup, but I posted only involved ones)

version: "3"

services:

  nginx:
    image: 333435094895.dkr.ecr.us-east-1.amazonaws.com/swarm/nginx:latest
    ports:
      - 80:80
      - 81:81
    networks:
      - thirtysec
    depends_on:
      - web
    deploy:
      mode: global
      update_config:
        delay: 2s
        monitor: 2s

  web:
    image: 333435094895.dkr.ecr.us-east-1.amazonaws.com/swarm/os:latest
    command: sh -c "python manage.py collectstatic --noinput && daphne thirtysec.asgi:channel_layer -b 0.0.0.0 -p 5000"
    ports:
      - 5000:5000
    networks:
      - thirtysec
    deploy:
      mode: replicated
      replicas: 2
      labels: [APP=THIRTYSEC]
      update_config:
        delay: 15s
        monitor: 15s
      placement:
        constraints: [node.labels.aws_type == t2.small]

    healthcheck:
      test: goss -g deploy/swarm/checks/web-goss.yaml validate
      interval: 2s
      timeout: 3s
      retries: 15

networks:
    thirtysec:

web-goss.yaml

port:
  tcp:5000:
    listening: true
    ip:
    - 0.0.0.0
@chris-fung
Copy link

I think this is an old problem since the release of Swarm Mode.

@chris-fung
Copy link

Run "sudo service docker restart" on the host with service container that can't be ping from the good ones, problem solved. Maybe good for a while, until creating new or updating services.

@darklow
Copy link
Author

darklow commented Mar 30, 2017

@chris-fung It comes back on its own, it happens occasionally for 1-20 requests in a row. The problem is these small interruptions when clients actually see Bad request error on production. I can't catch or know when it happens. I just receive an error from papertrail nginx logs saying it was unreachable, but the rest of the time everything works.
image

@darklow
Copy link
Author

darklow commented Mar 30, 2017

Yesterday after posting an issue, I noticed there is new version and upgraded all nodes from 17.03.0-ce to 17.03.1-ce and so far haven't received any errors on downtimes, but there haven't been full 24 hours yet and I haven't updated/rotated any services yet. Will post an update how it goes with this version.

@Fank
Copy link

Fank commented Mar 30, 2017

We have the same problem in our swarm too, i updated the swarm to 17.03.1-ce and check if this will happend again.

@chris-fung
Copy link

I also updated the swarm to 17.03.1-ce, still met the same problem just now. After restarting the docker engine on the problem host, all goes normal, but will happen again.

@thaJeztah
Copy link
Member

Some questions;

  • What options did you use when creating the swarm? Did you provide --advertise-addr and/or --listen-addr ?
  • are all nodes located in the same region? if not, do "worker-3", and "worker-4" have something in common? (e.g., same datacenter as the manager, or not)

These messages;

memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached
memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received

Indicate that there's a problem in communication between the nodes, this could be related to not specifying --advertise-addr when initialising the swarm.

Few times a day for few requests I am receiving an errors that nginx couldn't connect upstream and I always see 10.0.0.6 IP address mentioned:

Can you share how the "upstream" in your nginx server is configured?

  • is it connecting to the "web" service by name, or is the Virtual IP hard-coded?
  • is it connecting to the service, or individual containers (tasks) of the service?
  • the free nginx version by default only resolves IP-addresses once; did you specify a "resolver" (you can use 127.0.0.11 for the embedded DNS server)

@darklow
Copy link
Author

darklow commented Mar 30, 2017

@thaJeztah I tried both - specifying and not specifying --advertise-addr and when calling swarm join worker both times --listen-addr was identical - the same value I can see in inet addr when calling ifconfig eth0. Usually something like: 172.31.19.226. But as an answer for particular case when these errors appeared for that time I specified --advertise-addr manually.

Both worker nodes are on the same datacenter, same availability zone, same vps and subnet and created identically with docker-machine create.

For nginx I use web service name and following config:

server {
        listen 81 proxy_protocol default_server;
        ....
        real_ip_header   proxy_protocol;
        set_real_ip_from 10.0.0.0/8;
        set_real_ip_from 172.16.0.0/12;
        set_real_ip_from 192.168.0.0/16;

        location / {
            proxy_pass http://30sec_web:5000;
            proxy_redirect off;

            proxy_http_version 1.1;
            proxy_set_header Upgrade            $http_upgrade;
            proxy_set_header Connection         "upgrade";
            # Behind ELB 'https' scheme var is lost
            proxy_set_header X-Forwarded-Proto  'https';
            proxy_set_header Host               $http_host;
            proxy_set_header X-Real-IP          $proxy_protocol_addr;
            proxy_set_header X-Forwarded-For    $proxy_protocol_addr;
            proxy_set_header X-Forwarded-Port   $server_port;
            proxy_set_header X-Request-Start    $msec;
        }
}

Which version is officially correct - service name or ip address? I couldn't found on docs and I even tried using http://tasks.30sec_web but results were even worse, so I switched back.

I wasn't aware that free version only resolves IP adresses once and I am not using resolver. Hmm that may help, although I wonder - because this is not permanent issue and eventually issue resolves itself in a few seconds can it be related to resolver thing at all.

@Fank
Copy link

Fank commented Mar 30, 2017

@thaJeztah

  • Created the swarm with docker swarm init
  • We run a VMWare virtualization and all nodes are located on the same host. (For testing purpos we have one node which is worker in a other datacenter)

I use a golang application which communicate with a redis service and when the "Marking ... as failed, suspect timeout reached" then i see following in the golang app:
https://gist.github.com/Fank/6ab36ac8ad64563fbbdd0aa1d14375d5#file-golang-log

The weired thing is it workes sometimes.

@thaJeztah
Copy link
Member

@darklow @Fank thanks for the additional information

Which version is officially correct - service name or ip address? I couldn't found on docs and I even tried using http://tasks.30sec_web but results were even worse, so I switched back.

Resolving by service-name is correct

I wasn't aware that free version only resolves IP adresses once and I am not using resolver. Hmm that may help, although I wonder - because this is not permanent issue and eventually issue resolves itself in a few seconds can it be related to resolver thing at all.

When using the service name, that shouldn't make a difference here; the VIP of the service should not change (but IP-addresses of individual tasks can, which could be a problem if you used that 😄)

I'm gonna have to defer this to the networking and SwarmKit people, who may be able to ask more targeted questions 😄

ping @dongluochen @aboch PTAL

@chris-fung
Copy link

chris-fung commented Mar 30, 2017

@thaJeztah

  • Created the swarm with docker swarm init and then with --advertise-addr.
  • Run hyperv ubuntu on different windows 10 physical machines, totally 6 nodes.
  • Created services both with VIP or DNSRR mode.
  • All service's on the same overlay network "mynet".

For example, I deploy services like below format-- service name, replicas endpoint-mode, publish port, image:
1, nginxlb 2 vip ["8000":"8000","8001":"8001","8443":"8443"] (private registry)/nginx:alphine
2, kong 3 dnsrr kong:0.10.1
3, konga 1 dnsrr pantsel/konga:latest
4, user_service 3 dnsrr (private registry)/user_service:latest
5, other services on dnsrr mode....
and here's my nginx.conf

upstream kong-8000 {
        least_conn;
        server kong:8000 max_fails=3 fail_timeout=60 weight=1;
    }
    server {
        server_name _;
        listen 8000;

        location / {
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto;
            proxy_pass http://kong-8000;
            proxy_next_upstream error timeout http_500 http_502 http_503 http_504 http_404;
            proxy_next_upstream_tries 2;
        }
    }

    upstream kong-8001 {
        least_conn;
        server kong:8001 max_fails=3 fail_timeout=60 weight=1;
    }
    server {
        server_name _;
        listen 8001;

        location / {
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto;
            proxy_pass http://kong-8001;
            proxy_next_upstream error timeout http_500 http_502 http_503 http_504 http_404;
            proxy_next_upstream_tries 2;
        }
    }

    upstream kong-8443 {
        least_conn;
        server kong:8443 max_fails=3 fail_timeout=60 weight=1;
    }
    server {
        server_name _;
        listen 8443;

        location / {
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto;
            proxy_pass http://kong-8443;
            proxy_next_upstream error timeout http_500 http_502 http_503 http_504 http_404;
            proxy_next_upstream_tries 2;
        }

Whenever the service connection problem happens, I ping other services (or nslookup) inside a container of nginxlb service(ping with service name or directly with vip--dnsrr services have different IPs depending on there replicas), over and over again. Finally I realize that I can't ping any vip(or service name) to some node. I also tried to ping inside a service container on that node, of course, service containers on that node can ping each other. And restart the node solves the problem。

@vovimayhem
Copy link

I've been noticing this too since 2 weeks ago :(

@Nossnevs
Copy link

Nossnevs commented Apr 5, 2017

We have the same problem. We have 4 nodes that loses connection to each other and it is not a network problem. We consider it to happen randomly. After couple of minutes it gets back the connections and the swarm heals it self ending with all 4 nodes working.

Docker 17.03

@Fank
Copy link

Fank commented Apr 7, 2017

I did the following foreach node:

  • demoted node
  • removed node from swarm
  • stopped docker
  • delete /var/lib/docker
  • delete /etc/systemd/system/docker.service.d overwrites
  • delete everything in /etc/docker
  • Inited new swarm or joined new swarm

And it looks like it worked.

My old configuration, was a systemd overwrite with following parameter --bip=172.18.0.1/8 --storage-driver=overlay2 for node 1-3 and none configuration for node 4-5
New configuration is a /etc/docker/daemon.json with

{
	"storage-driver": "overlay2"
}

@sudo-bmitch
Copy link

sudo-bmitch commented Apr 10, 2017

I've been seeing the same issue. Some (possibly all) overlay IPs stop responding, DNS still resolves the IP but connections to a port on the IP hang indefinitely. Restarting just the docker daemon sometimes solves the issue, but today we needed to do a full reboot to recover. Services are running inside of swarm mode, networks created are "attachable" and sometimes the target IP is a standalone container running outside of swarm mode. If it's helpful, I also have a daemon-data and goroutine-stacks that was generated during this issue.

Docker version is 17.03.1-ce (similar issue was seen with 1.13.1)
Host OS is RHEL 7.3 with kernel 3.10.

Looking through my logs after restarting just dockerd, on host2 I see:

Apr 10 09:01:06 ***host2 dockerd: time="2017-04-10T09:01:06.257178027-06:00" level=debug msg="Neighbor entry already present for IP 10.1.10.67, mac 02:42:0a:00:01:03"
Apr 10 09:01:06 ***host2 dockerd: time="2017-04-10T09:01:06.264903885-06:00" level=debug msg="Neighbor entry already present for IP 10.0.1.3, mac 02:42:0a:00:01:03"
Apr 10 09:01:06 ***host2 dockerd: time="2017-04-10T09:01:06.264921849-06:00" level=debug msg="Neighbor entry already present for IP 10.1.10.67, mac 02:42:0a:00:01:03"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.562636938-06:00" level=debug msg="***host2.***.***-e2a59500b1df: Initiating bulk sync with nodes [***host1.***.***-0edb2b0f8955]"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.562678204-06:00" level=debug msg="***host2.***.***-e2a59500b1df: Initiating unsolicited bulk sync for networks [qd3qzz21s7jm398aq4paq5ke2 qww7tii19z86fug7vg8nsa4a2 dgruwla6gowe8jk8yzvcwcacz vdp1cqfnz5lnz071rt6p2kvgl sn93tbqpqsygur817q15mxd6b ydebbtvtjd3l5va7qvwjvi7ti] with node ***host1.***.***-0edb2b0f8955"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.670120679-06:00" level=debug msg="memberlist: TCP connection from=10.1.10.67:41288"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.677351582-06:00" level=debug msg="checkEncryption(dgruwla, 10.1.10.67, 4100, false)"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.677429689-06:00" level=debug msg="List of nodes: map[10.1.10.67:10.1.10.67]"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.677447405-06:00" level=debug msg="Programming encryption for vxlan 4100 between <nil> and 10.1.10.67"
Apr 10 09:01:07 ***host2 dockerd: time="2017-04-10T09:01:07.677490120-06:00" level=debug msg="/usr/sbin/iptables, [--wait -t mangle -C OUTPUT -p udp --dport 4789 -m u32 --u32 0>>22&0x3C@12&0xFFFFFF00=1049600 -j MARK --set-mark 13681891]"

On host1, I'm seeing:

Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.143150823-06:00" level=debug msg="Creating service for vip 10.0.2.6 fwMark 729 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth6242: link becomes ready
Apr 10 09:01:07 ***host1 kernel: br0: port 2(veth6242) entered forwarding state
Apr 10 09:01:07 ***host1 kernel: br0: port 2(veth6242) entered forwarding state
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.151639028-06:00" level=debug msg="Miss notification, l2 mac 02:42:0a:00:02:07"
Apr 10 09:01:07 ***host1 NetworkManager[1160]: <info>  [1491836467.1609] device (vethd28a2ce): driver 'veth' does not support carrier detection.
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.191785808-06:00" level=debug msg="Creating service for vip 10.0.2.4 fwMark 727 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.235685943-06:00" level=debug msg="Creating service for vip 10.0.2.2 fwMark 728 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.280173727-06:00" level=debug msg="Creating service for vip 10.0.2.8 fwMark 725 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.325045640-06:00" level=debug msg="Creating service for vip 10.0.2.5 fwMark 726 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.391084213-06:00" level=debug msg="Creating service for vip 10.0.1.11 fwMark 307 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth24804: link becomes ready
Apr 10 09:01:07 ***host1 kernel: br0: port 16(veth24804) entered forwarding state
Apr 10 09:01:07 ***host1 kernel: br0: port 16(veth24804) entered forwarding state
Apr 10 09:01:07 ***host1 NetworkManager[1160]: <info>  [1491836467.3996] device (vethd1009fd): driver 'veth' does not support carrier detection.
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.399638183-06:00" level=debug msg="Miss notification, l2 mac 02:42:0a:00:01:03"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.444175587-06:00" level=debug msg="Creating service for vip 10.0.1.36 fwMark 308 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.498370688-06:00" level=debug msg="Creating service for vip 10.0.1.14 fwMark 310 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.543398402-06:00" level=debug msg="Creating service for vip 10.0.1.32 fwMark 337 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.587459954-06:00" level=debug msg="memberlist: TCP connection from=10.1.10.68:54094"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.593656923-06:00" level=debug msg="Creating service for vip 10.0.1.23 fwMark 306 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.641793602-06:00" level=debug msg="Creating service for vip 10.0.1.20 fwMark 309 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.663756416-06:00" level=debug msg="***host1.***.***-0edb2b0f8955: Initiating  bulk sync for networks [qd3qzz21s7jm398aq4paq5ke2 qww7tii19z86fug7vg8nsa4a2 dgruwla6gowe8jk8yzvcwcacz vdp1cqfnz5lnz071rt6p2kvgl sn93tbqpqsygur817q15mxd6b ydebbtvtjd3l5va7qvwjvi7ti] with node ***host2.***.***-e2a59500b1df"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.693889010-06:00" level=debug msg="Creating service for vip 10.0.1.5 fwMark 311 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.739212582-06:00" level=debug msg="Creating service for vip 10.0.1.31 fwMark 312 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.783444768-06:00" level=debug msg="Creating service for vip 10.0.1.27 fwMark 305 ingressPorts []*libnetwork.PortConfig(nil) in sbox 0953d75 (d850b7a)"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07-06:00" level=info msg="Firewalld running: false"
Apr 10 09:01:07 ***host1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth4d8da54: link becomes ready
Apr 10 09:01:07 ***host1 kernel: docker_gwbridge: port 13(veth4d8da54) entered forwarding state
Apr 10 09:01:07 ***host1 kernel: docker_gwbridge: port 13(veth4d8da54) entered forwarding state
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.851440265-06:00" level=debug msg="sandbox set key processing took 1.920494571s for container d850b7ae4eb86f5d85ca3fab123cfd68c4e77dcf0cf7ce7d38ffd062937f5f2b"
Apr 10 09:01:07 ***host1 NetworkManager[1160]: <info>  [1491836467.8920] device (vethe9991e7): driver 'veth' does not support carrier detection.
Apr 10 09:01:07 ***host1 NetworkManager[1160]: <info>  [1491836467.8958] device (veth4d8da54): link connected
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.895955172-06:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-container\", Id:\"d850b7ae4eb86f5d85ca3fab123cfd68c4e77dcf0cf7ce7d38ffd062937f5f2b\", Status:0x0, Pid:\"\", Timestamp:(*timestamp.Timestamp)(0xc42b6e3d50)}"
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.896865954-06:00" level=debug msg=OpenMonitorChannel
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.897184732-06:00" level=debug msg="waiting on events" module="node/agent/taskmanager" task.id=syg4ozkkvbli3gu4arbew4xnd
Apr 10 09:01:07 ***host1 dockerd: time="2017-04-10T09:01:07.897219511-06:00" level=debug msg="libcontainerd: event unhandled: type:\"start-container\" id:\"d850b7ae4eb86f5d85ca3fab123cfd68c4e77dcf0cf7ce7d38ffd062937f5f2b\" timestamp:<seconds:1491836467 nanos:895640198 > "

@vlindhol
Copy link

Seconding this, couldn't connect to any services in the Swarm (connection refused in browser) on my dev machine until I restarted Docker. I didn't even stop the swarm or the services, so they came back when I started Docker again, and now I could connect. I'm running docker for mac. 17.0.3.1-ce-mac5

@veryeasily
Copy link

veryeasily commented Apr 21, 2017

Does anybody have a good workaround for this? I have been seeing a very intense version of this with 50 or so errors a day on a project that is not in production yet. I'd like to give swarm mode a try! But this may be the issue that keeps me from doing it.

Right around a timeout error, my journald logs for the swarm host look like:

Apr 21 03:46:59 cms1 dockerd[12847]: time="2017-04-21T03:46:59.725213177-04:00" level=debug msg="memberlist: TCP connection from=192.168.179.224:56542"
Apr 21 03:46:59 cms1 dockerd[12847]: time="2017-04-21T03:46:59.726369613-04:00" level=debug msg="cms1-cc27fac68311: Initiating  bulk sync for networks [sqsqgt1w1tx6sdkl1ffkwd4gu k4rs7wj8j9l
Apr 21 03:47:03 cms1 dockerd[12847]: time="2017-04-21T03:47:03.523876017-04:00" level=debug msg="cms1-cc27fac68311: Initiating bulk sync with nodes [api1-af079e26080c]"
Apr 21 03:47:03 cms1 dockerd[12847]: time="2017-04-21T03:47:03.524471218-04:00" level=debug msg="cms1-cc27fac68311: Initiating unsolicited bulk sync for networks [k4rs7wj8j9l7iphvvzgxf1tvg
Apr 21 03:47:03 cms1 dockerd[12847]: time="2017-04-21T03:47:03.533055208-04:00" level=debug msg="memberlist: TCP connection from=192.168.215.132:58804"
Apr 21 03:47:03 cms1 dockerd[12847]: time="2017-04-21T03:47:03.534115983-04:00" level=debug msg="cms1-cc27fac68311: Bulk sync to node api1-af079e26080c took 3.891603ms"
Apr 21 03:47:08 cms1 dockerd[12847]: time="2017-04-21T03:47:08.325381012-04:00" level=debug msg="memberlist: Initiating push/pull sync with: 192.168.179.224:7946"
Apr 21 03:47:14 cms1 dockerd[12847]: time="2017-04-21T03:47:14.543454903-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:15 cms1 dockerd[12847]: time="2017-04-21T03:47:15.541288812-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:16 cms1 dockerd[12847]: time="2017-04-21T03:47:16.541270876-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:17 cms1 dockerd[12847]: time="2017-04-21T03:47:17.541311754-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:18 cms1 dockerd[12847]: time="2017-04-21T03:47:18.541204905-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:19 cms1 dockerd[12847]: time="2017-04-21T03:47:19.542193460-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:20 cms1 dockerd[12847]: time="2017-04-21T03:47:20.541292566-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:21 cms1 dockerd[12847]: time="2017-04-21T03:47:21.545220110-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:22 cms1 dockerd[12847]: time="2017-04-21T03:47:22.545227792-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:23 cms1 dockerd[12847]: time="2017-04-21T03:47:23.545229680-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:24 cms1 dockerd[12847]: time="2017-04-21T03:47:24.549134318-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:25 cms1 dockerd[12847]: time="2017-04-21T03:47:25.489986967-04:00" level=debug msg="memberlist: TCP connection from=192.168.179.224:56548"
Apr 21 03:47:25 cms1 dockerd[12847]: time="2017-04-21T03:47:25.545209182-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:26 cms1 dockerd[12847]: time="2017-04-21T03:47:26.545232844-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:27 cms1 dockerd[12847]: time="2017-04-21T03:47:27.545211926-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:28 cms1 dockerd[12847]: time="2017-04-21T03:47:28.545274141-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:29 cms1 dockerd[12847]: time="2017-04-21T03:47:29.545082355-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:29 cms1 dockerd[12847]: time="2017-04-21T03:47:29.745951147-04:00" level=debug msg="Name To resolve: cmsdb."
Apr 21 03:47:29 cms1 dockerd[12847]: time="2017-04-21T03:47:29.746504653-04:00" level=debug msg="Lookup for cmsdb.: IP [10.0.0.5]"
Apr 21 03:47:29 cms1 dockerd[12847]: time="2017-04-21T03:47:29.746988872-04:00" level=debug msg="Name To resolve: cmsdb."
Apr 21 03:47:29 cms1 dockerd[12847]: time="2017-04-21T03:47:29.747267332-04:00" level=debug msg="Lookup name cmsdb. present without IPv6 address"
Apr 21 03:47:30 cms1 dockerd[12847]: time="2017-04-21T03:47:30.545170634-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:31 cms1 dockerd[12847]: time="2017-04-21T03:47:31.545120803-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:32 cms1 dockerd[12847]: time="2017-04-21T03:47:32.545203829-04:00" level=debug msg="miss notification for dest IP, 10.0.0.5"
Apr 21 03:47:33 cms1 dockerd[12847]: time="2017-04-21T03:47:33.523741853-04:00" level=debug msg="cms1-cc27fac68311: Initiating bulk sync with nodes [cms2-f57bb311959b]"
Apr 21 03:47:33 cms1 dockerd[12847]: time="2017-04-21T03:47:33.524200451-04:00" level=debug msg="cms1-cc27fac68311: Initiating unsolicited bulk sync for networks [k4rs7wj8j9l7iphvvzgxf1tvg
Apr 21 03:47:33 cms1 dockerd[12847]: time="2017-04-21T03:47:33.529436213-04:00" level=debug msg="memberlist: TCP connection from=192.168.206.113:57736"
Apr 21 03:47:33 cms1 dockerd[12847]: time="2017-04-21T03:47:33.530864193-04:00" level=debug msg="cms1-cc27fac68311: Bulk sync to node cms2-f57bb311959b took 4.530115ms"

Docker info

Containers: 6
 Running: 4
 Paused: 0
 Stopped: 2
Images: 42
Server Version: 17.04.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 118
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: og2yk0fadxw7k6a4o2m7aid9k
 Is Manager: true
 ClusterID: 1898wcbrzxlv9z0z10eolq2cn
 Managers: 1
 Nodes: 6
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 192.168.216.172
 Manager Addresses:
  192.168.216.172:2377
Runtimes: runc
Default Runtime: runc
Init Binary:
containerd version: 422e31ce907fd9c3833a38d7b8fdd023e5a76e73
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-72-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.796GiB
Name: cms1
ID: W3OM:ABYC:HRAY:XUG5:XIMW:LEHN:WI3R:47PI:5VBD:72MF:6B3V:4KE4
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 75
 Goroutines: 223
 System Time: 2017-04-21T03:49:59.823812455-04:00
 EventsListeners: 2
Username: nonintuitive
Registry: https://index.docker.io/v1/
Labels:
 org.whitney.app=cms
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

and docker version

Client:
 Version:      17.04.0-ce
 API version:  1.28
 Go version:   go1.7.5
 Git commit:   4845c56
 Built:        Mon Apr  3 18:07:42 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.04.0-ce
 API version:  1.28 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   4845c56
 Built:        Mon Apr  3 18:07:42 2017
 OS/Arch:      linux/amd64
 Experimental: false

@vovimayhem
Copy link

I keep seeing this, now on a Docker for Azure CE swarm I deployed from the store.docker.com a couple of weeks ago.

@muhammadwaheed
Copy link

@Fank
I did the following step for each swarm node

@thevkozlovsky
Copy link

thevkozlovsky commented Apr 25, 2017

Hi guys.

I had the same problem. The reason was in virtual IP address (VIP) and It is enabled by default. I turned off the VIP everywhere and the problems with the network connection is resolved.

To disable to use the virtual IP address, you must start the services with the following option --endpoint-mode dnsrr. For example:

$ docker service create \
  --replicas 3 \
  --name my-dnsrr-service \
  --network my-network \
  --endpoint-mode dnsrr \
  nginx

If you can't use this option to start your services (for example, you start them with docker-stack.yml), then you can access to your services by name: tasks.my-service (not my-service). In this case you will not be used virtual ip address.

From the documentation:

$ nslookup my-service

Server:    127.0.0.11
Address 1: 127.0.0.11

Name:      my-service
Address 1: 10.0.9.2 ip-10-0-9-2.us-west-2.compute.internal

10.0.9.2 - it is virtual ip address.

and:

$ nslookup tasks.my-service

Server:    127.0.0.11
Address 1: 127.0.0.11

Name:      tasks.my-service
Address 1: 10.0.9.4 my-web.2.6b3q2qbjveo4zauc6xig7au10.my-network
Address 2: 10.0.9.3 my-web.1.63s86gf6a0ms34mvboniev7bs.my-network
Address 3: 10.0.9.5 my-web.3.66u2hcrz0miqpc8h0y0f3v7aw.my-network

10.0.9.4, 10.0.9.3 and 10.0.9.5 - These are "real" ip addresses of containers with your services.

For more information read this https://docs.docker.com/engine/swarm/networking/#use-dns-round-robin-for-a-service

@thaJeztah
Copy link
Member

ping @aboch @sanimej PTAL

@sulphur
Copy link

sulphur commented May 9, 2017

any news on that issue i'm hitting the same problem. Eaven going through tasks.service is not working anymore. :( I had to re-initialize all my cluster ... ;(

@Fank
Copy link

Fank commented May 9, 2017

@muhammadwaheed @thaJeztah I debugged alot and tried some different enviroments, with my network loss issues between swarm nodes.

  1. VMware VM Ubuntu 16.04 LTS on slow storage (huge overloaded storage system), causes sometimes network loss.
  2. VMware VM Ubuntu 16.04 LTS on slow host (lots of VMs on same host, huge ram usage and 40% cpu avg), causes sometimes "no ack's received" and other network related communications.
  3. Physical Host HPE BL460c Ubuntu 16.04 LTS slow storage (using shared storage), works fine (no network issues), but slow service re-/starts etc.
  4. Physical Host HPE BL460c Ubuntu 16.04 LTS with own storage (integrated or network attachted, same result), works fine and fast.

We tried multiple VMware hosts/cluster and hardware, also tested docker version between 17.03 and 17.05, it looks like that a virtualization enviroment causes some weired issues with docker, switched to physical enviroment and works fine without any issues.

We use 4. solution since monday and we received very positive feedback about the performance and stability of the applications running on them.

@Shoeler
Copy link

Shoeler commented May 9, 2017

Correlation not implying causation, I wonder if it has more to do with vmware's vswitching stuff. We use VMware + vswitches and have the same problems you describe. I expect it to be more closely related to that.

@cpjolly
Copy link

cpjolly commented Nov 9, 2017

Docker version 17.10.0-ce, build f4ffd25 is still showing the same swarm instability. I could upload the log files if you really want, but the messages are the same as previously reported.

  • I have a 4 host swarm (3 managers, one worker) on Digital Ocean.
  • Each host is running the latest Debian 9.2 and the latest version of Docker.
  • The swarm has 19 containers, which are static & php websites & a Mariadb database cluster.
  • Every few days one of the hosts looses connection to the swarm for a few minutes.
  • It drops out of the swarm and kills all it's containers.
  • The swarm re-starts the missing containers on the remaining 3 hosts.
  • A few minutes later, the 4th host can see the swarm again and reconnects
  • End results = 3 hosts running 19 total containers, 1 host running zero containers
  • The cycle repeats itself every few days.
  • The host that looses connection changes.

@DBLaci
Copy link

DBLaci commented Nov 9, 2017

I tested with 17.11-rc3 (Docker version 17.11.0-ce-rc3, build 5b4af4f) and I can reproduce the issue.

Two service in a common network. I start the console on the first and try to access the second one (running in 2 replicas) and once it works once it doesn't (connection refused). When I kill (auto restart) the failing one it solves the problem (for some time). From the failed service perspective everything is fine, so this is a network problem. The (common) network is a manually created overlay network.

@svscorp
Copy link

svscorp commented Nov 9, 2017

@DBLaci Thanks for posting the test results.

I will hold with updating to 17.11 then.

@lipingxue
Copy link

@fcrisciani Thanks for your comment. Could you or someone looked into this issue #35249? Is it a known limitation for Swarm or a bug? Is it VMware environment specific or not? Thanks!

@fcrisciani
Copy link
Contributor

@DBLaci do you have any repro steps to share? Connection refused is a result of a RST packet, so the ICMP is coming back.

@fcrisciani
Copy link
Contributor

@lipingxue I gave you some steps to follow to start the debug going, please try to continue there the thread

@fcrisciani
Copy link
Contributor

@cpjolly can you verify that you are not in any of these situations: #32195 (comment)
If you can share the logs and some more information maybe that would be useful

@lipingxue
Copy link

@fcrisciani Thanks. I will try the steps you gave.

@DBLaci
Copy link

DBLaci commented Nov 10, 2017

@fcrisciani Not an exact way of reproduce because the problem apperass somewhat randomly or after a time (possibly idle):

  • I create overlay networks manually (there are 10-20 of them!)
  • I deploy stacks that contains 1 or more services and communicate over the manually created overlay networks. Most of the services are in 2 replicas
  • After the deploy everything is working (except the problems introduced in 17.10 and 17.11-rc stop one task in docker swarm mode will lead to task restart again and again #35417) fine
  • After a time (maybe next day) the problem occurs as I described above. The service is running but cannot be accessed from another service (or at least from some) and got connection refused until I restart it manually. Healthcheck and everything shows the service itself is working.

Docker 17.11-rc3 (at the moment) Ubuntu 16.04 (default sysctl now) on AWS - cpu/memory is not a bottleneck the swarm nodes (5) are on one subnet with 3 manager nodes (3/5)

@cpjolly
Copy link

cpjolly commented Nov 10, 2017

@fcrisciani,

I don't think any of the issues mentioned apply. These are Digital Ocean 2GByte 2CPU nodes and have plenty of free resources. The websites are not heavily loaded.

Here is the section from the logfile from a host when it dropped out of the swarm. I've removed the IP addresses.

Nov  6 23:28:45 d-o-host7 dockerd[722]: time="2017-11-06T23:28:45.941268749Z" level=info msg="Node join event for bb76507e2dc7/***aaa.bbb.ccc.ddd***"
Nov  6 23:28:57 d-o-host7 dockerd[722]: time="2017-11-06T23:28:57.723605347Z" level=info msg="NetworkDB stats d-o-host7(6e4824341349) - netID:wowg7fjnddbtpz0q9bx48r2jf leaving:false netPeers:4 entries:17 Queue qLen:0 netMsg/s:0"
Nov  6 23:28:57 d-o-host7 dockerd[722]: time="2017-11-06T23:28:57.724237444Z" level=info msg="NetworkDB stats d-o-host7(6e4824341349) - netID:f2uav87x5y8bnq4ames7uq962 leaving:false netPeers:4 entries:45 Queue qLen:0 netMsg/s:0"
Nov  6 23:29:15 d-o-host7 dockerd[722]: time="2017-11-06T23:29:15.944516390Z" level=info msg="Node join event for e9965448a620/***aaa.bbb.ccc.ddd***”
Nov  6 23:29:21 d-o-host7 dockerd[722]: time="2017-11-06T23:29:21.055909590Z" level=info msg="Node join event for bb76507e2dc7/***aaa.bbb.ccc.ddd***"
Nov  6 23:29:45 d-o-host7 dockerd[722]: time="2017-11-06T23:29:45.947775720Z" level=info msg="Node join event for e9965448a620/***aaa.bbb.ccc.ddd***"
Nov  6 23:30:11 d-o-host7 dockerd[722]: time="2017-11-06T23:30:11.876339601Z" level=error msg="agent: session failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=node/agent node.id=s04yzgmx1lp7lr0b82u4kkru9
Nov  6 23:30:15 d-o-host7 dockerd[722]: time="2017-11-06T23:30:15.950635615Z" level=info msg="Node join event for e9965448a620/***aaa.bbb.ccc.ddd***"
Nov  6 23:30:16 d-o-host7 dockerd[722]: time="2017-11-06T23:30:16.890923475Z" level=error msg="agent: session failed" error="session initiation timed out" module=node/agent node.id=s04yzgmx1lp7lr0b82u4kkru9
Nov  6 23:30:16 d-o-host7 dockerd[722]: time="2017-11-06T23:30:16.996404646Z" level=error msg="wait not found for raft request id 5f4ec4d98072" raft_id=65d32c457c187f9c
Nov  6 23:30:19 d-o-host7 dockerd[722]: time="2017-11-06T23:30:19.611111035Z" level=error msg="wait not found for raft request id 5f4ec4d98073" raft_id=65d32c457c187f9c
Nov  6 23:30:21 d-o-host7 dockerd[722]: time="2017-11-06T23:30:21.324538008Z" level=info msg="Node join event for df2b22cd5cbf/***aaa.bbb.ccc.ddd***"
Nov  6 23:30:21 d-o-host7 dockerd[722]: time="2017-11-06T23:30:21.991436364Z" level=error msg="agent: session failed" error="session initiation timed out" module=node/agent node.id=s04yzgmx1lp7lr0b82u4kkru9
Nov  6 23:30:25 d-o-host7 dockerd[722]: time="2017-11-06T23:30:25.506045192Z" level=error msg="Error in responding to bulk sync from node ***aaa.bbb.ccc.ddd***: failed to send a TCP message during bulk sync: dial tcp ***aaa.bbb.ccc.ddd***:7946: i/o timeout"
Nov  6 23:30:27 d-o-host7 dockerd[722]: time="2017-11-06T23:30:27.309034164Z" level=error msg="agent: session failed" error="session initiation timed out" module=node/agent node.id=s04yzgmx1lp7lr0b82u4kkru9
Nov  6 23:30:33 d-o-host7 dockerd[722]: time="2017-11-06T23:30:33.246762399Z" level=error msg="agent: session failed" error="session initiation timed out" module=node/agent node.id=s04yzgmx1lp7lr0b82u4kkru9
Nov  6 23:30:37 d-o-host7 dockerd[722]: time="2017-11-06T23:30:37.846276479Z" level=warning msg="memberlist: Failed fallback ping: write tcp ***aaa.bbb.ccc.ddd***:37134->***aaa.bbb.ccc.ddd***:7946: i/o timeout"
Nov  6 23:30:38 d-o-host7 dockerd[722]: time="2017-11-06T23:30:38.591673139Z" level=error msg="agent: session failed" error="session initiation timed out" module=node/agent node.id=s04yzgmx1lp7lr0b82u4kkru9
Nov  6 23:30:39 d-o-host7 dockerd[722]: time="2017-11-06T23:30:39.094446886Z" level=warning msg="memberlist: Failed fallback ping: write tcp ***aaa.bbb.ccc.ddd***:37130->***aaa.bbb.ccc.ddd***:7946: i/o timeout"
Nov  6 23:30:45 d-o-host7 dockerd[722]: time="2017-11-06T23:30:45.028017879Z" level=error msg="agent: session failed" error="session initiation timed out" module=node/agent node.id=s04yzgmx1lp7lr0b82u4kkru9
Nov  6 23:30:45 d-o-host7 dockerd[722]: time="2017-11-06T23:30:45.954611279Z" level=info msg="Node join event for bb76507e2dc7/***aaa.bbb.ccc.ddd***"
Nov  6 23:30:48 d-o-host7 dockerd[722]: time="2017-11-06T23:30:48.846692736Z" level=info msg="Node join event for e9965448a620/***aaa.bbb.ccc.ddd***"
Nov  6 23:30:56 d-o-host7 dockerd[722]: time="2017-11-06T23:30:56.345654780Z" level=error msg="agent: session failed" error="session initiation timed out" module=node/agent node.id=s04yzgmx1lp7lr0b82u4kkru9
Nov  6 23:31:01 d-o-host7 dockerd[722]: time="2017-11-06T23:31:01.637129587Z" level=error msg="wait not found for raft request id 5f4ec4d98074" raft_id=65d32c457c187f9c
Nov  6 23:31:15 d-o-host7 dockerd[722]: time="2017-11-06T23:31:15.957222161Z" level=info msg="Node join event for e9965448a620/***aaa.bbb.ccc.ddd***"
Nov  6 23:31:21 d-o-host7 dockerd[722]: time="2017-11-06T23:31:21.078212864Z" level=info msg="Node join event for bb76507e2dc7/***aaa.bbb.ccc.ddd***"

@fcrisciani
Copy link
Contributor

@cpjolly there are several different components that are complaining about network connectivity:
swarmkit:

Nov  6 23:30:11 d-o-host7 dockerd[722]: time="2017-11-06T23:30:11.876339601Z" level=error msg="agent: session failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=node/agent node.id=s04yzgmx1lp7lr0b82u4kkru9
[...]
Nov  6 23:30:56 d-o-host7 dockerd[722]: time="2017-11-06T23:30:56.345654780Z" level=error msg="agent: session failed" error="session initiation timed out" module=node/agent node.id=s04yzgmx1lp7lr0b82u4kkru9

memberlist:

Nov  6 23:30:25 d-o-host7 dockerd[722]: time="2017-11-06T23:30:25.506045192Z" level=error msg="Error in responding to bulk sync from node ***aaa.bbb.ccc.ddd***: failed to send a TCP message during bulk sync: dial tcp ***aaa.bbb.ccc.ddd***:7946: i/o timeout"
[...]
Nov  6 23:30:39 d-o-host7 dockerd[722]: time="2017-11-06T23:30:39.094446886Z" level=warning msg="memberlist: Failed fallback ping: write tcp ***aaa.bbb.ccc.ddd***:37130->***aaa.bbb.ccc.ddd***:7946: i/o timeout"

Looks like that the TCP connection of the control plane are being drop and there is trouble in creating them back.
If you have some network monitoring for the physical network you can verify around that timestamp to see if there was a glitch.

@fcrisciani
Copy link
Contributor

@DBLaci once you hit the condition of connection refused, can you try to open a shell in one src and dst container and try to ping each other using the container ip directly not the VIP?

@omerh
Copy link

omerh commented Nov 22, 2017

@DBLaci I had similar issues on 17.09.
Upgrade to 17.10.0-ce solve the issue we had.
It was related to Overlay fix for transient IP reuse docker/libnetwork#1935

I didn't create allot of networks manually, most of them were created using stack deploy.

You need to check ping as @fcrisciani mention to try and isolate the problem you have

@gnasr
Copy link

gnasr commented Dec 23, 2017

Same issue here Docker version 17.11.0-ce, build 1caf76c
Containers cannot access other service or external resources like rds instances randomly, if i restart the service it starts to work again

@jmarcos-cano
Copy link

jmarcos-cano commented Jan 11, 2018

I had a similar issue issue with

  • docker engine versions: 17.09.0, 17.09.1 (tried with both versions installed across the 3 nodes , one version at a time)
  • OS: centOS 7
  • kernel: 3.10.0-693.11.6.el7.x86_64
  • 3 nodes (3 of them managers)
  • sercvices configured with 2 replicas

everything started fine for a few minutes then, 1 of the 2 replicas stopped working and all of the outbound (_default network traffic as well) and ingress traffic got stucked (timed out )from the container and also from the host (failing host) itself, after a while we did a tcp dump and realize that we encountered this problem

Ran sudo sysctl -w net.ipv4.tcp_timestamps=0; sudo sysctl -w net.ipv4.tcp_window_scaling=0 this on the hosts actually solved our problems.
(beware it will be temporarily in order to make it permanent I had to modify /etc/systcl.conf)

So would recommend listing your tcp sysctl settings by doing a sudo sysctl -p |grep tcp and check if those two guys are enabled.

UPDATE
my final setup was like this:

  • tcp_tw_recycle: 0
  • tcp_timestamps: 0
  • tcp_window_scaling: 1
  • Docker upgraded to 17.12

Been running for more than 7 days now in 16 swarm cluster of 3+ servers each, no problem at all, all network glitches removed, hope you guys find it useful

@prologic
Copy link
Contributor

FWIW this seems to be the default on Rancher OS (what i run everywhere):

dm1:~$ sudo sysctl -a 2>/dev/null | egrep '(net.ipv4.tcp_timestamps|net.ipv4.tcp_window_scaling)'
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1

/cc @SvenDowideit

@thaJeztah thaJeztah added this to backlog in maintainers-session Jan 11, 2018
@thaJeztah thaJeztah removed this from backlog in maintainers-session Jan 11, 2018
@dmandalidis
Copy link
Contributor

Hi,

We 're experiencing the same issue with 17.06.2-ce. It just happened again and we captured the stacktraces of both the host which is running an nginx and the host which hosts the target service.

Logs are captured while I 'm running curl from inside nginx to the target service. The latter is with dnsrr, and both hosts have the above mentioned tcp_* values disabled. The whole scenario ended up with connection timeout. Note that the total number of replicas of other services in the same network running at that point were less than 20.

https://gist.github.com/dmandalidis/0ac8cf6cbd87d5dbb6b2b7b10d4374c7#file-proxy-log
https://gist.github.com/dmandalidis/0ac8cf6cbd87d5dbb6b2b7b10d4374c7#file-target-service-log

Hope it may help with troubleshooting.

@fcrisciani
Copy link
Contributor

@prologic it's default on linux from 2.6.8 :)

@prologic
Copy link
Contributor

@fcrisciani Oh :D I'm just trying to work out what net.ipv4.tcp_timestamps and net.ipv4.tcp_window_scaling have to do with seemed random networking issues I see in my set up after a few weeks of runtime :)

@thaJeztah thaJeztah removed this from the 17.06.0 milestone Jan 25, 2018
@fcrisciani
Copy link
Contributor

Several improvements in the networking area got merged in the latest set of releases.
Considering that this issue is already collecting several different signatures, it would be good to close this one here, and eventually have new problems being opened as new issues.

@cpuguy83
Copy link
Member

cpuguy83 commented Feb 7, 2018

I agree.

Thanks @fcrisciani
Thanks everyone!

@cpuguy83 cpuguy83 closed this as completed Feb 7, 2018
@thaJeztah
Copy link
Member

I'm locking the conversation on this issue to prevent it from collecting new reports, which go easily overlooked on closed issues: please open a new issue instead (you can add a link to this issue in your report if you think there's a relation)

@moby moby locked and limited conversation to collaborators Feb 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests