Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm VIP stops working on a node #25693

Closed
vasily-kirichenko opened this issue Aug 14, 2016 · 18 comments
Closed

Swarm VIP stops working on a node #25693

vasily-kirichenko opened this issue Aug 14, 2016 · 18 comments

Comments

@vasily-kirichenko
Copy link

Sun 2016-08-14 12:20:27.922132 [s=59b3a4289d084fee96545359c6b36612;i=35b8;b=97a4b05da1804751b13893c761086900;m=26af13d83a;t=53a04a0577ed4;x=eacbf56b46f6e112]
    PRIORITY=6
    _UID=0
    _GID=0
    _SYSTEMD_SLICE=system.slice
    _BOOT_ID=97a4b05da1804751b13893c761086900
    _MACHINE_ID=908ef89f83214d31be40cddb6cbcdd2f
    _HOSTNAME=xxxx
    SYSLOG_FACILITY=3
    _CAP_EFFECTIVE=1fffffffff
    _TRANSPORT=stdout
    _SELINUX_CONTEXT=system_u:system_r:init_t:s0
    SYSLOG_IDENTIFIER=dockerd
    _COMM=dockerd
    _EXE=/usr/bin/dockerd
    _CMDLINE=/usr/bin/dockerd --insecure-registry 1.1.1.1:5000
    _SYSTEMD_CGROUP=/system.slice/docker.service
    _SYSTEMD_UNIT=docker.service
    _PID=1942
    MESSAGE=time="2016-08-14T12:20:27.919724090+03:00" level=error msg="could not resolve peer \"10.255.0.4\": could not resolve peer: serf instance not initialized"

Sun 2016-08-14 12:20:17.839139 [s=59b3a4289d084fee96545359c6b36612;i=35b2;b=97a4b05da1804751b13893c761086900;m=26ae79fd89;t=53a049fbda423;x=68856f06d2d0158d]
    PRIORITY=6
    _UID=0
    _GID=0
    _SYSTEMD_SLICE=system.slice
    _BOOT_ID=97a4b05da1804751b13893c761086900
    _MACHINE_ID=908ef89f83214d31be40cddb6cbcdd2f
    _HOSTNAME=xxxx
     SYSLOG_FACILITY=3
    _CAP_EFFECTIVE=1fffffffff
    _TRANSPORT=stdout
    _SELINUX_CONTEXT=system_u:system_r:init_t:s0
    SYSLOG_IDENTIFIER=dockerd
    _COMM=dockerd
    _EXE=/usr/bin/dockerd
    _CMDLINE=/usr/bin/dockerd --insecure-registry 1.1.1.1:5000
    _SYSTEMD_CGROUP=/system.slice/docker.service
    _SYSTEMD_UNIT=docker.service
    _PID=1942
    MESSAGE=time="2016-08-14T12:20:17+03:00" level=info msg="Firewalld running: false"
@thaJeztah
Copy link
Member

Could your provide some more information, as requested in the issue template, and if possible, steps to reproduce?

  • output of docker version
  • output of docker info
  • where are the nodes running? (Physical, AWS, Azure, etc.), are they on the same physical network, same datacenter?

Without this, it'll be hard to find if there's a bug here, or to resolve it.

@vasily-kirichenko
Copy link
Author

Sorry for lack of info.

# docker version
Client:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:
 OS/Arch:      linux/amd64
# docker info
Containers: 15
 Running: 9
 Paused: 0
 Stopped: 6
Images: 20
Server Version: 1.12.0
Storage Driver: devicemapper
 Pool Name: docker-253:0-135417402-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 4.919 GB
 Data Space Total: 107.4 GB
 Data Space Available: 46.16 GB
 Metadata Space Used: 9.712 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.138 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host overlay null
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.28.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 188.6 GiB
Name: <host name here>
ID: IOWP:WM7R:5QRM:UL3W:PEHQ:HYKN:A35I:J3HC:OOYX:HTQO:DTBA:5DB5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
No Proxy: <list of local IPs>
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 <local IP here>:5000
 127.0.0.0/8

I'm trying to run a Swarm on four identical physical servers, they are in same network, in same DC. Maybe I could run some diagnostics?

@mavenugo
Copy link
Contributor

@vasily-kirichenko we fixed a bunch of issues in master. could you please try 1.12.1-rc1 (https://github.com/docker/docker/releases/tag/v1.12.1-rc1) and let us know how it goes.

@vasily-kirichenko
Copy link
Author

If I try to join the swarm, I get the following error:

docker swarm join --token SWMTKN-1-xxx <a manager node IP>
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.

@vasily-kirichenko
Copy link
Author

# docker swarm join --token SWMTKN-1-4l3xx4fuj5lo6ji7cq32stu8e2xvxu9z0mta0cffdfs4958dk4-9cq1tkfmkilgykonoi7ncn7oh 10.70.16.194:2377
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.
# telnet 10.70.16.194 2377
Trying 10.70.16.194...
Connected to 10.70.16.194.
Escape character is '^]'.

@vasily-kirichenko
Copy link
Author

@mavenugo I installed 1.12.1-rc1 on all my four nodes. Three of them are formed a Sworm, but when I try to join the rest node, I get this error:

Error response from daemon: x509: certificate has expired or is not yet valid

@justincormack
Copy link
Contributor

Is the clock set correctly on all the nodes?

On 14 Aug 2016 3:03 p.m., "Vasily Kirichenko" notifications@github.com
wrote:

@mavenugo https://github.com/mavenugo I installed 1.12.1-rc1 on all my
four nodes. Three of them are formed a Sworm, but when I try to join the
rest node, I get this error:

Error response from daemon: x509: certificate has expired or is not yet valid


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#25693 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAdcPO0s8SXU147Pj0uhWyh165DlsqfLks5qfyACgaJpZM4Jj3Bb
.

@vasily-kirichenko
Copy link
Author

@justincormack ooooh. It's not. Great point. Will fix it and see if it helps. Thanks!

@vasily-kirichenko
Copy link
Author

Done. Swarm is up and running:

# docker node ls
ID                           HOSTNAME            STATUS  AVAILABILITY  MANAGER STATUS
0tnyzxy3qdspskgg3n68uh5xf    h1  Ready   Active
76syhda6vln2cqhhzme0rbv2t    h3  Ready   Active        Reachable
89ds8r4z4y9clhzifigy2mv3e    h4  Ready   Active        Leader
by7rxn8q1ziqjfyfxjt2vx5cv *  h2  Ready   Active        Reachable

The service containing single replica is running as well:

# docker service ls
ID            NAME       REPLICAS  IMAGE                             COMMAND
dj6w3990y2vw  finch      1/1       x:5000/finch1:1.0

# docker service ps finch | grep Running
03df9um4qv7u8z13iz9vnwwoa  finch.9      x:5000/finch1:1.0  h4  Running        Running 9 minutes ago

Try to access the service via each node:

# curl http://<h4 IP>:33030/person/kot
{"payload":{"name":"kot","age":41},"server":"331e17234d87/10.255.0.18","appId":"a68c46cc-5604-4cd2-bb55-7f4a938b53b0"}

# curl http://<h3 IP>:33030/person/kot
{"payload":{"name":"kot","age":41},"server":"331e17234d87/10.255.0.18","appId":"a68c46cc-5604-4cd2-bb55-7f4a938b53b0"}

# curl http://<h1 or h2 IP>:33030/person/kot

<HTML><HEAD>
<TITLE>Network Error</TITLE>
</HEAD>
<BODY>
<FONT face="Helvetica">
<big><strong></strong></big><BR>
</FONT>
<blockquote>
<TABLE border=0 cellPadding=1 width="80%">
<TR><TD>
<FONT face="Helvetica">
<big>Network Error (tcp_error)</big>
<BR>
<BR>
</FONT>
</TD></TR>
<TR><TD>
<FONT face="Helvetica">
A communication error occurred: "Connection refused"
</FONT>
</TD></TR>
<TR><TD>
<FONT face="Helvetica">
The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time.
</FONT>
</TD></TR>
<TR><TD>
<FONT face="Helvetica" SIZE=2>
<BR>
For assistance, contact your network support team.
</FONT>
</TD></TR>
</TABLE>
</blockquote>
</FONT>
</BODY></HTML>

So I can successfully access the service via #3 and #4 nodes, but not via #1 and #2, even though Docker shows all the nodes are OK.

@vasily-kirichenko
Copy link
Author

# telnet <h1 IP> 33030
Trying <h1 IP>...
telnet: connect to address <h1 IP>: Connection refused

# telnet <h4 IP> 33030
Trying <h4 IP>...
Connected to <h4 IP>.
Escape character is '^]'.`

firewalld is stopped and disabled on all the machines. What I should check next?

@vasily-kirichenko
Copy link
Author

ifconfig shows about 10 interfaces like this:

veth92560f3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::a416:e7ff:fef4:23b0  prefixlen 64  scopeid 0x20<link>
        ether a6:16:e7:f4:23:b0  txqueuelen 0  (Ethernet)
        RX packets 100091  bytes 26809840 (25.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 100175  bytes 17414130 (16.6 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth9f4c875: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::f47a:b9ff:fef6:beb0  prefixlen 64  scopeid 0x20<link>
        ether f6:7a:b9:f6:be:b0  txqueuelen 0  (Ethernet)
        RX packets 8  bytes 648 (648.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 64  bytes 5092 (4.9 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vetha49f9ba: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::685e:1fff:fef2:842d  prefixlen 64  scopeid 0x20<link>
        ether 6a:5e:1f:f2:84:2d  txqueuelen 0  (Ethernet)
        RX packets 8  bytes 648 (648.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 66  bytes 5272 (5.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Is it normal?

@mrjana
Copy link
Contributor

mrjana commented Aug 15, 2016

@vasily-kirichenko You seem to be getting some http body when you curl the hosts where the request is not working. Where is that coming from?

@vasily-kirichenko
Copy link
Author

@mrjana I believe it's from the corporate proxy server. I added all four node IPs to no_proxy environment variable though.

@vasily-kirichenko
Copy link
Author

Service VIP port is not open on h1 and h2 machines:

$ netstat -a | grep 33030

However, it's open on h3 and h4:

$ netstat -a | grep 33030
tcp6       0      0 [::]:33030              [::]:*                  LISTEN

Current service state:

$ docker service ps finch | grep Running
03df9um4qv7u8z13iz9vnwwoa  finch.9      xxx:5000/finch1:1.0  h4  Running        Running 17 hours ago

$ docker service inspect --pretty finch
ID:             dj6w3990y2vwhxmo9sw47142n
Name:           finch
Mode:           Replicated
 Replicas:      1
Placement:
UpdateConfig:
 Parallelism:   1
 On failure:    pause
ContainerSpec:
 Image:         xxx:5000/finch1:1.0
Resources:
Ports:
 Protocol = tcp
 TargetPort = 29002
 PublishedPort = 33030

$ docker service inspect finch
[
    {
        "ID": "dj6w3990y2vwhxmo9sw47142n",
        "Version": {
            "Index": 516
        },
        "CreatedAt": "2016-08-14T09:16:01.708904457Z",
        "UpdatedAt": "2016-08-14T15:01:05.468601527Z",
        "Spec": {
            "Name": "finch",
            "TaskTemplate": {
                "ContainerSpec": {
                    "Image": "xxx:5000/finch1:1.0"
                },
                "Resources": {
                    "Limits": {},
                    "Reservations": {}
                },
                "RestartPolicy": {
                    "Condition": "any",
                    "MaxAttempts": 0
                },
                "Placement": {}
            },
            "Mode": {
                "Replicated": {
                    "Replicas": 1
                }
            },
            "UpdateConfig": {
                "Parallelism": 1,
                "FailureAction": "pause"
            },
            "EndpointSpec": {
                "Mode": "vip",
                "Ports": [
                    {
                        "Protocol": "tcp",
                        "TargetPort": 29002,
                        "PublishedPort": 33030
                    }
                ]
            }
        },
        "Endpoint": {
            "Spec": {
                "Mode": "vip",
                "Ports": [
                    {
                        "Protocol": "tcp",
                        "TargetPort": 29002,
                        "PublishedPort": 33030
                    }
                ]
            },
            "Ports": [
                {
                    "Protocol": "tcp",
                    "TargetPort": 29002,
                    "PublishedPort": 33030
                }
            ],
            "VirtualIPs": [
                {
                    "NetworkID": "8beim5na3heghjgl7co3ecioz",
                    "Addr": "10.255.0.8/16"
                }
            ]
        },
        "UpdateStatus": {
            "StartedAt": "0001-01-01T00:00:00Z",
            "CompletedAt": "0001-01-01T00:00:00Z"
        }
    }
]

@vasily-kirichenko
Copy link
Author

vasily-kirichenko commented Aug 15, 2016

If I increase the number of containers such that a container is running on h1, then port 33030 is open on that node. If I decrease the number of containers so that the container running on h1 shut down, then the port is immediately closed.

However, it does not work for node h2 - even if several containers are running on it, the port 33030 is not open.

@vasily-kirichenko
Copy link
Author

vasily-kirichenko commented Aug 15, 2016

It turns out DOCKER-INGRESS iptables chain does not exist on nodes h1 and h2 (the problematic ones).
On the other two nodes the chain does exist:

iptables -t nat -L -n
...
Chain DOCKER-INGRESS (2 references)
target     prot opt source               destination
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:33030 to:172.18.0.2:33030
RETURN     all  --  0.0.0.0/0            0.0.0.0/0

Any ideas?

@vasily-kirichenko
Copy link
Author

OK, I destroyed the swarm and recreated it from scratch, which seems to help, DOCKER-INGRESS appears on all the nodes and service is available via any of them.

@vasily-kirichenko
Copy link
Author

Everything works OK. I think the problem was caused by not synchronized clocks (~2 hours divergence). Closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants