New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New swarm overlay network issue #25266
Comments
This seems to be precisely what I experience too since I started testing swarm mode from rc4. I have not yet been able to get a swarm working on our cluster. I think it must be a duplicate of #24234, but this description may be more to the point. |
@Noodle05 this works just fine when I tried your example. Can you please make sure the appropriate tcp/udp ports are opened across the cluster to make sure the overlay networking works fine ? |
@mavenugo I'm sure it's not firewall issue. I even tried to turn off firewall. Also, if it's firewall issue, they shouldn't be able to ping each other at all. It's reachable, just on different IP. If you looks at output of "ip addr show" inside container, two IPs it get have different netmask, 10.255.0.7/16 and 10.255.0.6/32. 10.255.0.7 is reachable, 10.255.0.6 is unreachable. @mavenugo May I know you try to reproduce it, so I can give it a try? My environment is: three Ubuntu 16.04 (latest packages) with docker 1.12.0 (Release version). I do have ufw, but port 2377/tcp, 4789, 7946 are opened. I also tried stop ufw, same thing. |
Is this same issue as: #23855? If so, it's probably by design? Let me try to go to port directly, will report back. |
@Noodle05 thats correct. I missed the ping able to reach the individual container-ip case. this is not a firewall issue. BTW, the 2 IPs that you see in the container is not causing this issue. It was added to satisfy another requirement. The issue here seems to be more of IPVS not load-balancing correctly. Can you share the am using latest docker master (equivalent to 1.12.0) in 14.04 ubuntu. Can you share your daemon logs with debugs enabled ? I have to assume that there is some issue with ipvs configurations. |
Quick update.
I setup multiple overlay networks, for example: application service has three connections: connect to backend (for database), ldap-backend (for ldap server) and frontend (for proxy), @mavenugo If you can provide a link to document on how to get ipvsadm, it will be great helpful. Thanks |
@Noodle05 https://www.server-world.info/en/note?os=Ubuntu_14.04&p=lvs shows 'apt-get install ipvsadm', and on CentOS 7 it is 'yum install ipvsadm'. |
I am seeing a similar issue with an overlay network and services added to it. docker network create -d overlay --subnet 10.10.0.0/16 redis_net The redis port of 6379 is reachable from the flask container only when using the redis container IP. When trying to reach via the VIP or the VIP's dns record, it returns no route to host.
Should the exposed port of the redis container be available over the VIP? It seems like it should. All hosts in the swarm are running 1.12.1
|
Similar: after internal DNS failure resolving service names, I restarted the Docker service and DNS worked again. Docker 1.12.1 ( |
I have restarted the service before and it didn't help. Also, to clarify the DNS resolves to the VIP just fine. DNS resolution isn't broken. It's that the routing of the service's port using the VIP is not working. Even bypassing DNS and using the VIP address directly has the same no route to host. |
… and again. $ docker exec -ti 3b13024ff420 su - $USER -c 'ssh hg@$CONTAINERNAME'
ssh: Could not resolve hostname $CONTAINERNAME: Name or service not known All three hosts are at ten days' uptime. The target container has been up for six days. The other containers have been up for 47 hours. They could resolve and reach the target back then. Now two of the three can't. @mavenugo can you make any sense of the following? I ran it on an affected host. The target container has 10.255.0.4 on
Meanwhile, after restarting the Docker daemon: the service container on the other affected host can resolve the target container, but is getting the wrong IP address for it. Restarting Docker again doesn't help. Restarting Docker on the host with the log output above… ok, now it's getting the wrong address, too. Frustrating, but consistent. The daemons on all three hosts are binding to the same network for listening. I'll try adding |
[UPDATE: struck out jibe.] |
Same issue here, Docker 1.12.1 and 1 manager and 2 nodes running Ubuntu 16.04. Sometimes services can connect and resolve to each others hostnames, sometimes not. Restart of docker and server sometimes works. |
Also same here. Ubuntu 16.04, docker 1.12.1, 1 manager and 2 workers om digital ocean. Communicating via private network. Overlay network is encrypted. It works only sometimes. And when it works I still see a few requests time out between services (like 1 of 30 or so). Edit Also, when I first created the swarm I did not have the firewall setup correctly for overlay networking between the nodes. I fixed it, but it could not heal itself. I then recreated the swarm, saw overlay networking being replicated in the swarm, but to no avail. Issues persist. Edit 2 When using proxy_pass to an URL this is what happens in Nginx:
|
Same thing here. I have two Swarm clusters running on Google Compute Engine with Docker 1.12.1 and an overlay network. The Operating System is Ubuntu 16.04 Xenial. Each cluster run one master and two workers. When I started my services right after the clusters' creation, everything was running fine. After 5 days, my first cluster started to complain that some services were not able to communicate with each other. After 11 days, same issue with my second cluster. Remove and recreate the services don't fix the communication issue. Restart the Docker service on each node doesn't resolve the issue (at least, not every time). My only service publishing ports outside of the cluster is an Nginx. The others are mostly Java Web Applications and MongoDB. |
@bargenson Curious... are these nodes' time in sync with each other? |
ping @bargenson ^^ |
@cpuguy83 Yes they are. |
need testing. |
@cpuguy83
I trust a 5ms delta is acceptable. What's our next step for data gathering? I'll put in effort to help nail this down if the maintainers point me in the right direction. |
Forgot to mention: the reason I'm back is that it has, of course, happened again… this time shortly after a reboot of all three nodes. Restarting the affected daemon resolved the incident but not, of course, the issue. |
ping @mrjana PTAL! |
@garthk If it is a test setup, would you mind testing with 1.12.2-rc3 here: https://github.com/docker/docker/releases/tag/v1.12.2-rc3 ? It fixes a number of potential issues. |
@mrjana I've upgraded to 1.12.2 on all three nodes. I haven't seen anything break in the last hour. We'll have to leave the ticket open a couple of weeks given the unpredictable nature of the problem and lack of any diagnostic steps to nail it down, e.g. "X happens after Y, so watch for another Y and see if X recurs". |
@garthk Thanks for upgrading and reporting back. I am fine with keeping the issue open for some time. And yes we could've gone and debugged the issue step by step, but given that a lot of people hit the same kind of issues which are fixed in 1.12.2 it would just be unscalable to debug all them in an individual basis especially since the problem happened at a very low level. For sure this means that we probably need better metrics exposed by docker to scale troubleshooting and we will look into that more closely in the future. |
looks like issue is back. Also running centos 7.2 on 1.12.2 and docker services cannot communicate between each other on same overlay network |
Oops, do you have a try upstream branch. |
Those of you updating to docker 1.12.2, is that the client only? Or the client and various daemons? I'm finding when I update the daemons my services are unable to start. Would love to figure out a solution to this problem. It's killing me that my services can't talk to each other when they are on different nodes. |
I am facing this issue as well, with Docker version 1.12.3, build 34a2ead (via CoreOS 1235.0.0). I created a 3-node CoreOS cluster using their Alpha CloudFormation template, then edited the EC2 security group to allow all inbound traffic to all ports (TCP+UDP) from that security group. Then, pretty much followed the steps in this MongoDB replica set tutorial to create the swarm, network, volumes and services. I now have my 3 services running, one on each node. When I I'll be happy to provide any diagnostic logs/info you might need. |
@bourquep the CoreOS fork of Docker contains various modifications, which can play a role here, so it would be worth to report that with them. Also note, that if your setup is using encrypted overlay ( When you ping, are you pinging to a container or to the service? Pinging the VIP across hosts may not work, but you should be able to connect (e.g. |
@thaJeztah I have all ports opened between the hosts. I was trying to ping a service from within a container. After rebooting all 3 nodes, I was able to initiate the MongoDB replicaset and have the 3 I'll let the CoreOS team know about this. |
@bourquep When you open 'all ports' then you refer to the TCP layer. But for encrypted overlay network Protocol 50 of the IP layer needs to be open too. |
I have this issue reproduced on Ubuntu 16.04, Docker 1.12.4. I have only one encrypted overlay network, services are unreachable. DNS seems to work though, but the IPs can't be pinged and services are unavailable on their respective ports. |
@explicitcall could you open a new issue with details? Perhaps it's a different issue, and this thread is getting lengthy; may be good to open a "fresh" issue |
@explicitcall Do you have the service reachability issue only if the network is encrypted ? Can you try with the service connected to an unencrypted network ? |
@sanimej great suggestion, can't reproduce with unencrypted network. Will keep an eye on it for the next few days in the light of previous reports about the issue being sporadic. |
@explicitcall As suggested by @michaelkrog (also check #26523), please make sure ip protocol 50 packets (ESP) can be freely exchanged across your hosts. Make sure your firewall allows them. |
@aboch does this work differently with encryption enabled/disabled? I have this working without encryption without any other to changes to firewall settings or anything else |
It does |
@thaJeztah I agree that this is becoming a catch-all issue and i think we should consider closing and locking it and encourage folks to open new issues. |
@mavenugo agreed For anyone arriving here; have a look at the discussion above if there's a solution mentioned for your situation. If you're still having issues, please open a new bug report with as much details as possible (following the bug report template; https://raw.githubusercontent.com/docker/docker/master/.github/ISSUE_TEMPLATE.md), so that we can look into your issue. There are many possible causes for overlay networking not working; many are related to configuration, but if you think it's not a configuration issue but a bug, open a new issue. |
I'm testing docker 1.12 swarm, and having some issue with overlay network.
In three nodes (one manager, two workers) swarm environment, all running on ubuntu 16.04 with docker 1.12.
Create two alpine ssh server:
Found out which node they are running, and go to the node (ssh1 for example), execute command "ip addr show" shows container got two ip addresses from ingress network.
And docker inspect shows IP address is 10.255.0.7
But internal DNS resolve ssh1 to 10.255.0.6
And from ssh2, I can ping 10.255.0.7, cannot reach 10.255.0.6. But since DNS resolve ssh1 to 10.0.255.0.6, ssh2 cannot access ssh1.
I tried to create another overlay network, got same result.
Anybody know what I did wrong?
The text was updated successfully, but these errors were encountered: