New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"host not found" for service in same overlay network #26523
Comments
I have the same issue but sometimes not on all nodes, for example :
On the first host (cds-stage-ms1) the task failed because it couldn't resolve the name of another service |
@michaelkrog @nyanloutre the DNS seems to be resolving properly (to Virtual-IP). But, the |
Thanks @mavenugo. I discovered the issue because the proxy service, based on nginx, had the same result. Every 2nd request resulted in 'no route to host' output in the log. I had to remove the 2 workers to get my "swarm" working and that resolved the issues I was having. I will try to reproduce in a new environment. |
@michaelkrog okay got it. there were bunch of fixes that went in after 1.12.1. Maybe this is fixed by one of the fixes. If you can try a docker daemon from master (https://master.dockerproject.org/) and confirm that will help. |
… and another #25266? |
For me it's working properly since 1.12.1-RC1 |
@michaelkrog we released 1.12.2, which contains a lot of fixes in this area, and this issue may be resolved; could you give 1.12.2 a try and see if it's resolved for you? |
So I finally managed to recreate my setup and upgrade it to 1.12.2 – and it works! 👍 Awesome guys! |
Thanks @michaelkrog! |
But then again.. After 8 hours problems have started to occur again. I had scaled my proxy-service to 3 instances and my previsto-site to 3 instances as well. Suddenly some requests fails when requested by a proxy instance on the master node:
Rescaling the proxy-instance(3 -> 1 -> 3) fixes it for now, but I fear it will occur again soon. Edit |
And now after approx 15 hours of setting up my cluster the proxy service is not able to connect to any instances of my previsto-site service anymore – no matter what node the instances resides on. Scaling the services up/down does no longer fix the issue. Only working solution is to remove all worker nodes again and have only one node. EDIT
|
ping @mrjana |
One thing I haven't mentioned is that I have been using an encrypted overlay network all along, following this procedure.
I have now created a new unencrypted network, setup the services there, added 2 worker nodes again and have scaled the services to 3 instances each again. It has run flawlessly for 30 minutes now. I'll report any issues I might hit here. |
@michaelkrog could you try running the check-config script to see if anything is missing? https://github.com/docker/docker/blob/master/contrib/check-config.sh |
Sure @thaJeztah This is the output from all 3 instances:
Edit Edit |
ping @mrjana any thoughts? ^^ |
After bootstrapping doing: docker network create --driver overlay --opt encrypted foobar then having the containers each curl one another typically results in none of them being able to communicate over the network, I have seen it work occasionally however, but typically all three containers cannot reach each other. If i drop the encrypted flag it works fine, unfortunately i need the encryption for my use case x.x This is possibly related: |
I'm doing this in AWS, on a standard ubuntu 14.10 AMI. with docker 1.12.2 on all nodes. |
@coryleeio could it be related to #27425 ? |
ping @aboch |
@thaJeztah possibly, if the documented ports are indeed incorrect. I am only opening However, I do have it working presently with encrypted networking enabled, and only those ports open between the instances. I just bootstrapped the swarm a few times, and kept redeploying the containers until it worked.. |
@coryleeio note that it's not port 50, but protocol 50 (ESP) https://github.com/docker/docker.github.io/pull/230/files. I'm not too familiar with AWS's settings, but hope it helps |
Opened up protocol 50 (its under custom protocol in AWS) For each test I stopped the docker service, deleted docker data directory, restarted docker service, did docker swarm init, and docker swarm join, then created my containers, each with node constraints, so they wouldn't move around. Nodes were re-used and were not rebooted between runs. Here are the commands used: docker network create --driver overlay --opt encrypted foobar Then i'd exec in each container... and do: Test #1(protocol 50 enabled) Test #2(protocol 50 enabled) Test #3(protocol 50 disabled) After test 3 i turned off protocol 50 traffic to see if i could prove that enabling protocol 50 helped something. Curl a b and c worked after disabling protocol 50. So enabling/disabling it didn't seem to have an effect on the traffic. My guess is the traffic isn't encrypted, but i've not verified. Test #4(protocol 50 disabled) |
On my most recent run Test #5 On B: Cant resolve anything, including itself. On C: Can resolve C, cannot resolve anything else |
What gets me is it works some of the time just fine. I can't quite pin if it's a timing issue or what. |
To verify traffic is encrypted, while doing your reachability testing, check if the following command run on the docker hosts is intercepting encrypted packets:
|
Also, when the failure happen because of address resolution, can you manually check whether a ping from container to the other containers' IP address also fails ? |
Thanks @aboch, I checked and was able to validate the encryption is working. Apparently even when I disable it in the security group. Not sure why that is, but I'm glad it's encrypted. I was working through an example with it failing, and after a minute or two it started working, is there possibly a delay on it coming up? As for checking the IP addresses, i'll spin my cluster up and down a few times in the morning and see if I can get it to stop working again. In the mean time, I want to make sure I am looking at the correct thing cause i'm seeing something strange. if I do docker inspect B on the B node On the D node, exec /bin/sh on container D, i try to reach B. curl 10.0.0.5 -v results in....
curl b -v
This is strange, as it does not match the ip address that I would expect for B. curl 10.0.0.4 -v
Am I looking at the right ip address? That one doesn't match the one on the other node, but both seem to work. I have two services in my cluster and the ip address of B shows as 10.0.0.9 So i'm a bit confused about that, as 3 ip addresses seem to be working, but only two containers exist. 10.0.0.9 - - [24/Oct/2016:22:33:30 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.38.0" "-" I will try to reproduce the fail state in the morning, and will confirm if I can address the container by it's ip address at that time. |
when you curl the service name ( If your service had more than one replica, the resulting connections will be load balanced across the replicas, via ipvs.
BTW, I replicated your setup on AWS with ubuntu 14.04 AMI (kernel 3.13) and so far so good, things work fine over encrypted network. Not sure about your security group configuration, but I can confirm ESP packets need to be allowed in order for containers to communicate over ipsec. I will check tomorrow after a key rotation happened to see if that is the reason of the issue. |
What @coryleeio is experiencing is exactly what I am seeing too. Since switching to a none encrypted network I am seeing no issues on Ubuntu 16.04/docker 1.12.2/Digital Ocean/private networking. Its been smooth for 3 days now. In my case it could perhaps be the key rotation that caused problems. As mentioned earlier it worked flawlessly after upgrading to 1.12.2 for 8 hours before issues appeared. Then suddenly it was a complete mess even after rescaling services. |
@michaelkrog I do not have DO account to try this out. So that if I don't hit the issue in my aws setup, then we can debug on yours. Pls run these commands on each docker host before creating the services: Once at least one task is deployed on more than one docker host on the encrypted network, you should count With Please make a copy of You will know when the rotation happens if you monitor the xfrm activity with After the rotation, take again a copy of Thanks ! EDIT: Rotation happens every 12 hours |
My 3 nodes (one manager, 2 workers) went through a key rotation. So far, so good. I have scaled up and down the services and verified each task can connect to the services and can ping the other tasks' IP. I will keep monitoring to see if the issue arises after subsequent key rotations. In the meanwhile, I would like to make sure the issues you guys are encountering are indeed related to the encryption stuff and that your infra has the required policies to allow the ESP traffic across all nodes. This basic check should do it: Then repeat 12 hours later (or whenever the rotation has happened if you can monitor that). Note: It should not make a difference, but I am not publishing ports when I create the services (like in @coryleeio case). @michaelkrog If possible, can you post the complete command you use to create your two services. Thanks |
My security group (sg-xxxxxx) inbound rules looks like the following, outbound is open: Custom TCP Rule Custom TCP Rule Custom UDP Rule Custom TCP Rule Custom UDP Rule Custom Protocol I added the protocol 50 rule yesterday. It does seem to be really stable today, perhaps rebuilding everything with the ESP open was all I needed. I'm going to spin up a bunch of services, a router, and a bunch of databases all running on different encrypted networks, point some health checks at them, and leave it overnight just to confirm, but i'm feeling a lot better about it now with the ESP enabled. Doesn't quite explain how I was able to get connections before making that change, but since i can validate the encryption i'm not too fussed about it(thanks for that) =] |
@coryleeio that's good to hear; keep us posted how it goes |
@aboch I am away from my office till tomorrow, but I will definitely look into it then. |
@thaJeztah I managed to produce a weird state with the networks that can occur when you spin the cluster up and down a lot, and i'd be curious if @michaelkrog is perhaps doing something similar..... On manager: On workers: On manager: On all nodes: Testing here with exec, all containers can curl all containers, as expected. The network was created as it was needed on all the machines, everything is peachy. On all nodes: On all nodes: still looks good.... though why is the network still there? On manager: On workers: On manager: On workers: ids still match, but the manager network got downgraded to a local scope. Now i run my example on the newly created swarm But I get a network already exists of course. So i remove the foobar network On manager: foobar network still exists on worker, and is a swarm overlay network, local scope version was deleted from manager, but it did not propogate because it was local scope of course. On manager I run my example again: On manager: Note the ids are different, but our containers launch happy and connect to the different networks named foobar (docker ps will show each container running happily on each node, and they wont be able to communicate since they are on different networks that have the same name, scope, and driver, but different ids.) |
Thanks @coryleeio for the extra info. I am suggesting we wait for @michaelkrog to report his findings, so to see if we can rule out the encrypted network for his issue as well. Also, if he has not performed the swarm join/leave sequences you've done, I'd suggest you to report your new problem in a separate issue. That way we keep the focus on the original reported problem. |
@aboch Yeah that makes sense. In regards to my previous posts, incase anyone is following along, changing my security group didn't seem to have an effect, since for whatever reason the protocol 50 stuff was already going through in my AWS configuration, for whatever reason. I created a new ticket tldr; if your network ids don't match on your various nodes when you find that they can't communicate, you might check out #27796. |
So, 1) because my First test shows that all requests goes through on the unencrypted network whereas only some goes through on the encrypted network. To make sure my networks were not in a weird state I checked the networks on each node and they are all identical:
After this I removed all 3 services to start over:
I then ran the commands you requested:
I then created all 3 service again:
I then made a few requests to each of the proxies. First to the proxy on the unencrypted network(published on port 80)
Every request goes through. Then I requested the proxy on the encrypted network(published on port 443)
First 2 requests timed out because the proxy service were unable to request the previsto-site service. The 3rd request came through. ip xfrm stateengine1
engine2
engine 3
ip xfrm policyengine 1
engine 2
engine 3
|
I also tried pinging ip's from one of the tasks on the encrypted network that returned HTTP 504.
|
Thank you @michaelkrog for providing the extra information. The ipsec tunnels are properly installed on all nodes. From the
but clearly those encrypted packets did not make it to their destination. In order to see how many encrypted packets were received on each host, I need to see the o/p of But, based on what we have now, my guess is that something is blocking ESP packets from being received by the engine1 host. I don't know much about digital ocean, but I think you are in control of defining which traffic can freely be exchanged across your droplets, like the security groups in AWS. Can you double check that, and make sure that ip protol 50 packets can be received/sent by all hosts ? As a runtime check, what you could do is to run a |
Oh my! Entering another segment of my ignorance: IPSec :) So when I setup my environment (back in the 1.12 RC days) I followed the Docker Swarm Tutorial. But info about ESP was not included back then. I had my firewall setup like this:
I know nothing about IPSec and how it works, but I guessed that these rules must be blocking the ESP packets you mentioned. So I disabled the firewall on all nodes and Voila!; it works. Every request goes through on both networks. I did not disable the firewall before, because according to status I could retrieve from Docker everything seemed to be in order. For an ignorant developer type (like me) it is hard to see why the encrypted network does not work as the info available via Docker CLI does not show any errors. I redefined my firewall rules to this:
And then it works with firewall enabled! 👍 |
Awesome @michaelkrog, glad we resolved this one.
I know sorry for that. I realized that was missing only when #27425 was opened. @afrazkhan took care of fixing the documentation in docker/docs#230.
👍 |
Description
I am trying to setup a Docker in Swarm mode having 1 manager and 2 workers. They run in Digital Ocean's cloud and the nodes communicate via private networking.
I am consistently having issues when 2 services connected to the same overlay network tries to communicate. Sometimes the resolved ip does not hit some instances of a service and at others times the host name for a service is not at all resolvable.
Steps to reproduce the issue:
Describe the results you received:
The swarm is apparently working correctly:
Also my services seems to be running just fine:
But if I ping previsto-site from within proxy I get this:
However, if I scale down the previsto-site service to 0 and scale it up to 1 again, then I can resolve the host name.
Describe the results you expected:
I would expect the DNS resolving to work consistently.
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Digital Ocean, Ubuntu 16.04
The text was updated successfully, but these errors were encountered: