-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong behaviour of the DNS resolver within Swarm Mode overlay network #30134
Comments
@YouriT What exactly are you doing ? Killing the container process from the host or something else ? If the service had replicas, were you killing one of those ? If you don't scale it to 0 do you still see some inconsistency or it seems to happen only if you scale it down to 0 ? I can try to recreate the issue with the same sequence. |
@sanimej to be fair and honest with you I have some difficulties to find an exact pattern to reproduce. For example, for one of the services (shards) the nslookup gives me 3 IPs, then another 2, then another 1. Well for all of them I still have IP addresses which shouldn't be the case. I'm going to try to find an exact way to reproduce the bug but that's pretty hard. My feeling is kind of a bug in the overlay network for some unknown reason. If you could maybe point me to some log commands to get the content of this cache or something like that I might be able to give more details. The "crash the container" was either scale=0 or memory exhausted in my case which resulted in the container being killed 0 or > 0 exit codes. The services here didn't have any replicas apart from
About not scaling it to 0:
As you can see after scaling back to 1 for reach of those guys, I'm getting new IPs in R-R which is completely wrong :/ I'm willing to provide more informations but that's really hard with this bug. |
Btw, restarting docker doesn't change the result. Edit: restarting docker on the host hosting the service fixed the lookup |
I am experiencing this issue on a swarm hosting multiple service stacks. Occasionally, after removing stacks, containers crashing, or services being scaled down and then back up, DNS resolution inside a container for another service will return additional incorrect results. This completely hoses our setup when it happens to the service hosting our reverse proxy, as requests are proxied to incorrect addresses. Our swarm is running 1.13.1. Each service has certain containers that connect to a "public" overlay network which also is what our proxy service is connected to. It's within this overlay network that I see this error occurring. What I typically see is that a service is running at an IP address, say, 10.0.0.3, and then it gets moved (after being scaled or redeployed) to another IP address, like 10.0.0.12. However, DNS lookup on this service ( |
ping @sanimej |
I see similar issue. I'm running 4 nodes (1 manager and 3 workers).
UPD: It's not really hard to reproduce. Try to deploy using the following docker-compose file: https://gist.github.com/velimir0xff/28da8e16e01475b2a95f9ac74c069aa0 |
I have the same problem. After removing services, removing network, recreating network with new subnet and mask - I still see old IP address in |
I can confirm the same issue with Swarm, Docker version 17.03.1-ce, build c6d412e and overlay networking. |
ping @sanimej @fcrisciani |
we are aware of this issue, I'm actively working on a patch |
I seems to have a similar / same issue. When restoring a database (on 3 mongo containers in replicaset over 3 managers...for what it matters) the host/manager because unavailable. (Docker AWS with 3 t2.medium managers, no workers). Whilst the restoring is in progress I can barely ssh into the manager. Problem persists with 17.06-rc4. What I've noticed is that the problem seems to only happens when I deploy a second identical stack (obviously under a different name) and run a mongorestore on the second stack. Initially I thought it would be some kind of conflict between the two stacks but my understanding is that they are completely isolated. Is that correct? Possibly related to #32841 |
@activeperception heavy load and aws t2 instances is not a good combination. |
I am facing a bug that could be related. |
We're facing an issue which very much sounds like others have described here. We're running multiple stacks on the same swarm, and it appears that DNS entries gets mixed up / stale. As others also have mentioned, reproducing this issue in a predicatable manner can be challenging. We've got a swarm with 5 nodes. One of the stacks has two webservers and two databases: shop_drupalfront (alias: drupalfront / 2 replicas) We suddenly saw that drupalfront would resolve drupaldb to the IP of apidb. Scaling drupalfront down to 0 and then up to 1 would resolve the issue after a few attempts. We have also seen this issue in other stacks. A few observations:
@fcrisciani you mentioned that you were working on a patch for this issue. Are you making progress on that patch? |
@sbrattla I think some patches went into Docker 17.06.x; which version are you running? |
@thaJeztah we are running 17.06.02-ce. The issues we're seeing could certainly be something different from what's being described here, but what this thread describes aligns pretty well with what we're seeing. Is there anything more I can do to identify what's going on and what's going wrong? |
@thaJeztah and @fcrisciani I see that progress has been made on moby/libnetwork#1934 which, judging from the description, "smells" a bit of what could be our issue. Basically, already "taken" IPs are being handed out resulting in multiple services (load balancers) with the same IP. I see that this patch is "stuck" as it needs review. If this is the patch, how long should we expect before this patch to go into the ce edition? The issue we're seeing is causing a lot of noise in our development and production environments. Services gets their IPs mixed up on an daily/hourly basis, and the only remedy so far is to either downscale services to 0 and then up again (which works sometimes) or reboot entire Docker hosts. I'm editing this post again as we're seeing this issue more and more. We're seeing duplicate IPs for load balancers representing services that runs on the same port. That is, it's always load balancers on the same port within the same network that gets mixed up. This seems to be the rule. |
Seeing the same issue with the latest edge Docker: |
@thaJeztah or @fcrisciani I haven't hear from you in about 3 weeks. What's your take on this issue? |
@sbrattla master patch got merged, I think this fix would come with the next 17.10 RC2 would be great to have a feedback on the base of that image |
Anyone still seeing this problem on docker 17.10 or above? |
@thaJeztah we have unfortunately not been able to try out 17.10 RC2, so can't really say. Any chance this will be merged into the regular release if you don't receive any negative feedback? |
@sachnk at best of my knowledge 18.06 GA is scheduled for next week and will mainly depends if the testing phase passes with no hiccups. |
Looks like the fix is in 18.06.0-rc1; https://github.com/docker/docker-ce/blob/v18.06.0-ce-rc1/components/engine/vendor/github.com/docker/libnetwork/endpoint.go#L754-L758 18.06-rc1 is available for testing in the "test" channel on https://download.docker.com or to use the install script; https://test.docker.com |
I believe I too am experiencing this issue currently with 18.06.0-ce. Completely removed and reinstalled twice today (purge, EDIT: I don't quite understand how it all works, but I'm guessing this is a bug in libnetwork ResolveName or ResolveService given IP resolution works correctly. Docker Versionremcampb@remcampb-dev:~/tdm$ docker version
Client:
Version: 18.06.0-ce
API version: 1.38
Go version: go1.10.3
Git commit: 0ffa825
Built: Wed Jul 18 19:11:02 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.0-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: 0ffa825
Built: Wed Jul 18 19:09:05 2018
OS/Arch: linux/amd64
Experimental: false Docker Inforemcampb@remcampb-dev:~/tdm$ docker info
Containers: 5
Running: 5
Paused: 0
Stopped: 0
Images: 28
Server Version: 18.06.0-ce
Storage Driver: overlay
Backing Filesystem: extfs
Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: zmem59xovfqgkfozol7ud93qg
Is Manager: true
ClusterID: 173q07k838c8itt757d226fle
Managers: 1
Nodes: 1
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 172.31.100.102
Manager Addresses:
172.31.100.102:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: d64c661f1d51c48782c9cec8fda7604785f93587
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-131-generic
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 94.16GiB
Name: remcampb-dev
ID: ETAB:QAG4:SJ35:N4A2:23E6:GPAJ:ITMK:E75A:VJVA:PE7N:BI42:ZXOZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
HTTP Proxy: http://xxx.com:8080
HTTPS Proxy: http://xxx.com:8080
No Proxy: localhost,127.0.0.1,.xxx.com
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled Docker Networkremcampb@remcampb-dev:~/tdm$ docker network inspect tdm_backend
[
{
"Name": "tdm_backend",
"Id": "55f0y5igh52g0qiuy9bi1i2uc",
"Created": "2018-07-31T20:39:04.42637736-07:00",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "10.0.0.0/24",
"Gateway": "10.0.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"6293b0a79860886dbc4cf79b80ef7dae53401e0ee66dd55eee511a2afda4f688": {
"Name": "tdm_dbms.1.1ixkfmb5x7uqpmw9j1q4k6kaz",
"EndpointID": "0fddef7dc59df0ebf855bab1511956add5d0de8dd32301a5a5c389d1da3ef9cd",
"MacAddress": "02:42:0a:00:00:0c",
"IPv4Address": "10.0.0.12/24",
"IPv6Address": ""
},
"6cd5177b03e596f672d2cf4091c98e6787814cb2fbac3b6cad27a2c01d9e30ae": {
"Name": "tdm_goaccess.zmem59xovfqgkfozol7ud93qg.pf0yuyp09814ywh845k90j7qr",
"EndpointID": "1393efe89962660a3427fd18a4218c14c70dbdfeeddda14dfd9330636ccd1bc0",
"MacAddress": "02:42:0a:00:00:0a",
"IPv4Address": "10.0.0.10/24",
"IPv6Address": ""
},
"94d1d5b65ba4a33a95d91155ac5098655181ea3115c13df79b2910711ad6208b": {
"Name": "tdm_web.zmem59xovfqgkfozol7ud93qg.feq785imcpath3wtttkcvreqa",
"EndpointID": "c1050494902f6e97ac9e2395ed4ff180e7f2e14820d34ffe931980341ddec6cf",
"MacAddress": "02:42:0a:00:00:06",
"IPv4Address": "10.0.0.6/24",
"IPv6Address": ""
},
"d48204783b3d133d615f7868fee8199be4f101121d78a2798bbd4f1fc0652efd": {
"Name": "tdm_nginx.zmem59xovfqgkfozol7ud93qg.9ci0yiagx2sref8a3vbnltwh1",
"EndpointID": "48ba7958dac835bf691cb5cb4c29d9f37a7861b902c8592d07bd0c97dec3f183",
"MacAddress": "02:42:0a:00:00:08",
"IPv4Address": "10.0.0.8/24",
"IPv6Address": ""
},
"e79b4cdc1d3439c8133dcc761c977821969753376648179176f20704895fde4a": {
"Name": "tdm_etl.zmem59xovfqgkfozol7ud93qg.qrnqt9qciv3s3tdd4ww10sy8f",
"EndpointID": "946fae848fd9a846c72b0c955b3a847c027c5d1aee8e32012e190c9ffbd82764",
"MacAddress": "02:42:0a:00:00:04",
"IPv4Address": "10.0.0.4/24",
"IPv6Address": ""
},
"lb-tdm_backend": {
"Name": "tdm_backend-endpoint",
"EndpointID": "3975cd22848a18888a7ba9bf56b3f7b4159a9a72f8bd0b9814060220f2228dca",
"MacAddress": "02:42:0a:00:00:02",
"IPv4Address": "10.0.0.2/24",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4097"
},
"Labels": {
"com.docker.stack.namespace": "tdm"
},
"Peers": [
{
"Name": "96a36b765eb9",
"IP": "172.31.100.102"
}
]
}
] Ping PoCFrom the below it's clearly the resolver resolving the hostname incorrectly, resolving IP is correct. remcampb@remcampb-dev:~/tdm$ docker exec -it d48204783b3d sh
/ # nslookup web
nslookup: can't resolve '(null)': Name does not resolve
Name: web
Address 1: 10.0.0.5
/ # ping web
PING web (10.0.0.5): 56 data bytes
^C
--- web ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss
/ # ping 10.0.0.6
PING 10.0.0.6 (10.0.0.6): 56 data bytes
64 bytes from 10.0.0.6: seq=0 ttl=64 time=0.232 ms
64 bytes from 10.0.0.6: seq=1 ttl=64 time=0.074 ms
64 bytes from 10.0.0.6: seq=2 ttl=64 time=0.147 ms
^C
--- 10.0.0.6 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.074/0.151/0.232 ms
/ # arp
tdm_web.zmem59xovfqgkfozol7ud93qg.feq785imcpath3wtttkcvreqa.tdm_backend (10.0.0.6) at 02:42:0a:00:00:06 [ether] on eth1
? (172.18.0.1) at 02:42:b7:41:89:ad [ether] on eth2
? (10.0.0.5) at <incomplete> on eth1
/ # nslookup 10.0.0.6
nslookup: can't resolve '(null)': Name does not resolve
Name: 10.0.0.6
Address 1: 10.0.0.6 tdm_web.zmem59xovfqgkfozol7ud93qg.rzk5pefm0svdcrfzu57ttut69.tdm_backend
/ # nslookup 10.0.0.5
nslookup: can't resolve '(null)': Name does not resolve
Name: 10.0.0.5
Address 1: 10.0.0.5
/ # nslookup web
nslookup: can't resolve '(null)': Name does not resolve
Name: web
Address 1: 10.0.0.5 EDIT: Including debug logs. Shows resolving as 10.0.0.5 per resolver instead of 10.0.0.6 per network. Debug Logslevel=debug msg="Name To resolve: web."
level=debug msg="[resolver] lookup name web. present without IPv6 address"
level=debug msg="Name To resolve: web."
level=debug msg="[resolver] lookup for web.: IP [10.0.0.5]"
level=debug msg="Name To resolve: dbms."
level=debug msg="[resolver] lookup for dbms.: IP [10.0.0.11]"
level=debug msg="Name To resolve: web."
level=debug msg="IP To resolve 5.0.0.10"
level=debug msg="[resolver] query 5.0.0.10.in-addr.arpa. (PTR) from 172.18.0.6:46919, forwarding to udp:10.200.96.87"
level=debug msg="[resolver] external DNS udp:10.200.96.87 did not return any PTR records for \"5.0.0.10.in-addr.arpa.\""
level=debug msg="IP To resolve 6.0.0.10"
level=debug msg="[resolver] lookup for IP 6.0.0.10: name 51a94afdbe5a.tdm_backend" libnetwork Diagnostic Toolremcampb@remcampb-dev:~/tdm$ curl localhost:50015/help
OK
/getentry
/deleteentry
/help
/join
/clusterpeers
/updateentry
/networkstats
/
/gettable
/joinnetwork
/stackdump
/createentry
/ready
/leavenetwork
/networkpeers
remcampb@remcampb-dev:~/tdm$ curl localhost:50015/gettable?tname=endpoint_table\&nid=55f0y5igh52g0qiuy9bi1i2uc
OK
total entries: 5
0) k:`1960d8419b8830e5c5dea09edf91cb75cc47e5c42d069aefc7fb9313c5da0059` -> v:`Cjt0ZG1fd2ViLnptZW01OXhvdmZxZ2tmb3pvbDd1ZDkzcWcucnprNXBlZm0wc3ZkY3JmenU1N3R0dXQ2ORIHdGRtX3dlYhoZOWNjajdvYmlubm1tdGRiZXpleDFqcDdzdCIIMTAuMC4wLjUqCDEwLjAuMC42OgN3ZWJCDDUxYTk0YWZkYmU1YQ==` owner:`cc6ebbfd6253`
1) k:`2c833942674928e7de1e419a4a477758501c378ddd47fbcd506df194802eadb1` -> v:`Cj10ZG1fbmdpbnguem1lbTU5eG92ZnFna2Zvem9sN3VkOTNxZy50Y2s5MHVoZjlsMzM0ZmxteXNveWRxcmZwEgl0ZG1fbmdpbngaGXdmamQ3MDl6Z3R4MTFhcGpnZXFhZjIwY2UiCDEwLjAuMC43KggxMC4wLjAuODoFbmdpbnhCDDNkMWQ3MTVmNzEyMA==` owner:`cc6ebbfd6253`
2) k:`4c396a2eb253072c9c5a419f9fe8eb57ee6d370ad5c241dec013b871f005ad88` -> v:`CkB0ZG1fZ29hY2Nlc3Muem1lbTU5eG92ZnFna2Zvem9sN3VkOTNxZy53N2F4NmdnaHRhanQwcDN6ZHFvczJ2Y3dwEgx0ZG1fZ29hY2Nlc3MaGWx1bHdkN3dzOWc3ZmF5d2lldjdyNXVxZDciCDEwLjAuMC45KgkxMC4wLjAuMTI6CGdvYWNjZXNzQgxlN2VlMzgyNDdhYjc=` owner:`cc6ebbfd6253`
3) k:`b320043e5faf3f68e1594d915f85327a9b30dc9fb6db2a14928693f6f2f59c4b` -> v:`Cjt0ZG1fZXRsLnptZW01OXhvdmZxZ2tmb3pvbDd1ZDkzcWcucnV3dTRpZnI3cDhncnk5Z3cyM2JrMWJpNxIHdGRtX2V0bBoZamhweTlvejFia3E4aHV2MjNmdzBxcjZtbyIIMTAuMC4wLjMqCDEwLjAuMC40OgNldGxCDDgwMTcxYmVkNzdkYg==` owner:`cc6ebbfd6253`
4) k:`fcf1021badb5343704ffff86f26a115d20aad2738513b65c72941d59da5adf1e` -> v:`CiR0ZG1fZGJtcy4xLmFpbXJ3ZGRsMzM0N3d2aDNnZGo0MTczMjYSCHRkbV9kYm1zGhl3dXNnMHh6N3l6aDJrdDM2dXp3MTgzbnlhIgkxMC4wLjAuMTEqCTEwLjAuMC4xMDoEZGJtc0IMY2E1YjNlZmUwN2Jj` owner:`cc6ebbfd6253`
remcampb@remcampb-dev:~/tdm$ curl localhost:50015/gettable?tname=overlay_peer_table\&nid=55f0y5igh52g0qiuy9bi1i2uc
OK
total entries: 6
0) k:`1960d8419b8830e5c5dea09edf91cb75cc47e5c42d069aefc7fb9313c5da0059` -> v:`CgsxMC4wLjAuNi8yNBIRMDI6NDI6MGE6MDA6MDA6MDYaDjE3Mi4zMS4xMDAuMTAy` owner:`cc6ebbfd6253`
1) k:`2c833942674928e7de1e419a4a477758501c378ddd47fbcd506df194802eadb1` -> v:`CgsxMC4wLjAuOC8yNBIRMDI6NDI6MGE6MDA6MDA6MDgaDjE3Mi4zMS4xMDAuMTAy` owner:`cc6ebbfd6253`
2) k:`4c396a2eb253072c9c5a419f9fe8eb57ee6d370ad5c241dec013b871f005ad88` -> v:`CgwxMC4wLjAuMTIvMjQSETAyOjQyOjBhOjAwOjAwOjBjGg4xNzIuMzEuMTAwLjEwMg==` owner:`cc6ebbfd6253`
3) k:`9c48e23bfe3e80b32482b03da32d41f49e975834a6f09c61bb75581f19867df7` -> v:`CgsxMC4wLjAuMi8yNBIRMDI6NDI6MGE6MDA6MDA6MDIaDjE3Mi4zMS4xMDAuMTAy` owner:`cc6ebbfd6253`
4) k:`b320043e5faf3f68e1594d915f85327a9b30dc9fb6db2a14928693f6f2f59c4b` -> v:`CgsxMC4wLjAuNC8yNBIRMDI6NDI6MGE6MDA6MDA6MDQaDjE3Mi4zMS4xMDAuMTAy` owner:`cc6ebbfd6253`
5) k:`fcf1021badb5343704ffff86f26a115d20aad2738513b65c72941d59da5adf1e` -> v:`CgwxMC4wLjAuMTAvMjQSETAyOjQyOjBhOjAwOjAwOjBhGg4xNzIuMzEuMTAwLjEwMg==` owner:`cc6ebbfd6253` Can update with stackdump if desired. Host Interfaces
Swarm / docker-compose.ymlremcampb@remcampb-dev:~/tdm$ docker swarm init --advertise-addr eth1 Can post I was debugging a separate issue where I couldn't reach any of the published ports from a host interface and was instead Any way to debug the internals of the embedded resolver? |
@remingtonc |
@fcrisciani Thanks for taking a look. The resolver is resolving the hostnames to invalid IPs. e.g. container |
@remingtonc isn't it the VIP? |
if you want the task ip list you need to do tasks.<service_name> so will be nslookup tasks.web |
@fcrisciani That works. I was unfamiliar with the VIP - do you have any recommendations on troubleshooting VIP connectivity issues between containers? Running effectively a stock Docker installation with EDIT: Or should I, internally to the stack, use |
@remingtonc the advantage of using the VIP instead of the container IP is that with the VIP you don't care how many instances of the service are running behind it, it can be 1 or 10 but your application will reach the service using the same IP. You can also choose to use instead of the VIP the dns RR mode, and basically every time you resolve the service name you will have the list of containers ordered in round robin fashion. This means that you have to be aware that if the container that you are talking to goes down you will need to do another dns resolution and be sure that the previous results did not get cached. From a debugging point of view, the tool to use are:
|
I had the same issue resolving container IP from other containers of the same overlay network. |
Issue still present for me.
|
Hi list Can anyone confirm if the problem persists in 18.06.1-ce? |
I can confirm this problem still exists on When I connect to the host via But when I connect using Edit: So after re-reading this thread I realise it is giving me a VIP (virtual ip) which load balances between all the containers. However using the VIP, (ie |
@markwylde tasks.<service_name> returns the list of IP of the backend containers behind the specific service, while the resolution of the service name returns the VIP and then the LB will take care of redirecting the traffic towards an active container servicing that service. |
May not be the cause, but make sure that whatever is running in your container is listening on |
Sorry that I have to add a comment to this thread, because the problem is not solved yet. I use Docker 18.09.0, in a swarm with 4 master nodes on ubuntu 16.0.4 at digital ocean droplets. In most cases the swarm behavior fine. Sometimes the internal DNS answers with two service IPs for one service, after several docker service rm <service_name> and docker stack deploy ... command. The services are connected to one overlay network. One of the IP is an old never more existing service instance, the other one is the correct of the actual healthy service. The services are always as vip service available. The reverse proxy in a nginx container, tries to access both, one goes wrong, one succeed. To resolve this situation, I found no way without removing the overlay network to reset the internal DNS. Tried docker service update --force, docker service rm .., docker stack rm, docker service scale service=0, and so on. Perhaps there is a race condition to update the internal entries for service endpoints. A docker network reinit-dns command would be a solution on resolving this issues. Thanks |
The problem is still not solved in |
@leshik can you share how you are able to reproduce this on single node swarm? I just tried with guide on #30134 (comment) but it looked working just fine. |
@olljanat sure, here it is: First,
Stack compose file version: '3.7'
services:
first:
image: alpine
command: sleep 3600
init: true
second:
image: alpine
command: sleep 3600
init: true Then, [
{
"Name": "test_default",
"Id": "quisbu0cbrgz9acy9j808hrxu",
"Created": "2018-12-18T10:00:10.846072788+07:00",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "10.0.5.0/24",
"Gateway": "10.0.5.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"39c200af9048f20e2ca50a5e59e2534329134bfd4b270c7b6dcd95119116e86b": {
"Name": "test_first.1.qbwe3xsyuiaas50reme2ubtbr",
"EndpointID": "3968618c4ac249f82fab75d1c6e68aa8893f3d9cf8279c95f18df924d0271b69",
"MacAddress": "02:42:0a:00:05:03",
"IPv4Address": "10.0.5.3/24",
"IPv6Address": ""
},
"ccb93d48e2a7e8a6d4c0d0c60872b30f5c4cff43ef97aba4a5aec7118779e612": {
"Name": "test_second.1.rermha7mo5sipex53qm6o0s9o",
"EndpointID": "d37de1616bc297f9c9b047eeed66f75048f5d9ea2347bcf738fb92eb3623080e",
"MacAddress": "02:42:0a:00:05:06",
"IPv4Address": "10.0.5.6/24",
"IPv6Address": ""
},
"lb-test_default": {
"Name": "test_default-endpoint",
"EndpointID": "90df88a051eaf6557bf22883a6659044826d288f41559728ba5b13d6b5efed36",
"MacAddress": "02:42:0a:00:05:04",
"IPv4Address": "10.0.5.4/24",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4102"
},
"Labels": {
"com.docker.stack.namespace": "test"
},
"Peers": [
{
"Name": "eb787c8b22e8",
"IP": "192.168.1.23"
}
]
}
] Note the container addresses should be
So far, so good, now let's ping by name and nslookup each other: On
The other way round,
What? Why ping to
[
{
"Name": "test_default",
"Id": "quisbu0cbrgz9acy9j808hrxu",
"Created": "2018-12-18T10:00:10.846072788+07:00",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "10.0.5.0/24",
"Gateway": "10.0.5.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"15875ea4771e581cb63e27f80d97ab52d19d43f967f354e97550af8864d9e798": {
"Name": "test_second.1.uzbttwyswm8t0ajt1z9ucmw4l",
"EndpointID": "42927351b0b94a1acab1c567a959b5d45b36118031bf12571446bed3d7330820",
"MacAddress": "02:42:0a:00:05:07",
"IPv4Address": "10.0.5.7/24",
"IPv6Address": ""
},
"3f904b9f904cba634e697a8db04938c77b1d0bf897a86ab2c020515c228fe9f3": {
"Name": "test_first.1.ys7ul4vpkcpfjulcx5l11phtl",
"EndpointID": "921745cb0d0b5f4e3687792b8fffa14ba088da1ebea1194f76818751bb2b0bde",
"MacAddress": "02:42:0a:00:05:08",
"IPv4Address": "10.0.5.8/24",
"IPv6Address": ""
},
"lb-test_default": {
"Name": "test_default-endpoint",
"EndpointID": "90df88a051eaf6557bf22883a6659044826d288f41559728ba5b13d6b5efed36",
"MacAddress": "02:42:0a:00:05:04",
"IPv4Address": "10.0.5.4/24",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4102"
},
"Labels": {
"com.docker.stack.namespace": "test"
},
"Peers": [
{
"Name": "eb787c8b22e8",
"IP": "192.168.1.23"
}
]
}
] Addresses have changed, that's good. What about DNS?
The other way round,
Nope, same thing. What kind of magic is happening here? NAT? Not a good thing, this breaks everything that checks the once resolved IP address at the application level. |
@leshik swarm services are designed to work with n replicas of containers. That why by default swarm will create load balancer IP for each service. You can see it with command Anyway, if you want to use container addresses instead then just modify your stack looks like this:
This is btw documented to https://docs.docker.com/engine/swarm/ingress/#configure-an-external-load-balancer PS. This have nothing to do with original issue so please create new one if you don't get it working with this guidance. |
Wow, thanks @olljanat, that helped a lot. |
@leshik sounds good idea. How ever I'm not fully sure which document you are referring. Can you create new issue about it to https://github.com/docker/docker.github.io with links to documents? |
I have tried to recreate the problem. Bare in mind that I also had this breaking a few weeks ago. However, using the same version of docker I can't seem to recreate it. I have tried with both Ubuntu 16 and 18. Maybe it was a quirk with the operating system. Using the steps below, everything works exactly as expected. It still bugs me though that I can't recreate the bug. I've had to stop using the load balancer of all my services, and start using It is possible that I wasn't creating my network with the scope set to swarm. Would that make a difference?
Instead of:
DNS Resolver in Swarm ModeSetup environment
6 Create a new network
Manual service creation works
Stack deploy didn't work, but now does
aaa/docker-compose.yml:
bbb/docker-compose.yml:
|
have nothing to do with original issue which was about that Swarm DNS worked incorrectly on case where service/container crashed/was removed ( and that have been fixed already on 18.09) so if you still see this on one latest version plz create new issue about it. |
we are facing this issue as well. i think for some reason if the health checks fails intermittently, then swarm de-registers the service entirely. |
Interdependency can definitely be difficult. Generally, when designing your containers/services, they should be designed/configured to be resilient against failures of the services they depend on; those may not (yet) be available (for example, when deploying your stack), but should also take into account that those services may be (temporarily) unavailable during their whole lifecycle; a network connection can fail, a database may be in maintenance, ... Having (eg) a retry-loop to reconnect to those services could help in such situations. |
Having the same issue with 18.09.5 |
@jabteles plz create new issue and fill all asked details to it. This one have been closed already |
I'm trying to setup a mongodb shard with 2 shards and each shards being a replicaset (size=2). I have one mongos router and one replicaset (size=2) of config dbs.
I was getting plenty of errors about chunks migration and after digging I figured out that the target host was sometimes alive and sometimes not. But the containers where not crashed which was strange.
After digging deeper I figured out that the IP addresses got through the resolution were not right.
Please note that each service is running in
dnsrr
mode.Steps to reproduce the issue:
hard to get exactly the behaviour
docker network create -d overlay public
nslookup
the service your created earlierDescribe the results you received:
Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
Random issue. If I restart the machine then it works again. Seems the cache is poisoned or something.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
4 machines running on Ubuntu within OpenStack.
1 manager DRAIN LEADER
2 for
mongod
shards and replicas1 worker which has
mongos
The text was updated successfully, but these errors were encountered: