-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containers on overlay network cannot reach other containers #35807
Comments
This seems to be related to ARP, as if I query the neighbors in the container, I get failed and incomplete statuses.
I unfortunately didn't save the tcpdump for ARP requests I had running at the time, but I remember that I was only seeing requests going to individual container addresses, rather than broadcasting to the whole subnet. My understanding of ARP and skill with tcpdump are not good, though, so maybe that's how it's supposed to work. I also confirmed that it still affects docker 17.12.0 and consul 1.0.2. The fact that it is at the ARP level makes me think this is solidly in docker, though. |
I recreated it, and here's some tcpdump:
The 10.50.50.23 address is another container on the same host. Containers on some hosts are connectable, but containers on others are not. And some arp output:
And some
|
I've confirmed that this state seems to happen more frequently when launching many containers at once. |
ping @ddebroy @fcrisciani PTAL |
We are seeing the same issue too with large cluster deployments. Especially bad with 10 nodes cluster or more. Any advice? |
@lin-zhao @alexhexabeam are you still seeing this issue on the new stable releases? |
@fcrisciani Yes, I was able to reproduce this with the latest stable 18.03.0-ce.
The three reachable addresses are the host itself and two other containers on the same host. |
Dear all and @fcrisciani I also reproduced the issue with:
With all containers of an overlay network in one host.
Workaround: do the inverse ping (if container C1 -> C2 cannot work, do C2 -> C1, then the ping works in all directions). |
This logs seems to be related because I couldn't ping this IP 192.168.101.36
|
More info: this is related to the restart policy of the container set to restart in case of error. The endpoint keeps changing, thus the warning logs that says Swarm already see the IP. |
@antoinetran in your workaround, what do you mean reverse ping? Do you mean actually pinging C1 -> C2 inside the container? |
@lin-zhao yes. Pinging C2 in container C1 sometimes fixes the reverse ping from C2 to C1. In this case, I managed to understand what happened I think. All of this is normal if we consider the container always restart, some Swarm finds it weird to have the same IP/MAC again. Maybe this warning should be removed in this case. |
I'm suffering from this issue. I use 18.03.1-ce. It seems to happen more frequently when containers in overlay network operate on the same node. |
I also suffering from this issue. Have anything updated or fixed? @fcrisciani Docker version 18.06.1-ce, build e68fc7a |
I could also hit this from time to time. No exact way to reproduce. We have 14 bare metals with 18.06 installed, one overlay created on it. Not using swarm, we are using the overlay manually. Description I am starting a container on that overlay on host 10.2.1.67. The IP as auto assigned by docker. Yet, it can't ping any other container on the same overlay, nor any other container could ping it. On ALL syslog on the other hosts, it says
I believe cardinality:2 means a previous containeer with the same Virtual IP 10.9.228.33 wasn't deleted correctly when the last host 10.2.0.64 shut it down. Such that the when I am creating a new container on new host 10.2.1.67, it sense a duplicated entry in
Actually if I stop the docker daemon, remove Suggestion Maybe the old host didn't broadcast a peerDelete for some reasons? AndThe log says is it possible to make the new container announcement as forceUpdate:true such that it would work even peerDelete were missed for any reason? |
@sam0737 I encountered same issue. For a workaround, you can use static IPs for containers. Worked for me. Hope it will be fixed by the docker team soon. |
Here's workaround.
===> + contents added I found some cases where this workaround doesn't work. |
Can inform that I've also encountered something that at least seems very similar, but on 18.09. It's a swarm running on more than 10 bare metal hosts, and the symptom was an unreachable container. I also did see a few |
I've got the same problem, running a 12 Node Cluster, Engine Versions differ from 18.09.1 till 18.09.6 Some hosts seem to habe the Problem more often, that others, but this could be false. I start a service mit replica=6 and in 2 of 3 cases (doing a force-update to "retry" atm) at least one of the 6 containers delivers a connection refused (most of the tries), on the services "shared/virtual" ip. The ping from inside the container that is not reachable to the container, that is, resolves the problem (but only if i ping the actual local ip that i see on eth0 or docker inspect of the container, it does not work, if i ping it by service name). When the container, that did not work before the ping, pings the container, that is trying to reach, most of the times the problem is resolved, but not always, couldn't find find any pattern till now. The hosts syslog shows something like this, when starting the service, sometimes
|
FYI, I have converted my cluster (25+ hosts) from what they called "Overlay (Legacy)" now - which use a user supplied KV service (I picked Consul), to "Overlay Network (Attachable)" that comes with Swarm (and it has a builtin etcd as the KV service). i.e. I am not using the autoscaling stuff of Swarm, but just its overlay layer https://docs.docker.com/network/overlay/. Since then, everything has been running good so far. |
@sam0737 So you meant you used the swarm classic image https://hub.docker.com/_/swarm ? We used the overlay from this mode and we found the overlay to be not working good at all. |
I have the same problem currently. I am running three nodes with an overlay network without swarm and using consul Suddenly the network connection between two containers on different hosts (A->C) is not possible anymore, other containers which run on A and C can still communicate. ip neigh shows the correct mac address. journalctl -u docker.service also shows some warnings but the affected ip is not inside.
Docker version 19.03.12 |
Description
Overlay network randomly has specific machines unreachable. Containers on these nodes are unable to reach containers on other nodes, and vice-versa. This happens much more frequently on larger clusters, but we've seen it on smaller ones as well.
We are using consul for our kv store.
Steps to reproduce the issue:
Describe the results you received:
Containers on some nodes are unable to connect to containers on other nodes. They can connect to containers on the same node.
Journal logs:
Describe the results you expected:
Containers on all hosts should be able to connect to each other.
Additional information you deem important (e.g. issue happens only occasionally):
Intermittent/randomly occurs. Much more frequent on larger clusters.
Output of
docker version
:Tested with 17.05, 17.09.1, and 17.11.0. I don't have a 17.05 up currently to get the
docker version
output.Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
AWS and physical, running Centos 7. Tested with consul 0.7.2 and 1.0.1.
Restarting the docker daemon on unconnectable hosts sometimes fixes it for that host.
daemon.json
consul config file:
Possibly related to #32841 , #32195 , or #33721 , but I'm not using swarm, so I figured it warranted its own ticket.
The text was updated successfully, but these errors were encountered: