-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swarm is having occasional network connection problems between nodes. #32195
Comments
I think this is an old problem since the release of Swarm Mode. |
Run "sudo service docker restart" on the host with service container that can't be ping from the good ones, problem solved. Maybe good for a while, until creating new or updating services. |
@chris-fung It comes back on its own, it happens occasionally for 1-20 requests in a row. The problem is these small interruptions when clients actually see Bad request error on production. I can't catch or know when it happens. I just receive an error from papertrail nginx logs saying it was unreachable, but the rest of the time everything works. |
Yesterday after posting an issue, I noticed there is new version and upgraded all nodes from |
We have the same problem in our swarm too, i updated the swarm to |
I also updated the swarm to 17.03.1-ce, still met the same problem just now. After restarting the docker engine on the problem host, all goes normal, but will happen again. |
Some questions;
These messages;
Indicate that there's a problem in communication between the nodes, this could be related to not specifying
Can you share how the "upstream" in your nginx server is configured?
|
@thaJeztah I tried both - specifying and not specifying Both worker nodes are on the same datacenter, same availability zone, same vps and subnet and created identically with For nginx I use web service name and following config:
Which version is officially correct - service name or ip address? I couldn't found on docs and I even tried using I wasn't aware that free version only resolves IP adresses once and I am not using resolver. Hmm that may help, although I wonder - because this is not permanent issue and eventually issue resolves itself in a few seconds can it be related to resolver thing at all. |
I use a golang application which communicate with a redis service and when the "Marking ... as failed, suspect timeout reached" then i see following in the golang app: The weired thing is it workes sometimes. |
@darklow @Fank thanks for the additional information
Resolving by service-name is correct
When using the service name, that shouldn't make a difference here; the VIP of the service should not change (but IP-addresses of individual tasks can, which could be a problem if you used that 😄) I'm gonna have to defer this to the networking and SwarmKit people, who may be able to ask more targeted questions 😄 ping @dongluochen @aboch PTAL |
For example, I deploy services like below format-- service name, replicas endpoint-mode, publish port, image: upstream kong-8000 {
least_conn;
server kong:8000 max_fails=3 fail_timeout=60 weight=1;
}
server {
server_name _;
listen 8000;
location / {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto;
proxy_pass http://kong-8000;
proxy_next_upstream error timeout http_500 http_502 http_503 http_504 http_404;
proxy_next_upstream_tries 2;
}
}
upstream kong-8001 {
least_conn;
server kong:8001 max_fails=3 fail_timeout=60 weight=1;
}
server {
server_name _;
listen 8001;
location / {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto;
proxy_pass http://kong-8001;
proxy_next_upstream error timeout http_500 http_502 http_503 http_504 http_404;
proxy_next_upstream_tries 2;
}
}
upstream kong-8443 {
least_conn;
server kong:8443 max_fails=3 fail_timeout=60 weight=1;
}
server {
server_name _;
listen 8443;
location / {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto;
proxy_pass http://kong-8443;
proxy_next_upstream error timeout http_500 http_502 http_503 http_504 http_404;
proxy_next_upstream_tries 2;
} Whenever the service connection problem happens, I ping other services (or nslookup) inside a container of nginxlb service(ping with service name or directly with vip--dnsrr services have different IPs depending on there replicas), over and over again. Finally I realize that I can't ping any vip(or service name) to some node. I also tried to ping inside a service container on that node, of course, service containers on that node can ping each other. And restart the node solves the problem。 |
I've been noticing this too since 2 weeks ago :( |
We have the same problem. We have 4 nodes that loses connection to each other and it is not a network problem. We consider it to happen randomly. After couple of minutes it gets back the connections and the swarm heals it self ending with all 4 nodes working. Docker 17.03 |
I did the following foreach node:
And it looks like it worked. My old configuration, was a systemd overwrite with following parameter {
"storage-driver": "overlay2"
} |
I've been seeing the same issue. Some (possibly all) overlay IPs stop responding, DNS still resolves the IP but connections to a port on the IP hang indefinitely. Restarting just the docker daemon sometimes solves the issue, but today we needed to do a full reboot to recover. Services are running inside of swarm mode, networks created are "attachable" and sometimes the target IP is a standalone container running outside of swarm mode. If it's helpful, I also have a daemon-data and goroutine-stacks that was generated during this issue. Docker version is 17.03.1-ce (similar issue was seen with 1.13.1) Looking through my logs after restarting just dockerd, on host2 I see:
On host1, I'm seeing:
|
Seconding this, couldn't connect to any services in the Swarm (connection refused in browser) on my dev machine until I restarted Docker. I didn't even stop the swarm or the services, so they came back when I started Docker again, and now I could connect. I'm running docker for mac. 17.0.3.1-ce-mac5 |
Does anybody have a good workaround for this? I have been seeing a very intense version of this with 50 or so errors a day on a project that is not in production yet. I'd like to give swarm mode a try! But this may be the issue that keeps me from doing it. Right around a timeout error, my journald logs for the swarm host look like:
Docker info
and docker version
|
I keep seeing this, now on a Docker for Azure CE swarm I deployed from the store.docker.com a couple of weeks ago. |
@Fank
|
Hi guys. I had the same problem. The reason was in virtual IP address (VIP) and It is enabled by default. I turned off the VIP everywhere and the problems with the network connection is resolved. To disable to use the virtual IP address, you must start the services with the following option
If you can't use this option to start your services (for example, you start them with From the documentation:
and:
For more information read this https://docs.docker.com/engine/swarm/networking/#use-dns-round-robin-for-a-service |
any news on that issue i'm hitting the same problem. Eaven going through |
@muhammadwaheed @thaJeztah I debugged alot and tried some different enviroments, with my network loss issues between swarm nodes.
We tried multiple VMware hosts/cluster and hardware, also tested docker version between 17.03 and 17.05, it looks like that a virtualization enviroment causes some weired issues with docker, switched to physical enviroment and works fine without any issues. We use 4. solution since monday and we received very positive feedback about the performance and stability of the applications running on them. |
Correlation not implying causation, I wonder if it has more to do with vmware's vswitching stuff. We use VMware + vswitches and have the same problems you describe. I expect it to be more closely related to that. |
Docker version 17.10.0-ce, build f4ffd25 is still showing the same swarm instability. I could upload the log files if you really want, but the messages are the same as previously reported.
|
I tested with 17.11-rc3 (Docker version 17.11.0-ce-rc3, build 5b4af4f) and I can reproduce the issue. Two service in a common network. I start the console on the first and try to access the second one (running in 2 replicas) and once it works once it doesn't (connection refused). When I kill (auto restart) the failing one it solves the problem (for some time). From the failed service perspective everything is fine, so this is a network problem. The (common) network is a manually created overlay network. |
@DBLaci Thanks for posting the test results. I will hold with updating to 17.11 then. |
@fcrisciani Thanks for your comment. Could you or someone looked into this issue #35249? Is it a known limitation for Swarm or a bug? Is it VMware environment specific or not? Thanks! |
@DBLaci do you have any repro steps to share? Connection refused is a result of a RST packet, so the ICMP is coming back. |
@lipingxue I gave you some steps to follow to start the debug going, please try to continue there the thread |
@cpjolly can you verify that you are not in any of these situations: #32195 (comment) |
@fcrisciani Thanks. I will try the steps you gave. |
@fcrisciani Not an exact way of reproduce because the problem apperass somewhat randomly or after a time (possibly idle):
Docker 17.11-rc3 (at the moment) Ubuntu 16.04 (default sysctl now) on AWS - cpu/memory is not a bottleneck the swarm nodes (5) are on one subnet with 3 manager nodes (3/5) |
I don't think any of the issues mentioned apply. These are Digital Ocean 2GByte 2CPU nodes and have plenty of free resources. The websites are not heavily loaded. Here is the section from the logfile from a host when it dropped out of the swarm. I've removed the IP addresses.
|
@cpjolly there are several different components that are complaining about network connectivity:
memberlist:
Looks like that the TCP connection of the control plane are being drop and there is trouble in creating them back. |
@DBLaci once you hit the condition of connection refused, can you try to open a shell in one src and dst container and try to ping each other using the container ip directly not the VIP? |
@DBLaci I had similar issues on 17.09. I didn't create allot of networks manually, most of them were created using stack deploy. You need to check ping as @fcrisciani mention to try and isolate the problem you have |
Same issue here Docker version 17.11.0-ce, build 1caf76c |
I had a similar issue issue with
everything started fine for a few minutes then, 1 of the 2 replicas stopped working and all of the outbound (_default network traffic as well) and ingress traffic got stucked (timed out )from the container and also from the host (failing host) itself, after a while we did a tcp dump and realize that we encountered this problem Ran So would recommend listing your tcp sysctl settings by doing a UPDATE
Been running for more than 7 days now in 16 swarm cluster of 3+ servers each, no problem at all, all network glitches removed, hope you guys find it useful |
FWIW this seems to be the default on Rancher OS (what i run everywhere):
/cc @SvenDowideit |
Hi, We 're experiencing the same issue with 17.06.2-ce. It just happened again and we captured the stacktraces of both the host which is running an nginx and the host which hosts the target service. Logs are captured while I 'm running curl from inside nginx to the target service. The latter is with dnsrr, and both hosts have the above mentioned tcp_* values disabled. The whole scenario ended up with connection timeout. Note that the total number of replicas of other services in the same network running at that point were less than 20. https://gist.github.com/dmandalidis/0ac8cf6cbd87d5dbb6b2b7b10d4374c7#file-proxy-log Hope it may help with troubleshooting. |
@prologic it's default on linux from 2.6.8 :) |
@fcrisciani Oh :D I'm just trying to work out what |
Several improvements in the networking area got merged in the latest set of releases. |
I agree. Thanks @fcrisciani |
I'm locking the conversation on this issue to prevent it from collecting new reports, which go easily overlooked on closed issues: please open a new issue instead (you can add a link to this issue in your report if you think there's a relation) |
Few times a day I am having connection issues between nodes and clients are seeing occasional "Bad request" error. My swarm setup (aws) has following services:
nginx
(global) andweb
(replicated=2) and separate overlay network. Innginx.conf
I am usingproxy_pass http://web:5000
to route requests to web service. Both services are running and marked as healthy and haven't been restarted while having these errors. Manager is separate node (30sec-manager1
).Few times a day for few requests I am receiving an errors that nginx couldn't connect upstream and I always see
10.0.0.6
IP address mentioned:Here are related nginx and docker logs. Both web services are replicated on
30sec-worker3
and30sec-worker4
nodes.I checked other cases when nginx can't find find upstream and always I find these 3x lines appear most at these times in docker logs in:
By searching other issues, found that these have similar errors, so it may be related:
#28843
#25325
Anything I should check or debug more to spot the problem or is it a bug?
Thank you.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Amazon AWS (Manager - t2.micro, rest of nodes - t2.small)
docker-compose.yml (There are more services and nodes in setup, but I posted only involved ones)
web-goss.yaml
The text was updated successfully, but these errors were encountered: