-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker service update messes up VIP tables and "tasks" DNS entries #26772
Comments
@xiaods almost, but not quite. #26480 says the names aren't available on different hosts. That's not what's happening with this issue; the names are available on all hosts. #25394 says that the round-robin isn't routing to tasks on different hosts; I'm not checking that in this issue. Neither of those issues do an update, so I think this issue is different from both. One thing to note: I just tried updating my test servers to 1.13-dev, and doing a few dozen updates, and the problem does not occur! I see a small list of IPs in I'm going to try testing a little more to see if I can make this happen again in 1.13-dev, but if not I'll close this bug. |
@evanp i also came across the annoy issue related VIP. so wait your testing result. |
ping @mrjana ptal |
@evanp Thanks for taking the effort to test these with docker/docker master code. Many more fixes were added there and please try to reproduce this problem there and let us know. |
@mrjana We upgraded our 90-node cluster this morning and saw a great improvement. However, later in the day after a few updates, we're again seeing this error. I'm going to see if I can get some more detail. |
Also, as of right now the only way I see to repair this situation once it has arisen is to burn the cluster. It might be possible to just remove all the services, remove the network, and then re-add the network and all the services. It would be nice if you could use Probably the most frustrating part of this situation is that the data is available and correct in requests like I'm going to retry the test scenario outlined above with 1.13-dev and see if I can replicate it and possibly get some debug information from logs when it occurs. |
So, I spent some time this morning trying to replicate the error, and I couldn't do it in a test environment as described above. My next step is to set the debugging flag on a Docker node in our production environment and then do an update on a service in that node. If it causes the same problem, we'll at least have the logs necessary to explain it. |
@evanp wait for your update. |
@evanp Thanks for the detailed information! We're actually in the process of building In a few hours you should be able to try that version (and the official 1.12.2 should be out in a matter of weeks). |
Don't know if it helps but I am seeing this problem in a situation where |
@MichaelW-SD what do you mean by "fail"? How does the update fail when there are errors "in the image"? |
In my test enviroment , i find the same error when update service. and in my case ,i recreate service some time, so dig , output some diff vip . docker 1.13 fixed? |
@clhlc looks like 1.12.2-rc1 fixed the problem in my particular case. |
If others are able to test if 1.12.2-rc1 is fixing this, that would be great (note of course, it's an RC, so generally we don't recommend testing it on critical / production systems) https://github.com/docker/docker/releases/tag/v1.12.2-rc1 |
@evanp In my case I had a faulty image so that containers based on that image would not start. I will test this with 1.12.2-rc1. |
1.12.2-rc1 fixed this for me. |
@evanp was this fixed for you as well on 1.12.2-rc? |
So, we still saw this error with 1.13-dev as of this morning. We've been unable to get any purchase on the bug, and so we're regretfully moving to another clustering tool. I'm happy to help out with this bug if there's anything further I can do, but we no longer have a production cluster running Docker 1.12.x in swarm mode. Also, feel free to close this bug if there aren't others seeing the same problem. |
/cc @mrjana |
@evanp When you tried 1.13-dev from this morning can you tell me what failures you had? Did you have incorrect |
This issue smells like it's within cooee of #25266. My temperature is certainly within cooee of whomever on @evanp's team declared Swarm unfit for production. I'll try pounding |
@garthk The instrumentation is really in the daemon error logs and every issue that is fixed in 1.12.2 as you can see is based on such instrumentation. That is why I am asking for daemon logs. Do you have any from problem nodes so that we can confirm or deny if this is already fixed in 1.12.2? |
Multiple
docker service update
calls make the VIP tables for the overlay network incorrect, and mess up the DNS lookups fortasks.<service name>
.Description
I noticed connectivity problems between services in my cluster. By launching a terminal and using
curl
anddig
to examine the service names and "tasks" round-robin names, I realized that the map of IP addresses was incorrect.Steps to reproduce the issue:
testnet
. I useddocker-machine
with thedigitalocean
driver.web11
andtasks.web11
(and 12 and 13) dns entries withdig
, and check the output fromcurl
.service update
calls per service. I did 19 updates, just changing theLINE
environment variable.curl
anddig
to review theweb11
andtasks.web11
DNS entries and the Web output.Describe the results you received:
Lookup on
web11
remained correct, buttasks.web11
has far too many IP addresses for scale=3 service.curl
sporadically failed to connect or connected to Web servers for different services.Describe the results you expected:
At scale=3, a lookup on
tasks.web11
should return 3 IP addresses.And the curl results (using the
web11
name, which points to the VIP) should only return HTML from the server 11 service task containers.Additional information you deem important (e.g. issue happens only occasionally):
The
/proc/net/ip_vs
output is attached.I think this situation can arise with the
ingress
overlay network, too.proc-net-ip_vs.txt
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
The text was updated successfully, but these errors were encountered: