New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks are killed unexpectedly #32218
Comments
I noticed that in your |
Exactly, it is another issue that happens sometimes. I commented it here: #31377 Could you help me around problem with tasks are killed unexpectedly when I disconnect a server simulating that an unexpected problem occurs? Thank you very much in advance. |
I remember we had a similar problem long ago that was reproducible on DigitalOcean by "powering off" VMs: #25017 (cc @thaJeztah). The issue there was that agents could take a long time to realize that the old leader was down, because this way of shutting off the leader didn't sever the TCP connections, and the agents had to rely on a timeout to notice that the old leader was unavailable. Meanwhile, the new leader would only wait so long for the agents to connect to it, and eventually mark those nodes as down. We seemed to have fixed this, but it's possible there's a regression, or some variant of the problem still exists. You mentioned that the problem is 100% reproducible, which I wouldn't expect if the underlying cause is agents not connecting to the new leader soon enough. But it would still be useful to check the Are you running the same version of Docker on all nodes? I'm also looking into the issue with unexpected tasks in the "pending" state. |
Yes, I create the machines of my cluster with docker-machine and all nodes have the same version. I have reproduced it today again and the output before switching off swarm leader is:
Switching off 'manager1':
After 10~15 seconds, 'node1' and 'node2' are 'Ready' again (but some tasks were killed):
In fact, I have created a new cluster for performing the checks you told me and it is happening with 17.03.1-ce also:
|
Thanks for the response. I spent some time trying to reproduce this, and I found an issue that could cause containers to wrongly be restarted (moby/swarmkit#2091), but unfortunately I think it's different than what you're seeing. I could only trigger it by pausing manager daemons, not by shutting them down completely. It occurred to me that enabling debug logging on all nodes ( |
I have created a new cluster and I have enabled debug logging in all nodes.
After shutting down 'manager1', some containers of 'node2' received SIGTERM
'test-manager1' was switched off at 10:30:50, so I have attached logs from the next line this point. I can reproduce it always, I hope those logs can help you. |
Thanks. These logs are really helpful. Do you have any services with dynamic published ports (for example If you inspect the service before and after the leader switchover, do you see any changes in If this is not #29247, I suspect it's a similar allocator issue. The allocator making a change to the service or the task when it initializes could cause the task to be replaced. cc @yongtang |
Adding area/testing - this should have been caught by automated leader re-election tests (/cc @dhiltgen @dongluochen) |
I have reproduced it again and I attach here the output of docker inspect before and after shutting down leader. |
Hi @le-ortega: thanks so much for your help and patience so far. I've been looking into this but I've been a bit swamped today. I'm hoping to get back to it early next week. |
I think I found the problem! Opened moby/swarmkit#2113 |
@thaJeztah @vieux: The fix is being in swarmkit master and being backported for 17.05. Should it also be backported to 17.03.x? |
The fix was merged in #32576 |
closing this one; we're looking at doing a patch release for 17.03, and including the fix there as well. |
Description
I have created a cluster with 5 nodes (3 managers and 2 workers). I am testing the behavior of cluster when a node is disconnected. Once I have created my cluster, I have checked that leader is 'manager1' server:
I turn off the 'manager1' server and I have detected that some containers of some services in other nodes are killed to be started again. I would expect that those tasks running in
'manager1' were planned in other node(s).
Here, we can see that a task was killed in 'node2' to be started again:
Steps to reproduce the issue:
Describe the results you received:
Some containers of some services in other nodes were killed to be started again.
Describe the results you expected:
Tasks running in 'manager1' were planned in other node(s).
Additional information you deem important (e.g. issue happens only occasionally):
Issue happens always when leader node is disconnected.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
This cluster runs on DigitalOcean.
Additional information:
I add some logs of another time I have reproduced it (leader was 'manager3'):
Manager1 logs:
Manager2 logs:
Manager3 logs (after turn on it; containers name have been anonymized):
I use the button 'Switch off' of DigitialOcean droplet. If you click on it, a note appears saying: "you power off your Droplet...".
Attaching "syslog" part of this machine when I switched it off:
The text was updated successfully, but these errors were encountered: