New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.12] Killing leader makes all containers end up on a single node #25017
Comments
/cc @abronan @aluzzardi |
Just ran this again; put the daemons in debug mode, and collected the logs
Interesting bit is that for a while, after the old leader went down, the new leader wasn't aware that it was "up" (when viewed from
swarm-test-01.log.txt |
/cc @LK4D4 |
worth noting that apparently the DigitalOcean control panel does a "clean" shutdown, and not a direct power off
|
It looks like you shut
So I think the issue is either:
or
|
@aaronlehmann there are indeed messages:
|
@LK4D4 @aaronlehmann Is this a release blocker? |
@aaronlehmann I believe that when there's a leader re-election, we actually wait much more than the typical @LK4D4 by the way, how long would it take for a node to be kicked out in the usual case versus during a re-election/restart? |
@thaJeztah Fixed in moby/swarmkit#1238 Could you please confirm it has been fixed (once we update vendoring)? Thanks! |
Have done a lot of testing with @LK4D4, and although we have some pointers, there's no full solution yet; moving this to 1.12.1 |
@thaJeztah: We found many overlapping problems that contributed to this result. Most of them are already fixed in swarmkit thanks to @LK4D4's work. While we're not completely done here, it might be a good exercise to give this setup another try after vendoring the version of swarmkit on its bump_v1.12.1 branch. I expect this should show improvement. |
Yes @LK4D4 and I have done a lot of testing on those nodes; I'll keep them alive so that we can have a setup to test. Awesome work! |
@thaJeztah everything is merged. |
Thanks @LK4D4! |
@thaJeztah Iirc there has to be at least one issue with priority |
@ralphtheninja it wasn't reproducible in all environments (I'm keeping a number of test droplets alive on DigitalOcean, because on those particular instances it exposed this issue). We sure consider this, but we're also investigating some other issues that we like to get to the bottom of, before deciding on a 1.12.1 release. |
@thaJeztah: All of the commits in LK4D4's list were vendored by #25159. We may be able to close this if you can confirm this issue is fixed. For the 1.12.1 patch release, we're maintaining a special branch of Swarmkit that also has all of these commits. |
@aaronlehmann should we wait for moby/swarmkit#1299 ? |
Sure. |
I added that PR to @LK4D4's comment. |
Added yet another PR... |
@aaronlehmann @thaJeztah I've tested with last two PRs and was not able to reproduce. Nodes recover very fast, like 10-15s maximum. |
That's great news @LK4D4, I'm not near a computer now, but will give it a try tonight |
@thaJeztah @LK4D4: Have you confirmed that the fixes so far solve the problem? Let's close this ticket if that's the case, since swarmkit was vendored on Tuesday. |
@aaronlehmann last results were in the screen recording I shared, and looked stable, I think we can close this |
Opening as a new issue, per #24941 (comment)
Something I just ran into, and can reproduce reliably;
docker swarm init
)Then; create a service, and scale to 16
On one of the manager nodes (
swarm-test-02
), watchdocker node ls
;And on all nodes, watch
docker ps
;Kill the leader node
From the DigitalOcean control panel, destroy the leader node, meanwhile, on the nodes,
watch what happens
1. Initial state (before killing leader)
2. Status "unknown" for all nodes
Just after killing the leader, an
rpc deadline
error is presented, then, thenode status goes through the following stages:
3. Status "down" for all nodes
4. Status "down" for the manager that did not become leader
5. Status "ready"
However, at stage 5, all containers ended up on a single node:
swarm-test-02:
swarm-test-03:
swarm-test-04:
The text was updated successfully, but these errors were encountered: