[1.12] Killing leader makes all containers end up on a single node #25017

thaJeztah · 2016-07-25T17:31:56Z

Opening as a new issue, per #24941 (comment)

Note I just re-ran this on current master (Docker version 1.12.0-dev, build 42f2205),
and ended up with the same result

Something I just ran into, and can reproduce reliably;

Create 4 droplets on DigitalOcean, and install 1.12-RC4
First node creates swarm (docker swarm init)
Nodes 2, and 3 join the swarm as manager
Node 4 joins the swarm as worker

Then; create a service, and scale to 16

docker service create --name web --replicas=16 -p 80:80 nginx:alpine

On one of the manager nodes (swarm-test-02), watch docker node ls;

ID                           HOSTNAME       MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
0j5yrijzbkgwl3bmjzp1aag1f    swarm-test-03  Accepted    Ready   Active        Reachable
36oy4vdvxxmnuw72pfli5dv2i    swarm-test-01  Accepted    Ready   Active        Leader
37argq2z05d5719olaiqpauvh *  swarm-test-02  Accepted    Ready   Active        Reachable
bpcy1dxk7v6md85qg3m7jn8il    swarm-test-04  Accepted    Ready   Active

And on all nodes, watch docker ps;

CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS              PORTS               NAMES
72ea95d66177        nginx:alpine        "nginx -g 'daemon off"   About a minute ago   Up About a minute   80/tcp, 443/tcp     web.2.7h8zuthv89jr2
ao4c1x4r9qce
dd58f8757a06        nginx:alpine        "nginx -g 'daemon off"   About a minute ago   Up About a minute   80/tcp, 443/tcp     web.10.8bolssicpwc2
5op18dleducvg
59966a937870        nginx:alpine        "nginx -g 'daemon off"   About a minute ago   Up About a minute   80/tcp, 443/tcp     web.11.0shxh3ccvzpw
zvjvphjrzefkh
49040cf8f0a6        nginx:alpine        "nginx -g 'daemon off"   About a minute ago   Up About a minute   80/tcp, 443/tcp     web.9.18f9rtrhyamjd
mxnmd9hmokj0

Kill the leader node

From the DigitalOcean control panel, destroy the leader node, meanwhile, on the nodes,
watch what happens

1. Initial state (before killing leader)

ID                           HOSTNAME            MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
1msvix0hu4vz7czqcwvzi3poq *  ubuntu-2gb-ams3-01  Accepted    Ready   Active        Reachable
264qo2mcvkuagirx68iluzi88    ubuntu-2gb-ams3-02  Accepted    Ready   Active        Reachable
4z3arwkhuwvvqiqwm9rxqg165    ubuntu-2gb-ams3-01  Accepted    Ready   Active        Leader
en76v2a9mlf12sjwdtw3hyb58    ubuntu-2gb-ams3-01  Accepted    Ready   Active

2. Status "unknown" for all nodes

Just after killing the leader, an rpc deadline error is presented, then, the
node status goes through the following stages:

ID                           HOSTNAME       MEMBERSHIP  STATUS   AVAILABILITY  MANAGER STATUS
0j5yrijzbkgwl3bmjzp1aag1f    swarm-test-03  Accepted    Unknown  Active        Reachable
36oy4vdvxxmnuw72pfli5dv2i    swarm-test-01  Accepted    Unknown  Active        Unreachable
37argq2z05d5719olaiqpauvh *  swarm-test-02  Accepted    Unknown  Active        Leader
bpcy1dxk7v6md85qg3m7jn8il    swarm-test-04  Accepted    Unknown  Active

3. Status "down" for all nodes

ID                           HOSTNAME       MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
0j5yrijzbkgwl3bmjzp1aag1f    swarm-test-03  Accepted    Down    Active        Reachable
36oy4vdvxxmnuw72pfli5dv2i    swarm-test-01  Accepted    Down    Active        Unreachable
37argq2z05d5719olaiqpauvh *  swarm-test-02  Accepted    Down    Active        Leader
bpcy1dxk7v6md85qg3m7jn8il    swarm-test-04  Accepted    Down    Active

4. Status "down" for the manager that did not become leader

ID                           HOSTNAME       MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
0j5yrijzbkgwl3bmjzp1aag1f    swarm-test-03  Accepted    Down    Active        Reachable
36oy4vdvxxmnuw72pfli5dv2i    swarm-test-01  Accepted    Down    Active        Unreachable
37argq2z05d5719olaiqpauvh *  swarm-test-02  Accepted    Ready   Active        Leader
bpcy1dxk7v6md85qg3m7jn8il    swarm-test-04  Accepted    Ready   Active

5. Status "ready"

ID                           HOSTNAME       MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
0j5yrijzbkgwl3bmjzp1aag1f    swarm-test-03  Accepted    Ready   Active        Reachable
36oy4vdvxxmnuw72pfli5dv2i    swarm-test-01  Accepted    Down    Active        Unreachable
37argq2z05d5719olaiqpauvh *  swarm-test-02  Accepted    Ready   Active        Leader
bpcy1dxk7v6md85qg3m7jn8il    swarm-test-04  Accepted    Ready   Active

However, at stage 5, all containers ended up on a single node:

swarm-test-02:

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

swarm-test-03:

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

swarm-test-04:

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
06b59fd80536        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.1.7599uowuwx8kln46jzzlzrsej
5e751507a919        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.8.6wcxeir8kw4misv5hm6tbt1gb
9f871df4c0e6        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.9.6348tx8lisem60tge1n9h105p
34d922f5627b        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.10.18x3x83ay20moo4wfk6uqssuk
fe00d09420fe        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 5 minutes        80/tcp, 443/tcp     web.15.7ajzwkx125w0ilbv09kluxsh5
1a365c02c807        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.2.9g06i0ji6oby1a681hrnr7fuv
780f69e79646        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.12.0952zl8ejgp7glfr991xmiqsu
93002b1dcfe7        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.14.62cwuiykimiz6dlmiognotmcn
c96a48711179        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.3.96zsnc2bfgd9dj71bicgds0w1
7634285e896a        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.6.bqwqso18dqrkr80jc51aumonq
3ed44fe7e021        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.11.8s2va2tvmpw187s7v8ce6h3r2
23a083a91b6b        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.13.709wueza0j9631eibdm7ildu3
6d6871c931ce        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.4.1o0ay75v1y8amya2hnhnfb8y0
2a4cb84bd3ee        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.7.7ijasltqhbalctuahsly3n2cy
18bcc8cb66a1        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 5 minutes        80/tcp, 443/tcp     web.16.b7z9i8lo97fzxmtvwd06ufpxp
67fa07227b58        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.5.6ln0iyt24qxnic2nu0rg3ttxx

The text was updated successfully, but these errors were encountered:

thaJeztah · 2016-07-25T17:32:56Z

/cc @abronan @aluzzardi

thaJeztah · 2016-07-25T18:41:08Z

Just ran this again; put the daemons in debug mode, and collected the logs

swarm-test-01 = original leader (the node I switch off)
swarm-test-02 and swarm-test-03 are managers
swarm-test-04 is a worker

Interesting bit is that for a while, after the old leader went down, the new leader wasn't aware that it was "up" (when viewed from swarm-test-02 itself;

ID                           HOSTNAME       STATUS  AVAILABILITY  MANAGER STATUS
3x3p4ts6oik5z4835jdmhjo7r    swarm-test-01  Down    Active        Unreachable
629t5eojcj0ll4lu7iw8210ga *  swarm-test-02  Down    Active        Reachable
9wgwnjg63gv68ysnmhrc90a5t    swarm-test-04  Ready   Active
cf922kv1bodg04ciivhqngx88    swarm-test-03  Ready   Active        Leader

swarm-test-01.log.txt
swarm-test-02.log.txt
swarm-test-03.log.txt
swarm-test-04.log.txt

aluzzardi · 2016-07-25T18:46:40Z

/cc @LK4D4

thaJeztah · 2016-07-25T18:49:40Z

worth noting that apparently the DigitalOcean control panel does a "clean" shutdown, and not a direct power off

Jul 25 14:23:46 swarm-test-01 dockerd[2926]: time="2016-07-25T14:23:46.444502919-04:00" level=debug msg="Clean shutdown succeeded"
Jul 25 14:23:46 swarm-test-01 dockerd[2926]: time="2016-07-25T14:23:46.445714157-04:00" level=info msg="stopping containerd after receiving terminated"
Jul 25 14:23:46 swarm-test-01 dockerd[2926]: time="2016-07-25T14:23:46.445519702-04:00" level=debug msg="containerd connection state change: SHUTDOWN"
Jul 25 14:23:47 swarm-test-01 systemd[1]: Stopped Docker Application Container Engine.
-- Reboot --

aaronlehmann · 2016-07-25T19:00:31Z

It looks like you shut swarm-test-01 down around 14:23:45. The agents notice right away:

time="2016-07-25T14:23:45.078944184-04:00" level=error msg="agent: session failed" error="rpc error: code = 13 desc = transport is closing" module=agent
Jul 25 14:23:45 swarm-test-02 dockerd[24195]: time="2016-07-25T14:23:45.079207739-04:00" level=debug msg="agent: rebuild session" module=agent

swarm-test-04 reconnects successfully at 14:24:06:

time="2016-07-25T14:24:06.894433740-04:00" level=debug msg="(*session).start" module=agent
time="2016-07-25T14:24:08.143185422-04:00" level=debug msg="(*session).heartbeat" module=agent session.id=1deljljgv95s41xyss3tredyk
Jul 25 14:24:08 swarm-test-04 dockerd[1331]: time="2016-07-25T14:24:08.143183751-04:00" level=debug msg="(*session).listen" module=agent session.id=1deljljgv95s41xyss3tredyk
Jul 25 14:24:08 swarm-test-04 dockerd[1331]: time="2016-07-25T14:24:08.143258843-04:00" level=debug msg="(*session).watch" module=agent session.id=1deljljgv95s41xyss3tredyk

So I think the issue is either:

The agent takes too long to successfully reconnect after the leader goes down.

or

The grace period for deciding a node is down (15s, I think) is too short for the time the agent takes to reconnect.

LK4D4 · 2016-07-25T20:17:14Z

@aaronlehmann there are indeed messages:

level=debug msg="heartbeat expiration for unknown node" method="(*Dispatcher).markNodesUnknown" module=dispatcher node=3x3p4ts6oik5z4835jdmhjo7r

aluzzardi · 2016-07-26T01:23:10Z

@LK4D4 @aaronlehmann Is this a release blocker?

aluzzardi · 2016-07-26T01:37:34Z

The grace period for deciding a node is down (15s, I think) is too short for the time the agent takes to reconnect.

@aaronlehmann I believe that when there's a leader re-election, we actually wait much more than the typical 15 seconds. @LK4D4, is that correct?

@LK4D4 by the way, how long would it take for a node to be kicked out in the usual case versus during a re-election/restart?

aluzzardi · 2016-07-26T07:13:00Z

@thaJeztah Fixed in moby/swarmkit#1238

Could you please confirm it has been fixed (once we update vendoring)? Thanks!

thaJeztah · 2016-07-28T11:41:21Z

Have done a lot of testing with @LK4D4, and although we have some pointers, there's no full solution yet; moving this to 1.12.1

LK4D4 · 2016-07-29T19:58:11Z

aaronlehmann · 2016-07-29T20:23:06Z

@thaJeztah: We found many overlapping problems that contributed to this result.

Most of them are already fixed in swarmkit thanks to @LK4D4's work. While we're not completely done here, it might be a good exercise to give this setup another try after vendoring the version of swarmkit on its bump_v1.12.1 branch. I expect this should show improvement.

thaJeztah · 2016-07-29T23:30:54Z

Yes @LK4D4 and I have done a lot of testing on those nodes; I'll keep them alive so that we can have a setup to test. Awesome work!

LK4D4 · 2016-08-01T21:51:26Z

@thaJeztah everything is merged.

icecrime · 2016-08-01T22:08:05Z

Thanks @LK4D4!

ralphtheninja · 2016-08-02T09:47:26Z

@thaJeztah Iirc there has to be at least one issue with priority P0 to cause 1.12.1 to be released. Have you considered bumping this from P1 to P0?

thaJeztah · 2016-08-02T09:55:49Z

@ralphtheninja it wasn't reproducible in all environments (I'm keeping a number of test droplets alive on DigitalOcean, because on those particular instances it exposed this issue). We sure consider this, but we're also investigating some other issues that we like to get to the bottom of, before deciding on a 1.12.1 release.

aaronlehmann · 2016-08-02T22:02:38Z

@thaJeztah: All of the commits in LK4D4's list were vendored by #25159. We may be able to close this if you can confirm this issue is fixed.

For the 1.12.1 patch release, we're maintaining a special branch of Swarmkit that also has all of these commits.

thaJeztah · 2016-08-02T23:00:33Z

@aaronlehmann should we wait for moby/swarmkit#1299 ?

aaronlehmann · 2016-08-02T23:06:54Z

Sure.

aaronlehmann · 2016-08-02T23:07:33Z

I added that PR to @LK4D4's comment.

aaronlehmann · 2016-08-04T17:35:57Z

Added yet another PR...

LK4D4 · 2016-08-05T15:56:19Z

@aaronlehmann @thaJeztah I've tested with last two PRs and was not able to reproduce. Nodes recover very fast, like 10-15s maximum.

thaJeztah · 2016-08-05T17:28:31Z

That's great news @LK4D4, I'm not near a computer now, but will give it a try tonight

aaronlehmann · 2016-08-11T23:14:50Z

@thaJeztah @LK4D4: Have you confirmed that the fixes so far solve the problem? Let's close this ticket if that's the case, since swarmkit was vendored on Tuesday.

thaJeztah · 2016-08-11T23:17:33Z

@aaronlehmann last results were in the screen recording I shared, and looked stable, I think we can close this

thaJeztah added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.12 area/swarm labels Jul 25, 2016

thaJeztah added this to the 1.12.0 milestone Jul 25, 2016

thaJeztah mentioned this issue Jul 25, 2016

Swarm: Missing heartbeat-tick and election-tick configurable flags on docker swarm init and join #24941

Open

This was referenced Jul 26, 2016

dispatcher: Give a couple of minutes for nodes to recover on re-election moby/swarmkit#1238

Merged

Better handle re-scheduling on re-election moby/swarmkit#1239

Closed

aluzzardi mentioned this issue Jul 26, 2016

Revert "dispatcher: Give a couple of minutes for nodes to recover on re-election" moby/swarmkit#1240

Merged

thaJeztah modified the milestones: 1.12.1, 1.12.0 Jul 28, 2016

thaJeztah added the priority/P1 Important: P1 issues are a top priority and a must-have for the next release. label Jul 28, 2016

thaJeztah mentioned this issue Aug 5, 2016

Docker slow and not responding #25433

Closed

aaronlehmann mentioned this issue Aug 9, 2016

Replace use of grpc.Picker with grpc.Balancer moby/swarmkit#875

Closed

thaJeztah closed this as completed Aug 11, 2016

aaronlehmann mentioned this issue Apr 4, 2017

Tasks are killed unexpectedly #32218

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.12] Killing leader makes all containers end up on a single node #25017

[1.12] Killing leader makes all containers end up on a single node #25017

thaJeztah commented Jul 25, 2016 •

edited

thaJeztah commented Jul 25, 2016

thaJeztah commented Jul 25, 2016

aluzzardi commented Jul 25, 2016

thaJeztah commented Jul 25, 2016

aaronlehmann commented Jul 25, 2016

LK4D4 commented Jul 25, 2016

aluzzardi commented Jul 26, 2016

aluzzardi commented Jul 26, 2016 •

edited

aluzzardi commented Jul 26, 2016

thaJeztah commented Jul 28, 2016

LK4D4 commented Jul 29, 2016 •

edited by aaronlehmann

aaronlehmann commented Jul 29, 2016

thaJeztah commented Jul 29, 2016

LK4D4 commented Aug 1, 2016

icecrime commented Aug 1, 2016

ralphtheninja commented Aug 2, 2016

thaJeztah commented Aug 2, 2016

aaronlehmann commented Aug 2, 2016

thaJeztah commented Aug 2, 2016

aaronlehmann commented Aug 2, 2016

aaronlehmann commented Aug 2, 2016

aaronlehmann commented Aug 4, 2016

LK4D4 commented Aug 5, 2016

thaJeztah commented Aug 5, 2016

aaronlehmann commented Aug 11, 2016

thaJeztah commented Aug 11, 2016

[1.12] Killing leader makes all containers end up on a single node #25017

[1.12] Killing leader makes all containers end up on a single node #25017

Comments

thaJeztah commented Jul 25, 2016 • edited

Kill the leader node

1. Initial state (before killing leader)

2. Status "unknown" for all nodes

3. Status "down" for all nodes

4. Status "down" for the manager that did not become leader

5. Status "ready"

thaJeztah commented Jul 25, 2016

thaJeztah commented Jul 25, 2016

aluzzardi commented Jul 25, 2016

thaJeztah commented Jul 25, 2016

aaronlehmann commented Jul 25, 2016

LK4D4 commented Jul 25, 2016

aluzzardi commented Jul 26, 2016

aluzzardi commented Jul 26, 2016 • edited

aluzzardi commented Jul 26, 2016

thaJeztah commented Jul 28, 2016

LK4D4 commented Jul 29, 2016 • edited by aaronlehmann

aaronlehmann commented Jul 29, 2016

thaJeztah commented Jul 29, 2016

LK4D4 commented Aug 1, 2016

icecrime commented Aug 1, 2016

ralphtheninja commented Aug 2, 2016

thaJeztah commented Aug 2, 2016

aaronlehmann commented Aug 2, 2016

thaJeztah commented Aug 2, 2016

aaronlehmann commented Aug 2, 2016

aaronlehmann commented Aug 2, 2016

aaronlehmann commented Aug 4, 2016

LK4D4 commented Aug 5, 2016

thaJeztah commented Aug 5, 2016

aaronlehmann commented Aug 11, 2016

thaJeztah commented Aug 11, 2016

thaJeztah commented Jul 25, 2016 •

edited

aluzzardi commented Jul 26, 2016 •

edited

LK4D4 commented Jul 29, 2016 •

edited by aaronlehmann