Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.12] Killing leader makes all containers end up on a single node #25017

Closed
thaJeztah opened this issue Jul 25, 2016 · 26 comments
Closed

[1.12] Killing leader makes all containers end up on a single node #25017

thaJeztah opened this issue Jul 25, 2016 · 26 comments
Labels
area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Milestone

Comments

@thaJeztah
Copy link
Member

thaJeztah commented Jul 25, 2016

Opening as a new issue, per #24941 (comment)

Note I just re-ran this on current master (Docker version 1.12.0-dev, build 42f2205),
and ended up with the same result

Something I just ran into, and can reproduce reliably;

  1. Create 4 droplets on DigitalOcean, and install 1.12-RC4
  2. First node creates swarm (docker swarm init)
  3. Nodes 2, and 3 join the swarm as manager
  4. Node 4 joins the swarm as worker

Then; create a service, and scale to 16

docker service create --name web --replicas=16 -p 80:80 nginx:alpine

On one of the manager nodes (swarm-test-02), watch docker node ls;

ID                           HOSTNAME       MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
0j5yrijzbkgwl3bmjzp1aag1f    swarm-test-03  Accepted    Ready   Active        Reachable
36oy4vdvxxmnuw72pfli5dv2i    swarm-test-01  Accepted    Ready   Active        Leader
37argq2z05d5719olaiqpauvh *  swarm-test-02  Accepted    Ready   Active        Reachable
bpcy1dxk7v6md85qg3m7jn8il    swarm-test-04  Accepted    Ready   Active

And on all nodes, watch docker ps;

CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS              PORTS               NAMES
72ea95d66177        nginx:alpine        "nginx -g 'daemon off"   About a minute ago   Up About a minute   80/tcp, 443/tcp     web.2.7h8zuthv89jr2
ao4c1x4r9qce
dd58f8757a06        nginx:alpine        "nginx -g 'daemon off"   About a minute ago   Up About a minute   80/tcp, 443/tcp     web.10.8bolssicpwc2
5op18dleducvg
59966a937870        nginx:alpine        "nginx -g 'daemon off"   About a minute ago   Up About a minute   80/tcp, 443/tcp     web.11.0shxh3ccvzpw
zvjvphjrzefkh
49040cf8f0a6        nginx:alpine        "nginx -g 'daemon off"   About a minute ago   Up About a minute   80/tcp, 443/tcp     web.9.18f9rtrhyamjd
mxnmd9hmokj0

Kill the leader node

From the DigitalOcean control panel, destroy the leader node, meanwhile, on the nodes,
watch what happens

1. Initial state (before killing leader)

ID                           HOSTNAME            MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
1msvix0hu4vz7czqcwvzi3poq *  ubuntu-2gb-ams3-01  Accepted    Ready   Active        Reachable
264qo2mcvkuagirx68iluzi88    ubuntu-2gb-ams3-02  Accepted    Ready   Active        Reachable
4z3arwkhuwvvqiqwm9rxqg165    ubuntu-2gb-ams3-01  Accepted    Ready   Active        Leader
en76v2a9mlf12sjwdtw3hyb58    ubuntu-2gb-ams3-01  Accepted    Ready   Active

2. Status "unknown" for all nodes

Just after killing the leader, an rpc deadline error is presented, then, the
node status goes through the following stages:

ID                           HOSTNAME       MEMBERSHIP  STATUS   AVAILABILITY  MANAGER STATUS
0j5yrijzbkgwl3bmjzp1aag1f    swarm-test-03  Accepted    Unknown  Active        Reachable
36oy4vdvxxmnuw72pfli5dv2i    swarm-test-01  Accepted    Unknown  Active        Unreachable
37argq2z05d5719olaiqpauvh *  swarm-test-02  Accepted    Unknown  Active        Leader
bpcy1dxk7v6md85qg3m7jn8il    swarm-test-04  Accepted    Unknown  Active

3. Status "down" for all nodes

ID                           HOSTNAME       MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
0j5yrijzbkgwl3bmjzp1aag1f    swarm-test-03  Accepted    Down    Active        Reachable
36oy4vdvxxmnuw72pfli5dv2i    swarm-test-01  Accepted    Down    Active        Unreachable
37argq2z05d5719olaiqpauvh *  swarm-test-02  Accepted    Down    Active        Leader
bpcy1dxk7v6md85qg3m7jn8il    swarm-test-04  Accepted    Down    Active

4. Status "down" for the manager that did not become leader

ID                           HOSTNAME       MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
0j5yrijzbkgwl3bmjzp1aag1f    swarm-test-03  Accepted    Down    Active        Reachable
36oy4vdvxxmnuw72pfli5dv2i    swarm-test-01  Accepted    Down    Active        Unreachable
37argq2z05d5719olaiqpauvh *  swarm-test-02  Accepted    Ready   Active        Leader
bpcy1dxk7v6md85qg3m7jn8il    swarm-test-04  Accepted    Ready   Active

5. Status "ready"

ID                           HOSTNAME       MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
0j5yrijzbkgwl3bmjzp1aag1f    swarm-test-03  Accepted    Ready   Active        Reachable
36oy4vdvxxmnuw72pfli5dv2i    swarm-test-01  Accepted    Down    Active        Unreachable
37argq2z05d5719olaiqpauvh *  swarm-test-02  Accepted    Ready   Active        Leader
bpcy1dxk7v6md85qg3m7jn8il    swarm-test-04  Accepted    Ready   Active

However, at stage 5, all containers ended up on a single node:

swarm-test-02:

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

swarm-test-03:

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

swarm-test-04:

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
06b59fd80536        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.1.7599uowuwx8kln46jzzlzrsej
5e751507a919        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.8.6wcxeir8kw4misv5hm6tbt1gb
9f871df4c0e6        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.9.6348tx8lisem60tge1n9h105p
34d922f5627b        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.10.18x3x83ay20moo4wfk6uqssuk
fe00d09420fe        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 5 minutes        80/tcp, 443/tcp     web.15.7ajzwkx125w0ilbv09kluxsh5
1a365c02c807        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.2.9g06i0ji6oby1a681hrnr7fuv
780f69e79646        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.12.0952zl8ejgp7glfr991xmiqsu
93002b1dcfe7        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.14.62cwuiykimiz6dlmiognotmcn
c96a48711179        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.3.96zsnc2bfgd9dj71bicgds0w1
7634285e896a        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.6.bqwqso18dqrkr80jc51aumonq
3ed44fe7e021        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.11.8s2va2tvmpw187s7v8ce6h3r2
23a083a91b6b        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.13.709wueza0j9631eibdm7ildu3
6d6871c931ce        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.4.1o0ay75v1y8amya2hnhnfb8y0
2a4cb84bd3ee        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.7.7ijasltqhbalctuahsly3n2cy
18bcc8cb66a1        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 5 minutes        80/tcp, 443/tcp     web.16.b7z9i8lo97fzxmtvwd06ufpxp
67fa07227b58        nginx:alpine        "nginx -g 'daemon off"   5 minutes ago       Up 4 minutes        80/tcp, 443/tcp     web.5.6ln0iyt24qxnic2nu0rg3ttxx
@thaJeztah thaJeztah added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.12 area/swarm labels Jul 25, 2016
@thaJeztah thaJeztah added this to the 1.12.0 milestone Jul 25, 2016
@thaJeztah
Copy link
Member Author

/cc @abronan @aluzzardi

@thaJeztah
Copy link
Member Author

Just ran this again; put the daemons in debug mode, and collected the logs

  • swarm-test-01 = original leader (the node I switch off)
  • swarm-test-02 and swarm-test-03 are managers
  • swarm-test-04 is a worker

Interesting bit is that for a while, after the old leader went down, the new leader wasn't aware that it was "up" (when viewed from swarm-test-02 itself;

ID                           HOSTNAME       STATUS  AVAILABILITY  MANAGER STATUS
3x3p4ts6oik5z4835jdmhjo7r    swarm-test-01  Down    Active        Unreachable
629t5eojcj0ll4lu7iw8210ga *  swarm-test-02  Down    Active        Reachable
9wgwnjg63gv68ysnmhrc90a5t    swarm-test-04  Ready   Active
cf922kv1bodg04ciivhqngx88    swarm-test-03  Ready   Active        Leader

swarm-test-01.log.txt
swarm-test-02.log.txt
swarm-test-03.log.txt
swarm-test-04.log.txt

@aluzzardi
Copy link
Member

/cc @LK4D4

@thaJeztah
Copy link
Member Author

worth noting that apparently the DigitalOcean control panel does a "clean" shutdown, and not a direct power off

Jul 25 14:23:46 swarm-test-01 dockerd[2926]: time="2016-07-25T14:23:46.444502919-04:00" level=debug msg="Clean shutdown succeeded"
Jul 25 14:23:46 swarm-test-01 dockerd[2926]: time="2016-07-25T14:23:46.445714157-04:00" level=info msg="stopping containerd after receiving terminated"
Jul 25 14:23:46 swarm-test-01 dockerd[2926]: time="2016-07-25T14:23:46.445519702-04:00" level=debug msg="containerd connection state change: SHUTDOWN"
Jul 25 14:23:47 swarm-test-01 systemd[1]: Stopped Docker Application Container Engine.
-- Reboot --

@aaronlehmann
Copy link
Contributor

It looks like you shut swarm-test-01 down around 14:23:45. The agents notice right away:

time="2016-07-25T14:23:45.078944184-04:00" level=error msg="agent: session failed" error="rpc error: code = 13 desc = transport is closing" module=agent
Jul 25 14:23:45 swarm-test-02 dockerd[24195]: time="2016-07-25T14:23:45.079207739-04:00" level=debug msg="agent: rebuild session" module=agent

swarm-test-04 reconnects successfully at 14:24:06:

time="2016-07-25T14:24:06.894433740-04:00" level=debug msg="(*session).start" module=agent
time="2016-07-25T14:24:08.143185422-04:00" level=debug msg="(*session).heartbeat" module=agent session.id=1deljljgv95s41xyss3tredyk
Jul 25 14:24:08 swarm-test-04 dockerd[1331]: time="2016-07-25T14:24:08.143183751-04:00" level=debug msg="(*session).listen" module=agent session.id=1deljljgv95s41xyss3tredyk
Jul 25 14:24:08 swarm-test-04 dockerd[1331]: time="2016-07-25T14:24:08.143258843-04:00" level=debug msg="(*session).watch" module=agent session.id=1deljljgv95s41xyss3tredyk

So I think the issue is either:

  • The agent takes too long to successfully reconnect after the leader goes down.

or

  • The grace period for deciding a node is down (15s, I think) is too short for the time the agent takes to reconnect.

@LK4D4
Copy link
Contributor

LK4D4 commented Jul 25, 2016

@aaronlehmann there are indeed messages:

level=debug msg="heartbeat expiration for unknown node" method="(*Dispatcher).markNodesUnknown" module=dispatcher node=3x3p4ts6oik5z4835jdmhjo7r

@aluzzardi
Copy link
Member

@LK4D4 @aaronlehmann Is this a release blocker?

@aluzzardi
Copy link
Member

aluzzardi commented Jul 26, 2016

The grace period for deciding a node is down (15s, I think) is too short for the time the agent takes to reconnect.

@aaronlehmann I believe that when there's a leader re-election, we actually wait much more than the typical 15 seconds. @LK4D4, is that correct?

@LK4D4 by the way, how long would it take for a node to be kicked out in the usual case versus during a re-election/restart?

@aluzzardi
Copy link
Member

@thaJeztah Fixed in moby/swarmkit#1238

Could you please confirm it has been fixed (once we update vendoring)? Thanks!

@thaJeztah
Copy link
Member Author

Have done a lot of testing with @LK4D4, and although we have some pointers, there's no full solution yet; moving this to 1.12.1

@thaJeztah thaJeztah modified the milestones: 1.12.1, 1.12.0 Jul 28, 2016
@thaJeztah thaJeztah added the priority/P1 Important: P1 issues are a top priority and a must-have for the next release. label Jul 28, 2016
@aaronlehmann
Copy link
Contributor

@thaJeztah: We found many overlapping problems that contributed to this result.

Most of them are already fixed in swarmkit thanks to @LK4D4's work. While we're not completely done here, it might be a good exercise to give this setup another try after vendoring the version of swarmkit on its bump_v1.12.1 branch. I expect this should show improvement.

@thaJeztah
Copy link
Member Author

Yes @LK4D4 and I have done a lot of testing on those nodes; I'll keep them alive so that we can have a setup to test. Awesome work!

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 1, 2016

@thaJeztah everything is merged.

@icecrime
Copy link
Contributor

icecrime commented Aug 1, 2016

Thanks @LK4D4!

@ralphtheninja
Copy link
Contributor

@thaJeztah Iirc there has to be at least one issue with priority P0 to cause 1.12.1 to be released. Have you considered bumping this from P1 to P0?

@thaJeztah
Copy link
Member Author

@ralphtheninja it wasn't reproducible in all environments (I'm keeping a number of test droplets alive on DigitalOcean, because on those particular instances it exposed this issue). We sure consider this, but we're also investigating some other issues that we like to get to the bottom of, before deciding on a 1.12.1 release.

@aaronlehmann
Copy link
Contributor

@thaJeztah: All of the commits in LK4D4's list were vendored by #25159. We may be able to close this if you can confirm this issue is fixed.

For the 1.12.1 patch release, we're maintaining a special branch of Swarmkit that also has all of these commits.

@thaJeztah
Copy link
Member Author

@aaronlehmann should we wait for moby/swarmkit#1299 ?

@aaronlehmann
Copy link
Contributor

Sure.

@aaronlehmann
Copy link
Contributor

I added that PR to @LK4D4's comment.

@aaronlehmann
Copy link
Contributor

Added yet another PR...

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 5, 2016

@aaronlehmann @thaJeztah I've tested with last two PRs and was not able to reproduce. Nodes recover very fast, like 10-15s maximum.

@thaJeztah
Copy link
Member Author

That's great news @LK4D4, I'm not near a computer now, but will give it a try tonight

@aaronlehmann
Copy link
Contributor

@thaJeztah @LK4D4: Have you confirmed that the fixes so far solve the problem? Let's close this ticket if that's the case, since swarmkit was vendored on Tuesday.

@thaJeztah
Copy link
Member Author

@aaronlehmann last results were in the screen recording I shared, and looked stable, I think we can close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Projects
None yet
Development

No branches or pull requests

6 participants