Flaky test: TestAPISwarmRaftQuorum #34988

thaJeztah · 2017-09-26T16:08:45Z

This test seems to be failing frequently, e.g. https://jenkins.dockerproject.org/job/Docker-PRs/45756 (#34908)

10:58:56 ----------------------------------------------------------------------
10:58:56 FAIL: docker_api_swarm_test.go:349: DockerSwarmSuite.TestAPISwarmRaftQuorum
10:58:56 
10:58:56 [d027470b02f36] waiting for daemon to start
10:58:56 [d027470b02f36] daemon started
10:58:56 
10:58:56 [d416ed90108aa] waiting for daemon to start
10:58:56 [d416ed90108aa] daemon started
10:58:56 
10:58:56 [db1f95dd1b76f] waiting for daemon to start
10:58:56 [db1f95dd1b76f] daemon started
10:58:56 
10:58:56 [d416ed90108aa] exiting daemon
10:58:56 [db1f95dd1b76f] exiting daemon
10:58:56 docker_api_swarm_test.go:374:
10:58:56     c.Assert(err.Error(), checker.Contains, "deadline exceeded")
10:58:56 ... obtained string = "Error response from daemon: rpc error: code = Internal desc = transport is closing"
10:58:56 ... substring string = "deadline exceeded"
10:58:56 
10:58:56 [d027470b02f36] exiting daemon

The text was updated successfully, but these errors were encountered:

tophj-ibm · 2017-10-04T13:49:40Z

same test, different error from https://jenkins.dockerproject.org/job/Docker-PRs-s390x/6032/consoleFull

12:30:39 ----------------------------------------------------------------------
12:30:39 FAIL: docker_api_swarm_test.go:349: DockerSwarmSuite.TestAPISwarmRaftQuorum
12:30:39 
12:30:39 [d5de6519706ea] waiting for daemon to start
12:30:39 [d5de6519706ea] daemon started
12:30:39 
12:30:39 [d51bb8eea1341] waiting for daemon to start
12:30:39 [d51bb8eea1341] daemon started
12:30:39 
12:30:39 [d389f1745572e] waiting for daemon to start
12:30:39 [d389f1745572e] daemon started
12:30:39 
12:30:39 [d51bb8eea1341] exiting daemon
12:30:39 [d389f1745572e] exiting daemon
12:30:39 docker_api_swarm_test.go:374:
12:30:39     c.Assert(err.Error(), checker.Contains, "deadline exceeded")
12:30:39 ... obtained string = "Error response from daemon: rpc error: code = Unknown desc = raft: failed to process the request: node lost leader status"
12:30:39 ... substring string = "deadline exceeded"
12:30:39 
12:30:39 [d5de6519706ea] exiting daemon
12:31:33 
12:31:33 ----------------------------------------------------------------------

seemethere · 2017-10-04T21:02:03Z

Also failing on Docker CE

tophj-ibm · 2017-10-06T18:32:39Z

I'm not a swarm expert, but I took a look at this and I see a couple of things.

In the test we are shutting down 2 out of the 3 nodes, all of which are managers. When this happens, the remaining node prepares to step down from being a leader because there is no longer an active quorum (I believe this is a newer swarmkit change?). This is where the test flakiness occurs.

1.) If we hit the assert looking for "deadline exceeded" before the node steps down, it works as intended.

2.) If we hit the assert during/immediately after it steps down, a couple of errors are thrown because that node had a few pending tasks, which are some of what we are seeing here:

time="2017-10-06T17:05:50.801644156Z" level=warning msg="5083cee7072c32d9 stepped down to follower since quorum is not active" module=raft node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.801737638Z" level=info msg="5083cee7072c32d9 became follower at term 2" module=raft node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.801762084Z" level=info msg="raft.node: 5083cee7072c32d9 lost leader 5083cee7072c32d9 at term 2" module=raft node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.801944707Z" level=error msg="failed a store batch operation while processing tasks" error="raft: failed to process the request: node lost leader status" module=node node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.802276966Z" level=error msg="agent: session failed" error="rpc error: code = Canceled desc = context canceled" module="node/agent" node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.802315732Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=yhqcd5leqf91fd6zqqhrx9tc1 node.session=owhhja1dda0zxw3m3o4fl3kg3
time="2017-10-06T17:05:50.802355551Z" level=debug msg="agent: rebuild session" module="node/agent" node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.802935500Z" level=error msg="failed to receive changes from store watch API" error="rpc error: code = Internal desc = transport is closing"

3.) And finally if we wait a while for the node to step down and all the tasks to empty, then we get the following error, which seems like a reasonable thing to check for.

The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online

As for how to fix this, my guess is we don't want to look for 1 anymore. I think a correct solution would be to wait until the last node isn't a leader, and then look for case 3, and then start up the rest of the nodes, although I'm not 100% sure this is something this test should be testing.

ping @aaronlehmann in case number 2, are those errors are bug? Should the remaining tasks be handled gracefully in the event a node steps down?

aaronlehmann · 2017-10-07T02:43:36Z

Good observations. I think you are correct about the way to fix the test.

Those errors in case 2 look like they are expected behavior, but we could possibly do a better job of making it clear what's happening and that nothing is wrong beyond the loss of leadership. The only line I'm not completely sure about is:

time="2017-10-06T17:05:50.802935500Z" level=error msg="failed to receive changes from store watch API" error="rpc error: code = Internal desc = transport is closing"

I can't think of a reason offhand why this would be affected by a leadership change. Each node is supposed to watch its own store, whether or not it's the leader. It might be worth looking into.

kolyshkin · 2018-07-26T14:40:23Z

Got a PANIC from this test on powerpc: https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/10621/console

12:21:01 ----------------------------------------------------------------------
12:21:01 FAIL: check_test.go:347: DockerSwarmSuite.TearDownTest
12:21:01
12:21:01 check_test.go:352:
12:21:01 d.Stop(c)
12:21:01 /go/src/github.com/docker/docker/internal/test/daemon/daemon.go:401:
12:21:01 t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
12:21:01 ... Error: Error while stopping the daemon de1d200a57725 : exit status 130
12:21:01
12:21:01
12:21:01 ----------------------------------------------------------------------
12:21:01 PANIC: docker_api_swarm_test.go:359: DockerSwarmSuite.TestAPISwarmRaftQuorum
12:21:01
12:21:01 [de1d200a57725] waiting for daemon to start
12:21:01 [de1d200a57725] daemon started
12:21:01
12:21:01 [d8d1db6c7402b] waiting for daemon to start
12:21:01 [d8d1db6c7402b] daemon started
12:21:01
12:21:01 [dcc72e34f5265] waiting for daemon to start
12:21:01 [dcc72e34f5265] daemon started
12:21:01
12:21:01 [d8d1db6c7402b] exiting daemon
12:21:01 [dcc72e34f5265] exiting daemon
12:21:01 [d8d1db6c7402b] waiting for daemon to start
12:21:01 [d8d1db6c7402b] daemon started
12:21:01
12:21:01 [de1d200a57725] daemon started
12:21:01 Attempt #2: daemon is still running with pid 20165
12:21:01 Attempt #3: daemon is still running with pid 20165
12:21:01 Attempt #4: daemon is still running with pid 20165
12:21:01 [de1d200a57725] exiting daemon
12:21:01 ... Panic: Fixture has panicked (see related PANIC)
12:21:01
12:21:01 ----------------------------------------------------------------------

The panic is induced by the code in order to show backtrace (which can be found in the bundle tarball).

Not sure if we want to reopen this.

kolyshkin · 2018-07-26T14:42:59Z

The above has also happened on z (https://jenkins.dockerproject.org/job/Docker-PRs-s390x/10497/console) and is coming from #37358

kolyshkin · 2018-08-23T21:58:39Z

Saw a failure on z again (https://jenkins.dockerproject.org/job/Docker-PRs-s390x/10846/console):

01:56:29.649 
01:56:29.649 ----------------------------------------------------------------------
01:56:29.649 FAIL: docker_api_swarm_test.go:359: DockerSwarmSuite.TestAPISwarmRaftQuorum
01:56:29.649 
01:56:29.649 [d3ed86deae02a] waiting for daemon to start
01:56:29.649 [d3ed86deae02a] daemon started
01:56:29.649 
01:56:29.649 [d102c545ad8bb] waiting for daemon to start
01:56:29.649 [d102c545ad8bb] daemon started
01:56:29.649 
01:56:29.649 [d9892df4c7108] waiting for daemon to start
01:56:29.650 [d9892df4c7108] daemon started
01:56:29.650 
01:56:29.650 [d102c545ad8bb] exiting daemon
01:56:29.650 [d9892df4c7108] exiting daemon
01:56:29.650 docker_api_swarm_test.go:385:
01:56:29.650     // d1 will eventually step down from leader because there is no longer an active quorum, wait for that to happen
01:56:29.650     waitAndAssert(c, defaultReconciliationTimeout, func(c *check.C) (interface{}, check.CommentInterface) {
01:56:29.650         _, err = cli.ServiceCreate(context.Background(), service.Spec, types.ServiceCreateOptions{})
01:56:29.650         return err.Error(), nil
01:56:29.650     }, checker.Contains, "Make sure more than half of the managers are online.")
01:56:29.651 docker_utils_test.go:435:
01:56:29.651     c.Assert(v, checker, args...)
01:56:29.651 ... obtained string = "Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
01:56:29.651 ... substring string = "Make sure more than half of the managers are online."
01:56:29.651 
01:56:29.651 [d3ed86deae02a] exiting daemon
01:57:38.096 
01:57:38.096 ----------------------------------------------------------------------

kolyshkin · 2018-08-23T22:27:41Z

and another PR (#37703), also on z (from https://jenkins.dockerproject.org/job/Docker-PRs-s390x/10847/console):

01:52:07.639 
01:52:07.639 ----------------------------------------------------------------------
01:52:07.639 FAIL: docker_api_swarm_test.go:359: DockerSwarmSuite.TestAPISwarmRaftQuorum
01:52:07.639 
01:52:07.639 [d4541f9718a87] waiting for daemon to start
01:52:07.639 [d4541f9718a87] daemon started
01:52:07.640 
01:52:07.640 [d72c2dc9702d4] waiting for daemon to start
01:52:07.640 [d72c2dc9702d4] daemon started
01:52:07.640 
01:52:07.640 [d260eca6ccbb6] waiting for daemon to start
01:52:07.640 [d260eca6ccbb6] daemon started
01:52:07.640 
01:52:07.640 [d72c2dc9702d4] exiting daemon
01:52:07.640 [d260eca6ccbb6] exiting daemon
01:52:07.640 docker_api_swarm_test.go:385:
01:52:07.640     // d1 will eventually step down from leader because there is no longer an active quorum, wait for that to happen
01:52:07.640     waitAndAssert(c, defaultReconciliationTimeout, func(c *check.C) (interface{}, check.CommentInterface) {
01:52:07.640         _, err = cli.ServiceCreate(context.Background(), service.Spec, types.ServiceCreateOptions{})
01:52:07.641         return err.Error(), nil
01:52:07.641     }, checker.Contains, "Make sure more than half of the managers are online.")
01:52:07.641 docker_utils_test.go:435:
01:52:07.641     c.Assert(v, checker, args...)
01:52:07.641 ... obtained string = "Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
01:52:07.641 ... substring string = "Make sure more than half of the managers are online."
01:52:07.641 
01:52:07.641 [d4541f9718a87] exiting daemon
01:53:18.938 
01:53:18.938 ----------------------------------------------------------------------

thaJeztah added area/swarm area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. labels Sep 26, 2017

This was referenced Sep 26, 2017

Bump API version to 1.33 #34908

Merged

Improve error message for COPY missing destination #34918

Merged

flaky test DockerSwarmSuite.TestAPISwarmRaftQuorum #35039

Closed

tophj-ibm mentioned this issue Oct 11, 2017

[integration-cli] fix flaky raft test #35173

Merged

yongtang closed this as completed in #35173 Oct 12, 2017

thaJeztah mentioned this issue Apr 24, 2018

Warn when reserved-namespace engine labels are configured #36921

Merged

thaJeztah mentioned this issue May 21, 2018

bump libnetwork to eb6b2a57955e5c149d47c3973573216e8f8baa09 #37111

Merged

thaJeztah mentioned this issue Jun 18, 2018

[epic] flaky tests #37306

Open

This was referenced Jun 29, 2018

bump libnetwork to 430c00a #37371

Merged

Update cloudflare/cfssl to 1.3.2 #37390

Merged

update fsnotify to v1.4.7 #37392

Merged

[18.06] Fix bindmount autocreate race docker/engine#11

Merged

kolyshkin mentioned this issue Jul 26, 2018

Bump golang to 1.11.0 #37358

Merged

thaJeztah mentioned this issue Aug 7, 2018

[18.06] Return error if basename is expanded to blank docker/engine#25

Merged

thaJeztah mentioned this issue Aug 20, 2018

Global Default Address Pool feature support #37558

Merged

kolyshkin mentioned this issue Aug 23, 2018

daemon/setMounts(): don't let user mounts overshadow spec mounts #37704

Open

kolyshkin mentioned this issue Aug 23, 2018

daemon/setMounts(): remove dead code #37703

Merged

thaJeztah reopened this Aug 24, 2018

kolyshkin mentioned this issue Sep 6, 2018

vendor: update tar-split #37771

Merged

kolyshkin mentioned this issue Sep 17, 2018

fixes display other container's process when docker top #37748

Closed

thaJeztah mentioned this issue Oct 31, 2018

cluster: set bigger grpc limit for array requests #38103

Merged

olljanat mentioned this issue Dec 23, 2018

Disabled Flaky tests on powerpc and z #38423

Merged

thaJeztah added this to To do in Improving CI Jun 27, 2019

andrewhsu mentioned this issue Jul 11, 2019

flaky DockerSwarmSuite tests #39499

Closed

thaJeztah mentioned this issue Jul 14, 2019

Use Microsoft/hcsshim constants and deprecate pkg/system.GetOsVersion() #39100

Merged

thaJeztah mentioned this issue Sep 1, 2019

[18.09 backport] Jenkinsfile, CI, and Makefile updates docker/engine#338

Merged

tonistiigi mentioned this issue Sep 1, 2019

integration-cli: fix swarm tests flakiness #39531

Merged

This was referenced Sep 1, 2019

[19.03 backport] integration-cli: fix swarm tests flakiness docker/engine#345

Merged

[18.09 backport] integration-cli: fix swarm tests flakiness docker/engine#346

Merged

thaJeztah mentioned this issue Oct 10, 2019

[18.09 backport] integration-cli: Skip TestAPIImagesSaveAndLoad on RS3 and older docker/engine#394

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: TestAPISwarmRaftQuorum #34988

Flaky test: TestAPISwarmRaftQuorum #34988

thaJeztah commented Sep 26, 2017

tophj-ibm commented Oct 4, 2017

seemethere commented Oct 4, 2017

tophj-ibm commented Oct 6, 2017

aaronlehmann commented Oct 7, 2017

kolyshkin commented Jul 26, 2018

kolyshkin commented Jul 26, 2018

kolyshkin commented Aug 23, 2018

kolyshkin commented Aug 23, 2018

Flaky test: TestAPISwarmRaftQuorum #34988

Flaky test: TestAPISwarmRaftQuorum #34988

Comments

thaJeztah commented Sep 26, 2017

tophj-ibm commented Oct 4, 2017

seemethere commented Oct 4, 2017

tophj-ibm commented Oct 6, 2017

aaronlehmann commented Oct 7, 2017

kolyshkin commented Jul 26, 2018

kolyshkin commented Jul 26, 2018

kolyshkin commented Aug 23, 2018

kolyshkin commented Aug 23, 2018