Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: TestAPISwarmRaftQuorum #34988

Open
thaJeztah opened this issue Sep 26, 2017 · 8 comments · Fixed by #35173
Open

Flaky test: TestAPISwarmRaftQuorum #34988

thaJeztah opened this issue Sep 26, 2017 · 8 comments · Fixed by #35173
Labels
area/swarm area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.

Comments

@thaJeztah
Copy link
Member

This test seems to be failing frequently, e.g. https://jenkins.dockerproject.org/job/Docker-PRs/45756 (#34908)

10:58:56 ----------------------------------------------------------------------
10:58:56 FAIL: docker_api_swarm_test.go:349: DockerSwarmSuite.TestAPISwarmRaftQuorum
10:58:56 
10:58:56 [d027470b02f36] waiting for daemon to start
10:58:56 [d027470b02f36] daemon started
10:58:56 
10:58:56 [d416ed90108aa] waiting for daemon to start
10:58:56 [d416ed90108aa] daemon started
10:58:56 
10:58:56 [db1f95dd1b76f] waiting for daemon to start
10:58:56 [db1f95dd1b76f] daemon started
10:58:56 
10:58:56 [d416ed90108aa] exiting daemon
10:58:56 [db1f95dd1b76f] exiting daemon
10:58:56 docker_api_swarm_test.go:374:
10:58:56     c.Assert(err.Error(), checker.Contains, "deadline exceeded")
10:58:56 ... obtained string = "Error response from daemon: rpc error: code = Internal desc = transport is closing"
10:58:56 ... substring string = "deadline exceeded"
10:58:56 
10:58:56 [d027470b02f36] exiting daemon
@thaJeztah thaJeztah added area/swarm area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. labels Sep 26, 2017
@tophj-ibm
Copy link
Contributor

same test, different error from https://jenkins.dockerproject.org/job/Docker-PRs-s390x/6032/consoleFull

12:30:39 ----------------------------------------------------------------------
12:30:39 FAIL: docker_api_swarm_test.go:349: DockerSwarmSuite.TestAPISwarmRaftQuorum
12:30:39 
12:30:39 [d5de6519706ea] waiting for daemon to start
12:30:39 [d5de6519706ea] daemon started
12:30:39 
12:30:39 [d51bb8eea1341] waiting for daemon to start
12:30:39 [d51bb8eea1341] daemon started
12:30:39 
12:30:39 [d389f1745572e] waiting for daemon to start
12:30:39 [d389f1745572e] daemon started
12:30:39 
12:30:39 [d51bb8eea1341] exiting daemon
12:30:39 [d389f1745572e] exiting daemon
12:30:39 docker_api_swarm_test.go:374:
12:30:39     c.Assert(err.Error(), checker.Contains, "deadline exceeded")
12:30:39 ... obtained string = "Error response from daemon: rpc error: code = Unknown desc = raft: failed to process the request: node lost leader status"
12:30:39 ... substring string = "deadline exceeded"
12:30:39 
12:30:39 [d5de6519706ea] exiting daemon
12:31:33 
12:31:33 ----------------------------------------------------------------------

@seemethere
Copy link
Contributor

Also failing on Docker CE

@tophj-ibm
Copy link
Contributor

I'm not a swarm expert, but I took a look at this and I see a couple of things.

In the test we are shutting down 2 out of the 3 nodes, all of which are managers. When this happens, the remaining node prepares to step down from being a leader because there is no longer an active quorum (I believe this is a newer swarmkit change?). This is where the test flakiness occurs.

1.) If we hit the assert looking for "deadline exceeded" before the node steps down, it works as intended.

2.) If we hit the assert during/immediately after it steps down, a couple of errors are thrown because that node had a few pending tasks, which are some of what we are seeing here:

time="2017-10-06T17:05:50.801644156Z" level=warning msg="5083cee7072c32d9 stepped down to follower since quorum is not active" module=raft node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.801737638Z" level=info msg="5083cee7072c32d9 became follower at term 2" module=raft node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.801762084Z" level=info msg="raft.node: 5083cee7072c32d9 lost leader 5083cee7072c32d9 at term 2" module=raft node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.801944707Z" level=error msg="failed a store batch operation while processing tasks" error="raft: failed to process the request: node lost leader status" module=node node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.802276966Z" level=error msg="agent: session failed" error="rpc error: code = Canceled desc = context canceled" module="node/agent" node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.802315732Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=yhqcd5leqf91fd6zqqhrx9tc1 node.session=owhhja1dda0zxw3m3o4fl3kg3
time="2017-10-06T17:05:50.802355551Z" level=debug msg="agent: rebuild session" module="node/agent" node.id=yhqcd5leqf91fd6zqqhrx9tc1
time="2017-10-06T17:05:50.802935500Z" level=error msg="failed to receive changes from store watch API" error="rpc error: code = Internal desc = transport is closing"

3.) And finally if we wait a while for the node to step down and all the tasks to empty, then we get the following error, which seems like a reasonable thing to check for.

The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online 

As for how to fix this, my guess is we don't want to look for 1 anymore. I think a correct solution would be to wait until the last node isn't a leader, and then look for case 3, and then start up the rest of the nodes, although I'm not 100% sure this is something this test should be testing.

ping @aaronlehmann in case number 2, are those errors are bug? Should the remaining tasks be handled gracefully in the event a node steps down?

@aaronlehmann
Copy link
Contributor

Good observations. I think you are correct about the way to fix the test.

Those errors in case 2 look like they are expected behavior, but we could possibly do a better job of making it clear what's happening and that nothing is wrong beyond the loss of leadership. The only line I'm not completely sure about is:

time="2017-10-06T17:05:50.802935500Z" level=error msg="failed to receive changes from store watch API" error="rpc error: code = Internal desc = transport is closing"

I can't think of a reason offhand why this would be affected by a leadership change. Each node is supposed to watch its own store, whether or not it's the leader. It might be worth looking into.

@kolyshkin
Copy link
Contributor

Got a PANIC from this test on powerpc: https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/10621/console

12:21:01 ----------------------------------------------------------------------
12:21:01 FAIL: check_test.go:347: DockerSwarmSuite.TearDownTest
12:21:01
12:21:01 check_test.go:352:
12:21:01 d.Stop(c)
12:21:01 /go/src/github.com/docker/docker/internal/test/daemon/daemon.go:401:
12:21:01 t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
12:21:01 ... Error: Error while stopping the daemon de1d200a57725 : exit status 130
12:21:01
12:21:01
12:21:01 ----------------------------------------------------------------------
12:21:01 PANIC: docker_api_swarm_test.go:359: DockerSwarmSuite.TestAPISwarmRaftQuorum
12:21:01
12:21:01 [de1d200a57725] waiting for daemon to start
12:21:01 [de1d200a57725] daemon started
12:21:01
12:21:01 [d8d1db6c7402b] waiting for daemon to start
12:21:01 [d8d1db6c7402b] daemon started
12:21:01
12:21:01 [dcc72e34f5265] waiting for daemon to start
12:21:01 [dcc72e34f5265] daemon started
12:21:01
12:21:01 [d8d1db6c7402b] exiting daemon
12:21:01 [dcc72e34f5265] exiting daemon
12:21:01 [d8d1db6c7402b] waiting for daemon to start
12:21:01 [d8d1db6c7402b] daemon started
12:21:01
12:21:01 [de1d200a57725] daemon started
12:21:01 Attempt #2: daemon is still running with pid 20165
12:21:01 Attempt #3: daemon is still running with pid 20165
12:21:01 Attempt #4: daemon is still running with pid 20165
12:21:01 [de1d200a57725] exiting daemon
12:21:01 ... Panic: Fixture has panicked (see related PANIC)
12:21:01
12:21:01 ----------------------------------------------------------------------

The panic is induced by the code in order to show backtrace (which can be found in the bundle tarball).

Not sure if we want to reopen this.

@kolyshkin
Copy link
Contributor

The above has also happened on z (https://jenkins.dockerproject.org/job/Docker-PRs-s390x/10497/console) and is coming from #37358

@kolyshkin
Copy link
Contributor

Saw a failure on z again (https://jenkins.dockerproject.org/job/Docker-PRs-s390x/10846/console):

01:56:29.649 
01:56:29.649 ----------------------------------------------------------------------
01:56:29.649 FAIL: docker_api_swarm_test.go:359: DockerSwarmSuite.TestAPISwarmRaftQuorum
01:56:29.649 
01:56:29.649 [d3ed86deae02a] waiting for daemon to start
01:56:29.649 [d3ed86deae02a] daemon started
01:56:29.649 
01:56:29.649 [d102c545ad8bb] waiting for daemon to start
01:56:29.649 [d102c545ad8bb] daemon started
01:56:29.649 
01:56:29.649 [d9892df4c7108] waiting for daemon to start
01:56:29.650 [d9892df4c7108] daemon started
01:56:29.650 
01:56:29.650 [d102c545ad8bb] exiting daemon
01:56:29.650 [d9892df4c7108] exiting daemon
01:56:29.650 docker_api_swarm_test.go:385:
01:56:29.650     // d1 will eventually step down from leader because there is no longer an active quorum, wait for that to happen
01:56:29.650     waitAndAssert(c, defaultReconciliationTimeout, func(c *check.C) (interface{}, check.CommentInterface) {
01:56:29.650         _, err = cli.ServiceCreate(context.Background(), service.Spec, types.ServiceCreateOptions{})
01:56:29.650         return err.Error(), nil
01:56:29.650     }, checker.Contains, "Make sure more than half of the managers are online.")
01:56:29.651 docker_utils_test.go:435:
01:56:29.651     c.Assert(v, checker, args...)
01:56:29.651 ... obtained string = "Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
01:56:29.651 ... substring string = "Make sure more than half of the managers are online."
01:56:29.651 
01:56:29.651 [d3ed86deae02a] exiting daemon
01:57:38.096 
01:57:38.096 ----------------------------------------------------------------------

@kolyshkin
Copy link
Contributor

and another PR (#37703), also on z (from https://jenkins.dockerproject.org/job/Docker-PRs-s390x/10847/console):

01:52:07.639 
01:52:07.639 ----------------------------------------------------------------------
01:52:07.639 FAIL: docker_api_swarm_test.go:359: DockerSwarmSuite.TestAPISwarmRaftQuorum
01:52:07.639 
01:52:07.639 [d4541f9718a87] waiting for daemon to start
01:52:07.639 [d4541f9718a87] daemon started
01:52:07.640 
01:52:07.640 [d72c2dc9702d4] waiting for daemon to start
01:52:07.640 [d72c2dc9702d4] daemon started
01:52:07.640 
01:52:07.640 [d260eca6ccbb6] waiting for daemon to start
01:52:07.640 [d260eca6ccbb6] daemon started
01:52:07.640 
01:52:07.640 [d72c2dc9702d4] exiting daemon
01:52:07.640 [d260eca6ccbb6] exiting daemon
01:52:07.640 docker_api_swarm_test.go:385:
01:52:07.640     // d1 will eventually step down from leader because there is no longer an active quorum, wait for that to happen
01:52:07.640     waitAndAssert(c, defaultReconciliationTimeout, func(c *check.C) (interface{}, check.CommentInterface) {
01:52:07.640         _, err = cli.ServiceCreate(context.Background(), service.Spec, types.ServiceCreateOptions{})
01:52:07.641         return err.Error(), nil
01:52:07.641     }, checker.Contains, "Make sure more than half of the managers are online.")
01:52:07.641 docker_utils_test.go:435:
01:52:07.641     c.Assert(v, checker, args...)
01:52:07.641 ... obtained string = "Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
01:52:07.641 ... substring string = "Make sure more than half of the managers are online."
01:52:07.641 
01:52:07.641 [d4541f9718a87] exiting daemon
01:53:18.938 
01:53:18.938 ----------------------------------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/swarm area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.
Projects
Improving CI
  
To do
Development

Successfully merging a pull request may close this issue.

5 participants