New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test: TestAPISwarmRaftQuorum #34988
Comments
same test, different error from https://jenkins.dockerproject.org/job/Docker-PRs-s390x/6032/consoleFull
|
Also failing on Docker CE |
I'm not a swarm expert, but I took a look at this and I see a couple of things. In the test we are shutting down 2 out of the 3 nodes, all of which are managers. When this happens, the remaining node prepares to step down from being a leader because there is no longer an active quorum (I believe this is a newer swarmkit change?). This is where the test flakiness occurs. 1.) If we hit the assert looking for "deadline exceeded" before the node steps down, it works as intended. 2.) If we hit the assert during/immediately after it steps down, a couple of errors are thrown because that node had a few pending tasks, which are some of what we are seeing here:
3.) And finally if we wait a while for the node to step down and all the tasks to empty, then we get the following error, which seems like a reasonable thing to check for.
As for how to fix this, my guess is we don't want to look for 1 anymore. I think a correct solution would be to wait until the last node isn't a leader, and then look for case 3, and then start up the rest of the nodes, although I'm not 100% sure this is something this test should be testing. ping @aaronlehmann in case number 2, are those errors are bug? Should the remaining tasks be handled gracefully in the event a node steps down? |
Good observations. I think you are correct about the way to fix the test. Those errors in case 2 look like they are expected behavior, but we could possibly do a better job of making it clear what's happening and that nothing is wrong beyond the loss of leadership. The only line I'm not completely sure about is:
I can't think of a reason offhand why this would be affected by a leadership change. Each node is supposed to watch its own store, whether or not it's the leader. It might be worth looking into. |
Got a PANIC from this test on powerpc: https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/10621/console 12:21:01 ---------------------------------------------------------------------- The panic is induced by the code in order to show backtrace (which can be found in the bundle tarball). Not sure if we want to reopen this. |
The above has also happened on z (https://jenkins.dockerproject.org/job/Docker-PRs-s390x/10497/console) and is coming from #37358 |
Saw a failure on z again (https://jenkins.dockerproject.org/job/Docker-PRs-s390x/10846/console):
|
and another PR (#37703), also on z (from https://jenkins.dockerproject.org/job/Docker-PRs-s390x/10847/console):
|
This test seems to be failing frequently, e.g. https://jenkins.dockerproject.org/job/Docker-PRs/45756 (#34908)
The text was updated successfully, but these errors were encountered: