Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomizedRaftTest.livenessTestWithNoSnapshot fails because member is ACTIVE not READY #10545

Closed
oleschoenburg opened this issue Sep 28, 2022 · 3 comments · Fixed by #10640
Closed
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug kind/flake Categorizes issue or PR as related to a flaky test version:8.1.1 Marks an issue as being completely or in parts released in 8.1.1 version:8.2.0-alpha1 Marks an issue as being completely or in parts released in 8.2.0-alpha1 version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0

Comments

@oleschoenburg
Copy link
Member

org.opentest4j.AssertionFailedError: 
 expected: READY
 but was: ACTIVE
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
	at io.atomix.raft.ControllableRaftContexts.lambda$assertAllMembersAreReady$20(ControllableRaftContexts.java:516)
	at java.base/java.util.HashMap$Values.forEach(HashMap.java:1065)
	at io.atomix.raft.ControllableRaftContexts.assertAllMembersAreReady(ControllableRaftContexts.java:516)
	at io.atomix.raft.RandomizedRaftTest.livenessTest(RandomizedRaftTest.java:226)
	at io.atomix.raft.RandomizedRaftTest.livenessTestWithNoSnapshot(RandomizedRaftTest.java:141)

surefire-reports.zip

@oleschoenburg oleschoenburg added the kind/flake Categorizes issue or PR as related to a flaky test label Sep 28, 2022
@oleschoenburg
Copy link
Member Author

Here's what jqwik has to say:

                              |-------------------jqwik-------------------
tries = 3                     | # of calls to property
checks = 3                    | # of not rejected calls
generation = RANDOMIZED       | parameters are randomly generated
after-failure = PREVIOUS_SEED | use the previous seed
when-fixed-seed = ALLOW       | fixing the random seed is allowed
edge-cases#mode = NONE        | edge cases are not explicitly generated
seed = -7949969560383210802   | random seed to reproduce generated values

I can't reproduce it with those seeds.

@deepthidevaki deepthidevaki self-assigned this Oct 7, 2022
@deepthidevaki
Copy link
Contributor

Was able to reproduce it with seed = "6434342535110401459" on main at commit bb335699f503a467a58f354c9631305a57f39e4d. @oleschoenburg was also able to reproduce it with the same seed. No idea why the seed from the CI does not reproduce failure.

@deepthidevaki
Copy link
Contributor

This is what happened:

  1. Node 1 is the leader
  2. Leader sends an AppendRequest to node 2, before sending it updates RaftMemberContext#inFlightAppendCount to 1.
  3. Node 1 steps down
  4. Node 1 becomes leader again
  5. Node 1 resets the member context of 2. So inFlightAppendCount is reset to 0
  6. Node 1 sends a new append request to node 2, inFlightAppendCount is set to 1
  7. The first request times out. On handling the response it decrements the inFlightAppendCount to 0. BUG! This request was sent when this node was leader in the previous term. So the response should not be handled in this term.
  8. The second request times out. On handling the response it decrements the inFlightAppendCount to -1.
  9. Next time, when the leader attempts to send an append request RaftMemberContext#canAppend returns false because inFlightAppendCount != 0. As a result leader will never sent a heartbeat or new append request to node 2. Thus node 2 can never become ready.

@deepthidevaki deepthidevaki added the kind/bug Categorizes an issue or PR as a bug label Oct 7, 2022
zeebe-bors-camunda bot added a commit that referenced this issue Oct 10, 2022
10656: [Backport stable/8.1] fix(raft): do not handle response if role is already closed r=deepthidevaki a=backport-action

# Description
Backport of #10640 to `stable/8.1`.

closes #10545

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Oct 10, 2022
10657: [Backport stable/8.0] ci: merge deploy and auto-merge workflows into unified CI workflow r=oleschoenburg a=oleschoenburg

manual backport of #10616

10659: [Backport stable/8.0] fix(raft): do not handle response if the role is closed r=oleschoenburg a=deepthidevaki

## Description

Backports #10640 

Changes to the `ControlledRaftContext` is not backported as the original code does not exist in this version. 

closes #10545 

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@korthout korthout added the version:8.1.1 Marks an issue as being completely or in parts released in 8.1.1 label Oct 13, 2022
@korthout korthout added version:8.2.0-alpha1 Marks an issue as being completely or in parts released in 8.2.0-alpha1 release/8.0.8 labels Nov 1, 2022
@npepinpe npepinpe added the version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0 label Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug kind/flake Categorizes issue or PR as related to a flaky test version:8.1.1 Marks an issue as being completely or in parts released in 8.1.1 version:8.2.0-alpha1 Marks an issue as being completely or in parts released in 8.2.0-alpha1 version:8.2.0 Marks an issue as being completely or in parts released in 8.2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants