Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the conditions for fetching remote master history #89472

Merged
merged 5 commits into from Aug 22, 2022

Conversation

masseyke
Copy link
Member

As part of the master stability health check, when a non-master node thinks that it has seen the master go null a number of times it checks with that master to see if it also thinks it has gone null. This check is triggered by a cluster change event telling us that the current master is null and the previous was not (and after checking that the master has gone null repeatedly). The problem is that it is possible that the master went null due to something like a long GC pause, and that same pause could mean that the check times out. Once the master comes back and we get a cluster change even that the master is not null, we're not reaching out to the master for its history. So when a user query comes in all we know is that there was a timeout exception. That means we would get a YELLOW response when we ought to get a GREEN. This commit changes the check in the cluster changed event so that it also queries the master if it has just become non-null (and still checks that the master has gone null repeatedly).
This commit also fixes a couple of minor bugs in the test code.
Closes #89431

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for fixing this Keith

@masseyke
Copy link
Member Author

@elasticsearchmachine update branch

@masseyke
Copy link
Member Author

@elasticmachine update branch

@masseyke masseyke merged commit 5b3d51d into elastic:main Aug 22, 2022
@masseyke masseyke deleted the fix/stable-master-check branch August 22, 2022 16:18
elasticsearchmachine pushed a commit that referenced this pull request Aug 22, 2022
Unmuting the test muted by #89501 because it has been fixed in #89472
masseyke added a commit to masseyke/elasticsearch that referenced this pull request Sep 13, 2022
Fixing the conditions that the health API uses to determine when to check with a master node for its view
of master history if the master appears to have gone null repeatedly.
elasticsearchmachine pushed a commit that referenced this pull request Sep 13, 2022
…0040)

Fixing the conditions that the health API uses to determine when to check with a master node for its view
of master history if the master appears to have gone null repeatedly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] StableMasterDisruptionIT testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable failing
4 participants