Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a wait-for-status parameter to the health_report API #107963

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

masseyke
Copy link
Member

This adds a wait-for-status parameter to the health_report API, much like the one already in the cluster health API. This is mainly useful in integration tests, where we want to wait until the cluster is in a known state before proceeding. For a multi-node cluster it can take a little time for the health node to come up and start reporting.
Closes #107796

Copy link

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

…masseyke/elasticsearch into adding-wait-for-status-to-health-report
@DaveCTurner
Copy link
Contributor

Hmm I'm a bit hesitant to do this. All those ?wait_for_X parameters in GET _cluster/health are really only useful in tests, and they're kind of a pain to maintain in the production API. Are you sure there's no other way to achieve this? For ESIntegTestCase variants we should be able to reach into the cluster to wait more directly. For REST tests, maybe an entirely separate API that's only available in those tests would be better?

@DaveCTurner
Copy link
Contributor

Also we're only waiting on cluster state updates here, but I believe the health report API will change state for reasons that don't directly relate to a cluster state update.

@masseyke
Copy link
Member Author

Also we're only waiting on cluster state updates here, but I believe the health report API will change state for reasons that don't directly relate to a cluster state update.

Yeah good point. The reason I originally set out to deal with was the status being UNKNOWN, which is because the health service isn't' online yet. And we get a cluster state change event when it comes online.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Apr 26, 2024

I originally set out to deal with was the status being UNKNOWN

I could see value in waiting until the health service comes up, similar to how TransportMasterNodeAction waits (for 30s by default, controllable by ?master_timeout=) if there's no master at the time the request arrives. Indeed I think waiting like that could make sense as the default behaviour.

@nielsbauman
Copy link
Contributor

@dct @masseyke I'm afraid we currently don't have a way to determine whether the health service is online. The log line Node [...] is selected as the current health node. doesn't mean the full health service is online. After this log line (which will result in a cluster state update as Keith mentioned), the LocalHealthMonitor instances on the individual nodes will still have to send their health info to the health node. So, even if we wait for the cluster state to include a health node, there's still a chance that some health indicators will report UNKNOWN.

Either we'll have to implement a mechanism for determining whether the health service is online, or we're back to the original idea of this PR -- where we'd still need to address David's point regarding health updates not necessarily resulting in cluster state updates. I admittedly don't immediately have a good idea for such a mechanism.

@DaveCTurner
Copy link
Contributor

Either we'll have to implement a mechanism for determining whether the health service is online, or...

More strongly, I think the health API should wait to respond in the situation where it hasn't heard from all nodes yet, only returning UNKNOWN after some timeout.

@nielsbauman
Copy link
Contributor

Ah yeah that could work I think. I wonder what would be a good default behavior. If a cluster has trouble starting up (i.e. it's not just slowly starting up, it's not starting up) or a new node has issues, it might be confusing if the Health API takes 10/30/whatever seconds to load. Also, we'd need to take managed/internal use cases into account (e.g. the health page on cloud.elastic.co and maybe AutoOps in the future).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] CoreWithSecurityClientYamlTestSuiteIT test {yaml=health/10_basic/cluster health basic test} failing
4 participants