Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance documentation for the Gardenlet's /healthz endpoint #3359

Merged
merged 1 commit into from Jan 8, 2021

Conversation

danielfoehrKn
Copy link
Contributor

@danielfoehrKn danielfoehrKn commented Jan 7, 2021

How to categorize this PR?

/kind bug
/priority normal
/area documentation

What this PR does / why we need it:
Based on #2925 but uses direct clients instead of listers.

The health manager changes the gardenlets /healthz endpoint to DOWN if it cannot renew the lease in the Garden cluster.
However, this does not cover the case when the Gardener Extended API Server is down.

The problem is, that Seeds can be queried from the lister even though the Gardener Extended API Server is down. This can be prevented by using a direct client.

Advantage:

  • Gardenlet fails when Gardener Extended API Server is down. I think this is the correct behaviour, as the Gardenlet cannot work without it.

Disadvantage:

  • more network traffic (one request per 10 s)
  • I propose to decrease the ping interval from 10s -> 20s to save network traffic.

UPDATE: THis PR now only includes enhanced documentation for the Gardenlet's /healthz endpoint.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Enhance documentation for Gardenlet's /healthz endpoint.

@danielfoehrKn danielfoehrKn requested a review from a team as a code owner January 7, 2021 16:18
@gardener-robot gardener-robot added kind/bug Bug priority/normal size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 7, 2021
@timuthy
Copy link
Contributor

timuthy commented Jan 8, 2021

However, this does not cover the case when the Gardener Extended API Server is down.

I'm generally not objecting but curious if you experienced any particular problems leaving the Gardenlet up and running while the API server was down? So far I was quite confident that a component like Gardenlet can deal with such intermittent issues.

@rfranzke
Copy link
Member

rfranzke commented Jan 8, 2021

I'm still struggling a bit with this with respect to the advantages we get compared with the network I/O?

@timebertt
Copy link
Member

If network I/O is a big concern, we might wait for #3109 and use the metadata only client. This should be enough for this scenario, as we don't need the spec and saves network traffic.

@rfranzke
Copy link
Member

rfranzke commented Jan 8, 2021

Yeah, maybe. I'm still in the phase of understanding the motivation for this change and if we experienced any particular problem with it (or if we foresee problems if we don't introduce this now).

@timebertt
Copy link
Member

Alternatively, we could also keep the watch but use some event handlers for delete events instead of using the lister.
Though, this would make the logic event-driven instead of level-driven, which might lead to issues, when we miss events.

@danielfoehrKn
Copy link
Contributor Author

I'm generally not objecting but curious if you experienced any particular problems leaving the Gardenlet up and running while the API server was down?

No, I did not.
My main motivation was that I "expected" it to fail because I thought that the /healthzendpoint should always indicate whether the Gardenlet is healthy / can work.

Thinking about it again, we might be better off just documenting the behaviour - the /healthzendpoint only states that Gardenlet has a lease with the Garden cluster.

@rfranzke
Copy link
Member

rfranzke commented Jan 8, 2021

Yeah, @danielfoehrKn, maybe that's a good first step to clarify on what can be expected.

@timebertt
Copy link
Member

I guess, we should generally improve our health checks and do a clean separation between liveness and readiness (ref gardener-attic/gardener-resource-manager#102 (comment)).
Then we can report an unready state, when caches haven't been synced for a minute or so. This would also cover the stale lease/seed case addressed here.

@gardener-robot gardener-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 8, 2021
@danielfoehrKn
Copy link
Contributor Author

I guess, we should generally improve our health checks and do a clean separation between liveness and readiness (ref gardener-attic/gardener-resource-manager#102 (comment)).
Then we can report an unready state, when caches haven't been synced for a minute or so. This would also cover the stale lease/seed case addressed here.

Maybe you can open an issue for this describing this in more detail?
For this PR, I would only include the documentation - if that is fine with you.

@gardener-robot gardener-robot added the area/documentation Documentation related label Jan 8, 2021
@danielfoehrKn danielfoehrKn changed the title Direct client for gardenlet liveness-probe Enhance documentation for the Gardenlet's /healthz endpoint Jan 8, 2021
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rfranzke rfranzke merged commit 6749697 into gardener:master Jan 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Documentation related kind/bug Bug size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants