-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Load Balancer Health Checks #724
Conversation
1f3810c
to
a6e61bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me in general.
A few quick points:
- We should probably do a bit of extra QA/testing to make sure this behaves as expected end-to-end for both the
Local
andCluster
policies. I can help with that. - Should we add a quick section to our implementation details documentation describing how health checking is configured for each policy (and reasoning why maybe so that customers don't wonder why they can't configure it ony their own anymore)?
- Could you please add a change log item already highlighting the change, especially as far as it concerns the "interface changes" (annotations being dropped in favor of a static and better LB health check configuration)?
- We should check in with the corresponding team internally to make sure the changes here will be reflected in our production documentation page as well.
Forgot one point: can you please update the annotations documentation as well accordingly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks awesome. Thank you Braden.
The only thing I have slight concerns with is the removal of annotations w/o deprecation period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR needs a rebase now.
@@ -2184,327 +2184,63 @@ func Test_buildHealthCheck(t *testing.T) { | |||
errMsgPrefix string | |||
}{ | |||
{ | |||
name: "tcp health check", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we bring some/all of the tests back now that we still support it (through the fallback annotation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added them back
When `ExternalTrafficPolicy=Local`, the configured health check node port will be used which indicates whether the node has active pods. | ||
In both scenarios, the change will have a positive effect on load balancing behaviour. This will ensure that during life cycle changes within the cluster such as | ||
node autoscaling, node taints, pods going up and down will have the proper behaviour to ensure we don't send traffic to components that are not in a state to serve traffic. | ||
* A new annotation was added `service.beta.kubernetes.io/do-loadbalancer-revert-to-old-health-check` has been added to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add the new annotation to the annotations documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added!
Our current health check implementation is flawed. The external load balancer should not be responsible for health checking the pods since Kubernetes is already doing that. It instead should be health checking the worker nodes themselves. It is also dependent on the externalTrafficPolicy set on the Service itself. If the `externalTrafficPolicy=Local` then a health check node port is created to serve health checks for the node. If there are 1 or more active pods on the node, then the health check will return 200 and it will indicate how many healthy pods are running. If there is no active pods, then it will return a 503 and will fail the health checks. It is necessary for the LB to health check this dedicated endpoint to ensure we respect the lifecycle of the worker node and the pods running on it. If the `externalTrafficPolicy=Cluster` then we should health check kube-proxy running on the node. This is controlled by `--healthz-bind-address` which defaults to `0.0.0.0:10256`. Health checking kube-proxy allows us to follow the lifecycle of the node such as whether it is up and ready to serve traffic, or whether we should stop routing traffic because it is tainted, is being removed due to scaling down, or being removed due to maintenance. In this case, kube proxy is responsible for understanding the health of each individual pod and managing which pods should receive traffic. In our current behaviour, if a pod goes unhealthy, it can incorrectly mark entire worker nodes as unhealthy which isn't the case. If/when we make the switch to full cilium, we will need to make sure we set `--kube-proxy-replacement-healthz-bind-address`. This change will delegate pod healthiness to Kubernetes, which it should have always done. It will instead start health checking the Kubernetes components themselves to ensure they are up and ready to process traffic, and we can stop sending traffic to them when they are no longer healthy or are being drained for any number of reasons. Pod unhealthiness will no longer mark nodes as unhealthy. This should also fix a number of edge cases customers are experiencing when using cluster autoscaling or pod autoscalers.
2d86330
to
3d31011
Compare
3d31011
to
eb49f1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please squash-merge once we're ready.
Our current health check implementation is flawed. The external load balancer should not be responsible for health checking the pods since Kubernetes is already doing that. It instead should be health checking the worker nodes themselves. This is also dependent on the externalTrafficPolicy set on the Service itself.
If the
externalTrafficPolicy=Local
then a health check node port is created to serve health checks for the node. If there are 1 or more active pods on the node, then the health check will return 200 and it will indicate how many healthy pods are running. If there is no active pods, then it will return a 503 and will fail the health checks. It is necessary for the LB to health check this dedicated endpoint to ensure we respect the lifecycle of the worker node and the pods running on it.If the
externalTrafficPolicy=Cluster
then we should health check kube-proxy running on the node. This is controlled by--healthz-bind-address
which defaults to0.0.0.0:10256
. Health checking kube-proxy allows us to follow the lifecycle of the node such as whether it is up and ready to serve traffic, or whether we should stop routing traffic because it is tainted, is being removed due to scaling down, or being removed due to maintenance. In this case, kube proxy is responsible for understanding the health of each individual pod and managing which pods should receive traffic. In our current behaviour, if a pod goes unhealthy, it can incorrectly mark entire worker nodes as unhealthy which isn't the case. If/when we make the switch to full cilium, we will need to make sure we set--kube-proxy-replacement-healthz-bind-address
.This change will delegate pod healthiness to Kubernetes, which it should have always done. It will instead start health checking the Kubernetes components themselves to ensure they are up and ready to process traffic, and we can stop sending traffic to them when they are no longer healthy or are being drained for any number of reasons. Pod unhealthiness will no longer mark nodes as unhealthy. This should also fix a number of edge cases customers are experiencing when using cluster autoscaling or pod autoscalers.