[Proposal] Enhance Resource Health Monitoring within App CR #717

varshaprasad96 · 2024-01-23T19:50:21Z

Addresses the issue: carvel-dev/kapp-controller#1412

Addresses the issue: carvel-dev/kapp-controller#1412 Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

netlify · 2024-01-23T19:50:26Z

✅ Deploy Preview for carvel ready!

Name	Link
🔨 Latest commit	`5ae1a1f`
🔍 Latest deploy log	https://app.netlify.com/sites/carvel/deploys/65b0187fc1ec37000878d7b2
😎 Deploy Preview	https://deploy-preview-717--carvel.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

varshaprasad96 · 2024-01-23T19:57:32Z

proposals/kapp-controller/004-health-status-reporting/README.md

+
+We intend to extend the existing App API by adding a new status condition to expose the system's health. To do so, the following needs to be implemented:
+
+1. The controller reconciling the App CR needs to dynamically set up watches for the resources being deployed by the package. 


It would be helpful to know more and discuss about how we can enable watches in the App reconciler. Looks like currently, the kapp command is called, and based on its output the App status is popoulated:
https://github.com/carvel-dev/kapp-controller/blob/df87efdcf0c0c140ff644c8286257cd38a74fd42/pkg/app/app_deploy.go#L25

If we could return the list of resources which are being created (which currently is present in the cmd output) and dynamically set up watches, it would make it easier.

This is how we do in Rukpak for reference: https://github.com/operator-framework/rukpak/blob/6a8a84c9aff05efaba7b05992704ad38462a7ee8/internal/controllers/bundledeployment/bundledeployment.go#L389-L402.

This might not be ideal, but you can find all the resources by taking the involved GroupKinds from the ConfigMap associated with the kapp app, then listing/watching/informing using the appropriate app label selector.

Another possible approach would be to have kapp write some information about specific resources to a file upon reconciliation and have this information copied over to the status.
This is how we have information about used group versions and namespaces to the App status, using output from the --app-metadata-file-output flag.

This however means this information will be reported whenever the App syncs.

varshaprasad96 · 2024-01-23T19:59:53Z

proposals/kapp-controller/004-health-status-reporting/README.md

+
+#### Use Case: Monitoring the state of resources
+
+Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR. 


To confirm - the output of inspect command in App's status is populated during deploy. After which it is not dynamically updated when the health of any resources change? Am I missing anything here?

That is correct, it would only report the health of resources when a reconciliation occurs.

One correction (though it is not very relevant to the context) would be that inspect is not a part of rawOptions, enabling it would look more like:

#.....other spec deploy: - kapp: inspect: {} #....

varshaprasad96 · 2024-01-23T20:02:33Z

proposals/kapp-controller/004-health-status-reporting/README.md

+## Open Questions:
+
+1. Can using informers to watch resources increase cache size, potentially impacting the performance?
+2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition?


I guess this is what needs to happen eventually. It doesn't make sense to have two status fields serve similar purpose. The healthy condition can just list all the resources (instead of just the failed/unhealthy) ones or vice-versa.

Before refactoring the proposal for the same, I would like to confirm the use case of inspect and make sure of the direction we would like to go.

The inspect section initially just aggregated the statuses of all resources after a finished reconciliation. It was essentially the output of the kapp inspect command.
We disabled it by default in favour of reducing the number of API calls we make.

Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.

Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.

If we are watching all the resources in the cluster anyway, and triggering a reconcile - wondering if calling an inspect command on top of it is necessary. If so, this may also end up loading the API server - since I can expect more no. of reconciliations due to dependent resources.

Additionally, the second aspect is inspect and healthy showing conflicting information at any point in time to the user (I haven't looked into the codebase of inspect yet, but assuming controller client and the one used with kapp are different?).

Given:
Inspect - would show a superset health status of all the resources.
Healthy - would show only the unhealthy resources.

If we decide to support both of them, then probably we should make them exclusive?

100mik

Thank you so much for putting this together!
I aggregated some of my thoughts into comments.

So far the two goals that stand out for me are:

Immediate reporting of failures for certain resources
Structured per-resource reporting in case of failure

I would be curious to know if I am missing something else we are looking for too 🙏🏼
Let's take the discussion forward in the community meeting 🚀

100mik · 2024-01-24T13:46:09Z

proposals/kapp-controller/004-health-status-reporting/README.md

+We intend to extend the existing App API by adding a new status condition to expose the system's health. To do so, the following needs to be implemented:
+
+1. The controller reconciling the App CR needs to dynamically set up watches for the resources being deployed by the package. 
+2. Introduce a `Healthy` condition in App CR's `status` [field][app_cr_status].


I believed the original suggestion was to introduce conditions per resource as well, is that not required for your use case?
To illustrate something like

- type: HealthCheck/someapi/someversion/somenamespace/resource status: False message: "Failed to meet condition: "some more information""

Which could live in a separate in a separate field such as status.resourceConditions.

I am not convinced we need to report a condition for every resource. Imagine the results of an App being hundreds of resources created on cluster. The primary need is to signal to a user "this app is degraded". Including a subset of the unhealthy resources would be useful. I'd like to ensure we don't get anywhere close to the 1.5MB size limit for data in etcd, and whenever there's an unbounded data set (# of resources in this case), I start to get concerned.

From there, the user can go investigate further.

+1 to what Andy mentioned. A single condition, which consists of a consolidated list of unhealthy objects is sufficient. Something like this:

- lastTransitionTime: "2023-08-02T04:24:27Z" Message: "unhealthy resources: ["apiextensions.k8s.io/v1/CustomResourceDefinition/my.new.crd":"InvalidVersion", "deployments/test-ns/my-deploy":"MinimumReplicasUnavailable", "pods/test-ns/standalone-pod":"ImagePullBackoff"]

100mik · 2024-01-24T13:58:19Z

proposals/kapp-controller/004-health-status-reporting/README.md

+
+7. All other unspecified resources will be considered healthy.
+
+If any of the watched resource is unhealthy, the `Message` field of the healthy condition will have the statuses of the unhealthy resources ordered lexicographically. 


I think it would be helpful to not cases where the ReconcileFailed condition is not present, but the Healthy condition is false.
If this is not a possibility, is the problem we are trying to solve: having more structured information about failed resources surfaced?

See my previous comment re not wanting to list every unhealthy resource

100mik · 2024-01-24T14:04:44Z

proposals/kapp-controller/004-health-status-reporting/README.md

+
+If any of the watched resource is unhealthy, the `Message` field of the healthy condition will have the statuses of the unhealthy resources ordered lexicographically. 
+
+Since the resources deployed by the App reconciler have informers created for them, any change in the resource state will trigger a reconcile that in turn will re-evaluate the health of all resources. 


There is value to be able to treat some resources as more critical! (carvel-dev/kapp-controller#1279 comes to mind)

Today, in case of failure we have a mechanism which leads to immediate reconciliation on failure. However, in case of repeated failure, the reconciler exponentially backs off. Meaning that it would take longer to reconcile the app again if it has already failed >3 times (for example).
Worth noting the longest the app waits will always be equal to it's syncPeriod.

This prevents an app or a set of apps that is doomed to fail from hogging the reconciliation queue. Would we want something similar here as well?

100mik · 2024-01-24T14:09:51Z

proposals/kapp-controller/004-health-status-reporting/README.md

+
+#### Use Case: Monitoring the state of resources
+
+Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR. 


That is correct, it would only report the health of resources when a reconciliation occurs.

One correction (though it is not very relevant to the context) would be that inspect is not a part of rawOptions, enabling it would look more like:

#.....other spec deploy: - kapp: inspect: {} #....

100mik · 2024-01-24T14:14:18Z

proposals/kapp-controller/004-health-status-reporting/README.md

+## Open Questions:
+
+1. Can using informers to watch resources increase cache size, potentially impacting the performance?
+2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition?


The inspect section initially just aggregated the statuses of all resources after a finished reconciliation. It was essentially the output of the kapp inspect command.
We disabled it by default in favour of reducing the number of API calls we make.

Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.

100mik · 2024-01-24T14:18:32Z

proposals/kapp-controller/004-health-status-reporting/README.md

+
+Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR. 
+
+Though this command provides information about the resources created by the respective App CR, it does so by sending API requests during the reconciliation. Instead, using informers provides additional advantages of having real-time updates, efficient resource utilization and reduced load on API server.


If we work with informers it might be interesting to see what's an optimal number of resources to be watched we would recommend keeping resource utilisation in mind.
(Just a note, not something this proposal should address)

joaopapereira · 2024-01-24T15:23:07Z

proposals/kapp-controller/004-health-status-reporting/README.md

+5. An APIService resource will be healthy if/when: 
+- `Available` type condition in status is true.
+
+6. A CustomResourceDefinition resource will be healthy if/when:
+- `StoredVersions` has the expected API version for the CRD.
+
+7. All other unspecified resources will be considered healthy.


Is there any case where 5,6,7 would not lead to a deployment failure? if so do we really need to report health in these resources types?

joaopapereira · 2024-01-24T15:26:45Z

proposals/kapp-controller/004-health-status-reporting/README.md

+
+## Open Questions:
+
+1. Can using informers to watch resources increase cache size, potentially impacting the performance?


I think this will be the case since the same kapp-controller can be in charge of hundreds of apps, and if we do this for all the apps, we might end up getting informers for every resource in the cluster.

We can have this as an optional feature to start, similar to how inspect is right now?

If this proposal is implemented, health would ultimately be determined by evaluating only the following kinds:

Pod

ReplicationController

ReplicaSet

Deployment

StatefulSet

APIService

CustomResourceDefinition

Which would mean the additional overhead is at most 6 more informers with label selectors limiting the cache contents to just what kapp-controller is managing. We don't have to have informers for the App resources that do not contribute to the health condition, right?

Yes, it'd only need to be a subset of all APIs

Based on the discussion in the community meeting today -

The two use-cases for setting up informers to watch resources are:

Health monitoring and aggregating status.

Triggering a reconcile if any resource is unhealthy.

From OLM's end, the use case we want to fulfil is (1).

(2) is something that can cause performance issues in terms of continuously reconciling for any unhealthy resources even if we have informers set up for limited number of GVK's (especially on clusters where Kapp-ctrl is managing large no of App CRs).

If (2) is not to be addressed, to maintain modularity in terms of kapp controller's functionality, @joaopapereira suggested we explore having a separate controller to monitor the health status of individual resources which can be optionally enabled.

joaopapereira · 2024-01-24T15:32:39Z

proposals/kapp-controller/004-health-status-reporting/README.md

+
+1. Can using informers to watch resources increase cache size, potentially impacting the performance?
+2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition?
+


Another open question is if kapp-controller starts reacting to all changes in the cluster, what will happen to performance in general? At this point in time kapp-controller can become the major consumer of CPU of the full cluster.

[Proposal] Enhance Resource Health Monitoring within App CR

5ae1a1f

Addresses the issue: carvel-dev/kapp-controller#1412 Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

varshaprasad96 commented Jan 23, 2024

View reviewed changes

100mik reviewed Jan 24, 2024

View reviewed changes

joaopapereira reviewed Jan 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Enhance Resource Health Monitoring within App CR #717

[Proposal] Enhance Resource Health Monitoring within App CR #717

varshaprasad96 commented Jan 23, 2024

netlify bot commented Jan 23, 2024 •

edited

varshaprasad96 Jan 23, 2024

ncdc Jan 23, 2024

100mik Jan 24, 2024

varshaprasad96 Jan 23, 2024

100mik Jan 24, 2024

varshaprasad96 Jan 23, 2024

100mik Jan 24, 2024

varshaprasad96 Jan 24, 2024 •

edited

100mik left a comment •

edited

100mik Jan 24, 2024

ncdc Jan 24, 2024

varshaprasad96 Jan 24, 2024

100mik Jan 24, 2024

ncdc Jan 24, 2024

100mik Jan 24, 2024

100mik Jan 24, 2024

100mik Jan 24, 2024

100mik Jan 24, 2024

joaopapereira Jan 24, 2024

joaopapereira Jan 24, 2024

varshaprasad96 Jan 24, 2024

joelanford Jan 24, 2024

ncdc Jan 24, 2024

varshaprasad96 Jan 24, 2024 •

edited

joaopapereira Jan 24, 2024


		We intend to extend the existing App API by adding a new status condition to expose the system's health. To do so, the following needs to be implemented:

		1. The controller reconciling the App CR needs to dynamically set up watches for the resources being deployed by the package.


		#### Use Case: Monitoring the state of resources

		Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR.


		7. All other unspecified resources will be considered healthy.

		If any of the watched resource is unhealthy, the `Message` field of the healthy condition will have the statuses of the unhealthy resources ordered lexicographically.


		If any of the watched resource is unhealthy, the `Message` field of the healthy condition will have the statuses of the unhealthy resources ordered lexicographically.

		Since the resources deployed by the App reconciler have informers created for them, any change in the resource state will trigger a reconcile that in turn will re-evaluate the health of all resources.


		Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR.

		Though this command provides information about the resources created by the respective App CR, it does so by sending API requests during the reconciliation. Instead, using informers provides additional advantages of having real-time updates, efficient resource utilization and reduced load on API server.


		## Open Questions:

		1. Can using informers to watch resources increase cache size, potentially impacting the performance?


		1. Can using informers to watch resources increase cache size, potentially impacting the performance?
		2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition?

[Proposal] Enhance Resource Health Monitoring within App CR #717

Are you sure you want to change the base?

[Proposal] Enhance Resource Health Monitoring within App CR #717

Conversation

varshaprasad96 commented Jan 23, 2024

netlify bot commented Jan 23, 2024 • edited

✅ Deploy Preview for carvel ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varshaprasad96 Jan 24, 2024 • edited

Choose a reason for hiding this comment

100mik left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varshaprasad96 Jan 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Jan 23, 2024 •

edited

varshaprasad96 Jan 24, 2024 •

edited

100mik left a comment •

edited

varshaprasad96 Jan 24, 2024 •

edited