Design doc for offline processing #7142

nrfox · 2024-02-20T21:31:01Z

Describe the change

Adds a design doc for "offline processing" of kiali data.

Relates to #7076

POC PR: #7136

jshaughn

@nrfox Looks good, no fundamental issues for me, just a bunch of suggested edits for clarity/grammar.

jshaughn · 2024-02-26T18:29:02Z

design/KEPS/controller-model/proposal.md

+
+# Motivation
+
+Kiali currently has acceptible performance below a certain threshold of scale. That "scale" could be number of pods, services, Istio resources, namespaces, etc. but in general Kiali reaches a certain threshold of one or more of these factors and it becomes noticably slow. Typically this manifests itself in very long page load times (30s+) and very slow (30s+) API responses. Kiali should remain performant even at larger scale.


Suggested change

Kiali currently has acceptible performance below a certain threshold of scale. That "scale" could be number of pods, services, Istio resources, namespaces, etc. but in general Kiali reaches a certain threshold of one or more of these factors and it becomes noticably slow. Typically this manifests itself in very long page load times (30s+) and very slow (30s+) API responses. Kiali should remain performant even at larger scale.

Kiali currently has acceptible performance below a certain threshold of scale. That "scale" could be number of pods, services, Istio resources, namespaces, etc. but in general Kiali reaches a certain threshold of one or more of these factors and it becomes noticeably slow. Typically this manifests itself in very long page load times (30s+) and very slow (30s+) API responses. Kiali should remain performant even at larger scale.

jshaughn · 2024-02-26T18:31:49Z

design/KEPS/controller-model/proposal.md

+
+Kiali currently has acceptible performance below a certain threshold of scale. That "scale" could be number of pods, services, Istio resources, namespaces, etc. but in general Kiali reaches a certain threshold of one or more of these factors and it becomes noticably slow. Typically this manifests itself in very long page load times (30s+) and very slow (30s+) API responses. Kiali should remain performant even at larger scale.
+
+Notably Kiali does most processing within the lifecyle of a request meaning before the api responds to a request, it fetches data from external dependencies (prometheus, jaeger), performs some processing (graph generation, validations), and then transforms the results into an api response. If any one of those tasks takes a long time or fails entirely, the response can be extremely slow or fail altogether.


Suggested change

Notably Kiali does most processing within the lifecyle of a request meaning before the api responds to a request, it fetches data from external dependencies (prometheus, jaeger), performs some processing (graph generation, validations), and then transforms the results into an api response. If any one of those tasks takes a long time or fails entirely, the response can be extremely slow or fail altogether.

Notably, Kiali does most processing within the lifecyle of a request. This means that before the API responds to a request, it fetches data from external dependencies (e.g. prometheus, jaeger), performs some processing (graph generation, validations), and then transforms the results into an API response. If any one of those tasks performs poorly, or fails entirely, the response can be extremely slow or fail altogether.

jshaughn · 2024-02-26T18:35:47Z

design/KEPS/controller-model/proposal.md

+
+# Solution
+
+A "Kiali model" has been [previously discussed](https://github.com/kiali/kiali/discussions/4080) which is an in memory cache of data that Kiali computes like Validations, Health, TLS. Kiali would compute this data outside of a request and then cache it for some period of time. This KEP expands on that idea by providing a specific framework for how to compute and cache that data. This framework is henceforth called the "controller model" and it follows the same pattern that most kube controllers use.


Suggested change

A "Kiali model" has been [previously discussed](https://github.com/kiali/kiali/discussions/4080) which is an in memory cache of data that Kiali computes like Validations, Health, TLS. Kiali would compute this data outside of a request and then cache it for some period of time. This KEP expands on that idea by providing a specific framework for how to compute and cache that data. This framework is henceforth called the "controller model" and it follows the same pattern that most kube controllers use.

A "Kiali model" has been [previously discussed](https://github.com/kiali/kiali/discussions/4080), and is an in-memory, pre-computed cache of data for things like Validations, Health, and TLS. Kiali would compute this data outside of a request and then cache it for some period of time. This KEP expands on that idea by providing a specific framework for how to compute and cache that data. This framework is henceforth called the "controller model" and it follows the same pattern as most Kube controllers.

jshaughn · 2024-02-26T18:39:57Z

design/KEPS/controller-model/proposal.md

+
+A "Kiali model" has been [previously discussed](https://github.com/kiali/kiali/discussions/4080) which is an in memory cache of data that Kiali computes like Validations, Health, TLS. Kiali would compute this data outside of a request and then cache it for some period of time. This KEP expands on that idea by providing a specific framework for how to compute and cache that data. This framework is henceforth called the "controller model" and it follows the same pattern that most kube controllers use.
+
+A typical Kubernetes controller continually watches some objects for changes and when an object does change, it will read the current state, do some work to get things to the desired state, then update the status of the object. For example, a deployment controllor might watch for deployments to be created and when one is: the deployment controller reads the deployment spec --> creates pods according to the spec --> updates the deployment status with the pods it created. When someone reads that deployment from the kube API server, the API server does not compute the deployment status, it simply serves up what is saved in etcd.


Suggested change

A typical Kubernetes controller continually watches some objects for changes and when an object does change, it will read the current state, do some work to get things to the desired state, then update the status of the object. For example, a deployment controllor might watch for deployments to be created and when one is: the deployment controller reads the deployment spec --> creates pods according to the spec --> updates the deployment status with the pods it created. When someone reads that deployment from the kube API server, the API server does not compute the deployment status, it simply serves up what is saved in etcd.

A typical Kubernetes controller continually watches some objects for changes, and when an object does change, it reads the current state, does some work to get things to the desired state, then updates the status of the object. For example, a deployment controller might watch for deployments to be created and on creation the deployment controller: reads the deployment spec --> creates pods according to the spec --> updates the deployment status with the pods it created. When someone reads that deployment from the Kube API server, the API server does not compute the deployment status, it simply serves up what is saved in etcd.

jshaughn · 2024-02-26T18:42:04Z

design/KEPS/controller-model/proposal.md

+
+A typical Kubernetes controller continually watches some objects for changes and when an object does change, it will read the current state, do some work to get things to the desired state, then update the status of the object. For example, a deployment controllor might watch for deployments to be created and when one is: the deployment controller reads the deployment spec --> creates pods according to the spec --> updates the deployment status with the pods it created. When someone reads that deployment from the kube API server, the API server does not compute the deployment status, it simply serves up what is saved in etcd.
+
+This proposes that Kiali follows a similar pattern and have different controllers to compute and cache in memory the data that the Kiali API returns. These controllers will run in the same binary as Kiali and there won't be any additional deployment requirements. Each controller can read/watch from one or more sources such as `VirtualService` objects from the kube API or data outside of kube like proxy status for workload `Health`. After gathering the inputs, the controllers would compute something like Validations and then update the Kiali Cache.


Suggested change

This proposes that Kiali follows a similar pattern and have different controllers to compute and cache in memory the data that the Kiali API returns. These controllers will run in the same binary as Kiali and there won't be any additional deployment requirements. Each controller can read/watch from one or more sources such as `VirtualService` objects from the kube API or data outside of kube like proxy status for workload `Health`. After gathering the inputs, the controllers would compute something like Validations and then update the Kiali Cache.

This proposes that Kiali follow a similar pattern and have different controllers to compute and cache in memory the data that the Kiali API returns. These controllers will run in the same binary as Kiali and there won't be any additional deployment requirements. Each controller can read/watch from one or more sources, such as `VirtualService` objects from the kube API, or data outside of Kube like proxy status for workload `Health`. After gathering the inputs, the controllers would compute something like Validations and then update the Kiali Cache.

jshaughn · 2024-02-26T18:54:10Z

design/KEPS/controller-model/proposal.md

+
+![Validations Controller](Validations_Controller.png "Validations Controller")
+
+The controller watches each source and when one changes, it validates the object, and then updates the Kiali Cache with the validation. When the frontend asks for validations, the API reads what is in the Kiali Cache. Because validations are served directly from memory rather than computed on the fly, the API response times are very fast and remain that way even as the number of objects grows.


The example is fine, although since validations can involve multiple objects (configs) it may make sense to perform all validations on any change?

jshaughn · 2024-02-26T18:59:47Z

design/KEPS/controller-model/proposal.md

+
+The controller watches each source and when one changes, it validates the object, and then updates the Kiali Cache with the validation. When the frontend asks for validations, the API reads what is in the Kiali Cache. Because validations are served directly from memory rather than computed on the fly, the API response times are very fast and remain that way even as the number of objects grows.
+
+An advantage to the controller model is being able to re-use Kubernetes libraries and patterns for building controllers to handle setting up watches, parallel processing, retries on failures etc. Most of the sources will come from Kubernetes. Non-Kubernetes sources can be implemented with Polling if they do not have some kind of "watch" mechanism. Kubernetes sources will be updated almost instantateously making this a near "real time" solution. Non-Kubernetes sources will be limited by how often they poll the source but probably not more than 15-30s. This amount of lag is acceptabile for Kiali's use cases and is a resaonable tradeoff for better performance.


Suggested change

An advantage to the controller model is being able to re-use Kubernetes libraries and patterns for building controllers to handle setting up watches, parallel processing, retries on failures etc. Most of the sources will come from Kubernetes. Non-Kubernetes sources can be implemented with Polling if they do not have some kind of "watch" mechanism. Kubernetes sources will be updated almost instantateously making this a near "real time" solution. Non-Kubernetes sources will be limited by how often they poll the source but probably not more than 15-30s. This amount of lag is acceptabile for Kiali's use cases and is a resaonable tradeoff for better performance.

An advantage to building controllers using the controller model, is being able to re-use Kubernetes libraries and patterns to handle setting up watches, parallel process, retry on failures, etc. Most of the sources will come from Kubernetes. Non-Kubernetes sources can be implemented with Polling if they do not have some kind of "watch" mechanism. Kubernetes sources will be updated almost instantaneously, making this a "near real time" solution. Non-Kubernetes sources will be limited by how often they poll the source but probably not more than 15-30s. This amount of lag is acceptable for Kiali's use cases and is a reasonable trade-off for better performance.

jshaughn · 2024-02-26T19:01:21Z

design/KEPS/controller-model/proposal.md

+
+An advantage to the controller model is being able to re-use Kubernetes libraries and patterns for building controllers to handle setting up watches, parallel processing, retries on failures etc. Most of the sources will come from Kubernetes. Non-Kubernetes sources can be implemented with Polling if they do not have some kind of "watch" mechanism. Kubernetes sources will be updated almost instantateously making this a near "real time" solution. Non-Kubernetes sources will be limited by how often they poll the source but probably not more than 15-30s. This amount of lag is acceptabile for Kiali's use cases and is a resaonable tradeoff for better performance.
+
+There's a few downsides to this approach.


Suggested change

There's a few downsides to this approach.

There are a few downsides to this approach:

jshaughn · 2024-02-26T19:03:15Z

design/KEPS/controller-model/proposal.md

+
+There's a few downsides to this approach.
+
+1. Caching more objects in memory will require greater memory usage. The Kiali cache is an in-memory cache and storing more objects in memory will lead to an increase in memory consumption. This can be mitigated somewhat by only storing the results of computations in the Kiali cache, for example storing the trafficmap rather than all of the individual metrics that were used to generate it. There's also some optimizations to be made by reducing the amount of memory consumed by the kubernetes cache that Kiali uses: https://github.com/kiali/kiali/issues/7017. This could offset increased memory consumption by the Kiali cache. Ultimately though there's no free lunch and storing more objects in memory will require more memory. Kiali will need to keep the size of this cache reasonably small.


Suggested change

1. Caching more objects in memory will require greater memory usage. The Kiali cache is an in-memory cache and storing more objects in memory will lead to an increase in memory consumption. This can be mitigated somewhat by only storing the results of computations in the Kiali cache, for example storing the trafficmap rather than all of the individual metrics that were used to generate it. There's also some optimizations to be made by reducing the amount of memory consumed by the kubernetes cache that Kiali uses: https://github.com/kiali/kiali/issues/7017. This could offset increased memory consumption by the Kiali cache. Ultimately though there's no free lunch and storing more objects in memory will require more memory. Kiali will need to keep the size of this cache reasonably small.

1. Caching more objects in memory will require greater memory usage. The Kiali cache is an in-memory cache and storing more objects in memory will lead to an increase in memory consumption. This can be mitigated somewhat by only storing the results of computations in the Kiali cache, for example storing a graph traffic-map rather than all of the individual metrics used in its generation. There are also some optimizations to be made by reducing the amount of memory consumed by the Kubernetes cache that Kiali uses: https://github.com/kiali/kiali/issues/7017. This could offset increased memory consumption by the Kiali cache. Ultimately though there's no free lunch and storing more objects in memory will require more memory. Kiali will need to keep the size of this cache reasonably small.

Add KEP for controller model

4d0a8fc

nrfox requested review from hhovsepy and jshaughn February 20, 2024 21:31

jshaughn requested changes Feb 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design doc for offline processing #7142

Design doc for offline processing #7142

nrfox commented Feb 20, 2024

jshaughn left a comment

jshaughn Feb 26, 2024

jshaughn Feb 26, 2024

jshaughn Feb 26, 2024

jshaughn Feb 26, 2024

jshaughn Feb 26, 2024

jshaughn Feb 26, 2024

jshaughn Feb 26, 2024

jshaughn Feb 26, 2024

jshaughn Feb 26, 2024


		# Motivation

		Kiali currently has acceptible performance below a certain threshold of scale. That "scale" could be number of pods, services, Istio resources, namespaces, etc. but in general Kiali reaches a certain threshold of one or more of these factors and it becomes noticably slow. Typically this manifests itself in very long page load times (30s+) and very slow (30s+) API responses. Kiali should remain performant even at larger scale.


		Kiali currently has acceptible performance below a certain threshold of scale. That "scale" could be number of pods, services, Istio resources, namespaces, etc. but in general Kiali reaches a certain threshold of one or more of these factors and it becomes noticably slow. Typically this manifests itself in very long page load times (30s+) and very slow (30s+) API responses. Kiali should remain performant even at larger scale.

		Notably Kiali does most processing within the lifecyle of a request meaning before the api responds to a request, it fetches data from external dependencies (prometheus, jaeger), performs some processing (graph generation, validations), and then transforms the results into an api response. If any one of those tasks takes a long time or fails entirely, the response can be extremely slow or fail altogether.


		# Solution

		A "Kiali model" has been [previously discussed](https://github.com/kiali/kiali/discussions/4080) which is an in memory cache of data that Kiali computes like Validations, Health, TLS. Kiali would compute this data outside of a request and then cache it for some period of time. This KEP expands on that idea by providing a specific framework for how to compute and cache that data. This framework is henceforth called the "controller model" and it follows the same pattern that most kube controllers use.


		A "Kiali model" has been [previously discussed](https://github.com/kiali/kiali/discussions/4080) which is an in memory cache of data that Kiali computes like Validations, Health, TLS. Kiali would compute this data outside of a request and then cache it for some period of time. This KEP expands on that idea by providing a specific framework for how to compute and cache that data. This framework is henceforth called the "controller model" and it follows the same pattern that most kube controllers use.

		A typical Kubernetes controller continually watches some objects for changes and when an object does change, it will read the current state, do some work to get things to the desired state, then update the status of the object. For example, a deployment controllor might watch for deployments to be created and when one is: the deployment controller reads the deployment spec --> creates pods according to the spec --> updates the deployment status with the pods it created. When someone reads that deployment from the kube API server, the API server does not compute the deployment status, it simply serves up what is saved in etcd.


		A typical Kubernetes controller continually watches some objects for changes and when an object does change, it will read the current state, do some work to get things to the desired state, then update the status of the object. For example, a deployment controllor might watch for deployments to be created and when one is: the deployment controller reads the deployment spec --> creates pods according to the spec --> updates the deployment status with the pods it created. When someone reads that deployment from the kube API server, the API server does not compute the deployment status, it simply serves up what is saved in etcd.

		This proposes that Kiali follows a similar pattern and have different controllers to compute and cache in memory the data that the Kiali API returns. These controllers will run in the same binary as Kiali and there won't be any additional deployment requirements. Each controller can read/watch from one or more sources such as `VirtualService` objects from the kube API or data outside of kube like proxy status for workload `Health`. After gathering the inputs, the controllers would compute something like Validations and then update the Kiali Cache.


		![Validations Controller](Validations_Controller.png "Validations Controller")

		The controller watches each source and when one changes, it validates the object, and then updates the Kiali Cache with the validation. When the frontend asks for validations, the API reads what is in the Kiali Cache. Because validations are served directly from memory rather than computed on the fly, the API response times are very fast and remain that way even as the number of objects grows.


		The controller watches each source and when one changes, it validates the object, and then updates the Kiali Cache with the validation. When the frontend asks for validations, the API reads what is in the Kiali Cache. Because validations are served directly from memory rather than computed on the fly, the API response times are very fast and remain that way even as the number of objects grows.

		An advantage to the controller model is being able to re-use Kubernetes libraries and patterns for building controllers to handle setting up watches, parallel processing, retries on failures etc. Most of the sources will come from Kubernetes. Non-Kubernetes sources can be implemented with Polling if they do not have some kind of "watch" mechanism. Kubernetes sources will be updated almost instantateously making this a near "real time" solution. Non-Kubernetes sources will be limited by how often they poll the source but probably not more than 15-30s. This amount of lag is acceptabile for Kiali's use cases and is a resaonable tradeoff for better performance.


		An advantage to the controller model is being able to re-use Kubernetes libraries and patterns for building controllers to handle setting up watches, parallel processing, retries on failures etc. Most of the sources will come from Kubernetes. Non-Kubernetes sources can be implemented with Polling if they do not have some kind of "watch" mechanism. Kubernetes sources will be updated almost instantateously making this a near "real time" solution. Non-Kubernetes sources will be limited by how often they poll the source but probably not more than 15-30s. This amount of lag is acceptabile for Kiali's use cases and is a resaonable tradeoff for better performance.

		There's a few downsides to this approach.

	There's a few downsides to this approach.
	There are a few downsides to this approach:


		There's a few downsides to this approach.

		1. Caching more objects in memory will require greater memory usage. The Kiali cache is an in-memory cache and storing more objects in memory will lead to an increase in memory consumption. This can be mitigated somewhat by only storing the results of computations in the Kiali cache, for example storing the trafficmap rather than all of the individual metrics that were used to generate it. There's also some optimizations to be made by reducing the amount of memory consumed by the kubernetes cache that Kiali uses: https://github.com/kiali/kiali/issues/7017. This could offset increased memory consumption by the Kiali cache. Ultimately though there's no free lunch and storing more objects in memory will require more memory. Kiali will need to keep the size of this cache reasonably small.

Design doc for offline processing #7142

Are you sure you want to change the base?

Design doc for offline processing #7142

Conversation

nrfox commented Feb 20, 2024

Describe the change

jshaughn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment