Performance problems with custom dashboards discovery #3660

lucasponce · 2021-02-01T17:19:46Z

Querying workload details [1] shows a significant amount of queries between 1.26 and v1.29 in a highly populated cluster.

Info is provided by @primeroz (thanks a lot!)

So, this would deserve an investigation about what could be going on.

[1] http://localhost:20001/kiali/console/namespaces/twodotoh/workloads/gateway-external-write?tab=info&duration=600&refresh=0

primeroz · 2021-02-01T17:34:46Z

For reference, this is a relatively big cluster

80 nodes
250 pods within the mesh, mostly in the same namespace as the workload page that fails to load

when loading the workload page , the workload has about 30 running pods , kiali will often OOM jumping from a normal usage of few hundreds MB to its limits of 6GB.

Sometimes , especially on 1.29, it also caused prometheus to OOM raising its ram usage from an average of 14GB to 24GB (its limit)

jotak · 2021-02-02T10:52:16Z

hey @primeroz @lucasponce ,
I've found a quite serious bug in our prom client code but I have no reason to think it can cause your issue. Still ... might be interesting to test in your environment!
I've also added some trace-level logs.
All is there: #3664

Checking the new logs at a small scale, I don't find anything unexpected. When loading a Workload Details page just triggers 5 prometheus queries, all of them are expected (although it's certainly possible to optimize, there's some redundancy, but nothing huge as we saw in your case):

2021-02-02T10:32:30Z INF getRequestRatesForLabel: rate(istio_requests_total{destination_workload_namespace="default",destination_workload="ball-base"}[60s]) > 0
2021-02-02T10:32:30Z INF getRequestRatesForLabel: rate(istio_requests_total{source_workload_namespace="default",source_workload="ball-base"}[60s]) > 0
2021-02-02T10:32:30Z INF getRequestRatesForLabel: rate(istio_requests_total{destination_service_namespace="default",source_workload_namespace!="default"}[60s]) > 0
2021-02-02T10:32:30Z INF getRequestRatesForLabel: rate(istio_requests_total{destination_service_namespace="default"}[60s]) > 0
2021-02-02T10:32:30Z INF getRequestRatesForLabel: rate(istio_requests_total{source_workload_namespace="default"}[60s]) > 0

So ... unless the bug I fixed somehow triggers large side effects that I'm not seeing, I can't say yet how you came having like 3500 hits of prometheus for a single workload page refresh.

It would be nice to run my PR in your environment and check the logs (we're on slack to help on this)

lucasponce · 2021-02-02T10:54:24Z

@jotak is it possible that the minigraph is so populated in such environment that creates that load in the detail ?

primeroz · 2021-02-02T10:58:49Z

@jotak i can definetely run your version, do you have a built image of kiali i can pull ? also , how do i enable the extra tracing ?

I did some more tests last night and the 1.26 , after initially working, started to fail in the same way. I could see some version is too old error for the caching but had no time to do more investigation.

Was planning to get back to it later today , and indeed get some more debugging.

Also i was trying to figure out how to get the "Kiali custom dashboard to show up for the kiali workload" . I guess i need to make kiali part of the mesh ? ( is not in my testing )

lucasponce · 2021-02-02T11:00:07Z

I did some more tests last night and the 1.26 , after initially working, started to fail in the same way. I could see some version is too old error for the caching but had no time to do more investigation.

I can confirm this is due the bump of the client-go libraries, it should be harmless, just noisy to the logs.

jotak · 2021-02-02T11:06:37Z

@jotak is it possible that the minigraph is so populated in such environment that creates that load in the detail ?

If I'm correct, all the topology & associated health are taken from just these three queries, which should stay the same whatever number of nodes are in the graph:

rate(istio_requests_total{destination_service_namespace="default",source_workload_namespace!="default"}[60s]) > 0
rate(istio_requests_total{destination_service_namespace="default"}[60s]) > 0
rate(istio_requests_total{source_workload_namespace="default"}[60s]) > 0

So, I don't think the graph size should matter at least in terms of prom queries, unless I'm missing something

jotak · 2021-02-02T11:13:51Z

@jotak i can definetely run your version, do you have a built image of kiali i can pull ? also , how do i enable the extra tracing ?

I've just pushed this image: quay.io/jotak/kiali:dev. To increase logging level, edit the kiali env var LOG_LEVEL to trace.

I did some more tests last night and the 1.26 , after initially working, started to fail in the same way. I could see some version is too old error for the caching but had no time to do more investigation.

Was planning to get back to it later today , and indeed get some more debugging.

Also i was trying to figure out how to get the "Kiali custom dashboard to show up for the kiali workload" . I guess i need to make kiali part of the mesh ? ( is not in my testing )

It probably depends on how you installed istio and kiali. Basically, there's a CRD defined by kiali that must be installed, it's named "monitoringdashboards.monitoring.kiali.io"; do you have it?
Kiali doesn't have to be part of the mesh to get that, however, prometheus has to scrape kiali pod (you could configure an alternate prometheus URL if you don't want to mix these metrics with the mesh metrics, though the configuration of "custom_dashboards" here: https://github.com/kiali/kiali-operator/blob/master/deploy/kiali/kiali_cr.yaml#L419-L432 ; if you leave the custom_dashboard.prometheus config empty then it will just pick the main/istio prometheus config instead.)

primeroz · 2021-02-02T14:01:05Z

After chatting with @jotak (Thanks , you were super helpful) it turned out that the issue had to do with the Custom Dashboards

to note that the biggest impact was when opening the Application or the Workload page for an app ( label app ) that had about 120 pods in it

The timings are 89s for getCustomDashboardRefs and getWorkloads

I disabled the custom dashboards and now kiali is as snappy as it can possibly be, Ram usage is also normal now.

I will upgrade back to 1.29 to confirm this issue has nothing to do with the version then might be good to change the scope of the issue / close this and create another one

primeroz · 2021-02-02T14:36:48Z

Also on 1.29 , with custom dashboards disabled, the performances of kiali are good.

Is this because of the autodiscovery functionality where i have

N_OF_PODS_IN_APP * N_OF_DISCOVER_ON_METRICS in parallel ? ( so in my case 120 * 13 = 1560 )
It would be great to disable autodiscovery and rely on https://kiali.io/documentation/latest/runtimes-monitoring/#pods-annotations

Anyway @jotak @lucasponce might be good to either close this case in favour of a new one about custom dashboards to avoid confusion ?

thanks for your hlep

lucasponce · 2021-02-02T15:12:47Z

Anyway @jotak @lucasponce might be good to either close this case in favour of a new one about custom dashboards to avoid confusion ?

Oh, I've re-reading the whole thread.

Good finding, thanks @primeroz and @jotak for the time working on it.

If the dashboards issue is the main root cause, I think the same issue can be re-used, if it's an additional/different but related topic, then it could be splitted in another for better tracking.

I think I missed the context in my first response.

jotak · 2021-02-02T16:26:06Z

I renamed this issue. Some findings to share:

The issue is mostly seen on applications or workloads that have many pods, because many pods x high metrics cardinality = exploding volume of data. There is good news however: most queries that Kiali runs are fine. Just one query is really a problem, and it's not a central feature, it's the api.Series prom endpoint, used to get all series matching a given labelset, and is used exclusively in the custom dashboard discovery process.

So there are several suggestions that we can do:

The easy, no-code workaround is just to disable custom dashboards.
We could add a second flag in custom dashboards config to disable dashboards discovery while keeping the rest of the feature. This is interesting because we would still have the capability of self-debugging Kiali (and it proved to be useful right here - ok that's a chicken-egg problem).
This specific query should make use of a low cancel timeout in context (10s maybe?), because it impacts loading time of the Workload details and we don't want that it creates a bad experience while 99% of times there will be no dashboard found.
Prometheus v2.24, the latest version, has improved its API [1] in a way that should be very helpful here. Instead of calling api.Series, we can call api.LabelValues for label __name__ and the labelset to match on (previously, this endpoint didn't accept the labelset to match on as parameter). This endpoint should eliminate the cardinality issue and will return just 1 item per metric family (ie. regardless the number of pods per app), hence should be much more efficient in our case. But this solution only works for recent prometheus, so it will have to live with the alternative / legacy solution, having both in code.

[1] https://prometheus.io/docs/prometheus/2.24/querying/api/#querying-label-values

jotak · 2021-02-02T16:36:04Z

N_OF_PODS_IN_APP * N_OF_DISCOVER_ON_METRICS in parallel ? ( so in my case 120 * 13 = 1560 )
It would be great to disable autodiscovery and rely on https://kiali.io/documentation/latest/runtimes-monitoring/#pods-annotations

Yes +1 I think it's the first thing to do.
Perhaps a settings with values true/false/auto => if set to "auto", which would be the default, discovery would be automatically skipped when the number of pods found reaches an arbitrary threshold.

jshaughn · 2021-02-02T17:51:56Z

Prometheus v2.24, the latest version, has improved its API [1]

I guess I'd vote to for option 2+4, a second config value to be used only for older prom, that could later be deprecated, specifying true/false/auto (or maybe a pod threshold instead of auto), and then the new API approach used for eligible Proms.

Or, to be more simple, option 1+4, disable discovery unless we have a newer Prom, then use the API.

primeroz · 2021-02-02T18:04:46Z

I would really love option 2+4 , especially if 2 can land faster than option 4.

I am running prometheus 2.24 so if you need any testing in my setup ( the one that caused the issue ) i can totally do that.

thanks for looking into this

- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali/kiali#3660

- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali#3660

jotak · 2021-02-03T08:43:14Z

I've opened a PR about option 2; about option 4 (prometheus api update), we'll have to wait a little bit that prometheus/client_golang#828 gets merged & we update our go client.

- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali/kiali#3660

- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of #3660

- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali#3660

- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali/kiali#3660

* Performance / custom dashboards: new configs - discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali/kiali#3660 * mazz feedback

- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of #3660

jotak · 2021-02-15T12:48:59Z

Closing this one, as the initial issue has been worked around.
I've opened this ticket as a next step #3704 ; it can't be done until next prom go client release.

jmazzitelli · 2021-09-22T14:28:35Z

@primeroz Once PR 4367 is merged and released (should be released next week with v1.41) we'd like to make sure this helps you further. (in fact, I don't know if you have the dev environment to try - but perhaps you can try the PR build now? If you don't have a dev environment in which you can build Kiali from that PR, I could build you a test image and publish on quay.io if you want to try it out before release).

The idea is you shouldn't have to disable things - the hope is the Prometheus query request is faster now. I don't have an environment as large as yours (80 nodes, tons of pods) so I can't test in an environment that mimics yours.

UPDATE: I built and published a test image based on PR 4367 - if someone wants to test it - use this image: quay.io/jmazzitelli/kiali:pr4367

jotak mentioned this issue Feb 2, 2021

Prometheus client: fix lock copy + add traces #3664

Merged

jotak changed the title ~~Performance problems between Kiali v1.26 and v1.29~~ Performance problems with custom dashboards discovery Feb 2, 2021

jotak self-assigned this Feb 2, 2021

jotak added the backlog Triaged Issue added to backlog label Feb 2, 2021

jotak added this to Backlog in Sprint 52 via automation Feb 2, 2021

This was referenced Feb 3, 2021

Performance / custom dashboards: new configs kiali/kiali-operator#243

Merged

Performance / custom dashboards: new configs #3668

Merged

jotak added a commit that referenced this issue Feb 3, 2021

Performance / custom dashboards: new configs (#3668)

3aee1b1

- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of #3660

jotak mentioned this issue Feb 3, 2021

[v1.29] Performance / custom dashboards: new configs (#3668) #3670

Merged

jotak mentioned this issue Feb 3, 2021

[v1.29] Performance / custom dashboards: new configs kiali/kiali-operator#244

Merged

lucasponce added the area/scalability+performance label Feb 11, 2021

lucasponce removed this from Backlog in Sprint 52 Feb 12, 2021

lucasponce added this to Backlog in Sprint 53 via automation Feb 12, 2021

jotak mentioned this issue Feb 15, 2021

Improve discovery matcher process for Custom Dashboards #3704

Closed

jotak closed this as completed Feb 15, 2021

Sprint 53 automation moved this from Backlog to Done Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problems with custom dashboards discovery #3660

Performance problems with custom dashboards discovery #3660

lucasponce commented Feb 1, 2021

primeroz commented Feb 1, 2021

jotak commented Feb 2, 2021

lucasponce commented Feb 2, 2021

primeroz commented Feb 2, 2021

lucasponce commented Feb 2, 2021

jotak commented Feb 2, 2021

jotak commented Feb 2, 2021

primeroz commented Feb 2, 2021

primeroz commented Feb 2, 2021

lucasponce commented Feb 2, 2021 •

edited

jotak commented Feb 2, 2021 •

edited

jotak commented Feb 2, 2021 •

edited

jshaughn commented Feb 2, 2021

primeroz commented Feb 2, 2021 •

edited

jotak commented Feb 3, 2021 •

edited

jotak commented Feb 15, 2021

jmazzitelli commented Sep 22, 2021 •

edited

Performance problems with custom dashboards discovery #3660

Performance problems with custom dashboards discovery #3660

Comments

lucasponce commented Feb 1, 2021

primeroz commented Feb 1, 2021

jotak commented Feb 2, 2021

lucasponce commented Feb 2, 2021

primeroz commented Feb 2, 2021

lucasponce commented Feb 2, 2021

jotak commented Feb 2, 2021

jotak commented Feb 2, 2021

primeroz commented Feb 2, 2021

primeroz commented Feb 2, 2021

lucasponce commented Feb 2, 2021 • edited

jotak commented Feb 2, 2021 • edited

jotak commented Feb 2, 2021 • edited

jshaughn commented Feb 2, 2021

primeroz commented Feb 2, 2021 • edited

jotak commented Feb 3, 2021 • edited

jotak commented Feb 15, 2021

jmazzitelli commented Sep 22, 2021 • edited

lucasponce commented Feb 2, 2021 •

edited

jotak commented Feb 2, 2021 •

edited

jotak commented Feb 2, 2021 •

edited

primeroz commented Feb 2, 2021 •

edited

jotak commented Feb 3, 2021 •

edited

jmazzitelli commented Sep 22, 2021 •

edited