Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problems with custom dashboards discovery #3660

Closed
lucasponce opened this issue Feb 1, 2021 · 17 comments
Closed

Performance problems with custom dashboards discovery #3660

lucasponce opened this issue Feb 1, 2021 · 17 comments
Assignees
Labels
backlog Triaged Issue added to backlog
Projects

Comments

@lucasponce
Copy link
Contributor

Querying workload details [1] shows a significant amount of queries between 1.26 and v1.29 in a highly populated cluster.

Info is provided by @primeroz (thanks a lot!)

image

So, this would deserve an investigation about what could be going on.

[1] http://localhost:20001/kiali/console/namespaces/twodotoh/workloads/gateway-external-write?tab=info&duration=600&refresh=0

@primeroz
Copy link

primeroz commented Feb 1, 2021

For reference, this is a relatively big cluster

  • 80 nodes
  • 250 pods within the mesh, mostly in the same namespace as the workload page that fails to load

when loading the workload page , the workload has about 30 running pods , kiali will often OOM jumping from a normal usage of few hundreds MB to its limits of 6GB.

Sometimes , especially on 1.29, it also caused prometheus to OOM raising its ram usage from an average of 14GB to 24GB (its limit)

@jotak
Copy link
Contributor

jotak commented Feb 2, 2021

hey @primeroz @lucasponce ,
I've found a quite serious bug in our prom client code but I have no reason to think it can cause your issue. Still ... might be interesting to test in your environment!
I've also added some trace-level logs.
All is there: #3664

Checking the new logs at a small scale, I don't find anything unexpected. When loading a Workload Details page just triggers 5 prometheus queries, all of them are expected (although it's certainly possible to optimize, there's some redundancy, but nothing huge as we saw in your case):

2021-02-02T10:32:30Z INF getRequestRatesForLabel: rate(istio_requests_total{destination_workload_namespace="default",destination_workload="ball-base"}[60s]) > 0
2021-02-02T10:32:30Z INF getRequestRatesForLabel: rate(istio_requests_total{source_workload_namespace="default",source_workload="ball-base"}[60s]) > 0
2021-02-02T10:32:30Z INF getRequestRatesForLabel: rate(istio_requests_total{destination_service_namespace="default",source_workload_namespace!="default"}[60s]) > 0
2021-02-02T10:32:30Z INF getRequestRatesForLabel: rate(istio_requests_total{destination_service_namespace="default"}[60s]) > 0
2021-02-02T10:32:30Z INF getRequestRatesForLabel: rate(istio_requests_total{source_workload_namespace="default"}[60s]) > 0

So ... unless the bug I fixed somehow triggers large side effects that I'm not seeing, I can't say yet how you came having like 3500 hits of prometheus for a single workload page refresh.

It would be nice to run my PR in your environment and check the logs (we're on slack to help on this)

@lucasponce
Copy link
Contributor Author

@jotak is it possible that the minigraph is so populated in such environment that creates that load in the detail ?

@primeroz
Copy link

primeroz commented Feb 2, 2021

@jotak i can definetely run your version, do you have a built image of kiali i can pull ? also , how do i enable the extra tracing ?

I did some more tests last night and the 1.26 , after initially working, started to fail in the same way. I could see some version is too old error for the caching but had no time to do more investigation.

Was planning to get back to it later today , and indeed get some more debugging.

Also i was trying to figure out how to get the "Kiali custom dashboard to show up for the kiali workload" . I guess i need to make kiali part of the mesh ? ( is not in my testing )

@lucasponce
Copy link
Contributor Author

I did some more tests last night and the 1.26 , after initially working, started to fail in the same way. I could see some version is too old error for the caching but had no time to do more investigation.

I can confirm this is due the bump of the client-go libraries, it should be harmless, just noisy to the logs.

@jotak
Copy link
Contributor

jotak commented Feb 2, 2021

@jotak is it possible that the minigraph is so populated in such environment that creates that load in the detail ?

If I'm correct, all the topology & associated health are taken from just these three queries, which should stay the same whatever number of nodes are in the graph:

rate(istio_requests_total{destination_service_namespace="default",source_workload_namespace!="default"}[60s]) > 0
rate(istio_requests_total{destination_service_namespace="default"}[60s]) > 0
rate(istio_requests_total{source_workload_namespace="default"}[60s]) > 0

So, I don't think the graph size should matter at least in terms of prom queries, unless I'm missing something

@jotak
Copy link
Contributor

jotak commented Feb 2, 2021

@jotak i can definetely run your version, do you have a built image of kiali i can pull ? also , how do i enable the extra tracing ?

I've just pushed this image: quay.io/jotak/kiali:dev. To increase logging level, edit the kiali env var LOG_LEVEL to trace.

I did some more tests last night and the 1.26 , after initially working, started to fail in the same way. I could see some version is too old error for the caching but had no time to do more investigation.

Was planning to get back to it later today , and indeed get some more debugging.

Also i was trying to figure out how to get the "Kiali custom dashboard to show up for the kiali workload" . I guess i need to make kiali part of the mesh ? ( is not in my testing )

It probably depends on how you installed istio and kiali. Basically, there's a CRD defined by kiali that must be installed, it's named "monitoringdashboards.monitoring.kiali.io"; do you have it?
Kiali doesn't have to be part of the mesh to get that, however, prometheus has to scrape kiali pod (you could configure an alternate prometheus URL if you don't want to mix these metrics with the mesh metrics, though the configuration of "custom_dashboards" here: https://github.com/kiali/kiali-operator/blob/master/deploy/kiali/kiali_cr.yaml#L419-L432 ; if you leave the custom_dashboard.prometheus config empty then it will just pick the main/istio prometheus config instead.)

@primeroz
Copy link

primeroz commented Feb 2, 2021

After chatting with @jotak (Thanks , you were super helpful) it turned out that the issue had to do with the Custom Dashboards

to note that the biggest impact was when opening the Application or the Workload page for an app ( label app ) that had about 120 pods in it

The timings are 89s for getCustomDashboardRefs and getWorkloads

2021-02-02-144009_592x387_scrot

I disabled the custom dashboards and now kiali is as snappy as it can possibly be, Ram usage is also normal now.

I will upgrade back to 1.29 to confirm this issue has nothing to do with the version then might be good to change the scope of the issue / close this and create another one

@primeroz
Copy link

primeroz commented Feb 2, 2021

Also on 1.29 , with custom dashboards disabled, the performances of kiali are good.

Is this because of the autodiscovery functionality where i have

N_OF_PODS_IN_APP * N_OF_DISCOVER_ON_METRICS in parallel ? ( so in my case 120 * 13 = 1560 )
It would be great to disable autodiscovery and rely on https://kiali.io/documentation/latest/runtimes-monitoring/#pods-annotations

Anyway @jotak @lucasponce might be good to either close this case in favour of a new one about custom dashboards to avoid confusion ?

thanks for your hlep

@lucasponce
Copy link
Contributor Author

lucasponce commented Feb 2, 2021

Anyway @jotak @lucasponce might be good to either close this case in favour of a new one about custom dashboards to avoid confusion ?

Oh, I've re-reading the whole thread.

Good finding, thanks @primeroz and @jotak for the time working on it.

If the dashboards issue is the main root cause, I think the same issue can be re-used, if it's an additional/different but related topic, then it could be splitted in another for better tracking.

I think I missed the context in my first response.

@jotak jotak changed the title Performance problems between Kiali v1.26 and v1.29 Performance problems with custom dashboards discovery Feb 2, 2021
@jotak
Copy link
Contributor

jotak commented Feb 2, 2021

I renamed this issue. Some findings to share:

The issue is mostly seen on applications or workloads that have many pods, because many pods x high metrics cardinality = exploding volume of data. There is good news however: most queries that Kiali runs are fine. Just one query is really a problem, and it's not a central feature, it's the api.Series prom endpoint, used to get all series matching a given labelset, and is used exclusively in the custom dashboard discovery process.

So there are several suggestions that we can do:

  • The easy, no-code workaround is just to disable custom dashboards.

  • We could add a second flag in custom dashboards config to disable dashboards discovery while keeping the rest of the feature. This is interesting because we would still have the capability of self-debugging Kiali (and it proved to be useful right here - ok that's a chicken-egg problem).

  • This specific query should make use of a low cancel timeout in context (10s maybe?), because it impacts loading time of the Workload details and we don't want that it creates a bad experience while 99% of times there will be no dashboard found.

  • Prometheus v2.24, the latest version, has improved its API [1] in a way that should be very helpful here. Instead of calling api.Series, we can call api.LabelValues for label __name__ and the labelset to match on (previously, this endpoint didn't accept the labelset to match on as parameter). This endpoint should eliminate the cardinality issue and will return just 1 item per metric family (ie. regardless the number of pods per app), hence should be much more efficient in our case. But this solution only works for recent prometheus, so it will have to live with the alternative / legacy solution, having both in code.

[1] https://prometheus.io/docs/prometheus/2.24/querying/api/#querying-label-values

@jotak
Copy link
Contributor

jotak commented Feb 2, 2021

N_OF_PODS_IN_APP * N_OF_DISCOVER_ON_METRICS in parallel ? ( so in my case 120 * 13 = 1560 )
It would be great to disable autodiscovery and rely on https://kiali.io/documentation/latest/runtimes-monitoring/#pods-annotations

Yes +1 I think it's the first thing to do.
Perhaps a settings with values true/false/auto => if set to "auto", which would be the default, discovery would be automatically skipped when the number of pods found reaches an arbitrary threshold.

@jotak jotak self-assigned this Feb 2, 2021
@jotak jotak added the backlog Triaged Issue added to backlog label Feb 2, 2021
@jotak jotak added this to Backlog in Sprint 52 via automation Feb 2, 2021
@jshaughn
Copy link
Collaborator

jshaughn commented Feb 2, 2021

Prometheus v2.24, the latest version, has improved its API [1]

I guess I'd vote to for option 2+4, a second config value to be used only for older prom, that could later be deprecated, specifying true/false/auto (or maybe a pod threshold instead of auto), and then the new API approach used for eligible Proms.

Or, to be more simple, option 1+4, disable discovery unless we have a newer Prom, then use the API.

@primeroz
Copy link

primeroz commented Feb 2, 2021

I would really love option 2+4 , especially if 2 can land faster than option 4.

I am running prometheus 2.24 so if you need any testing in my setup ( the one that caused the issue ) i can totally do that.

thanks for looking into this

jotak added a commit to jotak/kiali-operator that referenced this issue Feb 3, 2021
- discovery_enabled (true/false/auto) to switch discovery mode
- discovery_auto_threshold: pods threshold above which discovery is
  skipped in auto mode

Part of kiali/kiali#3660
jotak added a commit to jotak/swscore that referenced this issue Feb 3, 2021
- discovery_enabled (true/false/auto) to switch discovery mode
- discovery_auto_threshold: pods threshold above which discovery is
  skipped in auto mode

Part of kiali#3660
@jotak
Copy link
Contributor

jotak commented Feb 3, 2021

I've opened a PR about option 2; about option 4 (prometheus api update), we'll have to wait a little bit that prometheus/client_golang#828 gets merged & we update our go client.

jotak added a commit to jotak/kiali-operator that referenced this issue Feb 3, 2021
- discovery_enabled (true/false/auto) to switch discovery mode
- discovery_auto_threshold: pods threshold above which discovery is
  skipped in auto mode

Part of kiali/kiali#3660
jotak added a commit that referenced this issue Feb 3, 2021
- discovery_enabled (true/false/auto) to switch discovery mode
- discovery_auto_threshold: pods threshold above which discovery is
  skipped in auto mode

Part of #3660
jotak added a commit to jotak/swscore that referenced this issue Feb 3, 2021
- discovery_enabled (true/false/auto) to switch discovery mode
- discovery_auto_threshold: pods threshold above which discovery is
  skipped in auto mode

Part of kiali#3660
jotak added a commit to jotak/kiali-operator that referenced this issue Feb 3, 2021
- discovery_enabled (true/false/auto) to switch discovery mode
- discovery_auto_threshold: pods threshold above which discovery is
  skipped in auto mode

Part of kiali/kiali#3660
jotak added a commit to kiali/kiali-operator that referenced this issue Feb 3, 2021
* Performance / custom dashboards: new configs

- discovery_enabled (true/false/auto) to switch discovery mode
- discovery_auto_threshold: pods threshold above which discovery is
  skipped in auto mode

Part of kiali/kiali#3660

* mazz feedback
jotak added a commit to kiali/kiali-operator that referenced this issue Feb 4, 2021
* Performance / custom dashboards: new configs

- discovery_enabled (true/false/auto) to switch discovery mode
- discovery_auto_threshold: pods threshold above which discovery is
  skipped in auto mode

Part of kiali/kiali#3660

* mazz feedback
jotak added a commit that referenced this issue Feb 4, 2021
- discovery_enabled (true/false/auto) to switch discovery mode
- discovery_auto_threshold: pods threshold above which discovery is
  skipped in auto mode

Part of #3660
@lucasponce lucasponce removed this from Backlog in Sprint 52 Feb 12, 2021
@lucasponce lucasponce added this to Backlog in Sprint 53 via automation Feb 12, 2021
@jotak
Copy link
Contributor

jotak commented Feb 15, 2021

Closing this one, as the initial issue has been worked around.
I've opened this ticket as a next step #3704 ; it can't be done until next prom go client release.

@jotak jotak closed this as completed Feb 15, 2021
Sprint 53 automation moved this from Backlog to Done Feb 15, 2021
@jmazzitelli
Copy link
Collaborator

jmazzitelli commented Sep 22, 2021

@primeroz Once PR 4367 is merged and released (should be released next week with v1.41) we'd like to make sure this helps you further. (in fact, I don't know if you have the dev environment to try - but perhaps you can try the PR build now? If you don't have a dev environment in which you can build Kiali from that PR, I could build you a test image and publish on quay.io if you want to try it out before release).

The idea is you shouldn't have to disable things - the hope is the Prometheus query request is faster now. I don't have an environment as large as yours (80 nodes, tons of pods) so I can't test in an environment that mimics yours.

UPDATE: I built and published a test image based on PR 4367 - if someone wants to test it - use this image: quay.io/jmazzitelli/kiali:pr4367

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Triaged Issue added to backlog
Projects
No open projects
Sprint 53
  
Done
Development

No branches or pull requests

5 participants