New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance problems with custom dashboards discovery #3660
Comments
For reference, this is a relatively big cluster
when loading the workload page , the workload has about 30 running pods , kiali will often OOM jumping from a normal usage of few hundreds MB to its limits of 6GB. Sometimes , especially on 1.29, it also caused prometheus to OOM raising its ram usage from an average of 14GB to 24GB (its limit) |
hey @primeroz @lucasponce , Checking the new logs at a small scale, I don't find anything unexpected. When loading a Workload Details page just triggers 5 prometheus queries, all of them are expected (although it's certainly possible to optimize, there's some redundancy, but nothing huge as we saw in your case):
So ... unless the bug I fixed somehow triggers large side effects that I'm not seeing, I can't say yet how you came having like 3500 hits of prometheus for a single workload page refresh. It would be nice to run my PR in your environment and check the logs (we're on slack to help on this) |
@jotak is it possible that the minigraph is so populated in such environment that creates that load in the detail ? |
@jotak i can definetely run your version, do you have a built image of kiali i can pull ? also , how do i enable the extra tracing ? I did some more tests last night and the 1.26 , after initially working, started to fail in the same way. I could see some Was planning to get back to it later today , and indeed get some more debugging. Also i was trying to figure out how to get the "Kiali custom dashboard to show up for the kiali workload" . I guess i need to make kiali part of the mesh ? ( is not in my testing ) |
I can confirm this is due the bump of the client-go libraries, it should be harmless, just noisy to the logs. |
If I'm correct, all the topology & associated health are taken from just these three queries, which should stay the same whatever number of nodes are in the graph:
So, I don't think the graph size should matter at least in terms of prom queries, unless I'm missing something |
I've just pushed this image:
It probably depends on how you installed istio and kiali. Basically, there's a CRD defined by kiali that must be installed, it's named "monitoringdashboards.monitoring.kiali.io"; do you have it? |
After chatting with @jotak (Thanks , you were super helpful) it turned out that the issue had to do with the Custom Dashboards to note that the biggest impact was when opening the The timings are 89s for I disabled the custom dashboards and now kiali is as snappy as it can possibly be, Ram usage is also normal now. I will upgrade back to 1.29 to confirm this issue has nothing to do with the version then might be good to change the scope of the issue / close this and create another one |
Also on 1.29 , with custom dashboards disabled, the performances of kiali are good. Is this because of the
Anyway @jotak @lucasponce might be good to either close this case in favour of a new one about thanks for your hlep |
Oh, I've re-reading the whole thread. Good finding, thanks @primeroz and @jotak for the time working on it. If the dashboards issue is the main root cause, I think the same issue can be re-used, if it's an additional/different but related topic, then it could be splitted in another for better tracking. I think I missed the context in my first response. |
I renamed this issue. Some findings to share: The issue is mostly seen on applications or workloads that have many pods, because many pods x high metrics cardinality = exploding volume of data. There is good news however: most queries that Kiali runs are fine. Just one query is really a problem, and it's not a central feature, it's the So there are several suggestions that we can do:
[1] https://prometheus.io/docs/prometheus/2.24/querying/api/#querying-label-values |
Yes +1 I think it's the first thing to do. |
I guess I'd vote to for option 2+4, a second config value to be used only for older prom, that could later be deprecated, specifying true/false/auto (or maybe a pod threshold instead of auto), and then the new API approach used for eligible Proms. Or, to be more simple, option 1+4, disable discovery unless we have a newer Prom, then use the API. |
I would really love option 2+4 , especially if 2 can land faster than option 4. I am running prometheus 2.24 so if you need any testing in my setup ( the one that caused the issue ) i can totally do that. thanks for looking into this |
- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali/kiali#3660
- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali#3660
I've opened a PR about option 2; about option 4 (prometheus api update), we'll have to wait a little bit that prometheus/client_golang#828 gets merged & we update our go client. |
- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali/kiali#3660
- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of #3660
- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali#3660
- discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali/kiali#3660
* Performance / custom dashboards: new configs - discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali/kiali#3660 * mazz feedback
* Performance / custom dashboards: new configs - discovery_enabled (true/false/auto) to switch discovery mode - discovery_auto_threshold: pods threshold above which discovery is skipped in auto mode Part of kiali/kiali#3660 * mazz feedback
Closing this one, as the initial issue has been worked around. |
@primeroz Once PR 4367 is merged and released (should be released next week with v1.41) we'd like to make sure this helps you further. (in fact, I don't know if you have the dev environment to try - but perhaps you can try the PR build now? If you don't have a dev environment in which you can build Kiali from that PR, I could build you a test image and publish on quay.io if you want to try it out before release). The idea is you shouldn't have to disable things - the hope is the Prometheus query request is faster now. I don't have an environment as large as yours (80 nodes, tons of pods) so I can't test in an environment that mimics yours. UPDATE: I built and published a test image based on PR 4367 - if someone wants to test it - use this image: |
Querying workload details [1] shows a significant amount of queries between 1.26 and v1.29 in a highly populated cluster.
Info is provided by @primeroz (thanks a lot!)
So, this would deserve an investigation about what could be going on.
[1] http://localhost:20001/kiali/console/namespaces/twodotoh/workloads/gateway-external-write?tab=info&duration=600&refresh=0
The text was updated successfully, but these errors were encountered: