-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate Prometheus resource usage #2980
Comments
Linkerd dashboard deployment summary pages (http://127.0.0.1:50750/namespaces/foo/deployments/bar) seem to be extremely resource-intensive for prometheus if the deployment in question has multiple upstream and downstream relationships, for example: Simply opening up 4 of those pages at the same time was sufficient to drive a 4X increase in the cpu utilization of linkerd-prometheus: |
As a followup: I replicated the same experiment, but leaving open pages for a deployment that is purely standalone and has neither upstreams nor downstreams. No CPU usage increase for linkerd-prometheus was seen, so this really does seem to be a question of how complex the graph is for the deployment in question. |
@siggy what do you think about putting some rules in to pre-calculate the deployment pages? It's a tough tradeoff. |
@grampelberg We tried recording rules awhile back with mixed results, though it may be worth another look. I'm also optimistic there may be some optimizations around dashboard query load. |
After running a few tests on GKE and AKS, here are my two main observations so far.
Environment setup on AKS:
Slow cooker configuration:
|
@ihcsim Just to confirm, does this mean we're running 100 slow-cooker pods at 100qps each? If so, recommend turning qps down to 1 (or ~10), as 100qps may put undue pressure on the kubernetes nodes and linkerd-proxy. We really only want pressure on Prometheus, which should not vary with qps (and if it does I'd love to hear about it). |
@siggy Thanks for the tips. Unrelated to qps, in my last round of test I saw some
|
Yeah, it's a TopRoutes query from the public API to Prometheus:
I think you're right that it's a timeout, but I'm not totally sure. We have pretty good metrics around the gRPC clients in the control-plane. Have a look at Prometheus metrics in the |
Just to add to this one, I'm seeing Prometheus using all the CPU available in a worker node when I open the linkerd dashboard. Memory doesn't seem to be a big issue for me, and it doesn't seem to be related to which page in the dashboard in open, even opening just the "overview" page is enough to trigger the CPU spike. I don't notice the issue when I only open grafana, that only happens with linkerd's own dashboard. I was running it in a 4-cores node and it was using 100% of CPU, starving all other pods. I edited the Prometheus deployment to add a limit of 1 core, and that makes the dashboard a bit flaky (and usually pretty slow), even with only 1 tab. I currently have ~100 meshed pods. This specific cluster is running with 10 nodes ( Please let me know if I can provide more information that could be useful. |
@brianstorti Thanks for bringing this up. Can you try updating the
I'm curious about how much memory Prometheus is consuming.
Hmm... I am a bit surprised by this number. On AKS, with 4 cores and 14GB of memory, I was about to get to about 1000 pods before my Prometheus starts to suffocate. Do you have many other workloads sharing the same node as Prometheus? For bigger clusters, I find using node selector and taint/toleration to isolate Prometheus to be helpful. I am also curious about your service-to-service communication pattern. Do you have X meshed clients talking to Y meshed servers, where Y >> X (by many times)? Finally, can you port-forward to your Linkerd Prometheus and run the following PromQL for me? These queries will impose additional load on your Prometheus, so don't do it on a prod cluster. You might also have to scale down the numbe of meshed pods.
|
Here are Prometheus queries: Here you can see the CPU and memory usage (this is Prometheus running in a "dedicated" 4-core node):
Not many, but yeah, I was not using a node selector so it was sharing the worker node with a few other pods. Now I'm running Prometheus in a dedicated 4-core node, but still seeing it use 100% of CPU.
We have one meshed service that receives requests from ~15 clients, and a service that sends requests to ~15 other services, but other than things are pretty evenly distributed. I can try the configmap change later today and let you know if it changes something. |
@ihcsim I tried applying these changes to |
Background
In #2922 a user reported a linkerd-prometheus pod using 30GB mem and 10GB ephemeral storage. Many factors contribute to Prometheus' resource usage, including:
prometheus_tsdb_head_series
== ~500k == ~300 linkerd proxies x ~1700 metrics/proxy)scrape_interval: 10s
--storage.tsdb.retention.time=6h
linkerd dashboard
and Grafana)Current state
Replicating the above set up with Prometheus v2.10.0 decreased steady state memory usage from 10GB -> 5GB, and high read-load from 12GB -> 8GB, this change will ship in #2979.
Proposal
Evaluate Prometheus resource usage, the goal being one or more of these outcomes:
linkerd dashboard
and Grafana (via recording rules and/or fewer queries per page)linkerd install
)/cc @jamesallen-vol @suever @complex64 (thanks for the user reports!)
The text was updated successfully, but these errors were encountered: