Evaluate Prometheus resource usage #2980

siggy · 2019-06-21T11:36:27Z

Background

In #2922 a user reported a linkerd-prometheus pod using 30GB mem and 10GB ephemeral storage. Many factors contribute to Prometheus' resource usage, including:

total time series (prometheus_tsdb_head_series == ~500k == ~300 linkerd proxies x ~1700 metrics/proxy)
scrape_interval: 10s
--storage.tsdb.retention.time=6h
read load (via linkerd dashboard and Grafana)

Current state

Replicating the above set up with Prometheus v2.10.0 decreased steady state memory usage from 10GB -> 5GB, and high read-load from 12GB -> 8GB, this change will ship in #2979.

Proposal

Evaluate Prometheus resource usage, the goal being one or more of these outcomes:

Linkerd default install changes
- upgrade to Prometheus 2.11 when WAL compression lands
- decrease set of metrics exported from proxy (or drop during collection)
- optimize reads from linkerd dashboard and Grafana (via recording rules and/or fewer queries per page)
- modify storage.tsdb.retention.time
- modify storage.tsdb.retention.size
- modify scrape interval
- ephemeral storage limits
Linkerd user tunable settings (via linkerd install)
- storage.tsdb.retention.time
- storage.tsdb.retention.size
- scrape interval
- ephermal storage limits
Document to the user how best to manage resource usage. This could involve modifying the linkerd-prometheus installation to use persistent volumes, etc. (Dashboard slow and Prometheus using large amount of resources #2922 (comment))

/cc @jamesallen-vol @suever @complex64 (thanks for the user reports!)

The text was updated successfully, but these errors were encountered:

memory · 2019-07-26T20:23:00Z

Linkerd dashboard deployment summary pages (http://127.0.0.1:50750/namespaces/foo/deployments/bar) seem to be extremely resource-intensive for prometheus if the deployment in question has multiple upstream and downstream relationships, for example:

Simply opening up 4 of those pages at the same time was sufficient to drive a 4X increase in the cpu utilization of linkerd-prometheus:

memory · 2019-07-26T21:06:01Z

As a followup: I replicated the same experiment, but leaving open pages for a deployment that is purely standalone and has neither upstreams nor downstreams. No CPU usage increase for linkerd-prometheus was seen, so this really does seem to be a question of how complex the graph is for the deployment in question.

grampelberg · 2019-07-29T15:16:12Z

@siggy what do you think about putting some rules in to pre-calculate the deployment pages? It's a tough tradeoff.

siggy · 2019-07-29T16:02:10Z

@grampelberg We tried recording rules awhile back with mixed results, though it may be worth another look. I'm also optimistic there may be some optimizations around dashboard query load.

ihcsim · 2019-08-27T22:10:27Z

After running a few tests on GKE and AKS, here are my two main observations so far.

Configuring the lifecycle test suite to run 100 slow-cooker pods, where each pod generates traffic at 100qps, the Linkerd Prometheus pod started to experience readiness and liveness probe failures. Changing the probes timeout interval alone from the default of 1 second to 60 seconds got me a lot further, until eventually the node runs out of memory and evicts the pod.
Launching the dashboard definitely causes CPU spikes, which diminish when the browser is closed. When operating in stressed environments with multiple dashboards, they started to fail as seen in the screenshot below.

Environment setup on AKS:

Infrastructure Resources	Count
Nodes	10
Pods	400
Total CPU cores	20
Total memory	70GB

Deployment Name	Pod count
slow-cooker	100
bb-broadcast	100
bb-p2p	100
bb-terminus	100

Slow cooker configuration:

qps: 100
concurrency: 1

siggy · 2019-08-27T22:34:02Z

@ihcsim Just to confirm, does this mean we're running 100 slow-cooker pods at 100qps each? If so, recommend turning qps down to 1 (or ~10), as 100qps may put undue pressure on the kubernetes nodes and linkerd-proxy. We really only want pressure on Prometheus, which should not vary with qps (and if it does I'd love to hear about it).

ihcsim · 2019-08-27T23:07:07Z

@siggy Thanks for the tips.

Unrelated to qps, in my last round of test I saw some context canceled logs in the public-api. Is this a query that the public-api was trying to send to Prometheus, but the context is canceled (due to context timeout?) because Prometheus was unresponsive?

linkerd linkerd-controller-569bb9cfd8-q9r6s public-api time="2019-08-27T22:22:42Z" level=error msg="Query(sum(increase(route_response_total{direction=\"inbound\", dst=~\"(kubernetes.default.svc.cluster.local|bb-broadcast.default.svc.cluster.local)(:\\\\d+)?\", namespace=\"default\", pod=\"bb-broadcast-7b8454d865-lg9vw\"}[1m])) by (rt_route, dst, classification)) failed with: Get http://linkerd-prometheus.linkerd.svc.cluster.local:9090/api/v1/query?query=sum%28increase%28route_response_total%7Bdirection%3D%22inbound%22%2C+dst%3D~%22%28kubernetes.default.svc.cluster.local%7Cbb-broadcast.default.svc.cluster.local%29%28%3A%5C%5Cd%2B%29%3F%22%2C+namespace%3D%22default%22%2C+pod%3D%22bb-broadcast-7b8454d865-lg9vw%22%7D%5B1m%5D%29%29+by+%28rt_route%2C+dst%2C+classification%29: context canceled"

siggy · 2019-08-27T23:27:27Z

Yeah, it's a TopRoutes query from the public API to Prometheus:

linkerd2/controller/api/public/top_routes.go

Line 21 in 981f5bc

    
           routeReqQuery             = "sum(increase(route_response_total%s[%s])) by (%s, dst, classification)"

I think you're right that it's a timeout, but I'm not totally sure. We have pretty good metrics around the gRPC clients in the control-plane. Have a look at Prometheus metrics in the linkerd-controller Prometheus job. The Linkerd Health Grafana dashboard is probably a good place to start.

brianstorti · 2019-09-16T18:28:41Z

Just to add to this one, I'm seeing Prometheus using all the CPU available in a worker node when I open the linkerd dashboard. Memory doesn't seem to be a big issue for me, and it doesn't seem to be related to which page in the dashboard in open, even opening just the "overview" page is enough to trigger the CPU spike.

I don't notice the issue when I only open grafana, that only happens with linkerd's own dashboard.

I was running it in a 4-cores node and it was using 100% of CPU, starving all other pods. I edited the Prometheus deployment to add a limit of 1 core, and that makes the dashboard a bit flaky (and usually pretty slow), even with only 1 tab.

I currently have ~100 meshed pods. This specific cluster is running with 10 nodes (m5.xlarge), 4 cores and 16GB each, running linkerd 2.5.

Please let me know if I can provide more information that could be useful.

ihcsim · 2019-09-17T06:24:41Z

@brianstorti Thanks for bringing this up. Can you try updating the linkerd-prometheus-config config map with the changes in https://github.com/linkerd/linkerd2/pull/3401/files#diff-26bef37e1506c3f8b33144756cb7e919R62-R68? I am very curious to see it will resolve the intense CPU consumption you are seeing. (Note that you may need to redeploy the Linkerd control plane to see the difference, in case cadvisor is already emitting a lot of metrics.)

Memory doesn't seem to be a big issue for me

I'm curious about how much memory Prometheus is consuming. kubectl -n linkerd top po should give us an idea.

~100 meshed pods

Hmm... I am a bit surprised by this number. On AKS, with 4 cores and 14GB of memory, I was about to get to about 1000 pods before my Prometheus starts to suffocate. Do you have many other workloads sharing the same node as Prometheus? For bigger clusters, I find using node selector and taint/toleration to isolate Prometheus to be helpful.

I am also curious about your service-to-service communication pattern. Do you have X meshed clients talking to Y meshed servers, where Y >> X (by many times)?

Finally, can you port-forward to your Linkerd Prometheus and run the following PromQL for me? These queries will impose additional load on your Prometheus, so don't do it on a prod cluster. You might also have to scale down the numbe of meshed pods.

kubectl -n linkerd port-forward svc/linkerd-prometheus 9090

topk(5, count({job="linkerd-proxy"} by (__name__))
topk(5, count({job="kubernetes-nodes-cadvisor"} by (__name__))

brianstorti · 2019-09-17T09:46:50Z

Here are Prometheus queries:

Here you can see the CPU and memory usage (this is Prometheus running in a "dedicated" 4-core node):

Do you have many other workloads sharing the same node as Prometheus?

Not many, but yeah, I was not using a node selector so it was sharing the worker node with a few other pods. Now I'm running Prometheus in a dedicated 4-core node, but still seeing it use 100% of CPU.

I am also curious about your service-to-service communication pattern. Do you have X meshed clients talking to Y meshed servers, where Y >> X (by many times)?

We have one meshed service that receives requests from ~15 clients, and a service that sends requests to ~15 other services, but other than things are pretty evenly distributed.

I can try the configmap change later today and let you know if it changes something.

brianstorti · 2019-09-17T14:21:18Z

@ihcsim I tried applying these changes to linkerd-prometheus-config and restarted all Linkerd deployments, but didn't notice any difference in the CPU usage. Memory usage did drop significantly though.

siggy added the area/telemetry label Jun 21, 2019

siggy mentioned this issue Jun 21, 2019

Dashboard slow and Prometheus using large amount of resources #2922

Closed

siggy added the pinned label Jul 26, 2019

scottcarol mentioned this issue Jul 29, 2019

Reduce frequency of Prometheus queries from the dashboard #3158

Closed

grampelberg mentioned this issue Aug 14, 2019

Prometheus Sizing and Tuning #3258

Closed

KIVagant mentioned this issue Feb 28, 2020

Add resource request and limits to each pod in L5d namespace (including Helm chart) #4122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Prometheus resource usage #2980

Evaluate Prometheus resource usage #2980

siggy commented Jun 21, 2019

memory commented Jul 26, 2019

memory commented Jul 26, 2019

grampelberg commented Jul 29, 2019

siggy commented Jul 29, 2019

ihcsim commented Aug 27, 2019

siggy commented Aug 27, 2019 •

edited

ihcsim commented Aug 27, 2019

siggy commented Aug 27, 2019

brianstorti commented Sep 16, 2019 •

edited

ihcsim commented Sep 17, 2019 •

edited

brianstorti commented Sep 17, 2019

brianstorti commented Sep 17, 2019 •

edited

Evaluate Prometheus resource usage #2980

Evaluate Prometheus resource usage #2980

Comments

siggy commented Jun 21, 2019

Background

Current state

Proposal

memory commented Jul 26, 2019

memory commented Jul 26, 2019

grampelberg commented Jul 29, 2019

siggy commented Jul 29, 2019

ihcsim commented Aug 27, 2019

siggy commented Aug 27, 2019 • edited

ihcsim commented Aug 27, 2019

siggy commented Aug 27, 2019

brianstorti commented Sep 16, 2019 • edited

ihcsim commented Sep 17, 2019 • edited

brianstorti commented Sep 17, 2019

brianstorti commented Sep 17, 2019 • edited

siggy commented Aug 27, 2019 •

edited

brianstorti commented Sep 16, 2019 •

edited

ihcsim commented Sep 17, 2019 •

edited

brianstorti commented Sep 17, 2019 •

edited