Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Prometheus resource usage #2980

Open
siggy opened this issue Jun 21, 2019 · 12 comments
Open

Evaluate Prometheus resource usage #2980

siggy opened this issue Jun 21, 2019 · 12 comments

Comments

@siggy
Copy link
Member

siggy commented Jun 21, 2019

Background

In #2922 a user reported a linkerd-prometheus pod using 30GB mem and 10GB ephemeral storage. Many factors contribute to Prometheus' resource usage, including:

  • total time series (prometheus_tsdb_head_series == ~500k == ~300 linkerd proxies x ~1700 metrics/proxy)
  • scrape_interval: 10s
  • --storage.tsdb.retention.time=6h
  • read load (via linkerd dashboard and Grafana)

Current state

Replicating the above set up with Prometheus v2.10.0 decreased steady state memory usage from 10GB -> 5GB, and high read-load from 12GB -> 8GB, this change will ship in #2979.

Proposal

Evaluate Prometheus resource usage, the goal being one or more of these outcomes:

  1. Linkerd default install changes
    • upgrade to Prometheus 2.11 when WAL compression lands
    • decrease set of metrics exported from proxy (or drop during collection)
    • optimize reads from linkerd dashboard and Grafana (via recording rules and/or fewer queries per page)
    • modify storage.tsdb.retention.time
    • modify storage.tsdb.retention.size
    • modify scrape interval
    • ephemeral storage limits
  2. Linkerd user tunable settings (via linkerd install)
    • storage.tsdb.retention.time
    • storage.tsdb.retention.size
    • scrape interval
    • ephermal storage limits
  3. Document to the user how best to manage resource usage. This could involve modifying the linkerd-prometheus installation to use persistent volumes, etc. (Dashboard slow and Prometheus using large amount of resources #2922 (comment))

/cc @jamesallen-vol @suever @complex64 (thanks for the user reports!)

@memory
Copy link
Contributor

memory commented Jul 26, 2019

Linkerd dashboard deployment summary pages (http://127.0.0.1:50750/namespaces/foo/deployments/bar) seem to be extremely resource-intensive for prometheus if the deployment in question has multiple upstream and downstream relationships, for example:

deployment

Simply opening up 4 of those pages at the same time was sufficient to drive a 4X increase in the cpu utilization of linkerd-prometheus:

image

@memory
Copy link
Contributor

memory commented Jul 26, 2019

As a followup: I replicated the same experiment, but leaving open pages for a deployment that is purely standalone and has neither upstreams nor downstreams. No CPU usage increase for linkerd-prometheus was seen, so this really does seem to be a question of how complex the graph is for the deployment in question.

@grampelberg
Copy link
Contributor

@siggy what do you think about putting some rules in to pre-calculate the deployment pages? It's a tough tradeoff.

@siggy
Copy link
Member Author

siggy commented Jul 29, 2019

@grampelberg We tried recording rules awhile back with mixed results, though it may be worth another look. I'm also optimistic there may be some optimizations around dashboard query load.

@ihcsim
Copy link
Contributor

ihcsim commented Aug 27, 2019

After running a few tests on GKE and AKS, here are my two main observations so far.

  1. Configuring the lifecycle test suite to run 100 slow-cooker pods, where each pod generates traffic at 100qps, the Linkerd Prometheus pod started to experience readiness and liveness probe failures. Changing the probes timeout interval alone from the default of 1 second to 60 seconds got me a lot further, until eventually the node runs out of memory and evicts the pod.
  2. Launching the dashboard definitely causes CPU spikes, which diminish when the browser is closed. When operating in stressed environments with multiple dashboards, they started to fail as seen in the screenshot below.

dashboard

Environment setup on AKS:

Infrastructure Resources Count
Nodes 10
Pods 400
Total CPU cores 20
Total memory 70GB
Deployment Name Pod count
slow-cooker 100
bb-broadcast 100
bb-p2p 100
bb-terminus 100

Slow cooker configuration:

  • qps: 100
  • concurrency: 1

@siggy
Copy link
Member Author

siggy commented Aug 27, 2019

@ihcsim Just to confirm, does this mean we're running 100 slow-cooker pods at 100qps each? If so, recommend turning qps down to 1 (or ~10), as 100qps may put undue pressure on the kubernetes nodes and linkerd-proxy. We really only want pressure on Prometheus, which should not vary with qps (and if it does I'd love to hear about it).

@ihcsim
Copy link
Contributor

ihcsim commented Aug 27, 2019

@siggy Thanks for the tips.

Unrelated to qps, in my last round of test I saw some context canceled logs in the public-api. Is this a query that the public-api was trying to send to Prometheus, but the context is canceled (due to context timeout?) because Prometheus was unresponsive?

linkerd linkerd-controller-569bb9cfd8-q9r6s public-api time="2019-08-27T22:22:42Z" level=error msg="Query(sum(increase(route_response_total{direction=\"inbound\", dst=~\"(kubernetes.default.svc.cluster.local|bb-broadcast.default.svc.cluster.local)(:\\\\d+)?\", namespace=\"default\", pod=\"bb-broadcast-7b8454d865-lg9vw\"}[1m])) by (rt_route, dst, classification)) failed with: Get http://linkerd-prometheus.linkerd.svc.cluster.local:9090/api/v1/query?query=sum%28increase%28route_response_total%7Bdirection%3D%22inbound%22%2C+dst%3D~%22%28kubernetes.default.svc.cluster.local%7Cbb-broadcast.default.svc.cluster.local%29%28%3A%5C%5Cd%2B%29%3F%22%2C+namespace%3D%22default%22%2C+pod%3D%22bb-broadcast-7b8454d865-lg9vw%22%7D%5B1m%5D%29%29+by+%28rt_route%2C+dst%2C+classification%29: context canceled"

@siggy
Copy link
Member Author

siggy commented Aug 27, 2019

Yeah, it's a TopRoutes query from the public API to Prometheus:

routeReqQuery = "sum(increase(route_response_total%s[%s])) by (%s, dst, classification)"

I think you're right that it's a timeout, but I'm not totally sure. We have pretty good metrics around the gRPC clients in the control-plane. Have a look at Prometheus metrics in the linkerd-controller Prometheus job. The Linkerd Health Grafana dashboard is probably a good place to start.

@brianstorti
Copy link
Contributor

brianstorti commented Sep 16, 2019

Just to add to this one, I'm seeing Prometheus using all the CPU available in a worker node when I open the linkerd dashboard. Memory doesn't seem to be a big issue for me, and it doesn't seem to be related to which page in the dashboard in open, even opening just the "overview" page is enough to trigger the CPU spike.

I don't notice the issue when I only open grafana, that only happens with linkerd's own dashboard.

I was running it in a 4-cores node and it was using 100% of CPU, starving all other pods. I edited the Prometheus deployment to add a limit of 1 core, and that makes the dashboard a bit flaky (and usually pretty slow), even with only 1 tab.

Screenshot 2019-09-16 15 16 40

I currently have ~100 meshed pods. This specific cluster is running with 10 nodes (m5.xlarge), 4 cores and 16GB each, running linkerd 2.5.

Please let me know if I can provide more information that could be useful.

@ihcsim
Copy link
Contributor

ihcsim commented Sep 17, 2019

@brianstorti Thanks for bringing this up. Can you try updating the linkerd-prometheus-config config map with the changes in https://github.com/linkerd/linkerd2/pull/3401/files#diff-26bef37e1506c3f8b33144756cb7e919R62-R68? I am very curious to see it will resolve the intense CPU consumption you are seeing. (Note that you may need to redeploy the Linkerd control plane to see the difference, in case cadvisor is already emitting a lot of metrics.)

Memory doesn't seem to be a big issue for me

I'm curious about how much memory Prometheus is consuming. kubectl -n linkerd top po should give us an idea.

~100 meshed pods

Hmm... I am a bit surprised by this number. On AKS, with 4 cores and 14GB of memory, I was about to get to about 1000 pods before my Prometheus starts to suffocate. Do you have many other workloads sharing the same node as Prometheus? For bigger clusters, I find using node selector and taint/toleration to isolate Prometheus to be helpful.

I am also curious about your service-to-service communication pattern. Do you have X meshed clients talking to Y meshed servers, where Y >> X (by many times)?

Finally, can you port-forward to your Linkerd Prometheus and run the following PromQL for me? These queries will impose additional load on your Prometheus, so don't do it on a prod cluster. You might also have to scale down the numbe of meshed pods.

kubectl -n linkerd port-forward svc/linkerd-prometheus 9090
topk(5, count({job="linkerd-proxy"} by (__name__))
topk(5, count({job="kubernetes-nodes-cadvisor"} by (__name__))

@brianstorti
Copy link
Contributor

Here are Prometheus queries:

Screenshot 2019-09-17 06 23 36
Screenshot 2019-09-17 06 24 20

Here you can see the CPU and memory usage (this is Prometheus running in a "dedicated" 4-core node):

linkerd-prometheus

Do you have many other workloads sharing the same node as Prometheus?

Not many, but yeah, I was not using a node selector so it was sharing the worker node with a few other pods. Now I'm running Prometheus in a dedicated 4-core node, but still seeing it use 100% of CPU.

I am also curious about your service-to-service communication pattern. Do you have X meshed clients talking to Y meshed servers, where Y >> X (by many times)?

We have one meshed service that receives requests from ~15 clients, and a service that sends requests to ~15 other services, but other than things are pretty evenly distributed.

I can try the configmap change later today and let you know if it changes something.

@brianstorti
Copy link
Contributor

brianstorti commented Sep 17, 2019

@ihcsim I tried applying these changes to linkerd-prometheus-config and restarted all Linkerd deployments, but didn't notice any difference in the CPU usage. Memory usage did drop significantly though.

Screenshot 2019-09-17 11 49 52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants