-
We run Prometheus in a sharded config (2 shards, 1 replica) in about 40 different k8s clusters that host a large production service. They're pretty big instances - each shard scraping about 3 million time-series per scrape cycle and evaluating ~300 rules. They also remote-write their metrics to a global Thanos cluster we run. We've been running all of the zones on Prometheus 2.30.3 since February 2022. With 2.30.3, each Prometheus instance was consuming ~15 vCPUS. We recently updated two of our largest clusters to 2.44.0. With 2.44.0 we were shocked (pleasantly) to find that the CPU utilization had dropped to around 5-6 vCPUs per instance. There's been a ton of changes obviously between 2.30.3 and 2.44.0. But I am curious what change or changes could have caused such a dramatic improvement in CPU utilization. I skimmed through the release notes for all the releases but nothing jumped out - or I just missed it. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 12 replies
-
Quite many, but you are right, we could outline those and be proud of it more (: It sounds like it deserves a blog post (: WDYT? @bboreham Would like to give us some screenshots of Heap (mem) and CPU graphs before & after? (: |
Beta Was this translation helpful? Give feedback.
-
All profiles are for 30 seconds. In The detail confirms that #12048 and #12084 gave big improvements. Even after this, nearly all the time is going into producing Labels to show in the 'dropped targets' view, which I have proposed to restrict. When using Kubernetes this issue can be avoided by filtering targets using namespaces and selectors in preference to drop rules.
|
Beta Was this translation helpful? Give feedback.
All profiles are for 30 seconds.
shard1-pprof-before.gz
shows 282 CPU-seconds, so 9.2 CPUs active.shard1-pprof-after.gz
shows 218 CPU-seconds, 6.4 CPUs active.This is a bit less than the 15->6 you first mentioned, but still a decent drop.
In
before
we have 190s in scrapePool.Sync, plus 65 in background garbage-collection.In
after
we have 118s in scrapePool.Sync, plus 50 in garbage-collection.The detail confirms that #12048 and #12084 gave big improvements.
Even after this, nearly all the time is going into producing Labels to show in the 'dropped targets' view, which I have proposed to restrict. When using Kubernetes this issue can be avoided by filtering targets using namespaces and …