Replies: 5 comments 1 reply
-
There is already the If you know your data well, you can adjust the block duration ( |
Beta Was this translation helpful? Give feedback.
-
@faangbait I think you are correctly understanding that Prometheus was designed with strict constraints, primarily that everything runs as one process. A common approach for large estates is to split up the metrics across multiple Prometheus, e.g. by app or by region. There are several projects which take the core of Prometheus and split pieces of it to run distributed across multiple machines. I haven't kept up with all of them, but Mimir does not need to bring all the data to one place for most queries. Promcon talk about query-sharding. On a point of detail about deduplicating strings, none of the currently-merged implementations do that. I wrote up some details in this document. There have been some proposals to do it - #5316, #11833 - plus I am working on another, described as option 5 in the doc. However labels are ~30% of the memory in a typical Prometheus (My Promcon talk) so this is not going to solve your whole problem. I am minded to click the "convert to discussion" button. |
Beta Was this translation helpful? Give feedback.
-
That piece of the puzzle seems well-architected. The piece I'm curious about is specifically aggregating at the scrape level. Say you've got 10,000 pods that will be horizontally scaled. You don't particularly care about the metrics in any particular pod, but you need to know the average. Now, prom scrapes throw an Instance label on each pod. Which makes a lot of sense, because you need to know what to scrape next time, too. And we could remove the instance label by just, setting it to a static value and honoring it, but that wouldn't remove the address meta label, which means a 10000x explosion. So you shard the scrapers, and that's fine, but then you lose the aggregation. Victoria structures their memory differently but arrives at the same problem. And it seems like the key reason everyone arrives at the same answer is because the only way to take an average is to load all the values in memory. So... we're back to infinite memory. Zapier's aggregator version of pushgateway seems like the best solution, but pushing metrics is generally bad. And they don't have much development help. |
Beta Was this translation helpful? Give feedback.
-
Until you do. It's not uncommon, especially in non-trivial networks, to have hot spots or misbehaving infrastructure that disproportionately affects a handful of instances. These few instances are still "healthy", but are serving enough slow requests that it triggers our p99 SLO threshold and we have to cordon the node for investigation. Without the |
Beta Was this translation helpful? Give feedback.
-
I like the idea that metrics are aggregated most of the time, at the point of scrape as far as possible, but there is a button on the UI to say "give me full detail here" that stops the aggregation within some scope for a period, e.g. 10 minutes. I'm not aware of specific work in the Prometheus domain to achieve this, but have been in some blue-sky conversations about it. One nit:
You can distribute computation of sums of values and sums of counts, then bring those together to average. At least one downstream project does this. |
Beta Was this translation helpful? Give feedback.
-
Proposal
We've got a product that needs to roll into prod and is going to demand (pick a number) terabytes of memory to support the cardinality they're sending.
Are you guys making any progress towards, like, quantum memory or a singularity? How exactly should we answer this request?
No matter how you slice it, you guys are making us store string text for every timeseries, so at minimum, we're talking half a kilobyte of overhead for six kilobytes of data. And sure, we can shard, but why can't I be like, "actually, how about we DON'T keep two hours of samples memory, we aggregate it after five minutes and send it straight to thanos so it becomes bwplotka's problem."
Agent was a good start but it just pushes the problem further down the pipe, because eventually, we've got to load everything back into memory to aggregate it, no?
Beta Was this translation helpful? Give feedback.
All reactions