[Feature Request] Infinite memory #12232

faangbait · 2023-04-05T01:23:38Z

faangbait
Apr 5, 2023

Proposal

We've got a product that needs to roll into prod and is going to demand (pick a number) terabytes of memory to support the cardinality they're sending.

Are you guys making any progress towards, like, quantum memory or a singularity? How exactly should we answer this request?

No matter how you slice it, you guys are making us store string text for every timeseries, so at minimum, we're talking half a kilobyte of overhead for six kilobytes of data. And sure, we can shard, but why can't I be like, "actually, how about we DON'T keep two hours of samples memory, we aggregate it after five minutes and send it straight to thanos so it becomes bwplotka's problem."

Agent was a good start but it just pushes the problem further down the pipe, because eventually, we've got to load everything back into memory to aggregate it, no?

SuperQ · 2023-04-05T08:30:59Z

SuperQ
Apr 5, 2023
Maintainer

There is already the stringlabels refactoring that deduplicates the strings in memory quite a bit.

If you know your data well, you can adjust the block duration (--storage.tsdb.min-block-duration) to match your scrape or rule interval and flush faster than every 2 hours. (I'm looking to move our production config to 1h to match our 15s scrape interval and 30s rule interval defaults)

0 replies

bboreham · 2023-04-05T09:23:50Z

bboreham
Apr 5, 2023
Maintainer

@faangbait I think you are correctly understanding that Prometheus was designed with strict constraints, primarily that everything runs as one process. A common approach for large estates is to split up the metrics across multiple Prometheus, e.g. by app or by region.

There are several projects which take the core of Prometheus and split pieces of it to run distributed across multiple machines. I haven't kept up with all of them, but Mimir does not need to bring all the data to one place for most queries. Promcon talk about query-sharding.

On a point of detail about deduplicating strings, none of the currently-merged implementations do that. I wrote up some details in this document. There have been some proposals to do it - #5316, #11833 - plus I am working on another, described as option 5 in the doc. However labels are ~30% of the memory in a typical Prometheus (My Promcon talk) so this is not going to solve your whole problem.

I am minded to click the "convert to discussion" button.

0 replies

rtksamsmith · 2023-04-05T14:27:57Z

rtksamsmith
Apr 5, 2023

A common approach for large estates is to split up the metrics across multiple Prometheus, e.g. by app or by region.

That piece of the puzzle seems well-architected. The piece I'm curious about is specifically aggregating at the scrape level.

Say you've got 10,000 pods that will be horizontally scaled. You don't particularly care about the metrics in any particular pod, but you need to know the average. Now, prom scrapes throw an Instance label on each pod. Which makes a lot of sense, because you need to know what to scrape next time, too. And we could remove the instance label by just, setting it to a static value and honoring it, but that wouldn't remove the address meta label, which means a 10000x explosion.

So you shard the scrapers, and that's fine, but then you lose the aggregation.

Victoria structures their memory differently but arrives at the same problem. And it seems like the key reason everyone arrives at the same answer is because the only way to take an average is to load all the values in memory. So... we're back to infinite memory.

Zapier's aggregator version of pushgateway seems like the best solution, but pushing metrics is generally bad. And they don't have much development help.

0 replies

SuperQ · 2023-04-06T08:09:18Z

SuperQ
Apr 6, 2023
Maintainer

You don't particularly care about the metrics in any particular pod

Until you do. It's not uncommon, especially in non-trivial networks, to have hot spots or misbehaving infrastructure that disproportionately affects a handful of instances. These few instances are still "healthy", but are serving enough slow requests that it triggers our p99 SLO threshold and we have to cordon the node for investigation.

Without the instance / pod labels, we would not be able to pinpoint the source of the problem quickly.

0 replies

bboreham · 2023-04-06T10:23:54Z

bboreham
Apr 6, 2023
Maintainer

I like the idea that metrics are aggregated most of the time, at the point of scrape as far as possible, but there is a button on the UI to say "give me full detail here" that stops the aggregation within some scope for a period, e.g. 10 minutes.

I'm not aware of specific work in the Prometheus domain to achieve this, but have been in some blue-sky conversations about it.

One nit:

the only way to take an average is to load all the values in memory

You can distribute computation of sums of values and sums of counts, then bring those together to average. At least one downstream project does this.

1 reply

faangbait Apr 6, 2023
Author

I'm not aware of specific work in the Prometheus domain to achieve this, but have been in some blue-sky conversations about it.

Sounds like a good first contrib. Would you merge it if it didn't suck?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Infinite memory #12232

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[Feature Request] Infinite memory #12232

faangbait Apr 5, 2023

Proposal

Replies: 5 comments · 1 reply

SuperQ Apr 5, 2023 Maintainer

bboreham Apr 5, 2023 Maintainer

rtksamsmith Apr 5, 2023

SuperQ Apr 6, 2023 Maintainer

bboreham Apr 6, 2023 Maintainer

faangbait Apr 6, 2023 Author

faangbait
Apr 5, 2023

Replies: 5 comments 1 reply

SuperQ
Apr 5, 2023
Maintainer

bboreham
Apr 5, 2023
Maintainer

rtksamsmith
Apr 5, 2023

SuperQ
Apr 6, 2023
Maintainer

bboreham
Apr 6, 2023
Maintainer

faangbait Apr 6, 2023
Author