`scrape_config` can use "prefix set" inclusion/exclusion instead of regex #10955

Freyert · 2022-05-02T21:24:50Z

Freyert
May 2, 2022

Proposal

Use case. Why is this important?
When I onboard a new prometheus exporter into a system, my team mates and I review the exposed metrics to determine which are valuable to retain and which should be dropped.

Prometheus metric names are usually prefixed with a "subsystem" that the metric belongs to. I find that during the metric selection process I usually select 2-3 metrics from each subsystem. The disparate subsystems with varying numbers of child metrics makes efficient regular expressions very difficult in my experience and I assume that most users in my situation concatenate all the metric names into a single regex using |. I expect this concatenated "monster regex" to be both time and space inefficient.

In my own experience the Go regex engine compiles | patterns into a loop over each pattern. So if we have to evaluate that loop over N metric names and there are M | patterns then the algorithm is O(M*N) to see if we should keep or drop a metric. There would also be I expect O(M) memory used to hold each of the patterns to match.

Set inclusion would require upfront time to compile the set data structure but should have be much more efficient per comparison.

Let's look at an example scrape config from Prometheus Trainings

action: keep
source_labels: [__name__]
regex: '(api_|http_).*'

Regex works excellently because we want to keep all the metrics of these two subsystems, and drop the rest. What if there are many subsystems of http and you only want to select a handful of metrics?

action: keep
source_labels: [__name__]
regex: '(api_|http_(headers_|(body_(bytes)).*' #not even sure this is a valid regex, but this is just for demonstration.

A more ergonomic (maybe) and performant (maybe) solution would be to use "prefix set inclusion" via a trie.

action: keep
source_labels: [__name__]
label_prefixes:
- api_*
- http_headers_*
- http_body_bytes_*

My actual case is even more extreme as I have about 15 different subsystems each containing one or two metrics that I want to keep. In my situation I would prefer just to explicitly list all the metrics I would like to keep and have prometheus compile it into an efficient prefix matching structure.

In my mind the situation here is similar to pattern matching URLs in a web framework or load balancer. Prefix matching is preferred due to efficiencies, but regex is great to have for captures and complex patterns.

roidelapluie · 2022-05-02T22:40:36Z

roidelapluie
May 2, 2022
Maintainer

Is this causing actual performance issues? We should only run relabeling once per metric, even when scraped multiple times.

Addidionally. Prometheus regexes are optimized for prefix patterns in regexes: #7453

However, there might be interest to include matching directly in client libraries (client_java and client_python apparently can do that), so you could use a list in parameters, avoiding them to use network resources :

params:
 metrics_match[]:
   - http_rest.*
   - http_api.*

cc @fstab for more context

0 replies

Freyert · 2022-05-03T18:31:04Z

Freyert
May 3, 2022
Author

I just went through the exercise of actually building the regular expression so we'll see if there are performance implications.

You could say that there is a performance issue around precision with my example.

I have 89 metrics I want to keep, but the exporter exposes 1461. I spent a few hours coming up with a regular expression that will match as few metrics as possible without being exhausting to maintain. I was able to reduce my kept metrics to 270 which is 181 more metrics than I actually want to keep.

This is the regular expression:

If you examine the regex in more depth it is entirely prefix matching, but cumbersome to deduce that on account of regex syntax.

Really the best way for me to move forward with a regex based solution is just to have a list of the individual metrics I want and the groups I prefixes. Then join them together with |.

I think I understand the trick behind #7453. During metric queries if people were not adding $ . . .^ when they were doing prefix searches it would cause the regex engine to do a lot of unnecessary searching within the string when you could just use the HasPrefix or HasSuffix methods.

I can see how that works with some small regexes for label matching.

0 replies

roidelapluie · 2022-05-08T22:39:00Z

roidelapluie
May 8, 2022
Maintainer

Maybe that's a good case to fix the exporter, too.

0 replies

roidelapluie · 2022-07-01T15:21:11Z

roidelapluie
Jul 1, 2022
Maintainer

Another idea is to use a list:

- source_labels: [__name__]
  regex: '(api_|http_(headers_|(body_(bytes)).
  target_label: __tmp_keep
  replacement: keep
- source_labels: [__name__]
  regex: 'mongodb_(foo|bar)'
  target_label: __tmp_keep
  replacement: keep
- source_labels: [__tmp_keep]
  regex: keep
  action: keep
- regex: __tmp_keep
  action: labeldrop

2 replies

unmilb Aug 24, 2023

@roidelapluie , have you tested the above solution or it is just an idea? because when I tried with the same it did not worked, I have multiple jobs and I want only 1000 metrics lets say, but the regex gets too long which is something does not look pretty and easy to maintain.

roidelapluie Aug 24, 2023
Maintainer

Yes it should work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`scrape_config` can use "prefix set" inclusion/exclusion instead of regex #10955

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

scrape_config can use "prefix set" inclusion/exclusion instead of regex #10955

Freyert May 2, 2022

Proposal

Replies: 4 comments · 2 replies

roidelapluie May 2, 2022 Maintainer

Freyert May 3, 2022 Author

roidelapluie May 8, 2022 Maintainer

roidelapluie Jul 1, 2022 Maintainer

unmilb Aug 24, 2023

roidelapluie Aug 24, 2023 Maintainer

`scrape_config` can use "prefix set" inclusion/exclusion instead of regex #10955

Freyert
May 2, 2022

Replies: 4 comments 2 replies

roidelapluie
May 2, 2022
Maintainer

Freyert
May 3, 2022
Author

roidelapluie
May 8, 2022
Maintainer

roidelapluie
Jul 1, 2022
Maintainer

roidelapluie Aug 24, 2023
Maintainer