Promethues counter decreases by 1 for some time series data #13950

ashishvaishno · 2024-04-18T12:12:22Z

What did you do?

I noticed lately a huge spike in one of our metrics.

If you look at the highlighted value 7756564 at epoch 1712719298.819 and the new entry has 1 value less than the previous one. this is the reason of the spike in rate/increase function
There was no restart on prometheus or the target in this case. What can contribute to this dip in value?

Below is graph of the data for 2 week

Here is the screenshot of the spike

What did you expect to see?

I would expect not to see a decrease in counter

What did you see instead? Under which circumstances?

We are running a HA setup of promethues (2 stateful set) with thanos.

System information

Linux 5.10.192-183.736.amzn2.x86_64 x86_64

Prometheus version

prometheus, version 2.45.0 (branch: HEAD, revision: 8ef767e396bf8445f009f945b0162fd71827f445)
  build user:       root@920118f645b7
  build date:       20230623-15:09:49
  go version:       go1.20.5
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

I have enabled debug logs on promethues now, will update the thread if I see something

The text was updated successfully, but these errors were encountered:

prymitive · 2024-04-18T12:36:52Z

This is unlikely to be a bug in Prometheus but most likely problem on your end.
If you look at timestamps you’ll notice that they are duplicated. There are always two samples ~20ms apart from each other. You might be scraping the same target twice or two different targets ends up with identical time series.
When everything works smoothly you won’t notice any problems. But if there’s a delay with either of these scrapes then it might result in data like you see above, mostly because timestamp of each sample is the beginning of the scrape request.
If one scrape starts, gets delayed on dns or connect attempt, but the other one is fast, then the slow scrape might end up with lower timestamp but higher value.

ashishvaishno · 2024-04-19T07:39:36Z

@prymitive Is there a way to handle this as I would need 2 statefulsets of promethues, Thanos does take care of deduplication but this delay might be difficult to manage right?

prymitive · 2024-04-19T07:44:20Z

Handle what exactly?
In Prometheus you’re supposed to have unique labels on all time series. Automatic injection of job and instance labels usually ensures this.
So first you need to understand why you have two scrapes that result in the same time series.

ashishvaishno · 2024-04-19T08:26:01Z

@prymitive
I have different labels for the metrics. Since we have two promethues setup, they both scrape data at 60s interval based upon when each stateful set starts. For any de-deplication thanos takes care of these situation. Now if I understood you point correctly there are "few" moments in time that scrape time of counter are slightly off. When the data is aggregated and queried on thanos I get the issue.
Or i have a wrong understanding?

Example : These are the label for my metrics

request_count_total{app="test", exported_id="test-594b9d94fc-kgdcg", exported_service="test", id="test-594b9d94fc-kgdcg", instance="172.26.19.57", job="kubernetes-services-pods", name="test", namespace="test", pod_template_hash="594b9d94fc", prometheus="monitoring/prometheus-stack-kube-prom-prometheus", service="test", system="INTERNET"}

On prom-0, i have this a value

ON prom-1, i that this value

On thanos querier :

Scrape duration on these endpoint is less than 0.1 sec as well

prymitive · 2024-04-19T09:29:20Z

If you use thanos and that’s where you see this problem then maybe thanos is merging two counters from two different Prometheus servers into a single time series?
Try your query on both Prometheus servers directly, if that works then you need to add some unique external labels on each Prometheus.

ashishvaishno · 2024-04-19T10:27:35Z

@prymitive I am already adding global external labels as promethues_replica: $(POD_NAME) in prometheus config, which is then used in thanos queries for de-duplication as --query.replica-label=prometheus_replica.

roidelapluie · 2024-04-23T08:36:12Z

Indeed 20ms come from two different Prometheis servers. It looks like a configuration issue on the Thanos side.

In your last comment you have a typo: promethues_replica , is it like that in your config too?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promethues counter decreases by 1 for some time series data #13950

Promethues counter decreases by 1 for some time series data #13950

ashishvaishno commented Apr 18, 2024

prymitive commented Apr 18, 2024

ashishvaishno commented Apr 19, 2024 •

edited

prymitive commented Apr 19, 2024

ashishvaishno commented Apr 19, 2024 •

edited

prymitive commented Apr 19, 2024

ashishvaishno commented Apr 19, 2024 •

edited

roidelapluie commented Apr 23, 2024

Promethues counter decreases by 1 for some time series data #13950

Promethues counter decreases by 1 for some time series data #13950

Comments

ashishvaishno commented Apr 18, 2024

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

prymitive commented Apr 18, 2024

ashishvaishno commented Apr 19, 2024 • edited

prymitive commented Apr 19, 2024

ashishvaishno commented Apr 19, 2024 • edited

prymitive commented Apr 19, 2024

ashishvaishno commented Apr 19, 2024 • edited

roidelapluie commented Apr 23, 2024

ashishvaishno commented Apr 19, 2024 •

edited

ashishvaishno commented Apr 19, 2024 •

edited

ashishvaishno commented Apr 19, 2024 •

edited