Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promethues counter decreases by 1 for some time series data #13950

Open
ashishvaishno opened this issue Apr 18, 2024 · 7 comments
Open

Promethues counter decreases by 1 for some time series data #13950

ashishvaishno opened this issue Apr 18, 2024 · 7 comments

Comments

@ashishvaishno
Copy link

What did you do?

I noticed lately a huge spike in one of our metrics.

If you look at the highlighted value 7756564 at epoch 1712719298.819 and the new entry has 1 value less than the previous one. this is the reason of the spike in rate/increase function
There was no restart on prometheus or the target in this case. What can contribute to this dip in value?
Screenshot 2024-04-17 at 10 24 46

Below is graph of the data for 2 week
Screenshot 2024-04-17 at 10 28 45

Here is the screenshot of the spike
Screenshot 2024-04-17 at 10 32 10

What did you expect to see?

I would expect not to see a decrease in counter

What did you see instead? Under which circumstances?

We are running a HA setup of promethues (2 stateful set) with thanos.

System information

Linux 5.10.192-183.736.amzn2.x86_64 x86_64

Prometheus version

prometheus, version 2.45.0 (branch: HEAD, revision: 8ef767e396bf8445f009f945b0162fd71827f445)
  build user:       root@920118f645b7
  build date:       20230623-15:09:49
  go version:       go1.20.5
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

I have enabled debug logs on promethues now, will update the thread if I see something
@prymitive
Copy link
Contributor

This is unlikely to be a bug in Prometheus but most likely problem on your end.
If you look at timestamps you’ll notice that they are duplicated. There are always two samples ~20ms apart from each other. You might be scraping the same target twice or two different targets ends up with identical time series.
When everything works smoothly you won’t notice any problems. But if there’s a delay with either of these scrapes then it might result in data like you see above, mostly because timestamp of each sample is the beginning of the scrape request.
If one scrape starts, gets delayed on dns or connect attempt, but the other one is fast, then the slow scrape might end up with lower timestamp but higher value.

@ashishvaishno
Copy link
Author

ashishvaishno commented Apr 19, 2024

@prymitive Is there a way to handle this as I would need 2 statefulsets of promethues, Thanos does take care of deduplication but this delay might be difficult to manage right?

@prymitive
Copy link
Contributor

Handle what exactly?
In Prometheus you’re supposed to have unique labels on all time series. Automatic injection of job and instance labels usually ensures this.
So first you need to understand why you have two scrapes that result in the same time series.

@ashishvaishno
Copy link
Author

ashishvaishno commented Apr 19, 2024

@prymitive
I have different labels for the metrics. Since we have two promethues setup, they both scrape data at 60s interval based upon when each stateful set starts. For any de-deplication thanos takes care of these situation. Now if I understood you point correctly there are "few" moments in time that scrape time of counter are slightly off. When the data is aggregated and queried on thanos I get the issue.
Or i have a wrong understanding?

Example : These are the label for my metrics

request_count_total{app="test", exported_id="test-594b9d94fc-kgdcg", exported_service="test", id="test-594b9d94fc-kgdcg", instance="172.26.19.57", job="kubernetes-services-pods", name="test", namespace="test", pod_template_hash="594b9d94fc", prometheus="monitoring/prometheus-stack-kube-prom-prometheus", service="test", system="INTERNET"}

On prom-0, i have this a value
Screenshot 2024-04-19 at 10 46 56
ON prom-1, i that this value
Screenshot 2024-04-19 at 10 47 20

On thanos querier :

Screenshot 2024-04-19 at 10 46 34

Scrape duration on these endpoint is less than 0.1 sec as well

@prymitive
Copy link
Contributor

If you use thanos and that’s where you see this problem then maybe thanos is merging two counters from two different Prometheus servers into a single time series?
Try your query on both Prometheus servers directly, if that works then you need to add some unique external labels on each Prometheus.

@ashishvaishno
Copy link
Author

ashishvaishno commented Apr 19, 2024

@prymitive I am already adding global external labels as promethues_replica: $(POD_NAME) in prometheus config, which is then used in thanos queries for de-duplication as --query.replica-label=prometheus_replica.

@roidelapluie
Copy link
Member

Indeed 20ms come from two different Prometheis servers. It looks like a configuration issue on the Thanos side.

In your last comment you have a typo: promethues_replica , is it like that in your config too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants