You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've started experiencing data corruption since we have started using remote_write.queue_config.sample_age_limit. We want to drop old samples after our Disaster recovery tests, in order to make sure we see fresh data as soon as possible (and we do not care much about samples scraped during the test itself). After these tests (which is basically the only time when sample_age_limit applies as we try to make sure we do not hit it during normal operation) we have seen unexpected counter resets as well as new, unexpected timeseries.
What did you expect to see?
Hitting sample_age_limit should only drop old data.
What did you see instead? Under which circumstances?
Our production setup is Prometheus with remote write directed towards Mimir cluster. In case when remote write endpoint is not accessible for more than sample_age_limit, we have seen corruptions in data ingested to remote storage. In both cases, just 2-3 samples just after remote write endpoint resumes its operation, and Prometheus drops old data and starts ingesting again.
So far we have notices two cases:
unexpected samples gets appended to an existing timeseries. Timestamp does not align with our scrape_interval, and value also does not align the timeseries. It probably happens to any metric type, but we have noticed this only with counters as the "fake" values are interpreted as a counter reset, causing huge spikes in our graphs.
new timeseries is created, with just a few samples just after the remote_write resume to ingest data to remote storage just like in the previous case. We have noticed it for cases which are clearly nonsense. It seems that existing series get mixed, like in this case when we tested it with Prometheus holding only its own exposed metrics and node_exporter - go_gc_duration_seconds_count{cluster="local-test", instance="localhost:9100", job="node-exporter", name="systemd-networkd.service", state="activating", type="notify-reload"}
The above we have been able to reproduce in setup with localy running Prometheus binary with configured remote_write towards our staging Mimir cluster. The outage have been simulated by iptables -I OUTPUT -d __IP__ -j DROP.
We have also tried to reproduce it in setup with two Prometheis with one serving as a remote_write receiver (see the attached docker-compose). In this scenario, we haven't seen any corrupted data in the receiving Prometheus, but we are getting log complaining about corrupted data:
System information
Linux 6.5.0-27-generic x86_64
Prometheus version
prometheus, version 2.50.1 (branch: HEAD, revision: 8c9b0285360a0b6288d76214a75ce3025bce4050)
build user: root@6213bb3ee580
build date: 20240226-11:36:26
go version: go1.21.7
platform: linux/amd64
tags: netgo,builtinassets,stringlabels
What did you do?
We've started experiencing data corruption since we have started using
remote_write.queue_config.sample_age_limit
. We want to drop old samples after our Disaster recovery tests, in order to make sure we see fresh data as soon as possible (and we do not care much about samples scraped during the test itself). After these tests (which is basically the only time whensample_age_limit
applies as we try to make sure we do not hit it during normal operation) we have seen unexpected counter resets as well as new, unexpected timeseries.What did you expect to see?
Hitting
sample_age_limit
should only drop old data.What did you see instead? Under which circumstances?
Our production setup is Prometheus with remote write directed towards Mimir cluster. In case when remote write endpoint is not accessible for more than
sample_age_limit
, we have seen corruptions in data ingested to remote storage. In both cases, just 2-3 samples just after remote write endpoint resumes its operation, and Prometheus drops old data and starts ingesting again.So far we have notices two cases:
go_gc_duration_seconds_count{cluster="local-test", instance="localhost:9100", job="node-exporter", name="systemd-networkd.service", state="activating", type="notify-reload"}
The above we have been able to reproduce in setup with localy running Prometheus binary with configured remote_write towards our staging Mimir cluster. The outage have been simulated by
iptables -I OUTPUT -d __IP__ -j DROP
.We have also tried to reproduce it in setup with two Prometheis with one serving as a remote_write receiver (see the attached docker-compose). In this scenario, we haven't seen any corrupted data in the receiving Prometheus, but we are getting log complaining about corrupted data:
System information
Linux 6.5.0-27-generic x86_64
Prometheus version
Prometheus configuration file
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
The text was updated successfully, but these errors were encountered: