Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupting data written to remote storage in case sample_age_limit is hit #13979

Open
david-vavra opened this issue Apr 23, 2024 · 3 comments · May be fixed by #14078
Open

Corrupting data written to remote storage in case sample_age_limit is hit #13979

david-vavra opened this issue Apr 23, 2024 · 3 comments · May be fixed by #14078

Comments

@david-vavra
Copy link

david-vavra commented Apr 23, 2024

What did you do?

We've started experiencing data corruption since we have started using remote_write.queue_config.sample_age_limit. We want to drop old samples after our Disaster recovery tests, in order to make sure we see fresh data as soon as possible (and we do not care much about samples scraped during the test itself). After these tests (which is basically the only time when sample_age_limit applies as we try to make sure we do not hit it during normal operation) we have seen unexpected counter resets as well as new, unexpected timeseries.

What did you expect to see?

Hitting sample_age_limit should only drop old data.

What did you see instead? Under which circumstances?

Our production setup is Prometheus with remote write directed towards Mimir cluster. In case when remote write endpoint is not accessible for more than sample_age_limit, we have seen corruptions in data ingested to remote storage. In both cases, just 2-3 samples just after remote write endpoint resumes its operation, and Prometheus drops old data and starts ingesting again.

So far we have notices two cases:

  1. unexpected samples gets appended to an existing timeseries. Timestamp does not align with our scrape_interval, and value also does not align the timeseries. It probably happens to any metric type, but we have noticed this only with counters as the "fake" values are interpreted as a counter reset, causing huge spikes in our graphs.
  2. new timeseries is created, with just a few samples just after the remote_write resume to ingest data to remote storage just like in the previous case. We have noticed it for cases which are clearly nonsense. It seems that existing series get mixed, like in this case when we tested it with Prometheus holding only its own exposed metrics and node_exporter - go_gc_duration_seconds_count{cluster="local-test", instance="localhost:9100", job="node-exporter", name="systemd-networkd.service", state="activating", type="notify-reload"}

The above we have been able to reproduce in setup with localy running Prometheus binary with configured remote_write towards our staging Mimir cluster. The outage have been simulated by iptables -I OUTPUT -d __IP__ -j DROP.

We have also tried to reproduce it in setup with two Prometheis with one serving as a remote_write receiver (see the attached docker-compose). In this scenario, we haven't seen any corrupted data in the receiving Prometheus, but we are getting log complaining about corrupted data:

System information

Linux 6.5.0-27-generic x86_64

Prometheus version

prometheus, version 2.50.1 (branch: HEAD, revision: 8c9b0285360a0b6288d76214a75ce3025bce4050)
  build user:       root@6213bb3ee580
  build date:       20240226-11:36:26
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Prometheus configuration file

global:
        scrape_interval: 30s
        scrape_timeout: 10s
        scrape_protocols:
        - OpenMetricsText1.0.0
        - OpenMetricsText0.0.1
        - PrometheusText0.0.4
        evaluation_interval: 30s
        external_labels:
          __replica__: prometheus-local-test-0
          cluster: local-test
      scrape_configs:
      - job_name: prometheus-scif-monitoring
        honor_timestamps: false
        track_timestamps_staleness: false
        scrape_interval: 30s
        scrape_timeout: 10s
        scrape_protocols:
        - OpenMetricsText1.0.0
        - OpenMetricsText0.0.1
        - PrometheusText0.0.4
        metrics_path: /metrics
        scheme: http
        enable_compression: false
        follow_redirects: false
        enable_http2: false
        static_configs:
        - targets:
          - localhost:9090
      - job_name: node-exporter
        honor_timestamps: false
        track_timestamps_staleness: false
        scrape_interval: 5s
        scrape_timeout: 1s
        scrape_protocols:
        - OpenMetricsText1.0.0
        - OpenMetricsText0.0.1
        - PrometheusText0.0.4
        scheme: http
        enable_compression: false
        follow_redirects: false
        enable_http2: false
        metrics_path: /metrics
        static_configs:
        - targets:
          - node-exporter:9100
      remote_write:
      - url: http://receive-prometheus:9090/api/v1/write
        remote_timeout: 12s
        follow_redirects: true
        enable_http2: true
        queue_config:
          capacity: 100
          max_shards: 3
          min_shards: 12
          batch_send_deadline: 5s
          min_backoff: 30ms
          max_backoff: 5s
          sample_age_limit: 1m
        metadata_config:
          send: true
          send_interval: 1m
          max_samples_per_send: 100

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

prometheus          | ts=2024-04-23T10:06:18.038Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus          | ts=2024-04-23T10:07:26.043Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus          | ts=2024-04-23T10:08:34.050Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus          | ts=2024-04-23T10:09:11.336Z caller=dedupe.go:112 component=remote level=info remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Remote storage resharding" from=12 to=3
receive-prometheus  | ts=2024-04-23T10:09:11.998Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
receive-prometheus  | ts=2024-04-23T10:09:11.998Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
prometheus          | ts=2024-04-23T10:09:12.029Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"job\" is not unique: invalid sample"
receive-prometheus  | ts=2024-04-23T10:09:12.035Z caller=write_handler.go:134 level=error component=web msg="Out of order sample from remote write" err="duplicate sample for timestamp" series="{__name__=\"go_gc_duration_seconds\", __replica__=\"prometheus-local-test-0\", cluster=\"local-test\", instance=\"node-exporter:9100\", job=\"node-exporter\"}" timestamp=1713866946725
receive-prometheus  | ts=2024-04-23T10:09:12.035Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
prometheus          | ts=2024-04-23T10:09:12.035Z caller=dedupe.go:112 component=remote level=error remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="non-recoverable error" count=2000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: duplicate sample for timestamp"
@david-vavra
Copy link
Author

david-vavra commented Apr 23, 2024

Attached is the docker compose we use to replicate the issue.
docker-compose.yml

EDIT:
Steps to reproduce:

  1. run the attached compose
  2. simulate outage of remote write receiving endpoints date; docker pause mimir receive-prometheus
  3. wait for at least sample_age_limit
  4. resume remote write date; docker unpause mimir receive-prometheus

@FUSAKLA
Copy link
Contributor

FUSAKLA commented May 1, 2024

@marctc haven't you experienced similar issues? You might have the context as author of this feature 🙏

@FUSAKLA
Copy link
Contributor

FUSAKLA commented May 10, 2024

Attempt to fix this #14078

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants