Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus too old sample issue #13972

Open
zbd20 opened this issue Apr 23, 2024 · 0 comments
Open

Prometheus too old sample issue #13972

zbd20 opened this issue Apr 23, 2024 · 0 comments

Comments

@zbd20
Copy link

zbd20 commented Apr 23, 2024

What did you do?

I am using Prometheus v2.47.0 in a production environment, and samples are sending from the Prometheus Agent to the Prometheus Server via remote write. At first, everything was normal, but one day both the Prometheus Agent and the Prometheus Server started logging errors simultaneously. Subsequently, remote write encountered exceptions, and samples could no longer be sent from the Prometheus Agent to the Prometheus Server. The Prometheus Server was unable to retrieve any data. The related logs are as follows.

What did you expect to see?

No error or warn logs are present, and the Prometheus remote write is working properly.

What did you see instead? Under which circumstances?

logs of the Prometheus Server

ts=2024-04-19T19:00:23.278Z caller=head.go:1298 level=info component=tsdb msg="Head GC completed" caller=truncateOOO duration=177.89714ms
ts=2024-04-19T19:00:23.291Z caller=compact.go:708 level=info component=tsdb msg="Found overlapping blocks during compaction" ulid=01HVVVPBKJ6ZPPFT1ZAKNJQ5D0
ts=2024-04-19T19:00:34.900Z caller=compact.go:464 level=info component=tsdb msg="compact blocks" count=2 mint=1712858400000 maxt=1713052800000 ulid=01HVVVPBKJ6ZPPFT1ZAKNJQ5D0 sources="[01HVVMTJJF4M4N0AD5DB4GWJHK 01HVVVP46HXTE8Y0V059C1PVW1]" duration=11.618858661s
ts=2024-04-19T19:00:34.993Z caller=db.go:1463 level=warn component=tsdb msg="Overlapping blocks found during reloadBlocks" detail="[mint: 1713376800000, maxt: 1713384000000, range: 2h0m0s, blocks: 2]: <ulid: 01HVVMTWRFJ98FFAR80Q1V16T7, mint: 1713247200000, maxt: 1713441600000, range: 54h0m0s>, <ulid: 01HVVVPAX03W5C02BERVRNPVYM, mint: 1713376800000, maxt: 1713384000000, range: 2h0m0s>\n[mint: 1713384000000, maxt: 1713391200000, range: 2h0m0s, blocks: 2]: <ulid: 01HVVMTWRFJ98FFAR80Q1V16T7, mint: 1713247200000, maxt: 1713441600000, range: 54h0m0s>, <ulid: 01HVVVPB1KFK9BPRXH4YTSSS13, mint: 1713384000000, maxt: 1713391200000, range: 2h0m0s>\n[mint: 1713391200000, maxt: 1713398400000, range: 2h0m0s, blocks: 2]: <ulid: 01HVVMTWRFJ98FFAR80Q1V16T7, mint: 1713247200000, maxt: 1713441600000, range: 54h0m0s>, <ulid: 01HVVVPB7M5R29B6QPS0Q6H4PH, mint: 1713391200000, maxt: 1713398400000, range: 2h0m0s>"
ts=2024-04-19T19:00:35.159Z caller=db.go:1619 level=info component=tsdb msg="Deleting obsolete block" block=01HVVMTJJF4M4N0AD5DB4GWJHK
ts=2024-04-19T19:00:35.162Z caller=db.go:1619 level=info component=tsdb msg="Deleting obsolete block" block=01HVVVP46HXTE8Y0V059C1PVW1
ts=2024-04-19T19:00:35.179Z caller=compact.go:708 level=info component=tsdb msg="Found overlapping blocks during compaction" ulid=01HVVVPQ6WR9TWGNWHKADKC9CD
ts=2024-04-19T19:00:59.310Z caller=compact.go:464 level=info component=tsdb msg="compact blocks" count=4 mint=1713247200000 maxt=1713441600000 ulid=01HVVVPQ6WR9TWGNWHKADKC9CD sources="[01HVVMTWRFJ98FFAR80Q1V16T7 01HVVVPAX03W5C02BERVRNPVYM 01HVVVPB1KFK9BPRXH4YTSSS13 01HVVVPB7M5R29B6QPS0Q6H4PH]" duration=24.146765124s
ts=2024-04-19T19:00:59.335Z caller=db.go:1619 level=info component=tsdb msg="Deleting obsolete block" block=01HVVVPAX03W5C02BERVRNPVYM
ts=2024-04-19T19:00:59.552Z caller=db.go:1619 level=info component=tsdb msg="Deleting obsolete block" block=01HVVMTWRFJ98FFAR80Q1V16T7
ts=2024-04-19T19:00:59.555Z caller=db.go:1619 level=info component=tsdb msg="Deleting obsolete block" block=01HVVVPB1KFK9BPRXH4YTSSS13
ts=2024-04-19T19:00:59.557Z caller=db.go:1619 level=info component=tsdb msg="Deleting obsolete block" block=01HVVVPB7M5R29B6QPS0Q6H4PH
ts=2024-04-19T20:33:26.485Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.539Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.626Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.775Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:27.042Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:27.552Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:28.546Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:30.490Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:34.358Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:39.393Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"

......

ts=2024-04-23T07:30:00.472Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-23T07:30:05.544Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-23T07:30:10.706Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-23T07:30:15.741Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-23T07:30:20.798Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-23T07:30:25.829Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-23T07:30:30.895Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-23T07:30:36.030Z caller=write_handler.go:76 level=error component=web msg="Error appending remote write" err="too old sample"

**logs of the Prometheus Agent**
ts=2024-04-19T10:37:15.381Z caller=db.go:621 level=info msg="series GC completed" duration=57.876035ms
ts=2024-04-19T12:37:15.440Z caller=db.go:621 level=info msg="series GC completed" duration=58.146495ms
ts=2024-04-19T12:37:15.440Z caller=checkpoint.go:100 level=info msg="Creating checkpoint" from_segment=108 to_segment=109 mint=1713529927000
ts=2024-04-19T12:37:18.083Z caller=db.go:691 level=info msg="WAL checkpoint complete" first=108 last=109 duration=2.701587989s
ts=2024-04-19T14:37:18.174Z caller=db.go:621 level=info msg="series GC completed" duration=87.806249ms
ts=2024-04-19T16:37:18.276Z caller=db.go:621 level=info msg="series GC completed" duration=99.576596ms
ts=2024-04-19T16:37:18.277Z caller=checkpoint.go:100 level=info msg="Creating checkpoint" from_segment=110 to_segment=111 mint=1713544323000
ts=2024-04-19T16:37:21.258Z caller=db.go:691 level=info msg="WAL checkpoint complete" first=110 last=111 duration=3.081670852s
ts=2024-04-19T18:37:21.330Z caller=db.go:621 level=info msg="series GC completed" duration=70.392311ms
ts=2024-04-19T20:33:26.517Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:34:29.714Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:35:30.113Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:36:30.478Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:37:21.427Z caller=db.go:621 level=info msg="series GC completed" duration=94.488464ms
ts=2024-04-19T20:37:21.428Z caller=checkpoint.go:100 level=info msg="Creating checkpoint" from_segment=112 to_segment=113 mint=1713558496000
ts=2024-04-19T20:37:24.536Z caller=db.go:691 level=info msg="WAL checkpoint complete" first=112 last=113 duration=3.203556221s
ts=2024-04-19T20:37:31.335Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:37:55.483Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1713558806 minSendTimestamp=1713559055
ts=2024-04-19T20:38:05.482Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1713558806 minSendTimestamp=1713559065
ts=2024-04-19T20:38:15.483Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1713558806 minSendTimestamp=1713559075
ts=2024-04-19T20:38:25.484Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1713558806 minSendTimestamp=1713559085
ts=2024-04-19T20:38:31.694Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-19T20:38:35.483Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1713558806 minSendTimestamp=1713559095
ts=2024-04-19T20:38:45.483Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1713558806 minSendTimestamp=1713559105
ts=2024-04-19T20:38:55.483Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1713558806 minSendTimestamp=1713559115

......

ts=2024-04-23T07:23:47.309Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-23T07:24:47.769Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-23T07:25:48.331Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-23T07:26:48.895Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-23T07:27:49.299Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-23T07:28:49.735Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-23T07:29:50.300Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"
ts=2024-04-23T07:30:51.184Z caller=dedupe.go:112 component=remote level=warn remote_name=prometheus-k8s-0 url=https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too old sample"

System information

No response

Prometheus version

prometheus, version 2.47.0 (branch: HEAD, revision: efa34a5840661c29c2e362efa76bc3a70dccb335)
  build user:       root@4f2c12e526ab
  build date:       20231002-15:09:56
  go version:       go1.20.8
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Prometheus configuration file

# the Prometheus Server configuration:
global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: ccos-monitoring/k8s
    prometheus_replica: prometheus-k8s-0
  keep_dropped_targets: 1
rule_files:
- /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml
scrape_configs: []
storage:
  tsdb:
    out_of_order_time_window: 1w
alerting:
  alert_relabel_configs:
  - action: labeldrop
    regex: prometheus_replica
  - regex: null
    target_label: ccos_io_alert_source
    replacement: platform
    action: replace
  - action: labeldrop
    regex: prometheus
  - action: labeldrop
    regex: prometheus_replica
  alertmanagers:
  - path_prefix: /
    scheme: https
    tls_config:
      insecure_skip_verify: false
      server_name: alertmanager-main.ccos-monitoring.svc
      ca_file: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
    kubernetes_sd_configs:
    - role: endpoints
      namespaces:
        names:
        - ccos-monitoring
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    api_version: v2
    relabel_configs:
    - action: keep
      source_labels:
      - __meta_kubernetes_service_name
      regex: alertmanager-main
    - action: keep
      source_labels:
      - __meta_kubernetes_endpoint_port_name
      regex: web

---
# the Prometheus Agent configuration:
global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: ccos-monitoring/agent-0
    prometheus_replica: prometheus-agent-0-0
  keep_dropped_targets: 1
scrape_configs:
- job_name: serviceMonitor/ccos-apiserver-operator/ccos-apiserver-operator/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - ccos-apiserver-operator
  scrape_interval: 30s
  scheme: https
  ......
remote_write:
- url: https://prometheus-k8s-0.ccos-monitoring:9091/api/v1/write
  remote_timeout: 30s
  name: prometheus-k8s-0
  write_relabel_configs:
  - target_label: __tmp_ccos_cluster_id__
    replacement: 965a1484-7d9c-4b94-a69f-6353792022a2
    action: replace
  - regex: __tmp_ccos_cluster_id__
    action: labeldrop
  bearer_token: *********
  tls_config:
    insecure_skip_verify: true
  queue_config:
    capacity: 10000
    min_shards: 1
    max_shards: 500
    max_samples_per_send: 2000
    batch_send_deadline: 10s
    min_backoff: 30ms
    max_backoff: 5s

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant