Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: with grafana gives inconsistent dashboard with every refresh #5501

Closed
maartengo opened this issue Jul 14, 2022 · 23 comments · Fixed by #5583
Closed

query: with grafana gives inconsistent dashboard with every refresh #5501

maartengo opened this issue Jul 14, 2022 · 23 comments · Fixed by #5583

Comments

@maartengo
Copy link

Thanos, Prometheus and Golang version used:

Object Storage Provider:

What happened:
When using the Thanos querier as data source in Grafana, the dashboard data becomes inconsistent on every refresh. See gif for example. This doesn't happen if Grafana points directly to the Thanos sidecar. This is without usage of the storage gateway.

It might very well be a configuration issue, but we have no idea what could cause this.

Animation

What you expected to happen:
Consistent data between refreshes

How to reproduce it (as minimally and precisely as possible):
We haven't tried to reproduce this in a minimal setup, but it does happen on all of our environments (10+) all running the same configuration.
I can supply the complete values.yaml privately if needed, but it boils down to:

thanos:
  storegateway:
    enabled: false
  query:
    enabled: true
    replicaLabel:
      - prometheus_replica
    dnsDiscovery:
      sidecarsService: 'prometheus-stack-kube-prom-thanos-discovery'
      sidecarsNamespace: 'prometheus'

kube-prometheus-stack:
  grafana:
    sidecar:
      datasources:
        url: http://prometheus-stack-thanos-query:9090/
        initDatasources: true
      dashboards:
        searchNamespace: ALL
        labelValue: null # Needs to be null in order to load our dashboards
  prometheus:
    replicas: 3
    thanosService:
      enabled: true
    thanosServiceMonitor:
      enabled: true
    service:
      sessionAffinity: 'ClientIP'
    prometheusSpec:
      thanos:
        objectStorageConfig:
          key: config-file.yaml
          name: thanos-secret

Chart:

  - name: kube-prometheus-stack
    version: 36.6.2
    repository: https://prometheus-community.github.io/helm-charts
  - name: thanos
    version: 10.5.5
    repository: https://charts.bitnami.com/bitnami

Full logs to relevant components:

  • Grafana: no logging occurs
  • Query: no logging occurs

Anything else we need to know:

Environment:
K8S on AKS. First time deploying Thanos.

@maartengo
Copy link
Author

maartengo commented Jul 14, 2022

Some things we tried:

  • Running one, or multiple instance of prometheus
  • Running one, or multiple instances of querier
  • Running without prometheus, only with data from storage gateway

We also noticed that the data shown in the Grafana dashboards is not necessarily inconsistent; it is data from the wrong graph/view from within the same dashboard. For example; a graph that should only have 0 and 1 values shows CPU values (0 - 100). It also looks like the data is mixed from multiple views instead of a simple switching.

@GiedriusS
Copy link
Member

It's by design due to how deduplication works - in the beginning, we need to select one "true" series. Thanos uses a heuristic - it chooses the series with the most up-to-date information. So, on each refresh, you might get a different "true" series. The typical way to solve this is to have query-frontend in front which caches past results so they won't change on each refresh. Let me know if that helps.

@maartengo
Copy link
Author

Unfortunately, we get the same effect with the query-frontend. The shown data keeps changing on refresh even when selecting a static time range. The results are good when viewing a single dashboard panel, but that was the case even before using the query-frontend

@rouke-broersma
Copy link

@GiedriusS
It's by design due to how deduplication works - in the beginning, we need to select one "true" series. Thanos uses a heuristic - it chooses the series with the most up-to-date information. So, on each refresh, you might get a different "true" series. The typical way to solve this is to have query-frontend in front which caches past results so they won't change on each refresh. Let me know if that helps.

Hi, what we seem to be seeing is that thanos-query is returning data for other panels/queries in the same dashboard in the same query result. We can usually plot the wrong graph on another panel in the same dashboard where it does make sense, and on refresh usually the graph moves to the correct panel.

If we inspect the query response we can see that data is contained in the result that does not make sense for the given query. For example, we have a query that should return 1 or 0. Among the 1's and 0's there are for some reason some time series with values like 1945. This should not be possible. If we execute the query standalone from the thanos-query UI the correct result is returned, but when executed from a grafana dashboard a mixed result is returned.

@roelvdberg
Copy link

roelvdberg commented Jul 15, 2022

I have removed grafana from the test. I tried to execute a bash script with multiple queries async and the data was scrambled again after a few attempts.

The bash script I used is below:

`curl 'http://prometheus-stack-thanos-query-frontend:9090/api/v1/query_range'
-H 'accept: application/json, text/plain, /'
-H 'Referer: '
-H 'content-type: application/x-www-form-urlencoded'
--data-raw 'query=mcps%3Akubelet_volume_stats_available%3Apercentage_mcps_workload%7BcustomerEnv%3D%22dev1%22%2CcustomerKey%3D%22ismcpsdev-dev3%22%7D&start=1657874085&end=1657877685&step=15'
--compressed > q1.json &

curl 'http://prometheus-stack-thanos-query-frontend:9090/api/v1/query_range'
-H 'accept: application/json, text/plain, /'
-H 'Referer: '
-H 'content-type: application/x-www-form-urlencoded'
--data-raw 'query=mcps%3Acontainer_resource_memory_usage%3Apercentage%7BcustomerEnv%3D%22dev1%22%2CcustomerKey%3D%22ismcpsdev-dev3%22%7D&start=1657874085&end=1657877685&step=15'
--compressed > q2.json &

curl 'http://prometheus-stack-thanos-query-frontend:9090/api/v1/query_range'
-H 'accept: application/json, text/plain, /'
-H 'Referer: '
-H 'content-type: application/x-www-form-urlencoded'
--data-raw 'query=mcps%3Acontainer_resource_cpu_usage%3Apercentage%7BcustomerEnv%3D%22dev1%22%2CcustomerKey%3D%22ismcpsdev-dev3%22%7D&start=1657874085&end=1657877685&step=15'
--compressed > q3.json &

curl 'http://prometheus-stack-thanos-query-frontend:9090/api/v1/query_range'
-H 'accept: application/json, text/plain, /'
-H 'Referer: '
-H 'content-type: application/x-www-form-urlencoded'
--data-raw 'query=mcps%3Anode_memory_pressure%3Atrue%7BcustomerEnv%3D%22dev1%22%2CcustomerKey%3D%22ismcpsdev-dev3%22%7D&start=1657874085&end=1657877685&step=15'
--compressed > q4.json &`

@ddreier
Copy link

ddreier commented Jul 15, 2022

We're seeing the same thing after upgrading Thanos from 0.24.0 to 0.27.0. Grafana 9.03 and Prometheus 2.32.1.

@UUIDNIE
Copy link

UUIDNIE commented Jul 15, 2022

Can confirm, saw the same issue using Grafana 9.0.3 with Thanos 0.27.0 + Prometheus 2.32.1.

@GiedriusS
Copy link
Member

Is it still reproducible if you disable deduplication? 🤔

@maartengo
Copy link
Author

This still happens even without deduplication

@GiedriusS
Copy link
Member

GiedriusS commented Jul 20, 2022

Mhm, I see it now on my own Thanos cluster on some of the panels but it's still unclear to me what causes it or how to reproduce it consistently. Perhaps if any of you have some spare time, could you please try this with different versions between 0.24.0 & 0.27.0 to see when it starts occurring?

If it happens even with deduplication then 9a8c984 d218e60 shouldn't cause it. 🤔

Maybe 54a0deb is the culprit? Could you please try to see if reverting it helps? 🤔 However, my cluster doesn't have this commit yet so it must not be it.

@rphua
Copy link

rphua commented Jul 22, 2022

We've tested with 0.24.0, 0.25.0 and 0.26.0 they all don't seem to have the issue. It only starts occurring since 0.27.0.

@cuyoung
Copy link

cuyoung commented Jul 23, 2022

We saw the same behavior after upgrading from 0.25.2 to 0.27.0. Downgrading just the Querier to 0.26.0 resolved the issue.

In our case Grafana was receiving out of order samples and illogical results(results in billions when they should be in hundreds).

@ddreier
Copy link

ddreier commented Jul 25, 2022

We had gone back down to Thanos 0.24.0, but have since moved to 0.26.0 and are not having any problems.

@gwpries
Copy link

gwpries commented Aug 3, 2022

Another org checking in that has had to downgrade to 0.26. We upgraded to 0.27 and experienced some pretty wild values on our grafana dashboards (something that only reports 0 or 1 showing results in the thousands). Ended up here trying to find if anyone else was having the issue.

@GiedriusS
Copy link
Member

Could you please try reverting #5410 and see if it helps? I'll try looking into this once I'll have some time.

@dswarbrick
Copy link

I also encountered this after upgrading from 0.26.0 to 0.27.0. Initially it appeared to only be present when using query frontend, but I later encountered it with just the plain querier. Also it's not specific to Grafana environments - I noticed the same weird data even when using the thanos query web UI.

@GiedriusS
Copy link
Member

GiedriusS commented Aug 10, 2022

Was able to reproduce this by myself after updating. Reverting #5410 doesn't help, must be something more serious. Will try to reproduce locally && fix it. Must be some kind of race condition.

GiedriusS added a commit to GiedriusS/thanos that referenced this issue Aug 10, 2022
Fix a data race between Respond() and query/queryRange functions by
returning an extra optional function from instrumented functions that
releases the resources i.e. calls Close().

Cannot reproduce the following race:

```
==================
WARNING: DATA RACE
Write at 0x00c00566fa00 by goroutine 562:
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1450 +0x8044
  github.com/prometheus/prometheus/promql.(*evaluator).rangeEval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1060 +0x2684
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1281 +0x42a4
  github.com/prometheus/prometheus/promql.(*evaluator).rangeEval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1060 +0x2684
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1281 +0x42a4
  github.com/prometheus/prometheus/promql.(*evaluator).Eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:989 +0xf5
  github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:645 +0xa77
  github.com/prometheus/prometheus/promql.(*Engine).exec()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:595 +0x71e
  github.com/prometheus/prometheus/promql.(*query).Exec()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:197 +0x250
  github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query()
      /home/giedrius/dev/thanos/pkg/api/query/v1.go:387 +0xbf2
  github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query-fm()

  ...
  Previous read at 0x00c00566fa00 by goroutine 570:
  github.com/prometheus/prometheus/promql.(*Point).MarshalJSON()
      <autogenerated>:1 +0x4e
  encoding/json.addrMarshalerEncoder()
      /usr/lib/go-1.19/src/encoding/json/encode.go:495 +0x1af
  encoding/json.condAddrEncoder.encode()
      /usr/lib/go-1.19/src/encoding/json/encode.go:959 +0x94
  encoding/json.condAddrEncoder.encode-fm()
      <autogenerated>:1 +0xa4
  encoding/json.arrayEncoder.encode()
      /usr/lib/go-1.19/src/encoding/json/encode.go:915 +0x10e
  encoding/json.arrayEncoder.encode-fm()
      <autogenerated>:1 +0x90
  encoding/json.sliceEncoder.encode()

```

Should fix thanos-io#5501.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
GiedriusS added a commit that referenced this issue Aug 10, 2022
* api: fix race between Respond() and query/queryRange

Fix a data race between Respond() and query/queryRange functions by
returning an extra optional function from instrumented functions that
releases the resources i.e. calls Close().

Cannot reproduce the following race:

```
==================
WARNING: DATA RACE
Write at 0x00c00566fa00 by goroutine 562:
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1450 +0x8044
  github.com/prometheus/prometheus/promql.(*evaluator).rangeEval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1060 +0x2684
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1281 +0x42a4
  github.com/prometheus/prometheus/promql.(*evaluator).rangeEval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1060 +0x2684
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1281 +0x42a4
  github.com/prometheus/prometheus/promql.(*evaluator).Eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:989 +0xf5
  github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:645 +0xa77
  github.com/prometheus/prometheus/promql.(*Engine).exec()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:595 +0x71e
  github.com/prometheus/prometheus/promql.(*query).Exec()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:197 +0x250
  github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query()
      /home/giedrius/dev/thanos/pkg/api/query/v1.go:387 +0xbf2
  github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query-fm()

  ...
  Previous read at 0x00c00566fa00 by goroutine 570:
  github.com/prometheus/prometheus/promql.(*Point).MarshalJSON()
      <autogenerated>:1 +0x4e
  encoding/json.addrMarshalerEncoder()
      /usr/lib/go-1.19/src/encoding/json/encode.go:495 +0x1af
  encoding/json.condAddrEncoder.encode()
      /usr/lib/go-1.19/src/encoding/json/encode.go:959 +0x94
  encoding/json.condAddrEncoder.encode-fm()
      <autogenerated>:1 +0xa4
  encoding/json.arrayEncoder.encode()
      /usr/lib/go-1.19/src/encoding/json/encode.go:915 +0x10e
  encoding/json.arrayEncoder.encode-fm()
      <autogenerated>:1 +0x90
  encoding/json.sliceEncoder.encode()

```

Should fix #5501.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* CHANGELOG: add item

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
@GiedriusS GiedriusS reopened this Aug 10, 2022
@GiedriusS
Copy link
Member

Could you please try out main-2022-08-10-d00a713a or later and see if it helps?

GiedriusS added a commit to vinted/thanos that referenced this issue Aug 11, 2022
Fix a data race between Respond() and query/queryRange functions by
returning an extra optional function from instrumented functions that
releases the resources i.e. calls Close().

Cannot reproduce the following race:

```
==================
WARNING: DATA RACE
Write at 0x00c00566fa00 by goroutine 562:
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1450 +0x8044
  github.com/prometheus/prometheus/promql.(*evaluator).rangeEval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1060 +0x2684
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1281 +0x42a4
  github.com/prometheus/prometheus/promql.(*evaluator).rangeEval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1060 +0x2684
  github.com/prometheus/prometheus/promql.(*evaluator).eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:1281 +0x42a4
  github.com/prometheus/prometheus/promql.(*evaluator).Eval()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:989 +0xf5
  github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:645 +0xa77
  github.com/prometheus/prometheus/promql.(*Engine).exec()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:595 +0x71e
  github.com/prometheus/prometheus/promql.(*query).Exec()
      /home/giedrius/go/pkg/mod/github.com/vinted/prometheus@v1.8.2-0.20220808145920-5c879a061105/promql/engine.go:197 +0x250
  github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query()
      /home/giedrius/dev/thanos/pkg/api/query/v1.go:387 +0xbf2
  github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query-fm()

  ...
  Previous read at 0x00c00566fa00 by goroutine 570:
  github.com/prometheus/prometheus/promql.(*Point).MarshalJSON()
      <autogenerated>:1 +0x4e
  encoding/json.addrMarshalerEncoder()
      /usr/lib/go-1.19/src/encoding/json/encode.go:495 +0x1af
  encoding/json.condAddrEncoder.encode()
      /usr/lib/go-1.19/src/encoding/json/encode.go:959 +0x94
  encoding/json.condAddrEncoder.encode-fm()
      <autogenerated>:1 +0xa4
  encoding/json.arrayEncoder.encode()
      /usr/lib/go-1.19/src/encoding/json/encode.go:915 +0x10e
  encoding/json.arrayEncoder.encode-fm()
      <autogenerated>:1 +0x90
  encoding/json.sliceEncoder.encode()

```

Should fix thanos-io#5501.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
@GiedriusS
Copy link
Member

Can't reproduce this anymore with d00a713, will close this issue in a few days if nothing comes up.

@hanjm
Copy link
Member

hanjm commented Aug 19, 2022

@yeya24 could you help release 0.27.1 for this important bug fix 😄

@yeya24
Copy link
Contributor

yeya24 commented Aug 19, 2022

@yeya24 could you help release 0.27.1 for this important bug fix 😄

cc @matej-g and @wiardvanrij

@Antiarchitect
Copy link
Contributor

Having the same troubles. Please issue a patch release :)

@wiardvanrij
Copy link
Member

Still catching up, but https://github.com/thanos-io/thanos/releases/tag/v0.28.0 would be the fastest fix forward to get unblocked. I'll discuss if someone can make a patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.