Need clarification on Thanos Query deduplication Behaviour #7128
Replies: 1 comment 1 reply
-
Please clarify how deduplication works with the latest changes to Query post 0.32.3 release. Do i need to define "--query.replica-label=replica" in Thanos Query with replica (Global ENV set in promtheus to individual instance (prometheus-0 prometheus-1) values respectively? If i do write_relabel_config of replica (before remote-write) to one common value for "replica" & perform deduplication in Query using same variable, then i see following in the receiver logs (for latest 5 or 15 mins metrics) matchers:<name:"name" value:"kube_configmap_created" > aggregates:COUNT aggregates:SUM without_replica_labels:"replica" " msg="Series: started fanout streams" status="store LabelSets: {receive="true", receive_replica="thanos-receive-0", ie global env replica is set to same value (from both prometheus replica during remote-write) in thanos receiver (WAL) & when performed query with dedup variable set on (--query.replica.label=replica ), i see metrics returned from receiver with "without_replica_lables:replica "" (NULL). Is this acceptable behaviour? Need to understand how deduplication works in Query when 2 sets of metrics for prometheus written to Thanos & later queried from Query component |
Beta Was this translation helpful? Give feedback.
-
Hi,
Is deduplication functionality of Thanos Query changed since v0.32.3 relese due to fix --> 6697? We see lot of metric gaps in our Grafana graph if we query metrics data for last 3 hrs or 5 or 12 hrs.
Setup
We have multiple EKS clusters spread across us-east-1 & us-west-2 region, where prometheus-operator (which is statefulset with 2 replicas) remote writes metrics of each EKS to our Thanos Receiver (statefulset with multiple replicas). This Prometheus operator has replica external label set pointing to each prometheus operator instance. ex: prometheus-0 & prometheus-1
In Thanos Receiver, as metrics (WAL) written from various EKS clusters, we set unique label (group specific while doing remote-write using write-relabel config at prometheus), & add receive-replica label (thanos receiver instance specific) to each metric before uploaded to backend S3 bucket.
Thanos Query which connects to Receiver instances & sharded Store API used as Datasource to Grafana. In Thanos Query, we currently using --query.replica-label set separately for replica & receive-replica label values to dedup metric queries.
Though query works fine for latest 5, 10, 15, 30 minutes queries which is produced from receiver (which retains latest metrics upto 4 hrs before pushing to S3), issue noticed when we perform query for longer range like last 3 hrs, 6 hrs, 10 hrs. where gaps noticed in metric graph. Means deduped set for a particular metric gets dropped or not completely (PromQL) returned to Grafana. There was no timeout in Grafana Query. We tested all thanos (Receiver, Query & Store) with latest 0.34.0 release as well. Please note, if we disable dedup (remove --query.replica-label - both replica & receive-replica labels to thanos Query), we see all the missing metrics (for certain time range) visible in Grafana Graph.
So, need clarification on how this Deduplication works with the latest fix to #6697 as i noticed "“without_replica_labels:" set with above labels (global replica & receive-replica) in the returned queries of Receiver & store if dedup is set in Thanos Query
Beta Was this translation helpful? Give feedback.
All reactions