Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: Error with "dnssrv+" service discovery with intermediate CNAME #5679

Closed
Tassatux opened this issue Sep 7, 2022 · 7 comments · Fixed by #5716
Closed

query: Error with "dnssrv+" service discovery with intermediate CNAME #5679

Tassatux opened this issue Sep 7, 2022 · 7 comments · Fixed by #5716

Comments

@Tassatux
Copy link

Tassatux commented Sep 7, 2022

Thanos, Prometheus and Golang version used:
thanos:v0.28.0

What happened:

When Thanos query is deployed in a Kubernetes cluster with coredns configured to avoid superfluous DNS requests (with autopath @kubernetes and pods verified), a CNAME may be returned to the DNS query resulting in a error in storeAPIs addresses resolution.

This happen when using "relative" DNS name service.namespace, without the full cluster DNS domain.

StoreAPI endpoint is properly discovered on Thanos query start, but few seconds later the resolution fail, removing the endpoint.

It look like it's because only SRV type in response are handled in https://github.com/thanos-io/thanos/blob/v0.28.0/pkg/discovery/dns/miekgdns/resolver.go#L37

What you expected to happen:

The CNAME should be followed to get the real SRV value.

How to reproduce it (as minimally and precisely as possible):

Update Kubernetes coredns config to include both autopath @kubernetes and pods verified:

kubernetes cluster.local in-addr.arpa ip6.arpa {
  pods verified
  fallthrough in-addr.arpa ip6.arpa
}
autopath @kubernetes

Restart coredns pods then deploy following manifests:

---
apiVersion: v1
kind: Namespace
metadata:
  name: test
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: query-test
  name: query-test
  namespace: test
spec:
  ports:
  - name: grpc
    port: 10901
  selector:
    app: query-test
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: query-main
  namespace: test
spec:
  selector:
    matchLabels:
      app: query-main
  template:
    metadata:
      labels:
        app: query-main
    spec:
      containers:
      - name: query
        args:
        - query
        - --store=dnssrv+_grpc._tcp.query-test.test
        image: quay.io/thanos/thanos:v0.28.0
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: query-test
  namespace: test
spec:
  selector:
    matchLabels:
      app: query-test
  template:
    metadata:
      labels:
        app: query-test
    spec:
      containers:
      - args:
        - query
        - --grpc-address=0.0.0.0:10901
        image: quay.io/thanos/thanos:v0.28.0
        name: query
        ports:
        - containerPort: 10901
          name: grpc

Full logs to relevant components:

Query logs:

level=info ts=2022-09-07T12:39:27.90183412Z caller=options.go:26 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2022-09-07T12:39:27.903174142Z caller=query.go:724 msg="starting query node"
level=info ts=2022-09-07T12:39:27.907765004Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2022-09-07T12:39:27.908028331Z caller=http.go:73 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2022-09-07T12:39:27.908649481Z caller=tls_config.go:195 service=http/server component=query msg="TLS is disabled." http2=false
level=info ts=2022-09-07T12:39:27.911832627Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2022-09-07T12:39:27.912201467Z caller=grpc.go:131 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=info ts=2022-09-07T12:39:32.920763992Z caller=endpointset.go:381 component=endpointset msg="adding new query with [storeAPI rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]" address=10.43.60.110:10901 extLset=
level=error ts=2022-09-07T12:39:57.914719788Z caller=query.go:555 msg="failed to resolve addresses for storeAPIs" err="lookup SRV records \"_grpc._tcp.query-test.test\": invalid SRV response record _grpc._tcp.query-test.test.test.svc.cluster.local.\t5\tIN\tCNAME\t_grpc._tcp.query-test.test.svc.cluster.local."

Anything else we need to know:

Without autopath @kubernetes and pods verified:

root@debug-pod:/# grep search /etc/resolv.conf
search test.svc.cluster.local svc.cluster.local cluster.local
root@debug-pod:/# dig +search +noall +answer +additional SRV _grpc._tcp.query-test.test
_grpc._tcp.query-test.test.svc.cluster.local. 5	IN SRV 0 100 10901 query-test.test.svc.cluster.local.
query-test.test.svc.cluster.local. 5 IN	A	10.43.60.110

tcpdump:

13:00:42.364131 eth0  Out ifindex 3 86:5d:56:72:50:08 ethertype IPv4 (0x0800), length 141: 10.42.2.20.40814 > 10.43.0.10.53: 53282+ [1au] SRV? _grpc._tcp.query-test.test.default.svc.cluster.local. (93)
13:00:42.368655 eth0  In  ifindex 3 ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 234: 10.43.0.10.53 > 10.42.2.20.40814: 53282 NXDomain*- 0/1/1 (186)
13:00:42.369493 eth0  Out ifindex 3 86:5d:56:72:50:08 ethertype IPv4 (0x0800), length 133: 10.42.2.20.41211 > 10.43.0.10.53: 25643+ [1au] SRV? _grpc._tcp.query-test.test.svc.cluster.local. (85)
13:00:42.371152 eth0  In  ifindex 3 ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 279: 10.43.0.10.53 > 10.42.2.20.41211: 25643*- 1/0/2 SRV query-test.test.svc.cluster.local.:10901 0 100 (231)

With autopath @kubernetes and pods verified:

root@debug-pod:/# dig +search +noall +answer +additional SRV _grpc._tcp.query-test.test
_grpc._tcp.query-test.test.test.svc.cluster.local. 5 IN	CNAME _grpc._tcp.query-test.test.svc.cluster.local.
_grpc._tcp.query-test.test.svc.cluster.local. 5	IN SRV 0 100 10901 query-test.test.svc.cluster.local.
query-test.test.svc.cluster.local. 5 IN	A	10.43.60.110

tcpdump:

13:01:33.400013 eth0  Out ifindex 3 86:5d:56:72:50:08 ethertype IPv4 (0x0800), length 141: 10.42.2.20.58526 > 10.43.0.10.53: 41154+ [1au] SRV? _grpc._tcp.query-test.test.default.svc.cluster.local. (93)
13:01:33.404428 eth0  In  ifindex 3 ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 397: 10.43.0.10.53 > 10.42.2.20.58526: 41154*- 2/0/2 CNAME _grpc._tcp.query-test.test.svc.cluster.local., SRV query-test.test.svc.cluster.local.:10901 0 100 (349)

I saw that a way to follow CNAME was added previously in LookupIPAddr function via #5271
Could we get the same for SRV?

Thanks.

@matej-g
Copy link
Collaborator

matej-g commented Sep 8, 2022

Thanks for the detailed report @Tassatux! This sounds like a reasonable request to me. Are you happy to take this or shall I open this for others to work on? 🙂

@Tassatux
Copy link
Author

Tassatux commented Sep 8, 2022

I'm not sure how to properly fix it, so I prefer that someone with more Go knowledge take a look. 🙂

@h20220025
Copy link

Can I work on this one??

@matej-g
Copy link
Collaborator

matej-g commented Sep 9, 2022

Sure @h20220025 go for it 🚀

@bwplotka
Copy link
Member

bwplotka commented Sep 14, 2022

Are you still on it @h20220025 ?

@Atharva-Shinde
Copy link
Contributor

Hello👋 I'd like to work on this issue :)

@matej-g
Copy link
Collaborator

matej-g commented Sep 16, 2022

Hey @Atharva-Shinde I'd say go for it 🚀, maybe @h20220025 didn't get a chance to pick this up after all 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment