Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiserver "Failed to process a Pod event", "failed with error error trying to connect; dns error: failed to lookup address information: Name does not resolve" #332

Closed
karmingc opened this issue Nov 1, 2022 · 18 comments

Comments

@karmingc
Copy link

karmingc commented Nov 1, 2022

Hello, I recently tried to deploy v0.2.2 to our cluster but seeing repeated error logs in the apiserver pods.

Version: v0.2.2
Deployed with ArgoCD using /yamlgen/deploy/bottlerocket-update-operator.yaml.
node:

...
  OS Image:                   Bottlerocket OS 1.10.1 (aws-k8s-1.22)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.8+bottlerocket
  Kubelet Version:            v1.22.15-eks-57e9f9a
  Kube-Proxy Version:         v1.22.15-eks-57e9f9a
...
{"v":0,"name":"apiserver","msg":"Failed to process a Pod event","level":50,"hostname":"brupop-apiserver-7cb6bd59bf-drx5h","pid":1,"time":"2022-11-01T19:54:32.219503796+00:00","target":"apiserver::api","line":120,"file":"apiserver/src/api/mod.rs","err":"failed to perform initial object list: HyperError: error trying to connect: dns error: failed to lookup address information: Name does not resolve"}
{"v":0,"name":"apiserver","msg":"failed with error error trying to connect: dns error: failed to lookup address information: Name does not resolve","level":50,"hostname":"brupop-apiserver-7cb6bd59bf-drx5h","pid":1,"time":"2022-11-01T19:54:32.219979544+00:00","target":"kube_client::client::builder","line":164,"file":"/src/.cargo/registry/src/github.com-1ecc6299db9ec823/kube-client-0.71.0/src/client/builder.rs"}

I'm a bit unsure about the intricacies of how the apiserver is working, but would appreciate any help on this.

@jpmcb
Copy link
Contributor

jpmcb commented Nov 2, 2022

The API server is attempting to list agent pods in the update operator namespace where the labels look like:

brupop.bottlerocket.aws/component=agent 

Can you give an overview of what's in the brupop-bottlerocket-aws namespace? Maybe the agent pods don't have that label

kubectl get all -n brupop-bottlerocket-aws

And to see what labels are under one of the agent pods:

kubectl describe -n brupop-bottlerocket-aws pods/brupop-agent-{ID}

@karmingc
Copy link
Author

karmingc commented Nov 2, 2022

hi @jpmcb,

$ kubectl get all                           
NAME                                                READY   STATUS    RESTARTS   AGE
pod/brupop-agent-lr5m6                              1/1     Running   0          112s
pod/brupop-apiserver-7cb6bd59bf-5fzx2               1/1     Running   0          112s
pod/brupop-apiserver-7cb6bd59bf-8rd59               1/1     Running   0          112s
pod/brupop-apiserver-7cb6bd59bf-pndh9               1/1     Running   0          112s
pod/brupop-controller-deployment-6484476846-h8j8q   1/1     Running   0          111s

NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/brupop-apiserver           ClusterIP   172.20.10.204   <none>        443/TCP   112s
service/brupop-controller-server   ClusterIP   172.20.107.67   <none>        80/TCP    112s

NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/brupop-agent   1         1         1       1            1           <none>          112s

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/brupop-apiserver               3/3     3            3           112s
deployment.apps/brupop-controller-deployment   1/1     1            1           112s

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/brupop-apiserver-7cb6bd59bf               3         3         3       112s
replicaset.apps/brupop-controller-deployment-6484476846   1         1         1       112s
$ kubectl describe pod/brupop-agent-lr5m6               
Name:             brupop-agent-lr5m6
Namespace:        brupop-bottlerocket-aws
Priority:         0
Service Account:  brupop-agent-service-account
Node:             ip-10-20-112-166.ec2.internal/10.20.112.166
Start Time:       Wed, 02 Nov 2022 12:13:29 -0400
Labels:           brupop.bottlerocket.aws/component=agent
                  controller-revision-hash=76489c4794
                  pod-template-generation=1
Annotations:      kubernetes.io/psp: eks.privileged
Status:           Running
IP:               10.20.125.225
IPs:
  IP:           10.20.125.225
Controlled By:  DaemonSet/brupop-agent
Containers:
  brupop:
    Container ID:  containerd://28b7ec0e8e6b439bc32e6e74fc2d170243d0595622c5c6df1895aa865322320e
    Image:         public.ecr.aws/bottlerocket/bottlerocket-update-operator:v0.2.2
    Image ID:      public.ecr.aws/bottlerocket/bottlerocket-update-operator@sha256:a6de31e1b3553e0c5b5401ec7f7cc435c150481f5c4827c061e523106b9748c0
    Port:          <none>
    Host Port:     <none>
    Command:
      ./agent
    State:          Running
      Started:      Wed, 02 Nov 2022 12:13:31 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  50Mi
    Requests:
      cpu:     10m
      memory:  50Mi
    Environment:
      MY_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /bin/apiclient from bottlerocket-apiclient (rw)
      /etc/brupop-tls-keys from bottlerocket-tls-keys (rw)
      /run/api.sock from bottlerocket-api-socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4k5wf (ro)
      /var/run/secrets/tokens/ from bottlerocket-agent-service-account-token (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  bottlerocket-api-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/api.sock
    HostPathType:  Socket
  bottlerocket-apiclient:
    Type:          HostPath (bare host directory volume)
    Path:          /bin/apiclient
    HostPathType:  File
  bottlerocket-agent-service-account-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3600
  bottlerocket-tls-keys:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  brupop-tls
    Optional:    false
  kube-api-access-4k5wf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason       Age    From               Message
  ----     ------       ----   ----               -------
  Normal   Scheduled    3m14s  default-scheduler  Successfully assigned brupop-bottlerocket-aws/brupop-agent-lr5m6 to ip-10-20-112-166.ec2.internal
  Warning  FailedMount  3m14s  kubelet            MountVolume.SetUp failed for volume "bottlerocket-tls-keys" : secret "brupop-tls" not found
  Normal   Pulled       3m13s  kubelet            Container image "public.ecr.aws/bottlerocket/bottlerocket-update-operator:v0.2.2" already present on machine
  Normal   Created      3m13s  kubelet            Created container brupop
  Normal   Started      3m12s  kubelet            Started container brupop

@gthao313
Copy link
Member

gthao313 commented Nov 2, 2022

hi @karmingc I'm not sure if this causes your issue. I noticed that you only have one node (one agent) running on your EKS cluster. Unfortunately, Brupop only support to work on the cluster which has more than 2 or 3 nodes. ( We will support more document on this).

Meanwhile, can you provide more details on the behavior of apiserver? Were the apiserver and controller on pending status like this? If yes, I think that might be the reason I mentioned above; otherwise, I'll do more investigation on this. Thanks!

NAME                                           READY   STATUS    RESTARTS   AGE
brupop-agent-fdtbz                             1/1     Running   1          3m59s
brupop-apiserver-74b67c9d99-jchpc              0/1     Pending   0          75s
brupop-apiserver-74b67c9d99-knsqs              0/1     Pending   0          75s
brupop-apiserver-74b67c9d99-xcsgj              0/1     Pending   0          75s
brupop-controller-deployment-79576f4fb-gltj6   0/1     Pending   0          75s

@karmingc
Copy link
Author

karmingc commented Nov 2, 2022

@gthao313 It was done manually, but I do have more nodes on my cluster...

I just tried labelling other nodes, testing with 2 and 3 nodes with the label bottlerocket.aws/updater-interface-version=2.0.0, but still seeing a similar issue related to dns/lookup..

$ kubectl get all                                                                                  
NAME                                                READY   STATUS    RESTARTS   AGE
pod/brupop-agent-9qnpt                              1/1     Running   0          4m2s
pod/brupop-agent-ljjkv                              1/1     Running   0          4m2s
pod/brupop-agent-nwqhs                              1/1     Running   0          24s
pod/brupop-apiserver-7cb6bd59bf-bmzsg               1/1     Running   0          4m2s
pod/brupop-apiserver-7cb6bd59bf-h5vxf               1/1     Running   0          4m2s
pod/brupop-apiserver-7cb6bd59bf-ncn25               1/1     Running   0          4m2s
pod/brupop-controller-deployment-6484476846-hgbpg   1/1     Running   0          4m1s

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/brupop-apiserver           ClusterIP   172.20.132.135   <none>        443/TCP   4m2s
service/brupop-controller-server   ClusterIP   172.20.190.242   <none>        80/TCP    4m2s

NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/brupop-agent   3         3         3       3            3           <none>          4m2s

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/brupop-apiserver               3/3     3            3           4m2s
deployment.apps/brupop-controller-deployment   1/1     1            1           4m2s

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/brupop-apiserver-7cb6bd59bf               3         3         3       4m3s
replicaset.apps/brupop-controller-deployment-6484476846   1         1         1       4m3s

logs from the apiserver pods

{"v":0,"name":"apiserver","msg":"failed with error error trying to connect: dns error: failed to lookup address information: Name does not resolve","level":50,"hostname":"brupop-apiserver-79bdb58bc6-p2zz6","pid":1,"time":"2022-11-02T17:26:51.736042453+00:00","target":"kube_client::client::builder","line":164,"file":"/src/.cargo/registry/src/github.com-1ecc6299db9ec823/kube-client-0.71.0/src/client/builder.rs"}
{"v":0,"name":"apiserver","msg":"Failed to process a Pod event","level":50,"hostname":"brupop-apiserver-79bdb58bc6-p2zz6","pid":1,"time":"2022-11-02T17:26:51.736064673+00:00","target":"apiserver::api","line":120,"file":"apiserver/src/api/mod.rs","err":"failed to perform initial object list: HyperError: error trying to connect: dns error: failed to lookup address information: Name does not resolve"}

@gthao313
Copy link
Member

gthao313 commented Nov 2, 2022

@karmingc Can you verify if your pods consistently have this error? This seems incorrect to me, and it maybe relates to certification.

Warning  FailedMount  3m14s  kubelet            MountVolume.SetUp failed for volume "bottlerocket-tls-keys" : secret "brupop-tls" not found

Have you installed cert-manager before installing update operator?

@karmingc
Copy link
Author

karmingc commented Nov 2, 2022

ah.. we already have cert-manager installed in our cluster

@jpmcb
Copy link
Contributor

jpmcb commented Nov 2, 2022

You should have the brupop-tls secret (which is also used by the apiserver certificate):

So something like:

❯ kubectl get certificates -n brupop-bottlerocket-aws
NAME                           READY   SECRET       AGE
brupop-apiserver-certificate   True    brupop-tls   3h8m

❯ kubectl get secrets -n brupop-bottlerocket-aws
NAME                                            TYPE                                  DATA   AGE
brupop-agent-service-account-token-t26gs        kubernetes.io/service-account-token   3      3h10m
brupop-apiserver-service-account-token-jnsd9    kubernetes.io/service-account-token   3      3h10m
brupop-controller-service-account-token-skxnz   kubernetes.io/service-account-token   3      3h10m
brupop-tls                                      kubernetes.io/tls                     3      3h8m
default-token-cgfgk                             kubernetes.io/service-account-token   3      3h10m

Are you deploying through the default manifest found in the repository? I wonder if your deployment via Argo isn't including the cert-manager bits

@karmingc
Copy link
Author

karmingc commented Nov 2, 2022

Yes. we are directly deploying the manifest here https://github.com/bottlerocket-os/bottlerocket-update-operator/blob/develop/yamlgen/deploy/bottlerocket-update-operator.yaml.

It should be, at least I'm seeing those resources in the brupop-bottlerocket-aws namespace.

$ kubectl get certificates
NAME                           READY   SECRET       AGE
brupop-apiserver-certificate   True    brupop-tls   10m

~ 
$ kubectl get secrets     
NAME                                            TYPE                                  DATA   AGE
brupop-agent-service-account-token-drwmx        kubernetes.io/service-account-token   3      10m
brupop-apiserver-service-account-token-h66gf    kubernetes.io/service-account-token   3      10m
brupop-controller-service-account-token-9c5cr   kubernetes.io/service-account-token   3      10m
brupop-tls                                      kubernetes.io/tls                     3      10m
default-token-pdj95                             kubernetes.io/service-account-token   3      10m

@jpmcb
Copy link
Contributor

jpmcb commented Nov 2, 2022

Hmmm curious! You might try deleting the entire brupop-bottlerocket-aws namespace (which should delete all the pods / resources) and reapply the yaml. That way, when the agent pods start up again, they can have a proper mount? I'm not sure why mounting the brupo-tls secret would fail

@jpmcb jpmcb modified the milestone: brupop 1.0.0 Nov 2, 2022
@karmingc
Copy link
Author

karmingc commented Nov 2, 2022

Hmmm curious! You might try deleting the entire brupop-bottlerocket-aws namespace (which should delete all the pods / resources) and reapply the yaml. That way, when the agent pods start up again, they can have a proper mount? I'm not sure why mounting the brupo-tls secret would fail

Wouldn't that delete the secrets too?

Anyway tried that and the same error is shown...

I did however restart the apiserver deployment and the mounting issue is not there.

$ kubectl describe pod/brupop-apiserver-c4f75879b-4qmw8
Name:             brupop-apiserver-c4f75879b-4qmw8
Namespace:        brupop-bottlerocket-aws
Priority:         0
Service Account:  brupop-apiserver-service-account
Node:             ip-10-20-127-68.ec2.internal/10.20.127.68
Start Time:       Wed, 02 Nov 2022 16:20:40 -0400
Labels:           brupop.bottlerocket.aws/component=apiserver
                  pod-template-hash=c4f75879b
Annotations:      kubectl.kubernetes.io/restartedAt: 2022-11-02T16:20:19-04:00
                  kubernetes.io/psp: eks.privileged
Status:           Running
IP:               10.20.125.102
IPs:
  IP:           10.20.125.102
Controlled By:  ReplicaSet/brupop-apiserver-c4f75879b
Containers:
  brupop:
    Container ID:  containerd://dcaac08a75b737780e2ea280008a8029fbd5638c761ef05866b1302742e738c5
    Image:         public.ecr.aws/bottlerocket/bottlerocket-update-operator:v0.2.2
    Image ID:      public.ecr.aws/bottlerocket/bottlerocket-update-operator@sha256:a6de31e1b3553e0c5b5401ec7f7cc435c150481f5c4827c061e523106b9748c0
    Port:          8443/TCP
    Host Port:     0/TCP
    Command:
      ./apiserver
    State:          Running
      Started:      Wed, 02 Nov 2022 16:20:41 -0400
    Ready:          True
    Restart Count:  0
    Liveness:       http-get https://:8443/ping delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get https://:8443/ping delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /etc/brupop-tls-keys from bottlerocket-tls-keys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2jh4w (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  bottlerocket-tls-keys:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  brupop-tls
    Optional:    false
  kube-api-access-2jh4w:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  60s   default-scheduler  Successfully assigned brupop-bottlerocket-aws/brupop-apiserver-c4f75879b-4qmw8 to ip-10-20-127-68.ec2.internal
  Normal  Pulled     59s   kubelet            Container image "public.ecr.aws/bottlerocket/bottlerocket-update-operator:v0.2.2" already present on machine
  Normal  Created    59s   kubelet            Created container brupop
  Normal  Started    59s   kubelet            Started container brupop

@jpmcb
Copy link
Contributor

jpmcb commented Nov 4, 2022

You may try deploying the kubernetes dns-utils pod into the brupop namespace and see if something is wrong with dns resolution to/from the agent and apiserver

@karmingc
Copy link
Author

karmingc commented Nov 4, 2022

What would be the host name to test for the agent?

I tried a couple and it didn't seem error out so far..

❯ kubectl exec -i -t dnsutils -- nslookup brupop-apiserver.brupop-bottlerocket-aws
Server:		172.20.0.10
Address:	172.20.0.10#53

Name:	brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local
Address: 172.20.176.214


$ kubectl exec -i -t dnsutils -- nslookup brupop-controller-server.brupop-bottlerocket-aws
Server:		172.20.0.10
Address:	172.20.0.10#53

Name:	brupop-controller-server.brupop-bottlerocket-aws.svc.cluster.local
Address: 172.20.135.6


$ kubectl exec -ti dnsutils -- cat /etc/resolv.conf
search brupop-bottlerocket-aws.svc.cluster.local svc.cluster.local cluster.local ec2.internal
nameserver 172.20.0.10
options ndots:2

edit:
forgot to provide the cluster ip for kube-dns

$ kubectl get service kube-dns -n kube-system  

NAME       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
kube-dns   ClusterIP   172.20.0.10   <none>        53/UDP,53/TCP   486d

@jpmcb
Copy link
Contributor

jpmcb commented Nov 4, 2022

Dns is working.

This is a problem where, for some reason, our Kubernetes client in the API server isn't able to reach the base kubernetes API at kubernetes.default.svc in order to place an initial "watcher" on the agent pods.

What is the shape of your network? What CNI are you using? Are you using the node's host-network?

I'm wondering if this is related to our usage of Rust-ls as mentioned here: kube-rs/kube#1071


Edit: are you able to upgrade to our 1.0.0 release? There were many small changes that went into that (and the logs look much better). We upgraded our kubernetes client dependency code in that release so it would be interesting to see if this persists on the newest release

@karmingc
Copy link
Author

karmingc commented Nov 7, 2022

That would indeed seem to be the case...

Added another container within the apiserver pods and used tcpdump to monitor calls to kubernetes.default.svc and this is what we are seeing

using dig

15:20:30.587796 IP brupop-apiserver-ddcfb7d55-hj28q.49204 > kube-dns.kube-system.svc.cluster.local.53: 15341+ [1au] A? kubernetes.default.svc. (63)
15:20:30.588848 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.49204: 15341 NXDomain* 0/1/1 (138)
15:20:30.665433 IP brupop-apiserver-ddcfb7d55-hj28q.39373 > kube-dns.kube-system.svc.cluster.local.53: 55645+ PTR? 10.0.20.172.in-addr.arpa. (42)
15:20:30.666014 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.39373: 55645*- 1/0/0 PTR kube-dns.kube-system.svc.cluster.local. (118)

using curl

15:24:02.707393 IP brupop-apiserver-ddcfb7d55-hj28q.51918 > kube-dns.kube-system.svc.cluster.local.53: 9464+ A? kubernetes.default.svc.brupop-bottlerocket-aws.svc.cluster.local. (82)
15:24:02.707447 IP brupop-apiserver-ddcfb7d55-hj28q.51918 > kube-dns.kube-system.svc.cluster.local.53: 1779+ AAAA? kubernetes.default.svc.brupop-bottlerocket-aws.svc.cluster.local. (82)
15:24:02.708136 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.51918: 1779 NXDomain*- 0/1/0 (175)
15:24:02.708200 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.51918: 9464 NXDomain*- 0/1/0 (175)
15:24:02.708269 IP brupop-apiserver-ddcfb7d55-hj28q.53675 > kube-dns.kube-system.svc.cluster.local.53: 44054+ A? kubernetes.default.svc.svc.cluster.local. (58)
15:24:02.708296 IP brupop-apiserver-ddcfb7d55-hj28q.53675 > kube-dns.kube-system.svc.cluster.local.53: 64785+ AAAA? kubernetes.default.svc.svc.cluster.local. (58)
15:24:02.712202 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.53675: 44054 NXDomain*- 0/1/0 (151)
15:24:02.712204 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.53675: 64785 NXDomain*- 0/1/0 (151)
15:24:02.712355 IP brupop-apiserver-ddcfb7d55-hj28q.59823 > kube-dns.kube-system.svc.cluster.local.53: 50321+ A? kubernetes.default.svc.cluster.local. (54)
15:24:02.712387 IP brupop-apiserver-ddcfb7d55-hj28q.59823 > kube-dns.kube-system.svc.cluster.local.53: 12396+ AAAA? kubernetes.default.svc.cluster.local. (54)
15:24:02.713617 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.59823: 50321*- 1/0/0 A 172.20.0.1 (106)
15:24:02.713618 IP kube-dns.kube-system.svc.cluster.local.53 > brupop-apiserver-ddcfb7d55-hj28q.59823: 12396*- 0/1/0 (147)

which might probably suggest that the current mechanism to reach the base kubernetes API is not going through the list of DNS search domains for hostname lookup, similar to using dig.

I also updated to v1.0.0, the logs are more clear, but the problem persists.

edit: this was done with updating ndots to 5 instead of our previous code showing 2.

@karmingc
Copy link
Author

karmingc commented Nov 7, 2022

possibly unrelated, but we are also noticing a significantly increase in calls to coredns with brupop. I'm not sure if this is due to aggressive retries, but it is somewhat worrisome as it throttles coredns...

Screen Shot 2022-11-07 at 1 31 43 PM

@jpmcb
Copy link
Contributor

jpmcb commented Nov 17, 2022

What is the shape of your network? What CNI are you using? Are you using the node's host-network?

@karmingc are you using the hostNetwork?

hostNetwork: true

I found something possibly similar where (if you're using hostNetwork: true), you'll also want to set the dns policy:

hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet

kube-rs/kube#953 (comment)


If you're not using hostNetwork, can you tell me what your network looks like? What CNI you're using? Any firewall restrictions? Any special configurations to kube-dns?

@duboisf
Copy link

duboisf commented Mar 31, 2023

Turns out the problem was our fault. We have a webhook mutator that injects an ndots value of 2:

  dnsConfig:
    options:
    - name: ndots
      value: "2"

and I was sure I had checked that we had disabled it for the this operator when we saw the DNS issues but nope, it was being injected 🤦

Once we turned disabled the webhook mutator for this operator it resolved our dns issues.

Sorry for the trouble folks 😅

Oh and thanks for this project, it's simplifying our maintenance burden!

@stmcginnis
Copy link
Contributor

Awesome news! Thanks for reporting back with the details. That's very useful.

Sounds like this issue can be closed. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants