Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node marking other nodes as Non-Ready after some time (~60s) #262

Open
MockyJoke opened this issue Jan 5, 2022 · 1 comment
Open

Node marking other nodes as Non-Ready after some time (~60s) #262

MockyJoke opened this issue Jan 5, 2022 · 1 comment

Comments

@MockyJoke
Copy link

MockyJoke commented Jan 5, 2022

I have a kubeadm k8s cluster with 5 nodes. I am able to perform kubectl log on 4 out of 5 are working properly except one of them.
One that problematic node: sudo wg returns the follows:

user@ubuntuserver3:~$ sudo wg
interface: kilo0
  public key: <somekey>
  private key: (hidden)
  listening port: <somePort>

Which is strange as on other nodes I'm able to see a list of peers. At the beginning I thought this might be due to the host itself, so I wiped the VM and reinstalled ubuntu 20.04 on it. However, the issue remains.

I took a deeper look into the issue and found the following

  1. Running sudo wg within 60s after the kilo Pod is restarted actually return some peers. But they will all be gone at ~ >60 s.
  2. I then tried to debug the source code and found
  • On the problematic node, the Ready() function is retuning false for the other normal hosts.
  • The function returned false because this check is failed time.Now().Unix()-n.LastSeen < int64(checkInPeriod)*2/int64(time.Second)
    • My understanding is: here we are checking if the Nodes has been seen the the past 60s. (Default value of checkInPeriod *2 )
  1. I tried to put down some extra logging lines in the problematic host. And I'm surprise to that the LastSeen value for the other nodes are all ~60-80s from UTC.Now, so those nodes are actually treated as non-ready nodes, thus they are not added as peers into wg conf.
  2. I tried to find out why they are all seen at 60-80s ago, and found this cache update related logic are set to 5mins, however I'm not sure if that is related to my issue.

But I probably missed something, as this issue is not observed on 4 out of 5 of my nodes.
Can you kindly point out, beside that cache update related logic that I found. Is there other logic that are also updating the LastSeen value for the nodes?

Thank you very much, and this is indeed a great project!

Regards,
MockyJoke

My kilo manifest:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kilo
  namespace: kube-system
  labels:
    app.kubernetes.io/name: kilo
data:
  cni-conf.json: |
    {
       "cniVersion":"0.3.1",
       "name":"kilo",
       "plugins":[
          {
             "name":"kubernetes",
             "type":"bridge",
             "bridge":"kube-bridge",
             "isDefaultGateway":true,
             "forceAddress":true,
             "mtu": 1420,
             "ipam":{
                "type":"host-local"
             }
          },
          {
             "type":"portmap",
             "snat":true,
             "capabilities":{
                "portMappings":true
             }
          }
       ]
    }
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kilo
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kilo
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - list
  - patch
  - watch
- apiGroups:
  - kilo.squat.ai
  resources:
  - peers
  verbs:
  - list
  - watch
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kilo
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kilo
subjects:
  - kind: ServiceAccount
    name: kilo
    namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kilo
  namespace: kube-system
  labels:
    app.kubernetes.io/name: kilo
    app.kubernetes.io/part-of: kilo
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kilo
      app.kubernetes.io/part-of: kilo
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kilo
        app.kubernetes.io/part-of: kilo
    spec:
      serviceAccountName: kilo
      hostNetwork: true
      containers:
      - name: kilo
        image: squat/kilo
        args:
        - --kubeconfig=/etc/kubernetes/kubeconfig
        - --hostname=$(NODE_NAME)
        - --mesh-granularity=full
        - --subnet=10.100.0.0/16
        - --port=<someport>
        - --iptables-forward-rules=true
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        ports:
        - containerPort: 1107
          name: metrics
        securityContext:
          privileged: true
        volumeMounts:
        - name: cni-conf-dir
          mountPath: /etc/cni/net.d
        - name: kilo-dir
          mountPath: /var/lib/kilo
        - name: kubeconfig
          mountPath: /etc/kubernetes
          readOnly: true
        - name: lib-modules
          mountPath: /lib/modules
          readOnly: true
        - name: xtables-lock
          mountPath: /run/xtables.lock
          readOnly: false
      initContainers:
      - name: install-cni
        image: squat/kilo
        command:
        - /bin/sh
        - -c
        - set -e -x;
          cp /opt/cni/bin/* /host/opt/cni/bin/;
          TMP_CONF="$CNI_CONF_NAME".tmp;
          echo "$CNI_NETWORK_CONFIG" > $TMP_CONF;
          rm -f /host/etc/cni/net.d/*; 
          mv $TMP_CONF /host/etc/cni/net.d/$CNI_CONF_NAME
        env:
        - name: CNI_CONF_NAME
          value: 10-kilo.conflist
        - name: CNI_NETWORK_CONFIG
          valueFrom:
            configMapKeyRef:
              name: kilo
              key: cni-conf.json
        volumeMounts:
        - name: cni-bin-dir
          mountPath: /host/opt/cni/bin
        - name: cni-conf-dir
          mountPath: /host/etc/cni/net.d
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - name: cni-bin-dir
        hostPath:
          path: /opt/cni/bin
      - name: cni-conf-dir
        hostPath:
          path: /etc/cni/net.d
      - name: kilo-dir
        hostPath:
          path: /var/lib/kilo
      - name: kubeconfig
        configMap:
          name: kube-proxy
          items:
          - key: kubeconfig.conf
            path: kubeconfig
      - name: lib-modules
        hostPath:
          path: /lib/modules
      - name: xtables-lock
        hostPath:
          path: /run/xtables.lock
          type: FileOrCreate

@DeamonLuck
Copy link

DeamonLuck commented Jan 7, 2022

Found out the delay is caused by some incorrect DNS setting on the problematic host.
It's probably not related to kilo.

Please feel free to close this issue. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants