Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selector-less service with secondary IPs not working properly on Rocky 8/9 with latest kube-proxy #124587

Closed
fengye87 opened this issue Apr 28, 2024 · 30 comments
Labels
kind/support Categorizes issue or PR as a support question. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@fengye87
Copy link

What happened?

I have a 4-node (1 control-plane and 3 workers) cluster setup. Each node have two NICs, one for Pod network (inside 192.168.16.0/20), the other one for storage (inside 10.87.87.0/24).

I have a headless service, an operator would dynamically update its endpoints with storage NIC IPs. I cannot access this headless service (like nc -vz <SERVICE-IP> <PORT>) on nodes properly. But I can access other services on nodes.

What did you expect to happen?

I should be able to access this headless service with secondary IPs just like other services from any node.

How can we reproduce it (as minimally and precisely as possible)?

  1. Setup a cluster with at least 2 Rocky8/9 nodes, each node needs to have two NICs
  2. Assign different subnet and IP for two NICs
  3. Create a nginx deployment but run each Pod with host network
  4. Create a headless service and set its endpoints to IPs of secondary NIC of each nodes
  5. Try nc -vz <SERVICE-IP> 80 from any node. It will or will not succeed, depending on whether the service resolves to current node or node

Anything else we need to know?

I've tried several combinations to rule out possible causes:

  1. The problem exists on at least Rocky 8 and 9, with Kubernetes 1.25.12 and 1.30.0, with Calico, Flannel and Cilium CNI
  2. The problem doesn't exist on Ubuntu 24.04, with Kubernetes 1.25.12, with Calico CNI

So it seems to me that it's not a CNI issue, but rather related to kube-proxy (or its combination with OS).

I also dug a little bit with tcpdump. It seems to me the source node did transmitted packets, but the source IP is the primary NIC's, not the secondary one's. This can be the real cause, but don't know what caused this.

And I can confirm that:

  1. Connections between nodes via secondary NIC is OK, I can access the port via secondary NIC's IP normally
  2. Firewall is stopped and disabled, selinux is turned off

Kubernetes version

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.16", GitCommit:"c5f43560a4f98f2af3743a59299fb79f07924373", GitTreeState:"clean", BuildDate:"2023-11-15T22:39:12Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"30", GitVersion:"v1.30.0", GitCommit:"7c48c2bd72b9bf5c44d21d7338cc7bea77d0ad2a", GitTreeState:"clean", BuildDate:"2024-04-17T17:27:03Z", GoVersion:"go1.22.2", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.25) and server (1.30) exceeds the supported minor version skew of +/-1

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.3 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.3 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.3"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
$ uname -a
Linux kube-1 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Nov 8 17:36:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@fengye87 fengye87 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 28, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 28, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fengye87
Copy link
Author

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 28, 2024
@uablrek
Copy link
Contributor

uablrek commented Apr 28, 2024

This seem to be a routing setup problem (which is not a K8s responsibility)

I also dug a little bit with tcpdump. It seems to me the source node did transmitted packets, but the source IP is the primary NIC's, not the secondary one's. This can be the real cause, but don't know what caused this.

So, are packets trasmitted from the node on NIC-storage with dst to one of your configured endpoint addresses, but with sourceIP taken from NIC-pod?

Are you trying from a node (i.e. main netns) or from within a POD?

Can you please provide the routing config from that environment (pod or node), like output from ip route?

@fengye87
Copy link
Author

So, are packets trasmitted from the node on NIC-storage with dst to one of your configured endpoint addresses, but with sourceIP taken from NIC-pod?

Yes, that's what I'm seeing from tcpdump

Are you trying from a node (i.e. main netns) or from within a POD?

From main netns

Can you please provide the routing config from that environment (pod or node), like output from ip route?

I've destroyed that cluster, but I've checked ip route that time, it's all normal for both NICs. There are correct routes for both NIC's CIDRs respectively

@uablrek

@uablrek
Copy link
Contributor

uablrek commented Apr 29, 2024

Each node have two NICs, one for Pod network (inside 192.168.16.0/20)

Does this mean that all PODs, and the NICs on the nodes have addresses from this range?

That is an unusual setup, at least for IPv4. It's nothing wrong with it, in fact I would encourage it for IPv6, but I expect it to not be very well tested. The common way is to have a private cidr for PODs, and node addresses (on the NIC) from a more "official" range. E.g. in KinD PODs have 10.244.0.x addresses, while the nodes have addresses from the Docker network.

In your setup (if I got it right), you have not setup egress masquerading I suppose? That would explain how packets can be sent on one nic, while having the source of another. BTW, this is usually possible out-of-the-box for IPv4, but for IPv6 you must set the sysctls:

sudo sysctl -w net.ipv4.ip_nonlocal_bind=1
sudo sysctl -w net.ipv6.ip_nonlocal_bind=1

@fengye87
Copy link
Author

fengye87 commented Apr 29, 2024

Does this mean that all PODs, and the NICs on the nodes have addresses from this range?

No. This is just CIDR for the primary NIC of the node. Pod (not using hostNetwork) will get an address from the default Pod CIDR, which is non-overlapping with both NICs' CIDRs.

In fact, the primary network and the default pod network works perfectly. I've been using the same setup for years, and I double checked that normal services (those with pod selectors) work fine. The service in problem is headless with some secondary NIC addresses, which has just recently been introduced into my setup for some tests.

@uablrek
Copy link
Contributor

uablrek commented Apr 30, 2024

From https://kubernetes.io/docs/concepts/services-networking/service/#headless-services:

For headless Services, a cluster IP is not allocated, kube-proxy does not handle these Services, and there is no load balancing or proxying done by the platform for them.

So, for a headless service kube-proxy not involved. It is CoreDNS that returns a set of endpoint addresses, which seem to be correct in your case (since the dest address is correct).

This problem can't be fixed by any update in K8s.

/remove-kind bug
/kind support

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 30, 2024
@uablrek
Copy link
Contributor

uablrek commented Apr 30, 2024

Um, that raises another question: how can you use nc -vz <SERVICE-IP> 80.

A headless service has no SERVICE-IP. For test I use:

apiVersion: v1
kind: Service
metadata:
  name: mconnect
spec:
  clusterIP: None
  ipFamilyPolicy: RequireDualStack
  ports:
  - port: 5001
    name: mconnect
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: mconnect-4
  labels:
    kubernetes.io/service-name: mconnect
addressType: IPv4
ports:
  - name: mconnect
    protocol: TCP
    port: 5001
endpoints:
  - addresses:
      - "192.168.3.201"
    conditions:
      ready: true
    nodeName: vm-201
  - addresses:
      - "192.168.3.202"
    conditions:
      ready: true
    nodeName: vm-202

and

# nslookup mconnect.default.svc.cluster.local
...
Name:   mconnect.default.svc.cluster.local
Address: 192.168.3.201
Name:   mconnect.default.svc.cluster.local
Address: 192.168.3.202
# kubectl get svc mconnect
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
mconnect   ClusterIP   None         <none>        5001/TCP   3m12s

@uablrek
Copy link
Contributor

uablrek commented Apr 30, 2024

I also tested with a service without a selector, but without "clusterIP: None" (which makes it a headless service). Now there is a SERVICE-IP, or ClusterIP:

NAME       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
mconnect   ClusterIP   12.0.39.151   <none>        5001/TCP   5m38s

Now I can test:

# nc -vz 12.0.39.151 5001
12.0.39.151 (12.0.39.151:5001) open

This works in my env, which is not "Rocky" nor "Ubuntu". I can't install Rocky just for this test, but I downgraded my kernel to Linux 5.10.215, but it still works.

I will close this issue as it's not a bug in K8s.

/close

But if you only have one external endpoint for storage, I suggest you try "clusterIP: None" and use a symbolic address (DN).

@k8s-ci-robot
Copy link
Contributor

@uablrek: Closing this issue.

In response to this:

I also tested with a service without a selector, but without "clusterIP: None" (which makes it a headless service). Now there is a SERVICE-IP, or ClusterIP:

NAME       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
mconnect   ClusterIP   12.0.39.151   <none>        5001/TCP   5m38s

Now I can test:

# nc -vz 12.0.39.151 5001
12.0.39.151 (12.0.39.151:5001) open

This works in my env, which is not "Rocky" nor "Ubuntu". I can't install Rocky just for this test, but I downgraded my kernel to Linux 5.10.215, but it still works.

I will close this issue as it's not a bug in K8s.

/close

But if you only have one external endpoint for storage, I suggest you try "clusterIP: None" and use a symbolic address (DN).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fengye87
Copy link
Author

I've re-created a cluster to reproduce the problem, please allow me to clarify the problem further.

I have 2 hosts with cleanly installed Rocky 9.3 minimal.

Node Primary IP (default gateway) Secondary IP
kube-1 192.168.27.13/20 10.87.87.201/24
kube-2 192.168.29.161/20 10.87.87.202/24

Then I setup the Kubernetes cluster with my Ansible playbook: ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i hosts.ini -k playbook.yml

Install Calico CNI: kubectl apply -f https://projectcalico.docs.tigera.io/manifests/calico.yaml

Install nmstate so that I can configure secondary IP via YAML:

kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/download/v0.82.0/nmstate.io_nmstates.yaml
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/download/v0.82.0/namespace.yaml
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/download/v0.82.0/service_account.yaml
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/download/v0.82.0/role.yaml
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/download/v0.82.0/role_binding.yaml
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/download/v0.82.0/operator.yaml

cat <<EOF | kubectl create -f -
apiVersion: nmstate.io/v1
kind: NMState
metadata:
  name: nmstate
---
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: kube-1-secondary-ip
spec:
  nodeSelector:
    kubernetes.io/hostname: kube-1
  desiredState:
    interfaces:
      - name: ens5
        type: ethernet
        state: up
        ipv4:
          enabled: true
          dhcp: false
          address:
            - ip: "10.87.87.201"
              prefix-length: 24
---
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: kube-2-secondary-ip
spec:
  nodeSelector:
    kubernetes.io/hostname: kube-2
  desiredState:
    interfaces:
      - name: ens5
        type: ethernet
        state: up
        ipv4:
          enabled: true
          dhcp: false
          address:
            - ip: "10.87.87.202"
              prefix-length: 24
EOF

Finally, create a nginx pod and service:

apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  ports:
    - port: 80
      name: http
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: nginx-1
  labels:
    kubernetes.io/service-name: nginx
addressType: IPv4
ports:
  - name: http
    port: 80
endpoints:
  - addresses:
      - "10.87.87.202"
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  nodeSelector:
    kubernetes.io/hostname: kube-2
  hostNetwork: true
  containers:
    - name: nginx
      image: nginx
      securityContext:
        privileged: true
      ports:
        - containerPort: 80

Here you can notice that:

  • The service has no selector, its endpoints are setup manually
  • The endpoint slice containers secondary IP only
  • The nginx pod is using hostNetwork, so it can be accessed via secondary IP
  • The nginx pod is on kube-2

Now get nginx service ClusterIP:

fengye87@fengye87-dev:~$ kubectl get svc nginx
NAME    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
nginx   ClusterIP   10.103.166.117   <none>        80/TCP    31m

Then if I access the service IP (10.103.166.117) from kube-2, it works with no problem since the nginx pod is on the same node:

fengye87@fengye87-dev:~$ ssh root@192.168.29.161
root@192.168.29.161's password:
Last login: Tue Apr 30 16:56:46 2024 from 192.168.27.210
[root@kube-2 ~]# nc -vz 10.103.166.117 80
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to 10.103.166.117:80.
Ncat: 0 bytes sent, 0 bytes received in 0.04 seconds.
[root@kube-2 ~]# nc -vz 10.87.87.202 80
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to 10.87.87.202:80.
Ncat: 0 bytes sent, 0 bytes received in 0.04 seconds.

But if I access the service IP from kube-1, it fails (but works with secondary IP directly):

fengye87@fengye87-dev:~$ ssh root@192.168.27.13
root@192.168.27.13's password:
Last login: Tue Apr 30 16:56:23 2024 from 192.168.27.210
[root@kube-1 ~]# nc -vz 10.103.166.117 80
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: TIMEOUT.
[root@kube-1 ~]# nc -vz 10.87.87.202 80
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to 10.87.87.202:80.
Ncat: 0 bytes sent, 0 bytes received in 0.04 seconds.

tcpdump output:

[root@kube-1 ~]# tcpdump -i ens5 port 80 -n
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens5, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:17:17.307353 IP 192.168.27.13.38498 > 10.87.87.202.http: Flags [S], seq 2360866232, win 64240, options [mss 1460,sackOK,TS val 4273983287 ecr 0,nop,wscale 7], length 0
17:17:18.354999 IP 192.168.27.13.38498 > 10.87.87.202.http: Flags [S], seq 2360866232, win 64240, options [mss 1460,sackOK,TS val 4273984335 ecr 0,nop,wscale 7], length 0
17:17:20.403046 IP 192.168.27.13.38498 > 10.87.87.202.http: Flags [S], seq 2360866232, win 64240, options [mss 1460,sackOK,TS val 4273986383 ecr 0,nop,wscale 7], length 0
17:17:24.435102 IP 192.168.27.13.38498 > 10.87.87.202.http: Flags [S], seq 2360866232, win 64240, options [mss 1460,sackOK,TS val 4273990415 ecr 0,nop,wscale 7], length 0
^C
4 packets captured
9 packets received by filter
0 packets dropped by kernel
[root@kube-2 ~]# tcpdump -i ens5 port 80 -n
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens5, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:17:17.308103 IP 192.168.27.13.38498 > 10.87.87.202.http: Flags [S], seq 2360866232, win 64240, options [mss 1460,sackOK,TS val 4273983287 ecr 0,nop,wscale 7], length 0
17:17:18.355345 IP 192.168.27.13.38498 > 10.87.87.202.http: Flags [S], seq 2360866232, win 64240, options [mss 1460,sackOK,TS val 4273984335 ecr 0,nop,wscale 7], length 0
17:17:20.403413 IP 192.168.27.13.38498 > 10.87.87.202.http: Flags [S], seq 2360866232, win 64240, options [mss 1460,sackOK,TS val 4273986383 ecr 0,nop,wscale 7], length 0
17:17:24.435467 IP 192.168.27.13.38498 > 10.87.87.202.http: Flags [S], seq 2360866232, win 64240, options [mss 1460,sackOK,TS val 4273990415 ecr 0,nop,wscale 7], length 0
^C
4 packets captured
4 packets received by filter
0 packets dropped by kernel

ip route output:

[root@kube-1 ~]# ip route
default via 192.168.16.3 dev ens4 proto dhcp src 192.168.27.13 metric 100
10.87.87.0/24 dev ens5 proto kernel scope link src 10.87.87.201 metric 101
172.16.79.128/26 via 10.87.87.202 dev tunl0 proto bird onlink
blackhole 172.16.126.64/26 proto bird
172.16.126.65 dev cali339b901821b scope link
172.16.126.66 dev calibb80de9a2ea scope link
172.16.126.67 dev calib15f670f129 scope link
172.16.126.68 dev calib25d52d18d5 scope link
172.16.126.69 dev calie86b836035f scope link
172.16.126.70 dev caliaf1513e6664 scope link
172.16.126.71 dev cali275567c8e6f scope link
192.168.16.0/20 dev ens4 proto kernel scope link src 192.168.27.13 metric 100
[root@kube-2 ~]# ip route
default via 192.168.16.3 dev ens4 proto dhcp src 192.168.29.161 metric 100
10.87.87.0/24 dev ens5 proto kernel scope link src 10.87.87.202 metric 101
blackhole 172.16.79.128/26 proto bird
172.16.126.64/26 via 10.87.87.201 dev tunl0 proto bird onlink
192.168.16.0/20 dev ens4 proto kernel scope link src 192.168.29.161 metric 100

I'll keep this env and am happy to provider further info for diagnose. @uablrek

@uablrek
Copy link
Contributor

uablrek commented Apr 30, 2024

Thanks, good info. That looks weird indeed. K8s (kube-proxy) only NAT the dest address, and the rest is delegated to the cni-plugin and the os.

That said, it would be very interesting to figure out how this can happen. One possibility is ip "rules". Can you please check:

ip rule

Another possibility is that the packet is routed twice, but I really don't understand how that can happen.

What proxy-mode are you using? iptables (default), or ipvs?

If you use proxy-mode=ipvs, please check:

ip add show dev kube-ipvs0
# and if possible
ipvsadm -ln

@uablrek
Copy link
Contributor

uablrek commented Apr 30, 2024

There is another mystery: even though the src is wrong, the connect should succeed since the nodes have connectivity on both networks. That would be asymmetric routing, but should work never the less.

Can you please use tcpdump -eni ens5 port 80 on kube-1?

What I am aiming at is to see if the packets goes to the default gw, rather than to kube-2.

@fengye87
Copy link
Author

fengye87 commented May 6, 2024

@uablrek Sorry for my delayed reply, I was on holidays.

IP rules:

[root@kube-1 ~]# ip rule
0:	from all lookup local
32766:	from all lookup main
32767:	from all lookup default
[root@kube-2 ~]# ip rule
0:	from all lookup local
32766:	from all lookup main
32767:	from all lookup default

I'm using the default iptables mode of kube-proxy.

tcpdump -eni ens5 port 80 output:

[root@kube-1 ~]# tcpdump -eni ens5 port 80
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens5, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:31:30.852170 52:54:00:ab:c5:69 > 52:54:00:f7:16:a8, ethertype IPv4 (0x0800), length 74: 192.168.27.13.45230 > 10.87.87.202.http: Flags [S], seq 2283491804, win 64240, options [mss 1460,sackOK,TS val 473069536 ecr 0,nop,wscale 7], length 0
10:31:31.859085 52:54:00:ab:c5:69 > 52:54:00:f7:16:a8, ethertype IPv4 (0x0800), length 74: 192.168.27.13.45230 > 10.87.87.202.http: Flags [S], seq 2283491804, win 64240, options [mss 1460,sackOK,TS val 473070543 ecr 0,nop,wscale 7], length 0
10:31:33.907011 52:54:00:ab:c5:69 > 52:54:00:f7:16:a8, ethertype IPv4 (0x0800), length 74: 192.168.27.13.45230 > 10.87.87.202.http: Flags [S], seq 2283491804, win 64240, options [mss 1460,sackOK,TS val 473072591 ecr 0,nop,wscale 7], length 0
10:31:37.939082 52:54:00:ab:c5:69 > 52:54:00:f7:16:a8, ethertype IPv4 (0x0800), length 74: 192.168.27.13.45230 > 10.87.87.202.http: Flags [S], seq 2283491804, win 64240, options [mss 1460,sackOK,TS val 473076623 ecr 0,nop,wscale 7], length 0
^C
4 packets captured
6 packets received by filter
0 packets dropped by kernel
[root@kube-2 ~]# tcpdump -eni ens5 port 80
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens5, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:31:30.852492 52:54:00:ab:c5:69 > 52:54:00:f7:16:a8, ethertype IPv4 (0x0800), length 74: 192.168.27.13.45230 > 10.87.87.202.http: Flags [S], seq 2283491804, win 64240, options [mss 1460,sackOK,TS val 473069536 ecr 0,nop,wscale 7], length 0
10:31:31.858953 52:54:00:ab:c5:69 > 52:54:00:f7:16:a8, ethertype IPv4 (0x0800), length 74: 192.168.27.13.45230 > 10.87.87.202.http: Flags [S], seq 2283491804, win 64240, options [mss 1460,sackOK,TS val 473070543 ecr 0,nop,wscale 7], length 0
10:31:33.906855 52:54:00:ab:c5:69 > 52:54:00:f7:16:a8, ethertype IPv4 (0x0800), length 74: 192.168.27.13.45230 > 10.87.87.202.http: Flags [S], seq 2283491804, win 64240, options [mss 1460,sackOK,TS val 473072591 ecr 0,nop,wscale 7], length 0
10:31:37.938939 52:54:00:ab:c5:69 > 52:54:00:f7:16:a8, ethertype IPv4 (0x0800), length 74: 192.168.27.13.45230 > 10.87.87.202.http: Flags [S], seq 2283491804, win 64240, options [mss 1460,sackOK,TS val 473076623 ecr 0,nop,wscale 7], length 0
^C
4 packets captured
6 packets received by filter
0 packets dropped by kernel

@fengye87
Copy link
Author

fengye87 commented May 7, 2024

So I created another cluster with similar setup, only just this time the host OS is debian 12. Like I said, the above scenario worked in this setup. Below is some output:

root@kube-1:~# nc -vz 10.111.222.200 80
10.111.222.200: inverse host lookup failed: Unknown host
(UNKNOWN) [10.111.222.200] 80 (http) open
root@kube-1:~# tcpdump -i any port 80 -n
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
13:34:48.678067 ens5  Out IP 192.168.28.51.51940 > 10.87.87.212.80: Flags [S], seq 2217316973, win 64240, options [mss 1460,sackOK,TS val 3242569523 ecr 0,nop,wscale 7], length 0
13:34:48.678554 ens4  In  IP 10.87.87.212.80 > 192.168.28.51.51940: Flags [S.], seq 171408541, ack 2217316974, win 65160, options [mss 1460,sackOK,TS val 3117473550 ecr 3242569523,nop,wscale 7], length 0
13:34:48.678623 ens5  Out IP 192.168.28.51.51940 > 10.87.87.212.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3242569523 ecr 3117473550], length 0
13:34:48.678699 ens5  Out IP 192.168.28.51.51940 > 10.87.87.212.80: Flags [F.], seq 1, ack 1, win 502, options [nop,nop,TS val 3242569523 ecr 3117473550], length 0
13:34:48.679175 ens4  In  IP 10.87.87.212.80 > 192.168.28.51.51940: Flags [F.], seq 1, ack 2, win 510, options [nop,nop,TS val 3117473551 ecr 3242569523], length 0
13:34:48.679200 ens5  Out IP 192.168.28.51.51940 > 10.87.87.212.80: Flags [.], ack 2, win 502, options [nop,nop,TS val 3242569524 ecr 3117473551], length 0
^C
6 packets captured
8 packets received by filter
0 packets dropped by kernel
root@kube-2:~# tcpdump -i any port 80 -n
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
13:34:48.804312 ens5  In  IP 192.168.28.51.51940 > 10.87.87.212.80: Flags [S], seq 2217316973, win 64240, options [mss 1460,sackOK,TS val 3242569523 ecr 0,nop,wscale 7], length 0
13:34:48.804418 ens4  Out IP 10.87.87.212.80 > 192.168.28.51.51940: Flags [S.], seq 171408541, ack 2217316974, win 65160, options [mss 1460,sackOK,TS val 3117473550 ecr 3242569523,nop,wscale 7], length 0
13:34:48.804659 ens5  In  IP 192.168.28.51.51940 > 10.87.87.212.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3242569523 ecr 3117473550], length 0
13:34:48.804744 ens5  In  IP 192.168.28.51.51940 > 10.87.87.212.80: Flags [F.], seq 1, ack 1, win 502, options [nop,nop,TS val 3242569523 ecr 3117473550], length 0
13:34:48.804990 ens4  Out IP 10.87.87.212.80 > 192.168.28.51.51940: Flags [F.], seq 1, ack 2, win 510, options [nop,nop,TS val 3117473551 ecr 3242569523], length 0
13:34:48.805210 ens5  In  IP 192.168.28.51.51940 > 10.87.87.212.80: Flags [.], ack 2, win 502, options [nop,nop,TS val 3242569524 ecr 3117473551], length 0
^C
6 packets captured
9 packets received by filter
0 packets dropped by kernel

So we can see both IN and OUT packets in tcpdump. But on Rocky 9 setup, there's only one direction packets:

[root@kube-1 ~]# tcpdump -i any port 80 -n
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
13:36:48.598318 ens5  Out IP 192.168.27.13.51528 > 10.87.87.202.http: Flags [S], seq 3710117211, win 64240, options [mss 1460,sackOK,TS val 570587282 ecr 0,nop,wscale 7], length 0
13:36:49.619041 ens5  Out IP 192.168.27.13.51528 > 10.87.87.202.http: Flags [S], seq 3710117211, win 64240, options [mss 1460,sackOK,TS val 570588303 ecr 0,nop,wscale 7], length 0
13:36:51.667001 ens5  Out IP 192.168.27.13.51528 > 10.87.87.202.http: Flags [S], seq 3710117211, win 64240, options [mss 1460,sackOK,TS val 570590351 ecr 0,nop,wscale 7], length 0
13:36:55.699009 ens5  Out IP 192.168.27.13.51528 > 10.87.87.202.http: Flags [S], seq 3710117211, win 64240, options [mss 1460,sackOK,TS val 570594383 ecr 0,nop,wscale 7], length 0
^C
4 packets captured
17 packets received by filter
0 packets dropped by kernel
[root@kube-2 ~]# tcpdump -i any port 80 -n
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
13:36:48.597978 ens5  In  IP 192.168.27.13.51528 > 10.87.87.202.http: Flags [S], seq 3710117211, win 64240, options [mss 1460,sackOK,TS val 570587282 ecr 0,nop,wscale 7], length 0
13:36:49.618401 ens5  In  IP 192.168.27.13.51528 > 10.87.87.202.http: Flags [S], seq 3710117211, win 64240, options [mss 1460,sackOK,TS val 570588303 ecr 0,nop,wscale 7], length 0
13:36:51.666354 ens5  In  IP 192.168.27.13.51528 > 10.87.87.202.http: Flags [S], seq 3710117211, win 64240, options [mss 1460,sackOK,TS val 570590351 ecr 0,nop,wscale 7], length 0
13:36:55.698380 ens5  In  IP 192.168.27.13.51528 > 10.87.87.202.http: Flags [S], seq 3710117211, win 64240, options [mss 1460,sackOK,TS val 570594383 ecr 0,nop,wscale 7], length 0
^C
4 packets captured
11 packets received by filter
0 packets dropped by kernel

So I think the original problem can be divided into two parts:

  1. Why Rocky 9 not replying asymmetricly routed packets. This is not related to kube-proxy, probably an OS config thing, I'll dig into it.
  2. It seems that in this scenario, the source IP is not what one would expect. Although with asymmetric routing it should work, but one would expect the source IP be the secondary NIC's IP. Could you confirm whether this is a designed behavior of kube-proxy? @uablrek

@fengye87
Copy link
Author

fengye87 commented May 7, 2024

  1. Why Rocky 9 not replying asymmetricly routed packets. This is not related to kube-proxy, probably an OS config thing, I'll dig into it.

Yep, it's an OS config thing. Turns out Rocky set rp_filter to 1 by default. If I set it to 0, the problem went away. And this value is 0 by default on Debian, that's why the problem doesn't exist on Debian hosts.

@uablrek
Copy link
Contributor

uablrek commented May 7, 2024

Could you confirm whether this is a designed behavior of kube-proxy?

As I said, kube-proxy is not involved in routing at all. It NATs the dest, and leave routing to the OS and/or the cni-plugin. That's why this problem can't be fixed in K8s code.

@fengye87
Copy link
Author

fengye87 commented May 7, 2024

OK, I see. Thanks a lot for your help @uablrek . It really clarified the problem for me. And would you mind the trouble to give me a direction on how to fix this? So that packets of this service would go in and out through the secondary NIC only

@uablrek
Copy link
Contributor

uablrek commented May 7, 2024

I can't reproduce the asymmetric routing in my env. The routing tables you included last in #124587 (comment) should direct packets on kube-1 to dest 10.87.87.202 via ens5 with src 10.87.87.201:

10.87.87.0/24 dev ens5 proto kernel scope link src 10.87.87.201 metric 101

Not directed via ens5 with src 192.168.27.13 as in your trace:

17:17:17.307353 IP 192.168.27.13.38498 > 10.87.87.202.http: Flags [S], seq 2360866232, win 64240, options [mss ...

If I can attend the sig/network meeting on May 9, I will ask if anybody can explain how this can happen. But in any case, I think all will agree that this is not a K8s bug.

@fengye87
Copy link
Author

fengye87 commented May 7, 2024

The routing tables you included last in #124587 (comment) should direct packets on kube-1 to dest 10.87.87.202 via ens5 with src 10.87.87.201:

Yes, that's the strange part of it. I'm not very good at iptables, but is it possible that iptables has chosen a wrong public IP?

If I can attend the sig/network meeting on May 9, I will ask if anybody can explain how this can happen. But in any case, I think all will agree that this is not a K8s bug.

That would be great then, thanks again!

@uablrek
Copy link
Contributor

uablrek commented May 7, 2024

I succeeded to get asymmetric routing 😄 , but I had to trash the iptables rules setup by kube-proxy:

# iptables -t nat -Z
# iptables -t nat -L -nv
(narrowed...)
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
    0     0 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
    0     0 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully
# (do an test access here.... and after:)
# iptables -t nat -L -nv
(narrowed...)
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    8   480 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
    1    60 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
    1    60 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

You see one hit on the MASQUERADE rule caused by the access. What MASQUERADE does is that it NATs the src address to one on outgoing interface. In your case it would NAT (whateverIP)->10.87.87.201.

Now, if I deliberately remove the MASQUERADE rule:

# iptables -t nat -D KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully

Then I get asymmetric routing!

@uablrek
Copy link
Contributor

uablrek commented May 7, 2024

So, this can still be a bug in kube-proxy. But it is more likely that something in your environment is wrong, since it works unless I do bad things with the setup.

@fengye87
Copy link
Author

fengye87 commented May 8, 2024

Hmm, that's interesting, coz my iptables -t nat -L -nv output is:

Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
  229 15376 RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
    0     0 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
    0     0 MASQUERADE  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

Packet count of the masquerade rule it always 0 (both Debian and Rocky cluster).

@uablrek
Copy link
Contributor

uablrek commented May 8, 2024

Hm, please check:

sysctl -w net.bridge.bridge-nf-call-iptables=1

If this is 0, weird things happen. (but I tested in my env to set it to 0, but didn't get asymmetric routing...)

@uablrek
Copy link
Contributor

uablrek commented May 8, 2024

The call to the KUBE-MARK-MASQ is made from the KUBE-SVC-* chain for the service. My env:

Chain KUBE-SVC-LZMPKBY6MU3PEIQ4 (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    1    60 KUBE-MARK-MASQ  tcp  --  *      *      !11.0.0.0/16          12.0.204.152         /* default/mconnect:mconnect cluster IP */ tcp dpt:5001
    0     0 KUBE-SEP-VMQIYBFKOHI4ZCKJ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/mconnect:mconnect -> 192.168.3.3:5001 */ statistic mode random probability 0.50000000000
    1    60 KUBE-SEP-6DTZE6SG2FKOJ6OR  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/mconnect:mconnect -> 192.168.3.4:5001 */

Please check addresses and counter in your env

@fengye87
Copy link
Author

fengye87 commented May 8, 2024

Hm, please check:

sysctl -w net.bridge.bridge-nf-call-iptables=1

The value is already 1:

[root@kube-1 ~]# sysctl net.bridge.bridge-nf-call-iptables
net.bridge.bridge-nf-call-iptables = 1

I can see packet count incremented in other rules, but not the masquerade one of chain KUBE-POSTROUTING. See diff below:

Screenshot 2024-05-08 at 14 10 38
Screenshot 2024-05-08 at 14 11 08
Screenshot 2024-05-08 at 14 11 23

And the diff is almost identical if I change the nginx pod back to pod network and the service back to selector-based one.

@uablrek
Copy link
Contributor

uablrek commented May 8, 2024

The problem seem to be that your "KUBE-SVC-*" don't have a KUBE-MARK-MASQas mine do #124587 (comment).

I don't know why, but please provide your kube-proxy configmap, and the service manifest

@uablrek
Copy link
Contributor

uablrek commented May 8, 2024

I found it 😄

You MUST provide clusterCIDR in your kube-proxy config!

This allows kube-proxy to distinguish traffic from pods (no MASQ needed), and from anything else (MASQ needed, in our case the node).

My kube-proxy conf has this item:

clusterCIDR: "11.0.0.0/16,1100::/48"

If I remove it, the KUBE-MARK-MASQ rule does not appear in the "KUBE-SVC-*" chains, and I get asymmetric routing.

So, this is not a bug in K8s, I was right about that. But it was a hard problem to find.

@uablrek
Copy link
Contributor

uablrek commented May 8, 2024

To sum up:

  • This is not a headless service (clusterIP=None), the service has a clusterIP, but not a selector
  • When the clusterIP is addressed from main netns on a node, and the endpoint is external, then the src must be masqueraded
  • To allow kube-proxy to insert masquerading rules the clusterCIDR must be provided in the kube-proxy conf

(there are other possibilities than setting clusterCIDR, please check pkg/proxy/util/localdetector.go if you are interrested)

@fengye87
Copy link
Author

fengye87 commented May 9, 2024

You MUST provide clusterCIDR in your kube-proxy config!

Yes, I can confirm that you're right. After setting a proper clusterCIDR, the asymmetric routing problem went away.

This is not a headless service (clusterIP=None), the service has a clusterIP, but not a selector

Yes, I've misused "headless" here. I'll change the issue title.

Thank you for your patience, really helped me out of this problem.

@fengye87 fengye87 changed the title Headless service with secondary IPs not working properly on Rocky 8/9 with latest kube-proxy Selector-less service with secondary IPs not working properly on Rocky 8/9 with latest kube-proxy May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

3 participants