Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong udp conntrack entries are populated after a pod is killed using --force #105657

Closed
pmarag opened this issue Oct 13, 2021 · 22 comments · Fixed by #106163
Closed

Wrong udp conntrack entries are populated after a pod is killed using --force #105657

pmarag opened this issue Oct 13, 2021 · 22 comments · Fixed by #106163
Assignees
Labels
area/kube-proxy kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@pmarag
Copy link

pmarag commented Oct 13, 2021

What happened?

On a single node kubernetes cluster, deployed using minikube:

[root@control-plane ~]# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
control-plane.minikube.internal Ready control-plane,master 46h v1.22.2 10.45.253.76 Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.el8.x86_64 docker://20.10.6

We have two pods handling SIP traffic. One pod is sending an OPTIONS to the other on port 5060 using a service.

ea1-tasftapp ClusterIP 10.97.122.198 5060/UDP 46h

tas-sip-b2bua-tasftapp-69c574b88c-7nvsm 1/1 Running 0 13m 10.244.0.35 control-plane.minikube.internal
tas-sip-b2bua-tassnf-78fd487c7d-t9b2j 1/1 Running 0 45h 10.244.0.24 control-plane.minikube.internal

Connection on conntrack looks like this:

udp 17 119 src=10.244.0.24 dst=10.97.122.198 sport=5060 dport=5060 src=10.244.0.35 dst=10.244.0.24 sport=5060 dport=5060 [ASSURED] mark=0 use=1

If we delete the pod behind the ea1-tasftapp service by using kubectl delete pod, a new one is created:

tas-sip-b2bua-tasftapp-69c574b88c-7rdgm 1/1 Running 0 47s 10.244.0.36 control-plane.minikube.internal
tas-sip-b2bua-tassnf-78fd487c7d-t9b2j 1/1 Running 0 45h 10.244.0.24 control-plane.minikube.internal

and conntrack is updated correctly:

udp 17 119 src=10.244.0.24 dst=10.97.122.198 sport=5060 dport=5060 src=10.244.0.36 dst=10.244.0.24 sport=5060 dport=5060 [ASSURED] mark=0 use=1

If we delete the pod using force, a new pod is created:

tas-sip-b2bua-tasftapp-69c574b88c-gvx9g 1/1 Running 0 22s 10.244.0.37 control-plane.minikube.internal
tas-sip-b2bua-tassnf-78fd487c7d-t9b2j 1/1 Running 0 45h 10.244.0.24 control-plane.minikube.internal

but the conntrack is updated using the node ip instead of the new pods. Due to this, connectivity between the two pods doesn't work.

udp 17 29 src=10.244.0.24 dst=10.97.122.198 sport=5060 dport=5060 [UNREPLIED] src=10.97.122.198 dst=10.45.253.76 sport=5060 dport=9817 mark=0 use=1

Problem is solved only by restarting the pod initiated the connection (tas-sip-b2bua-tassnf-78fd487c7d-t9b2j) or by restarting kube-proxy.

Issue appears only on 1.22 kubernetes version. On version 1.21 everything is working properly.

What did you expect to happen?

After pod is killed, a new conntrack entry should be created towards the IP of the new pod.

How can we reproduce it (as minimally and precisely as possible)?

Having udp connectivity between a pod and a service and force delete the pod where the traffic is routed through the service.

Anything else we need to know?

No response

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

On premises deployment.

OS version

NAME="Red Hat Enterprise Linux"
VERSION="8.4 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.4"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.4"

Linux control-plane.minikube.internal 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and and version (if applicable)

minikube version: v1.23.2
commit: 0a0ad764652082477c00d51d2475284b5d39ceed

Related plugins (CNI, CSI, ...) and versions (if applicable)

[root@control-plane ~]# calicoctl version
Client Version: v3.20.2
Git commit: dcb4b76a
Cluster Version: v3.20.2
Cluster Type: k8s,kdd,bgp,kubeadm

@pmarag pmarag added the kind/bug Categorizes issue or PR as related to a bug. label Oct 13, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 13, 2021
@pmarag
Copy link
Author

pmarag commented Oct 13, 2021

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 13, 2021
@pmarag
Copy link
Author

pmarag commented Oct 14, 2021

/area kube-proxy

@aojea
Copy link
Member

aojea commented Oct 14, 2021

can you share the manifests and steps to try to reproduce it?
Is not clear to me what pods belong to the service in this case

The conntrack entry using the nodeIP depends on the iptables rules you have configured in the host, if you target a service that doesn't exist most probably the CNI source nat the Pod to the hostIP

@vjhpe
Copy link

vjhpe commented Oct 14, 2021

Let me try to describe our setup, as reproducing our deployment would not be possible.

# kubectl get nodes -o wide
NAME                              STATUS   ROLES                  AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                               KERNEL-VERSION          CONTAINER-RUNTIME
control-plane.minikube.internal   Ready    control-plane,master   3d    v1.22.2   10.45.253.76   <none>        Red Hat Enterprise Linux 8.4 (Ootpa)   4.18.0-305.el8.x86_64   docker://20.10.6

# kubectl get svc --namespace tas -o wide --selector platform=tas-scif-sip
NAME           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE     SELECTOR
ea1-redis      ClusterIP   10.101.100.25    <none>        6379/TCP            2d23h   app=redis,chart=redis,environment=dev,platform=tas-scif-sip,release=ea1
ea1-tasftapp   ClusterIP   10.97.122.198    <none>        5060/UDP            2d23h   app=tasftapp,chart=tas-sip-b2bua,environment=dev,platform=tas-scif-sip,release=ea1
ea1-tassnf     NodePort    10.105.71.73     <none>        5060:30191/UDP      2d23h   app=snf,chart=tas-sip-b2bua,environment=dev,platform=tas-scif-sip,release=ea1
ea1-telegraf   ClusterIP   10.104.241.139   <none>        8186/TCP,9273/TCP   2d23h   app=telegraf,chart=metrics,environment=dev,platform=tas-scif-sip,release=ea1

The issue here is between the services "ea1-tassnf" and "ea1-tasftapp". Both these services have 1 pod behind them.

# kubectl get pods --namespace tas -o wide --selector platform=tas-scif-sip
NAME                                      READY   STATUS    RESTARTS   AGE     IP            NODE                              NOMINATED NODE   READINESS GATES
metrics-telegraf-cd5b5d79f-xfrx2          1/1     Running   0          2d23h   10.244.0.19   control-plane.minikube.internal   <none>           <none>
redis-tasredis-79f77dd7f8-4xmq9           1/1     Running   0          2d23h   10.244.0.17   control-plane.minikube.internal   <none>           <none>
tas-sip-b2bua-tasftapp-69c574b88c-gvx9g   1/1     Running   0          25h     10.244.0.37   control-plane.minikube.internal   <none>           <none>
tas-sip-b2bua-tassnf-78fd487c7d-t9b2j     1/1     Running   0          2d23h   10.244.0.24   control-plane.minikube.internal   <none>           <none>

The 'tas-sip-b2bua-tassnf' pod sends periodic heartbeats (SIP protocol, over UDP) to 'tas-sip-b2bua-tasftapp'. The heartbeats are addressed to the service name 'ea1-tasftapp'.
Everything works fine until we try to simulate a failure. We try to simulate a failure by force deleting the pod 'tas-sip-b2bua-tasftapp' (kubectl delete pod ... --force). This results in a new pod for 'tas-sip-b2bua-tasftapp'. But this new pod never receives the heartbeat, while we see the pod 'tas-sip-b2bua-tasfsnf' keeps sending heartbeats, but it times out.
The root cause seems to be the conntrack entries (see the explanation in the issue description).
If we restart 'kube-proxy' , the heartbeats start working.
Also as stated earlier, the exact same scenario works fine on kubernetes v1.21. This issue appears from v1.22.

@aojea
Copy link
Member

aojea commented Oct 14, 2021

most of the conntrack issues with UDP were fixed, I'm surprised this worked before and not now :/
... can you run kube-proxy with verbosity 7 and attach the logs after reproduce the issue (please share more or less the time when you delete the pod and the name of the pods so I can find them easily in the logs)

@pmarag
Copy link
Author

pmarag commented Oct 14, 2021

@aojea I have the logs exactly from when I start our test scenario and we deleted the pod. The verbosity is 9 hope this is not a problem. Please find them attached! I attach also the same exactly test on kubernetes 1.21 (where everything works as expected). I can see some differences but to be honest, wasn't able to find what exactly is creating the issue.
kube-proxy-kubernetes-1.21.txt
kube-proxy-kubernetes-1.22.log

Since we can always replicate the issue, I can provide whatever else is needed/may help on the troubleshooting.

@aojea
Copy link
Member

aojea commented Oct 14, 2021

EDIT

WRONG

on 1.22

I1012 10:43:50.515733 1 bounded_frequency_runner.go:296] sync-runner: ran, next possible in 1s, periodic in 1h0m0s
I1012 10:43:56.533279 1 config.go:260] Calling handler.OnEndpointSliceUpdate
I1012 10:43:56.533356 1 endpointslicecache.go:358] Setting endpoints for "tas/ea1-tasftapp:sip" to [10.244.0.29:5060]
I1012 10:43:56.533376 1 endpointslicecache.go:358] Setting endpoints for "tas/ea1-tasftapp:sip" to [10.244.0.29:5060]
I1012 10:43:56.533429 1 proxier.go:845] "Syncing iptables rules"

on 1.21

I1012 12:54:20.458590 1 config.go:260] Calling handler.OnEndpointSliceUpdate
I1012 12:54:20.906930 1 config.go:260] Calling handler.OnEndpointSliceUpdate
I1012 12:54:20.906992 1 endpointslicecache.go:354] Setting endpoints for "tas/ea1-tasftapp:sip" to [10.244.0.15:5060]
I1012 12:54:20.907097 1 proxier.go:841] "Stale service" protocol="udp" svcPortName="tas/ea1-tasftapp:sip" clusterIP="10.110.43.106"
I1012 12:54:20.907114 1 proxier.go:857] "Syncing iptables rules"

on 1.21 it deletes the conntrack entry for service

I1012 12:54:20.989553 1 proxier.go:1658] "Deleting conntrack stale entries for Services" ips=[10.110.43.106]
I1012 12:54:20.989584 1 conntrack.go:66] Clearing conntrack entries [-D --orig-dst 10.110.43.106 -p udp]

on 1.22 doesn't detect the service stale 🤔 and call the slicecache twice but the handler once

@danwinship does this ring a bell?

@aojea
Copy link
Member

aojea commented Oct 15, 2021

If we delete the pod behind the ea1-tasftapp service by using kubectl delete pod, a new one is created:
If we delete the pod using force, a new pod is created:

@pmarag do I read it correctly and the problem only happens when using --force to delete the pod, without force it works well?

@vjhpe
Copy link

vjhpe commented Oct 15, 2021

@aojea, actually our latest tests show otherwise, let me explain. The ea1-taftapp pod used to take time to terminate (in its SIGTERM handler), and this issue was observed only when we deleted the pod with —force, and not with graceful termination or with a large grace period (e.g. —grace-period=10).
Now, we have corrected a bug in our application in its SIGTERM handler, and the application terminates very fast.
With this update in the application, we observe this issue for all cases i.e. regardless of whether the pod is gracefully deleted or with force.
Maybe this gives a clue? Because when the application used to take time to terminate, we dont get this issue when the pod was deleted with —grace-period=10, but if we shorten the grace period (e.g. —grace-period=5) or use —force, this issue appears.

@aojea
Copy link
Member

aojea commented Oct 15, 2021

I don't have clear this is related to the kubernetes version, the logs are missing some information but in the 1.22 logs the logic seem correct

iptables rule created

I1012 10:43:48.812390 1 traffic.go:91] [DetectLocalByCIDR (10.244.0.0/16)] Jump Not Local: [-m comment --comment "tas/ea1-tasftapp:sip cluster IP" -m udp -p udp -d 10.97.122.198/32 --dport 5060 ! -s 10.244.0.0/16 -j KUBE-MARK-MASQ]

stale endpoint detected

I1012 10:43:48.825804 1 endpointslicecache.go:358] Setting endpoints for "tas/ea1-tasftapp:sip" to [10.244.0.28:5060]
I1012 10:43:48.825822 1 endpoints.go:553] Stale endpoint tas/ea1-tasftapp:sip -> 10.244.0.28:5060

entries deleted

I1012 10:43:48.907046 1 proxier.go:1694] "Deleting stale endpoint connections" endpoints=[{Endpoint:10.244.0.28:5060 ServicePortName:tas/ea1-tasftapp:sip}]
I1012 10:43:48.907093 1 conntrack.go:66] Clearing conntrack entries [-D --orig-dst 10.97.122.198 --dst-nat 10.244.0.28 -p udp]
I1012 10:43:48.910404 1 proxier.go:812] "SyncProxyRules complete" elapsed="84.61535ms"
I1012 10:43:48.910433 1 bounded_frequency_runner.go:296] sync-runner: ran, next possible in 1s, periodic in 1h0m0s

less than one second later one new endpoint added to the service and stale service detected

I1012 10:43:49.698095 1 config.go:260] Calling handler.OnEndpointSliceUpdate
I1012 10:43:50.432177 1 config.go:260] Calling handler.OnEndpointSliceUpdate
I1012 10:43:50.432273 1 endpointslicecache.go:358] Setting endpoints for "tas/ea1-tasftapp:sip" to [10.244.0.29:5060]
I1012 10:43:50.432577 1 proxier.go:829] "Stale service" protocol="udp" svcPortName="tas/ea1-tasftapp:sip" clusterIP="10.97.122.198"

and its entries are deleted

I1012 10:43:50.512309 1 proxier.go:1681] "Deleting conntrack stale entries for Services" ips=[10.97.122.198]
I1012 10:43:50.512350 1 conntrack.go:66] Clearing conntrack entries [-D --orig-dst 10.97.122.198 -p udp]

I don't know if you may have a race somewhere, in the application or with the CNI, I suggest you to try that with KIND that has less "networking overhead"

@pmarag
Copy link
Author

pmarag commented Oct 15, 2021

@aojea We already tried with KIND, actually this is the CNI used on the environment that my colleague @vjhpe is using. Behavior is exactly the same. As I said I'm always available to provide further logs from any kubernetes version (1.20,21 or 22). Right now, Behavior is the same using Calico or KIND. The scenario is working on 1.20 and 1.21 but not on 1.22.

@aojea
Copy link
Member

aojea commented Oct 15, 2021

wow, can you attach the full logs of the kube-proxy?

I'd like to check the configuration options and the api calls

@pmarag
Copy link
Author

pmarag commented Oct 15, 2021

@aojea Thank you again for the support and effort. Find bellow all the information:

Kubernetes 1.21

[root@control-plane ~]# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
control-plane.minikube.internal Ready control-plane,master 2d23h v1.21.5 10.45.253.76 Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.el8.x86_64 docker://20.10.6

[root@control-plane ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

[root@control-plane ~]# kubectl get svc --namespace tas -o wide --selector platform=tas-scif-sip
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
ea1-redis ClusterIP 10.109.170.31 6379/TCP 21s app=redis,chart=redis,environment=dev,platform=tas-scif-sip,release=ea1
ea1-tasftapp ClusterIP 10.110.133.170 5060/UDP 21s app=tasftapp,chart=tas-sip-b2bua,environment=dev,platform=tas-scif-sip,release=ea1
ea1-tassnf NodePort 10.96.49.152 5060:32492/UDP 21s app=snf,chart=tas-sip-b2bua,environment=dev,platform=tas-scif-sip,release=ea1
ea1-telegraf ClusterIP 10.109.214.39 8186/TCP,9273/TCP 21s app=telegraf,chart=metrics,environment=dev,platform=tas-scif-sip,release=ea1

[root@control-plane ~]# kubectl get pods --namespace tas -o wide --selector platform=tas-scif-sip
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metrics-telegraf-cd5b5d79f-c2lq8 1/1 Running 0 35s 10.244.0.17 control-plane.minikube.internal
redis-tasredis-79f77dd7f8-rssh4 1/1 Running 0 35s 10.244.0.19 control-plane.minikube.internal
tas-sip-b2bua-tasftapp-69c574b88c-nqhxb 1/1 Running 0 35s 10.244.0.22 control-plane.minikube.internal
tas-sip-b2bua-tassnf-78fd487c7d-c6j6t 1/1 Running 0 35s 10.244.0.18 control-plane.minikube.internal

[root@control-plane ~]# conntrack -L | grep 10.110.133.170
udp 17 119 src=10.244.0.18 dst=10.110.133.170 sport=5060 dport=5060 src=10.244.0.22 dst=10.244.0.18 sport=5060 dport=5060 [ASSURED] mark=0 use=1
conntrack v1.4.4 (conntrack-tools): 294 flow entries have been shown.

After force delete:

[root@control-plane ~]# kubectl get pods --namespace tas -o wide --selector platform=tas-scif-sip
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metrics-telegraf-cd5b5d79f-c2lq8 1/1 Running 0 2m3s 10.244.0.17 control-plane.minikube.internal
redis-tasredis-79f77dd7f8-rssh4 1/1 Running 0 2m3s 10.244.0.19 control-plane.minikube.internal
tas-sip-b2bua-tasftapp-69c574b88c-mtsrp 1/1 Running 0 15s 10.244.0.23 control-plane.minikube.internal
tas-sip-b2bua-tassnf-78fd487c7d-c6j6t 1/1 Running 0 2m3s 10.244.0.18 control-plane.minikube.internal

[root@control-plane ~]# conntrack -L | grep 10.110.133.170
udp 17 119 src=10.244.0.18 dst=10.110.133.170 sport=5060 dport=5060 src=10.244.0.23 dst=10.244.0.18 sport=5060 dport=5060 [ASSURED] mark=0 use=1
conntrack v1.4.4 (conntrack-tools): 315 flow entries have been shown.

Kube-proxy logs with verbosity 9

kubernetes_1.21_conntrack_issue.log

Kubernets 1.22

[root@control-plane ~]# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
control-plane.minikube.internal Ready control-plane,master 4d1h v1.22.2 10.45.253.76 Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-305.el8.x86_64 docker://20.10.6

[root@control-plane ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2"

[root@control-plane ~]# kubectl get svc --namespace tas -o wide --selector platform=tas-scif-sip
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
ea1-redis ClusterIP 10.100.207.51 6379/TCP 2m17s app=redis,chart=redis,environment=dev,platform=tas-scif-sip,release=ea1
ea1-tasftapp ClusterIP 10.102.145.128 5060/UDP 2m17s app=tasftapp,chart=tas-sip-b2bua,environment=dev,platform=tas-scif-sip,release=ea1
ea1-tassnf NodePort 10.107.207.247 5060:31779/UDP 2m17s app=snf,chart=tas-sip-b2bua,environment=dev,platform=tas-scif-sip,release=ea1
ea1-telegraf ClusterIP 10.109.139.216 8186/TCP,9273/TCP 2m17s app=telegraf,chart=metrics,environment=dev,platform=tas-scif-sip,release=ea1

[root@control-plane ~]# kubectl get pods --namespace tas -o wide --selector platform=tas-scif-sip
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metrics-telegraf-cd5b5d79f-tqbb8 1/1 Running 0 2m38s 10.244.0.41 control-plane.minikube.internal
redis-tasredis-79f77dd7f8-x2fs7 1/1 Running 0 2m38s 10.244.0.43 control-plane.minikube.internal
tas-sip-b2bua-tasftapp-69c574b88c-jhdv4 1/1 Running 0 2m38s 10.244.0.42 control-plane.minikube.internal
tas-sip-b2bua-tassnf-78fd487c7d-sb447 1/1 Running 0 2m38s 10.244.0.40 control-plane.minikube.internal

[root@control-plane ~]# conntrack -L | grep 10.102.145.128
udp 17 119 src=10.244.0.40 dst=10.102.145.128 sport=5060 dport=5060 src=10.244.0.42 dst=10.244.0.40 sport=5060 dport=5060 [ASSURED] mark=0 use=1
conntrack v1.4.4 (conntrack-tools): 284 flow entries have been shown.

After force delete:

[root@control-plane ~]# kubectl get pods --namespace tas -o wide --selector platform=tas-scif-sip
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metrics-telegraf-cd5b5d79f-tqbb8 1/1 Running 0 5m44s 10.244.0.41 control-plane.minikube.internal
redis-tasredis-79f77dd7f8-x2fs7 1/1 Running 0 5m44s 10.244.0.43 control-plane.minikube.internal
tas-sip-b2bua-tasftapp-69c574b88c-xjwtg 1/1 Running 0 103s 10.244.0.44 control-plane.minikube.internal
tas-sip-b2bua-tassnf-78fd487c7d-sb447 1/1 Running 0 5m44s 10.244.0.40 control-plane.minikube.internal

[root@control-plane ~]# conntrack -L | grep 10.102.145.128
udp 17 29 src=10.244.0.40 dst=10.102.145.128 sport=5060 dport=5060 [UNREPLIED] src=10.102.145.128 dst=10.45.253.76 sport=5060 dport=7845 mark=0 use=1
conntrack v1.4.4 (conntrack-tools): 318 flow entries have been shown.

Kube-proxy logs with verbosity 9

[kubernetes_1.22_conntrack_issue.log]
(https://github.com/kubernetes/kubernetes/files/7353289/kubernetes_1.22_conntrack_issue.log)

This is again from my environment, using Calico but as I said, the behavior is exactly the same with kindnet

@aojea
Copy link
Member

aojea commented Oct 17, 2021

I can't reproduce it :/

Install an UDP service with a server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: server-deployment
  labels:
    app: MyApp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: MyApp
  template:
    metadata:
      labels:
        app: MyApp
    spec:
      containers:
      - name: agnhost
        image: k8s.gcr.io/e2e-test-images/agnhost:2.21
        args:
          - netexec
          - --http-port=80
          - --udp-port=80
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nodeport-service
spec:
  type: NodePort
  selector:
    app: MyApp
  ports:
    - protocol: UDP
      port: 80
      targetPort: 80

Get the new service IP

kubectl get service
NAME               TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
kubernetes         ClusterIP   10.96.0.1      <none>        443/TCP        3d21h
nodeport-service   NodePort    10.96.73.155   <none>        80:31409/UDP   13m

Run another pod polling the udp service constantly

kubectl run test -it --image k8s.gcr.io/e2e-test-images/agnhost:2.21 --command -- ash
while true; do echo "hostname" | nc -u -w 1 10.96.73.155 80 ; sleep 1;  done

In another terminal watch the conntrack entries inside the node

watch -c "conntrack -L | grep 10.96.73.155"

After doing kubectl delete --force to the service pod, the conntrack entries are correct and the app keeps working

@pmarag
Copy link
Author

pmarag commented Oct 18, 2021

@aojea Thank you again for the effort. We are working to isolate the issue and provide what is needed in order for you to simulate it. We will come back as soon as possible with an update.

@vjhpe
Copy link

vjhpe commented Oct 27, 2021

@aojea Sorry for coming back late on this. We were trying our best to isolate this problem, and provide you a way to reproduce this.
We now have a way to consistently reproduce this problem without requiring all of our containers. And, I think I know what triggers this issue. Our pod had an initContainer that just did a sleep for 5 secs. The wrong iptable entry occurs only when this pause initContainer is present. As soon as I remove this initContainer, we do not see this issue anymore, no matter how many times I delete the pod (--force or gracefully).

I have written 2 small Java applications, which you can deploy to simulate the problem in your k8s cluster. I have built container images for these 2 applications which you can pull into your environment -

  • options-responder: This listens for SIP OPTIONS over UDP (port 5060), and just responds for every incoming request.
  • options-sender: This sends SIP OPTIONS request every 1 sec to options-responder.

The images are available on docker hub -

docker pull vjhpe/options-responder:1.0.0
docker pull vjhpe/options-sender:1.0.0

I've attached 2 yaml files that will deploy these into k8s -
options-svc.zip

kubectl apply -f .\options-responder-pause.yaml
kubectl apply -f .\options-sender.yaml

> kubectl get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE   READINESS GATES
options-responder-887f76c9d-77vl9   1/1     Running   0          34s   10.244.2.94   minikube-m03   <none>           <none>
options-sender-5977bcd74b-jz9kg     1/1     Running   0          6s    10.244.2.95   minikube-m03   <none>           <none>

> kubectl get svc -o wide
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE   SELECTOR
kubernetes          ClusterIP   10.96.0.1       <none>        443/TCP    16d   <none>
options-responder   ClusterIP   10.111.87.18    <none>        5060/UDP   37s   active=yes,app=options-uas
options-sender      ClusterIP   10.102.42.111   <none>        5060/UDP   9s    app=options-uac

> kubectl get nodes -o wide
NAME           STATUS   ROLES                  AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION                      CONTAINER-RUNTIME
minikube       Ready    control-plane,master   16d   v1.22.2   192.168.49.2   <none>        Ubuntu 20.04.2 LTS   5.10.16.3-microsoft-standard-WSL2   docker://20.10.8
minikube-m02   Ready    <none>                 25h   v1.22.2   192.168.49.3   <none>        Ubuntu 20.04.2 LTS   5.10.16.3-microsoft-standard-WSL2   docker://20.10.8
minikube-m03   Ready    <none>                 25h   v1.22.2   192.168.49.4   <none>        Ubuntu 20.04.2 LTS   5.10.16.3-microsoft-standard-WSL2   docker://20.10.8

We can check the logs from options-sender pod when it is exchanging packets successfully -

> kubectl logs -f options-sender-5977bcd74b-jz9kg
log4j:WARN No appenders could be found for logger (options-sender).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Received OPTIONS response 200
Received OPTIONS response 200
Received OPTIONS response 200
Received OPTIONS response 200

Now we kill the options-responder pod -

> kubectl delete pod options-responder-887f76c9d-77vl9 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "options-responder-887f76c9d-77vl9" force deleted
PS C:\tutorials\options-responder\k8s> kubectl get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE   READINESS GATES
options-responder-887f76c9d-59qc6   1/1     Running   0          9s    10.244.2.96   minikube-m03   <none>           <none>
options-sender-5977bcd74b-jz9kg     1/1     Running   0          83s   10.244.2.95   minikube-m03   <none>           <none>

The options-sender pod soon starts getting timeouts, and this continues -

Received OPTIONS response 200
Received OPTIONS response 200
Received timeout event - javax.sip.TimeoutEvent[source=gov.nist.javax.sip.SipProviderImpl@319b92f3]
Received timeout event - javax.sip.TimeoutEvent[source=gov.nist.javax.sip.SipProviderImpl@319b92f3]
Received timeout event - javax.sip.TimeoutEvent[source=gov.nist.javax.sip.SipProviderImpl@319b92f3]
Received timeout event - javax.sip.TimeoutEvent[source=gov.nist.javax.sip.SipProviderImpl@319b92f3]
Received timeout event - javax.sip.TimeoutEvent[source=gov.nist.javax.sip.SipProviderImpl@319b92f3]
Received timeout event - javax.sip.TimeoutEvent[source=gov.nist.javax.sip.SipProviderImpl@319b92f3]

I monitor the conntrack entries during the time I delete the pod -

$ while true; do sudo conntrack -L | grep 5060; sleep 1; done
tcp      6 431998 ESTABLISHED src=192.168.49.2 dst=192.168.49.4 sport=50600 dport=10250 src=192.168.49.4 dst=192.168.49.2 sport=10250 dport=50600 [ASSURED] mark=0 use=1
udp      17 119 src=10.244.2.95 dst=10.111.87.18 sport=5060 dport=5060 src=10.244.2.94 dst=10.244.2.95 sport=5060 dport=5060 [ASSURED] mark=0 use=1
conntrack v1.4.5 (conntrack-tools): 8 flow entries have been shown.
tcp      6 431997 ESTABLISHED src=192.168.49.2 dst=192.168.49.4 sport=50600 dport=10250 src=192.168.49.4 dst=192.168.49.2 sport=10250 dport=50600 [ASSURED] mark=0 use=1
conntrack v1.4.5 (conntrack-tools): 9 flow entries have been shown.
tcp      6 431996 ESTABLISHED src=192.168.49.2 dst=192.168.49.4 sport=50600 dport=10250 src=192.168.49.4 dst=192.168.49.2 sport=10250 dport=50600 [ASSURED] mark=0 use=1
udp      17 29 src=10.244.2.95 dst=10.111.87.18 sport=5060 dport=5060 [UNREPLIED] src=10.111.87.18 dst=192.168.49.4 sport=5060 dport=63760 mark=0 use=1
conntrack v1.4.5 (conntrack-tools): 9 flow entries have been shown.
tcp      6 431995 ESTABLISHED src=192.168.49.2 dst=192.168.49.4 sport=50600 dport=10250 src=192.168.49.4 dst=192.168.49.2 sport=10250 dport=50600 [ASSURED] mark=0 use=1
udp      17 29 src=10.244.2.95 dst=10.111.87.18 sport=5060 dport=5060 [UNREPLIED] src=10.111.87.18 dst=192.168.49.4 sport=5060 dport=63760 mark=0 use=1
conntrack v1.4.5 (conntrack-tools): 9 flow entries have been shown.

For the UNREPLIED udp entries, we still see wrong IP (node IP) on the return path and this is never cleared.

I think if you deploy this into your k8s cluster, you should be able to reproduce this problem. And also figure out why adding an initContainer with a small pause introduces wrong iptable entry which is never deleted!
Let me know if you face any issues in deploying this.

@thockin
Copy link
Member

thockin commented Oct 28, 2021

/assign @aojea

@aojea
Copy link
Member

aojea commented Oct 28, 2021

Also as stated earlier, the exact same scenario works fine on kubernetes v1.21. This issue appears from v1.22.

I think if you deploy this into your k8s cluster, you should be able to reproduce this problem. And also figure out why adding an initContainer with a small pause introduces wrong iptable entry which is never deleted!
Let me know if you face any issues in deploying this.

is still the kubernetes version making any difference or was a red herring?

@pmarag
Copy link
Author

pmarag commented Oct 29, 2021

is still the kubernetes version making any difference or was a red herring?

Using the pods provided by my colleague, initial statement is still valid. The scenario works fine on kubernetes v1.21:

[root@control-plane ~]# kubectl get nodes -o wide
NAME                              STATUS   ROLES                  AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                               KERNEL-VERSION          CONTAINER-RUNTIME
control-plane.minikube.internal   Ready    control-plane,master   16d   v1.21.5   10.45.253.76   <none>        Red Hat Enterprise Linux 8.4 (Ootpa)   4.18.0-305.el8.x86_64   docker://20.10.6

[root@control-plane tmp]# kubectl get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP            NODE                              NOMINATED NODE   READINESS GATES
options-responder-887f76c9d-dfb2q   1/1     Running   0          70s   10.244.0.18   control-plane.minikube.internal   <none>           <none>
options-sender-5977bcd74b-2kgm7     1/1     Running   0          61s   10.244.0.19   control-plane.minikube.internal   <none>           <none>

[root@control-plane tmp]# kubectl get svc -o wide
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE   SELECTOR
kubernetes          ClusterIP   10.96.0.1       <none>        443/TCP    16d   <none>
options-responder   ClusterIP   10.96.49.152    <none>        5060/UDP   82s   active=yes,app=options-uas
options-sender      ClusterIP   10.109.214.39   <none>        5060/UDP   73s   app=options-uac

[root@control-plane tmp]# conntrack -L | grep 5060
udp      17 119 src=10.244.0.19 dst=10.96.49.152 sport=5060 dport=5060 src=10.244.0.18 dst=10.244.0.19 sport=5060 dport=5060 [ASSURED] mark=0 use=1

[root@control-plane tmp]# kubectl delete pod options-responder-887f76c9d-dfb2q --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "options-responder-887f76c9d-dfb2q" force deleted

[root@control-plane tmp]# kubectl get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP            NODE                              NOMINATED NODE   READINESS GATES
options-responder-887f76c9d-hjr88   1/1     Running   0          27s     10.244.0.20   control-plane.minikube.internal   <none>           <none>
options-sender-5977bcd74b-2kgm7     1/1     Running   0          3m39s   10.244.0.19   control-plane.minikube.internal   <none>           <none>

[root@control-plane tmp]# conntrack -L | grep 5060
udp      17 119 src=10.244.0.19 dst=10.96.49.152 sport=5060 dport=5060 src=10.244.0.20 dst=10.244.0.19 sport=5060 dport=5060 [ASSURED] mark=0 use=1

@aojea
Copy link
Member

aojea commented Oct 30, 2021

/triage accepted
I can reproduce it, there is definitively something odd here

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 30, 2021
@aojea
Copy link
Member

aojea commented Oct 30, 2021

great catch, this is something worse than the udp conntrack problem

@aojea
Copy link
Member

aojea commented Nov 8, 2021

and cherry=pick to 1.22 branch #106239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants