Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable 2605.6.0 shutdown order prevents closing of TCP connections #213

Closed
awprice opened this issue Oct 18, 2020 · 31 comments · Fixed by flatcar/init#30 or flatcar/coreos-cloudinit#10
Labels
kind/bug Something isn't working

Comments

@awprice
Copy link

awprice commented Oct 18, 2020

Description

When shutting down a node running Flatcar Stable 2605.6.0, we have noticed that the order of termination of processes prevents the closing of long-running TCP connections. In Flatcar Stable 2512.5.0 however, we do not see this issue - long-running TCP connections are closed, with a FIN and RST being sent to the client prior to the node being shutdown.

Impact

This impacts our Kubernetes control plane nodes running Flatcar - when they are shutdown, the Kubernetes API server running on the nodes is unable to notify it's clients that it has shutdown and that they should establish a new connection to a different node. As a result, the clients continue using the broken connection for ~5-15 minutes until it is timed out/reset by the client.

Environment and steps to reproduce

  1. Set-up:
  • Start a node running 2605.6.0, on AWS.
  • Run either the Kubernetes API server or a web server as a static pod
  • Have a client on a second node (can be running any version of Flatcar Stable) connect to the Kubernetes API server or a web server with a long running TCP connection
  • Start a tcpdump process on the client node, observing traffic between server/client
  • Shut down the node running the server.
  • Observe the lack of FIN/RST packets being sent to the client when the node is shut down
  1. Task:
  • Node was running as a Control plane node in the cluster, with kubelet and containerd configured as systemd units.
  • Kubernetes API server run as a static pod
  1. Action(s):
  • Shutdown node through AWS console
  1. Error:
  • No error, but the lack of TCP FIN/RST packets being sent to client on node shutdown

Expected behavior

  • I expect TCP FIN/RST packets to be sent to clients when the node is shutdown. We see this correct behaviour with Stable 2512.5.0

Additional information

Please see example tcpdump packet captures from each version:

10.34.224.1: Kubernetes service IP of Kubernetes API server
10.36.1.149: Client

2605.6.0
Node terminated, but with no FIN/RST, client continously attempts to ACK, until connection is RST. Re-established right after to a second control plane node.

07:17:18.888927 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 110: 10.34.224.1.443 > 10.36.1.149.57206: Flags [P.], seq 1083385:1083427, ack 38639, win 442, options [nop,nop,TS val 2565543848 ecr 3417911959], length 42
07:17:18.889018 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 140: 10.34.224.1.443 > 10.36.1.149.57206: Flags [P.], seq 1083427:1083499, ack 38639, win 442, options [nop,nop,TS val 2565543848 ecr 3417911959], length 72
07:17:18.889027 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 632: 10.34.224.1.443 > 10.36.1.149.57206: Flags [P.], seq 1083499:1084063, ack 38639, win 442, options [nop,nop,TS val 2565543848 ecr 3417911959], length 564
07:17:18.889115  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1084063, win 8855, options [nop,nop,TS val 3417911963 ecr 2565543848], length 0
07:17:29.889391 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 4202: 10.34.224.1.443 > 10.36.1.149.57206: Flags [P.], seq 1084063:1088197, ack 38639, win 442, options [nop,nop,TS val 2565554848 ecr 3417911963], length 4134
07:17:29.889623  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1088197, win 8855, options [nop,nop,TS val 3417922964 ecr 2565554848], length 0
07:17:29.908150 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 1275: 10.34.224.1.443 > 10.36.1.149.57206: Flags [P.], seq 1088197:1089404, ack 38639, win 442, options [nop,nop,TS val 2565554864 ecr 3417922964], length 1207
07:17:29.908497  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 110: 10.36.1.149.57206 > 10.34.224.1.443: Flags [P.], seq 38639:38681, ack 1089404, win 8855, options [nop,nop,TS val 3417922982 ecr 2565554864], length 42
07:17:29.949836 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 68: 10.34.224.1.443 > 10.36.1.149.57206: Flags [.], ack 38681, win 442, options [nop,nop,TS val 2565554909 ecr 3417922982], length 0
07:18:00.300264  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1089404, win 8855, options [nop,nop,TS val 3417953374 ecr 2565554909], length 0
07:18:30.508331  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1089404, win 8855, options [nop,nop,TS val 3417983582 ecr 2565554909], length 0
07:19:00.716327  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1089404, win 8855, options [nop,nop,TS val 3418013790 ecr 2565554909], length 0
07:19:30.924322  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1089404, win 8855, options [nop,nop,TS val 3418043998 ecr 2565554909], length 0
07:20:01.132290  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1089404, win 8855, options [nop,nop,TS val 3418074206 ecr 2565554909], length 0
07:20:31.340313  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1089404, win 8855, options [nop,nop,TS val 3418104414 ecr 2565554909], length 0
07:21:01.548278  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1089404, win 8855, options [nop,nop,TS val 3418134622 ecr 2565554909], length 0
07:21:31.756304  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1089404, win 8855, options [nop,nop,TS val 3418164830 ecr 2565554909], length 0
07:22:01.964276  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [.], ack 1089404, win 8855, options [nop,nop,TS val 3418195038 ecr 2565554909], length 0
07:22:32.172294  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.57206 > 10.34.224.1.443: Flags [R.], seq 38681, ack 1089404, win 8855, options [nop,nop,TS val 3418225246 ecr 2565554909], length 0
07:22:32.175523  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 76: 10.36.1.149.37530 > 10.34.224.1.443: Flags [S], seq 1721832127, win 64240, options [mss 1460,sackOK,TS val 3418225249 ecr 0,nop,wscale 7], length 0
07:22:32.175746 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 76: 10.34.224.1.443 > 10.36.1.149.37530: Flags [S.], seq 1137501786, ack 1721832128, win 62643, options [mss 8961,sackOK,TS val 1657888926 ecr 3418225249,nop,wscale 7], length 0
07:22:32.175794  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.37530 > 10.34.224.1.443: Flags [.], ack 1, win 502, options [nop,nop,TS val 3418225249 ecr 1657888926], length 0
07:22:32.176264  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 76: 10.36.1.149.37532 > 10.34.224.1.443: Flags [S], seq 1774965486, win 64240, options [mss 1460,sackOK,TS val 3418225250 ecr 0,nop,wscale 7], length 0
07:22:32.176544  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 277: 10.36.1.149.37530 > 10.34.224.1.443: Flags [P.], seq 1:210, ack 1, win 502, options [nop,nop,TS val 3418225250 ecr 1657888926], length 209
07:22:32.177052 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 76: 10.34.224.1.443 > 10.36.1.149.37532: Flags [S.], seq 979143818, ack 1774965487, win 62643, options [mss 8961,sackOK,TS val 1657888927 ecr 3418225250,nop,wscale 7], length 0
07:22:32.177056 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 68: 10.34.224.1.443 > 10.36.1.149.37530: Flags [.], ack 210, win 488, options [nop,nop,TS val 1657888927 ecr 3418225250], length 0

2512.5.0
Connection is terminated with FIN/RST, re-established right after to a second control plane node.

23:14:54.520062  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 110: 10.36.1.149.54850 > 10.34.224.1.443: Flags [P.], seq 3886:3928, ack 538196, win 2422, options [nop,nop,TS val 3388967651 ecr 1653593574], length 42
23:14:54.520691 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 68: 10.34.224.1.443 > 10.36.1.149.54850: Flags [.], ack 3928, win 481, options [nop,nop,TS val 1653593577 ecr 3388967651], length 0
23:14:56.064592 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 114: 10.34.224.1.443 > 10.36.1.149.54850: Flags [P.], seq 538196:538242, ack 3928, win 481, options [nop,nop,TS val 1653595121 ecr 3388967651], length 46
23:14:56.105251  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.54850 > 10.34.224.1.443: Flags [.], ack 538242, win 2422, options [nop,nop,TS val 3388969236 ecr 1653595121], length 0
23:14:56.113731 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 68: 10.34.224.1.443 > 10.36.1.149.54850: Flags [F.], seq 538242, ack 3928, win 481, options [nop,nop,TS val 1653595170 ecr 3388969236], length 0
23:14:56.113993  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 99: 10.36.1.149.54850 > 10.34.224.1.443: Flags [P.], seq 3928:3959, ack 538243, win 2422, options [nop,nop,TS val 3388969245 ecr 1653595170], length 31
23:14:56.114133  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.54850 > 10.34.224.1.443: Flags [F.], seq 3959, ack 538243, win 2422, options [nop,nop,TS val 3388969245 ecr 1653595170], length 0
23:14:56.114613 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 56: 10.34.224.1.443 > 10.36.1.149.54850: Flags [R], seq 3586342138, win 0, length 0
23:14:56.114768 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 56: 10.34.224.1.443 > 10.36.1.149.54850: Flags [R], seq 3586342138, win 0, length 0
23:14:56.115365  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 76: 10.36.1.149.42056 > 10.34.224.1.443: Flags [S], seq 3853743467, win 64240, options [mss 1460,sackOK,TS val 3388969246 ecr 0,nop,wscale 7], length 0
23:14:56.115454  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 76: 10.36.1.149.42058 > 10.34.224.1.443: Flags [S], seq 497973755, win 64240, options [mss 1460,sackOK,TS val 3388969246 ecr 0,nop,wscale 7], length 0
23:14:56.116108 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 56: 10.34.224.1.443 > 10.36.1.149.42058: Flags [R.], seq 0, ack 497973756, win 0, length 0
23:14:56.116181 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 76: 10.34.224.1.443 > 10.36.1.149.42056: Flags [S.], seq 1172611798, ack 3853743468, win 62643, options [mss 8961,sackOK,TS val 2068830134 ecr 3388969246,nop,wscale 7], length 0
23:14:56.116191  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 68: 10.36.1.149.42056 > 10.34.224.1.443: Flags [.], ack 1, win 502, options [nop,nop,TS val 3388969247 ecr 2068830134], length 0
23:14:56.116510  In 0a:4b:2d:e8:11:56 ethertype IPv4 (0x0800), length 277: 10.36.1.149.42056 > 10.34.224.1.443: Flags [P.], seq 1:210, ack 1, win 502, options [nop,nop,TS val 3388969247 ecr 2068830134], length 209
23:14:56.117160 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 68: 10.34.224.1.443 > 10.36.1.149.42056: Flags [.], ack 210, win 488, options [nop,nop,TS val 2068830135 ecr 3388969247], length 0
23:14:56.340916 Out ee:ee:ee:ee:ee:ee ethertype IPv4 (0x0800), length 1787: 10.34.224.1.443 > 10.36.1.149.42056: Flags [P.], seq 1:1720, ack 210, win 488, options [nop,nop,TS val 2068830359 ecr 3388969247], length 1719
@pothos
Copy link
Member

pothos commented Oct 19, 2020

Hi,
I did not try to reproduce your setup but just tested with ncat -l PORT that a node in general sends the FIN on shutdown. Maybe something is different with the way the API server is terminated or it's the network setup, or the packet got lost somehow. Anyway, the FIN is not the relevant part because it only closes the writing part of the node's connection. The RST is the interesting part because it comes from the kernel when no open socket for that port exists to receive the packet – this can be happening at the late stage of the shutdown when the server is terminated and the kernel is still running but only if the machine is shutting down slow enough for that to happen, otherwise it will only happen when the system comes up again. It is the same node but just rebooted, right, no BGP load balancing? The clock time in the second log looks like there was no reboot…

In summary I would say that there is no guarantee to get a RST from a node that is shutting down and that's why TCP keepalive timeouts exist. There is an issue for keepalive to be used in kubectl for example: kubernetes/kubernetes#94301

@awprice
Copy link
Author

awprice commented Oct 19, 2020

Thanks @pothos, some answers to your questions:

  • This isn't a reboot of the node - it's a shutdown of an old Kubernetes API server, with new nodes running the API server ready to accept requests. Kubernetes service IPs and endpoints are used to send traffic to the new nodes.
  • I don't think the packet is getting lost - I can reproduce this every single time the node is shut down on 2605.6.0, but it doesn't occur on 2512.5.0

We've been able to workaround this issue on Flatcar 2605.6.0 by writing a very simple systemd unit that sends a SIGTERM to the kube-apiserver process on shutdown. When shutting down the node we see the expected behaviour, a FIN/RST being sent, the same as 2512.5.0.

The fact that this systemd unit changes the shut down order of the API server by moving it earlier suggests to me the shutdown ordering in 2605.6.0

[Unit]
Description=kube-apiserver graceful shutdown service
After=network.target
Before=kubelet.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=echo "Kube-apiserver shutdown started"
ExecStopPost=pkill -SIGTERM kube-apiserver

[Install]
WantedBy=multi-user.target

@pothos
Copy link
Member

pothos commented Oct 19, 2020

Ah, thanks, not a shutdown of the machine but the kubelet.service – That would be systemctl stop kubelet.service or something else? Can you paste your systemctl cat kubelet.service definition?

@awprice
Copy link
Author

awprice commented Oct 19, 2020

@pothos No - I am talking about a shutdown of the node/machine. The inclusion of kubelet.service in the above systemd unit is to ensure the shutdown of the kube-apiserver process is done after the kubelet is shutdown to prevent it from being started back up.

@pothos
Copy link
Member

pothos commented Oct 19, 2020

Ok, yes, the behavior of systemd could have changed in the new version. This may be a bug, it may be a documented change, or it may be a change of undefined behavior the service relied on. To reproduce this and find a small example it would be good if you can provide the relevant service units. Then we can try to adjust the service units and/or maybe file an upstream bug.

@awprice
Copy link
Author

awprice commented Oct 20, 2020

@pothos Thank you for your diagnosis on what could be the issue.

Here are the relevant service units that we run on the nodes with kube-apiserver. It's a pretty simple setup.

Let me know if there is anything else I can provide you with.

containerd.service

[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target

[Service]
ExecStartPre=/sbin/modprobe overlay
ExecStartPre=/opt/bin/containerd-init.sh
ExecStart=/opt/containerd/bin/containerd --log-level=warn --config=/etc/containerd/config.toml

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

/opt/bin/containerd-init.sh is a script to download and install the containerd binaries.
This systemd unit is similar to the upstream one from the containerd project: https://github.com/containerd/containerd/blob/master/containerd.service

kubelet.service

[Unit]
Wants=node-init.service
After=node-init.service
[Service]
EnvironmentFile=/etc/kubernetes/kubelet/service_env
ExecStart=/opt/bin/kubelet_v1.17.9 \
--config=/etc/kubernetes/kubeconfig/kubelet.conf \
--cni-conf-dir=/etc/cni/net.d \
--network-plugin=cni \
--container-runtime=remote \
--container-runtime-endpoint="unix:///run/containerd/containerd.sock" \
--image-pull-progress-deadline=5m \
--register-node=true \
--cloud-provider=aws \
--kubeconfig=/etc/kubernetes/kubeconfig/kubelet-kubeconfig.yaml \
--node-labels=role=controller,customer=controller,node.kubernetes.io/controller=true,runtime=remote \
--vmodule="server=10,auth=10" \
--volume-plugin-dir="/var/lib/kubelet/kubelet-plugins/exec/" \
--node-ip=10.1.136.199 \
--register-with-taints=infrastructure=controller:NoSchedule 
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target

node-init.service

[Unit]
Description=Node Init
Requires=network-online.target
After=network-online.target
[Service]
TimeoutSec=10min
ExecStart=/opt/bin/node-init.sh
RemainAfterExit=yes
Type=oneshot
[Install]
WantedBy=multi-user.target

/opt/bin/node-init.sh is a script to download Kubelet and other binaries.

The Kubernetes API Server is run as a static pod, with the pod spec in the manifest folder that the Kubelet reads. It runs on host network.

@awprice
Copy link
Author

awprice commented Oct 30, 2020

I've tested Flatcar Stable 2605.7.0 and can confirm that this issue is still occurring.

@pothos Any update on your investigation into this? Are the details provided above sufficient, or is there more I can provide?

@dongsupark dongsupark added the kind/bug Something isn't working label Nov 2, 2020
@pothos
Copy link
Member

pothos commented Nov 4, 2020

Thanks for the details, also about the custom containerd. I've been on vacation for two weeks. For the Kubernetes setup I'll probably use Lokomotive as I guess the node-init.sh script alone wouldn't help me much?

What I verified now is that with a plain Flatcar Stable VM on a bridge docker run -d --rm --network host alpine nc -i 65000 -l -p 8888 followed by a system shutdown sends a FIN so that a ncat --recv-only 192.168.122.xy 8888 client will terminate.
Next I'll see if the containerd unit you use makes any difference.

@awprice
Copy link
Author

awprice commented Nov 5, 2020

@pothos If it helps, I've come up with an even simpler method for reproducing/replicating this.

  1. Create instance running Flatcar Stable 2605.7.0 on cloud provider. In our case we started an m5.xlarge on AWS.
  2. Create and start containerd.service. You don't need a custom containerd config.toml for this, I started it without one.
  3. Using crictl (https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md), create a pod: crictl runp pod.json
  4. Pull the ubuntu:18.04 image: crictl pull ubuntu:18.04
  5. Create a container in the pod: crictl create <pod id> container.json pod.json
  6. Start the container: crictl start <container id>
  7. Exec into the container: crictl exec -it <container id> bash
  8. Install netcat in the container: apt update && apt -y install netcat
  9. Start the netcat server in the container: nc -i 65000 -l -p 8888 -v
  10. Start a second instance running any distro, I chose Ubuntu 18.04
  11. Connect to the netcat server: ncat --recv-only 172.31.23.40 8888 -vv
  12. Observe a successful connection on both the client and server
  13. Stop the Flatcar instance in the console, in our case we stopped it in the AWS console.

With the above, you will observe the lack of a FIN being sent to the client on shutdown.

I repeated the same steps as above, but with Flatcar 2512.5.0 and observed a FIN being sent to the client.

pod.json

{
  "metadata": {
    "name": "pod-sandbox",
    "namespace": "default",
    "attempt": 1,
    "uid": "hdishd83djaidwnduwk28bcsb"
  },
  "logDirectory": "/tmp",
  "linux": {
    "security_context": {
      "namespace_options": {
        "network": 2
      }
    }
  }
}

network: 2 runs the pod/sandbox on host network.

container.json

{
  "metadata": {
    "name": "sleep"
  },
  "image":{
    "image": "ubuntu:18.04"
  },
  "command": [
    "sleep",
    "86400"
  ],
  "log_path":"ubuntu.log",
  "linux": {
  }
}

@pothos
Copy link
Member

pothos commented Nov 5, 2020

Great, I'll have a look!

@pothos
Copy link
Member

pothos commented Nov 6, 2020

crictl doesn't work with the containerd we ship by default even when the cri plugin is enabled. Can you provide your /etc/containerd/config.toml (if used) and /opt/bin/containerd-init.sh files to me so that I can reproduce your custom containerd setup?

@awprice
Copy link
Author

awprice commented Nov 6, 2020

@pothos As stated above, I was able to reproduce this without a custom /etc/containerd/config.toml, as in the file did not exist - containerd will fall back to the default.

Additionally, I got crictl to work with the containerd version Flatcar ships, with the CRI plugin enabled.

Have you set the CONTAINER_RUNTIME_ENDPOINT env variable for crictl?

# export CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/containerd/containerd.sock
# ./crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
# ./crictl pull ubuntu:18.04
Image is up to date for sha256:56def654ec22f857f480cdcc640c474e2f84d4be2e549a9d16eaba3f397596e9
# ./crictl images
IMAGE                      TAG                 IMAGE ID            SIZE
docker.io/library/ubuntu   18.04               56def654ec22f       26.7MB

@pothos
Copy link
Member

pothos commented Nov 6, 2020

The containerd service file is created by torcx under /run/systemd/system/containerd.service and references /run/torcx/unpack/docker/usr/share/containerd/config.toml which sets /run/docker/libcontainerd/docker-containerd.sock as socket address. That's what I used but the runtime doesn't match:

sudo crictl -r unix:///var/run/docker/libcontainerd/docker-containerd.sock runp pod.json
FATA[0000] run pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: runtime "io.containerd.runc.v1" binary not installed "containerd-shim-runc-v1": file does not exist: unknown 

@awprice
Copy link
Author

awprice commented Nov 6, 2020

@pothos I used the following containerd systemd unit:

# cat /etc/systemd/system/containerd.service 
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target

[Service]
ExecStartPre=/sbin/modprobe overlay
ExecStart=/usr/bin/containerd --log-level=warn --config=/etc/containerd/config.toml

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

@pothos
Copy link
Member

pothos commented Nov 6, 2020

I still get this error even with a vanilla Docker. See the following Container Linux Config I used to set up Docker without using torcx:

(Needs to be transpiled to JSON with ct.)

ignition-docker.yaml

@pothos
Copy link
Member

pothos commented Nov 6, 2020

Can you provide the contents of /etc/containerd/config.toml, please?

@pothos
Copy link
Member

pothos commented Nov 6, 2020

Ah, now I got it running, I also added the binaries from https://github.com/containerd/containerd/releases/tag/v1.2.14 to /opt/docker/

@awprice
Copy link
Author

awprice commented Nov 6, 2020

As stated above, we are not using Docker at all in our stack. I'm not sure why you keep trying to reproduce this with Docker - I haven't mentioned the use of it.

I'll mention it again - I can reproduce this without a file in /etc/containerd/config.toml. We are using the default containerd config.

You can view this default containerd config with containerd config default.

We aren't using containerd 1.2.14 either - we are using 1.3.7 included with Flatcar.

@pothos
Copy link
Member

pothos commented Nov 6, 2020

I'm not trying to reproduce it with Docker but the crictl as you wrote, but maybe I use a different version (1.19.0) which results in the error binary not installed "containerd-shim-runc-v1"? Since this binary is not shipped with Flatcar, I found out that it is part of the upstream containerd release I mentioned.

@pothos
Copy link
Member

pothos commented Nov 6, 2020

Same problem with crictl version v1.17.0:

$ sudo CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/containerd/containerd.sock crictl  runp pod.json
FATA[0000] run pod sandbox failed: rpc error: code = Unknown desc = failed to create containerd task: runtime "io.containerd.runc.v1" binary not installed "containerd-shim-runc-v1": file does not exist: unknown

@pothos
Copy link
Member

pothos commented Nov 6, 2020

Ok, so the difference is that you use upstream containerd 1.3.7 and not the one included in Flatcar, where the containerd-shim-runc-v1 binary was missing. I copied the upstream binaries to /run/torcx/bin and now can run it successfully.

@pothos
Copy link
Member

pothos commented Nov 6, 2020

When shutting down, I can see something like systemd-shutdown: Waiting for containerd-shim, containerd-shim because KillMode=process means that the container processes are not terminated when the systemd service is stopped. Now they are left and nothing will kill them on shutdown until systemd has waited enough to kill them – at this point there is no guarantee for the FIN to be send because the network is mostly likely already down. This is the normal behavior and it's not up to find out what should actually terminate the container processes in this case. It seems KillMode=process was chosen to leave containers running while the containerd service is merely restarted. A workaround for specific process like you did makes sense but in general there is no way to find the processes that should be terminated and I think this is something to report to upstream containerd.

@awprice
Copy link
Author

awprice commented Nov 6, 2020

@pothos Yes - I noticed this too, but what concerns me is the behaviour change from Flatcar 2512 to 2605. On 2512 we received the FIN on shutdown - this suggests something has changed between releases.

As you noted KillMode=process is used to prevent the containers from being terminated when the containerd process is restarted. I would prefer to keep this behaviour as I see this as beneficial.

@pothos
Copy link
Member

pothos commented Nov 6, 2020

If you don't expect containerd to be restarted you can also comment out the KillMode=process line and the containers will be terminated on shutdown. Now it depends on Kubernetes to be able to restart the pods when the containerd service restarts, which could happen if it crashes or gets restarted via something like Ansible.

@pothos
Copy link
Member

pothos commented Nov 6, 2020

Yes, I'll try to get a logs for the shutdown process to see when the network is down, when SIGTERM is send, and when SIGKILL is send. Maybe the shim processes didn't react on SIGTERM?

@pothos
Copy link
Member

pothos commented Nov 6, 2020

By the way, the containerd-shim-runc-v1 processes do nothing on SIGTERM in my case:

$ sudo pkill -f -15 containerd-shim-runc-v1
$ ps -ef | grep containerd-shim-runc-v1
root        1487       1  0 09:49 ?        00:00:00 /run/torcx/bin/containerd-shim-runc-v1 -namespace k8s.io -id 8ab325e4ef8b41721ff272437e0f0363d203a801e10078848c683687aaca31f4 -address /run/containerd/containerd.sock
root        1542       1  0 09:49 ?        00:00:00 /run/torcx/bin/containerd-shim-runc-v1 -namespace k8s.io -id 05c453318404b76883ff81fce60258dd7e6e052c3285e2d9f35bb4cdf479bb01 -address /run/containerd/containerd.sock

This means the child process is also not terminated until sudo pkill -f -9 containerd-shim-runc-v1.

I don't know if systemd stops at the parent process or continues with the child processes where SIGTERM should have an effect.

@pothos
Copy link
Member

pothos commented Nov 9, 2020

Looking at the source code, systemd-shutdown is not signalling to parents first but to children as well, so the above issue with the containerd-shim-runc binary is not causing the problem, but, as you said, the order of when systemd signals the SIGTERM, as you suspected. The shim process only dying at SIGKILL is just causing a delay before the system can poweroff.

Overall the system behaves as supposed: The remaining processes are terminated with SIGTERM after the normal systemd units are stopped, including systemd-networkd. This makes sense because otherwise systemd-shutdown would send the SIGTERM to all processes including those that are normal systemd units and have a proper way defined to be terminated.
However, in 2512.5.0 with systemd 243 there is a different behavior when stopping systemd-networkd. The interface stays up while on 2605.6.0 with systemd 245 the interface is turned down when systemd-networkd is stopped. I didn't find something related to this in the systemd NEWS file, though.

I'm not sure what the desired behavior of systemd-networkd stopping should be because both approaches have their pros and cons. In the mean time, if you want your process to be able to send the FIN it would need to be stopped by a systemd unit like you do with the workaround. Since there is no guarantee to get the packet anyway, the TCP keepalive option should be used to detect if a node gets terminated.

@pothos
Copy link
Member

pothos commented Nov 9, 2020

This change here seems relevant: systemd/systemd@8006035

The previous behavior was

[Network]
KeepConfiguration=dhcp-on-stop

@pothos
Copy link
Member

pothos commented Nov 9, 2020

Thanks for reporting this!
I will file a PR to reenable this behavior for the rootfs networkd, but not the initramfs networkd. The transition from the initramfs should to deconfigure to return the address lease and also to prevent that the interface looks like manually configured when the rootfs networkd sees it.

pothos added a commit to flatcar/init that referenced this issue Nov 9, 2020
The default behavior in systemd-networkd was changed in v244 from
keeping the IP addresses and routes on service stop to deconfiguring
them:
systemd/systemd@8006035
Deconfiguring means that on system shutdown the DHCP address is
properly released but also has the side effect that orphaned processes
not part of a systemd unit don't have network connectivity when they
get the broadcasted SIGTERM.
Restore the previous behavior and hope that DHCP servers recognize
the system again on reboot and hand out the same address don't have to
rely on the address release (which, anyway, is not send for a crashing
system either). The default can of course be changed by the user.
In the initramfs the KeepConfiguration=no behavior is desired because
otherwise the IP address is not released which can cause problems when
a different DHCP client configuration is set on first boot and the DHCP
server wouldn't be able to recognize the rootfs system and keep two
addresses allocated.

Fixes flatcar/Flatcar#213
pothos added a commit to flatcar/init that referenced this issue Nov 9, 2020
The default behavior in systemd-networkd was changed in v244 from
keeping the IP addresses and routes on service stop to deconfiguring
them:
systemd/systemd@8006035
Deconfiguring means that on system shutdown the DHCP address is
properly released but also has the side effect that orphaned processes
not part of a systemd unit don't have network connectivity when they
get the broadcasted SIGTERM.
Restore the previous behavior and hope that DHCP servers recognize
the system again on reboot and hand out the same address don't have to
rely on the address release (which, anyway, is not sent for a crashing
system either). The default can of course be changed by the user.
In the initramfs the KeepConfiguration=no behavior is desired because
otherwise the IP address is not released which can cause problems when
a different DHCP client configuration is set on first boot and the DHCP
server wouldn't be able to recognize the rootfs system and keep two
addresses allocated.

Fixes flatcar/Flatcar#213
pothos added a commit to flatcar/coreos-cloudinit that referenced this issue Nov 9, 2020
The default behavior in systemd-networkd was changed in v244 from
keeping the IP addresses and routes on service stop to deconfiguring
them:
systemd/systemd@8006035
Deconfiguring means that on system shutdown the DHCP address is
properly released but also has the side effect that orphaned processes
not part of a systemd unit don't have network connectivity when they
get the broadcasted SIGTERM.
Restore the previous behavior and hope that DHCP servers recognize
the system again on reboot and hand out the same address don't have to
rely on the address release (which, anyway, is not sent for a crashing
system either).

Fixes flatcar/Flatcar#213
@awprice
Copy link
Author

awprice commented Nov 12, 2020

Thanks for looking into this @pothos! I've tested out

[Network]
KeepConfiguration=dhcp-on-stop

in our clusters and it solves the shutdown issue for us. I really appreciate you taking the time to discover/test this!

@pothos
Copy link
Member

pothos commented Nov 16, 2020

This will be part of the next bugfix release for the Stable 2605 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
3 participants