-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stable 2605.6.0 shutdown order prevents closing of TCP connections #213
Stable 2605.6.0 shutdown order prevents closing of TCP connections #213
Comments
Hi, In summary I would say that there is no guarantee to get a RST from a node that is shutting down and that's why TCP keepalive timeouts exist. There is an issue for keepalive to be used in kubectl for example: kubernetes/kubernetes#94301 |
Thanks @pothos, some answers to your questions:
We've been able to workaround this issue on Flatcar 2605.6.0 by writing a very simple systemd unit that sends a SIGTERM to the The fact that this systemd unit changes the shut down order of the API server by moving it earlier suggests to me the shutdown ordering in 2605.6.0
|
Ah, thanks, not a shutdown of the machine but the |
@pothos No - I am talking about a shutdown of the node/machine. The inclusion of |
Ok, yes, the behavior of systemd could have changed in the new version. This may be a bug, it may be a documented change, or it may be a change of undefined behavior the service relied on. To reproduce this and find a small example it would be good if you can provide the relevant service units. Then we can try to adjust the service units and/or maybe file an upstream bug. |
@pothos Thank you for your diagnosis on what could be the issue. Here are the relevant service units that we run on the nodes with kube-apiserver. It's a pretty simple setup. Let me know if there is anything else I can provide you with. containerd.service
kubelet.service
node-init.service
The Kubernetes API Server is run as a static pod, with the pod spec in the manifest folder that the Kubelet reads. It runs on host network. |
I've tested Flatcar Stable 2605.7.0 and can confirm that this issue is still occurring. @pothos Any update on your investigation into this? Are the details provided above sufficient, or is there more I can provide? |
Thanks for the details, also about the custom containerd. I've been on vacation for two weeks. For the Kubernetes setup I'll probably use Lokomotive as I guess the What I verified now is that with a plain Flatcar Stable VM on a bridge |
@pothos If it helps, I've come up with an even simpler method for reproducing/replicating this.
With the above, you will observe the lack of a I repeated the same steps as above, but with Flatcar 2512.5.0 and observed a pod.json {
"metadata": {
"name": "pod-sandbox",
"namespace": "default",
"attempt": 1,
"uid": "hdishd83djaidwnduwk28bcsb"
},
"logDirectory": "/tmp",
"linux": {
"security_context": {
"namespace_options": {
"network": 2
}
}
}
}
container.json {
"metadata": {
"name": "sleep"
},
"image":{
"image": "ubuntu:18.04"
},
"command": [
"sleep",
"86400"
],
"log_path":"ubuntu.log",
"linux": {
}
} |
Great, I'll have a look! |
|
@pothos As stated above, I was able to reproduce this without a custom Additionally, I got Have you set the
|
The containerd service file is created by torcx under
|
@pothos I used the following containerd systemd unit:
|
I still get this error even with a vanilla Docker. See the following Container Linux Config I used to set up Docker without using torcx: (Needs to be transpiled to JSON with |
Can you provide the contents of |
Ah, now I got it running, I also added the binaries from https://github.com/containerd/containerd/releases/tag/v1.2.14 to |
As stated above, we are not using Docker at all in our stack. I'm not sure why you keep trying to reproduce this with Docker - I haven't mentioned the use of it. I'll mention it again - I can reproduce this without a file in You can view this default containerd config with We aren't using containerd 1.2.14 either - we are using 1.3.7 included with Flatcar. |
I'm not trying to reproduce it with Docker but the |
Same problem with
|
Ok, so the difference is that you use upstream containerd 1.3.7 and not the one included in Flatcar, where the containerd-shim-runc-v1 binary was missing. I copied the upstream binaries to |
When shutting down, I can see something like |
@pothos Yes - I noticed this too, but what concerns me is the behaviour change from Flatcar 2512 to 2605. On 2512 we received the FIN on shutdown - this suggests something has changed between releases. As you noted |
If you don't expect containerd to be restarted you can also comment out the |
Yes, I'll try to get a logs for the shutdown process to see when the network is down, when SIGTERM is send, and when SIGKILL is send. Maybe the shim processes didn't react on SIGTERM? |
By the way, the containerd-shim-runc-v1 processes do nothing on SIGTERM in my case:
This means the child process is also not terminated until I don't know if systemd stops at the parent process or continues with the child processes where SIGTERM should have an effect. |
Looking at the source code, systemd-shutdown is not signalling to parents first but to children as well, so the above issue with the containerd-shim-runc binary is not causing the problem, but, as you said, the order of when systemd signals the SIGTERM, as you suspected. The shim process only dying at SIGKILL is just causing a delay before the system can poweroff. Overall the system behaves as supposed: The remaining processes are terminated with SIGTERM after the normal systemd units are stopped, including systemd-networkd. This makes sense because otherwise systemd-shutdown would send the SIGTERM to all processes including those that are normal systemd units and have a proper way defined to be terminated. I'm not sure what the desired behavior of systemd-networkd stopping should be because both approaches have their pros and cons. In the mean time, if you want your process to be able to send the FIN it would need to be stopped by a systemd unit like you do with the workaround. Since there is no guarantee to get the packet anyway, the TCP keepalive option should be used to detect if a node gets terminated. |
This change here seems relevant: systemd/systemd@8006035 The previous behavior was
|
Thanks for reporting this! |
The default behavior in systemd-networkd was changed in v244 from keeping the IP addresses and routes on service stop to deconfiguring them: systemd/systemd@8006035 Deconfiguring means that on system shutdown the DHCP address is properly released but also has the side effect that orphaned processes not part of a systemd unit don't have network connectivity when they get the broadcasted SIGTERM. Restore the previous behavior and hope that DHCP servers recognize the system again on reboot and hand out the same address don't have to rely on the address release (which, anyway, is not send for a crashing system either). The default can of course be changed by the user. In the initramfs the KeepConfiguration=no behavior is desired because otherwise the IP address is not released which can cause problems when a different DHCP client configuration is set on first boot and the DHCP server wouldn't be able to recognize the rootfs system and keep two addresses allocated. Fixes flatcar/Flatcar#213
The default behavior in systemd-networkd was changed in v244 from keeping the IP addresses and routes on service stop to deconfiguring them: systemd/systemd@8006035 Deconfiguring means that on system shutdown the DHCP address is properly released but also has the side effect that orphaned processes not part of a systemd unit don't have network connectivity when they get the broadcasted SIGTERM. Restore the previous behavior and hope that DHCP servers recognize the system again on reboot and hand out the same address don't have to rely on the address release (which, anyway, is not sent for a crashing system either). The default can of course be changed by the user. In the initramfs the KeepConfiguration=no behavior is desired because otherwise the IP address is not released which can cause problems when a different DHCP client configuration is set on first boot and the DHCP server wouldn't be able to recognize the rootfs system and keep two addresses allocated. Fixes flatcar/Flatcar#213
The default behavior in systemd-networkd was changed in v244 from keeping the IP addresses and routes on service stop to deconfiguring them: systemd/systemd@8006035 Deconfiguring means that on system shutdown the DHCP address is properly released but also has the side effect that orphaned processes not part of a systemd unit don't have network connectivity when they get the broadcasted SIGTERM. Restore the previous behavior and hope that DHCP servers recognize the system again on reboot and hand out the same address don't have to rely on the address release (which, anyway, is not sent for a crashing system either). Fixes flatcar/Flatcar#213
Thanks for looking into this @pothos! I've tested out
in our clusters and it solves the shutdown issue for us. I really appreciate you taking the time to discover/test this! |
This will be part of the next bugfix release for the Stable 2605 branch. |
Description
When shutting down a node running Flatcar Stable 2605.6.0, we have noticed that the order of termination of processes prevents the closing of long-running TCP connections. In Flatcar Stable 2512.5.0 however, we do not see this issue - long-running TCP connections are closed, with a FIN and RST being sent to the client prior to the node being shutdown.
Impact
This impacts our Kubernetes control plane nodes running Flatcar - when they are shutdown, the Kubernetes API server running on the nodes is unable to notify it's clients that it has shutdown and that they should establish a new connection to a different node. As a result, the clients continue using the broken connection for ~5-15 minutes until it is timed out/reset by the client.
Environment and steps to reproduce
Expected behavior
Additional information
Please see example tcpdump packet captures from each version:
10.34.224.1
: Kubernetes service IP of Kubernetes API server10.36.1.149
: Client2605.6.0
Node terminated, but with no FIN/RST, client continously attempts to ACK, until connection is RST. Re-established right after to a second control plane node.
2512.5.0
Connection is terminated with FIN/RST, re-established right after to a second control plane node.
The text was updated successfully, but these errors were encountered: