only configure swap if swap is enabled #120784

elezar · 2023-09-20T18:51:24Z

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

This PR fixes the startup of kubernetes on systems using cgroupv2 where swap is not enabled. This can be demonstrated on such as system using kind and any kubernetes version of 1.28.0, 1.28.1, or 1.28.2.

Since the memory.swap.max value is set, this causes containerd and runc to try to write a value for the cgroup causing containers to fail when starting with messages such as:

Sep 20 09:48:54 k8s-dra-driver-cluster-control-plane kubelet[318]: E0920 09:48:54.918262     318 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with RunContainerError: \"failed to create co
ntainerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set unified resource \\\"memory.swa
p.max\\\": open /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-burstable.slice/kubelet-kubepods-burstable-poddfde4e625667960b6c2b494dd6946943.slice/cri-containerd-4085d63ae91fe66c6fd8ea3e7d2d20d3f7492ef51f11621e2dcbb7f47d345d0f.
scope/memory.swap.max: no such file or directory: unknown\"" pod="kube-system/kube-apiserver-k8s-dra-driver-cluster-control-plane" podUID="dfde4e625667960b6c2b494dd6946943"

This is independent of whether NodeSwap is enabled or not.

Note that the issue seemed to have been introduced in #118764 with the fix extending the logic added in #119486 to also be applicable to cgroupv2 systems with swap disabled.

Special notes for your reviewer:

Assuming:

cgroupv2 is being used (/sys/fs/cgroup/cgroup.controllers is present)
swap is disabled (swapon -s shows nothing)

This should be reproducible in kind with the following:

kind create cluster --retain --image kindest/node:v1.28.0 --name test-cluster

which will create a single-node kind cluster using k8s v1.28.0. This will cause an error when starting the control-plane node.

Running:

kind export logs --name test-cluster

Will ensure that the kubelet logs are available. These can be inspected using

~$ grep memory.swap.max /tmp/769685610/test-cluster-control-plane/kubelet.log | tail -1
Sep 20 19:59:32 foo-control-plane kubelet[263]: E0920 19:59:32.930675     263 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with RunContainerError: \"failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set unified resource \\\"memory.swap.max\\\": open /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-burstable.slice/kubelet-kubepods-burstable-podc576cd164244f9e5e46e146c0d642304.slice/cri-containerd-8729d25356d4538d93740e37da5106fbce905d354c50af77aae373ea9d68b010.scope/memory.swap.max: no such file or directory: unknown\"" pod="kube-system/etcd-foo-control-plane" podUID="c576cd164244f9e5e46e146c0d642304"

Where this particular example shows the etcd container failing to start.

Furthermore, we can confirm that this error is not present in an earlier k8s version:

$ kind create cluster --image kindest/node:v1.27.6 --name test-cluster
Creating cluster "test-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.27.6) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-test-cluster"
You can now use your cluster with:

Does this PR introduce a user-facing change?

Fixed a bug where containers would not start on cgroupv2 systems where swap is disabled.

k8s-ci-robot · 2023-09-20T18:51:33Z

Hi @elezar. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elezar · 2023-09-20T18:53:11Z

/cc @iholder101 @pacoxu

klueska · 2023-09-20T19:43:57Z

/cc

klueska · 2023-09-20T19:45:33Z

/triage accepted
/priority important-soon

klueska · 2023-09-20T19:48:25Z

/ok-to-test

elezar · 2023-09-21T21:07:10Z

/retest

elezar · 2023-09-22T10:22:24Z

/retest

elezar · 2023-09-22T11:38:18Z

/retest

iholder101 · 2023-09-22T15:32:55Z

Thank you @elezar!
/lgtm

k8s-ci-robot · 2023-09-22T15:33:04Z

LGTM label has been added.

Git tree hash: ee094ed71c873dc446ef4634e38341d6c210e2d5

iholder101 · 2023-09-22T15:34:50Z

@mrunalp @klueska PTAL

klueska

Minor comment otherwise LGTM

klueska · 2023-09-25T12:00:10Z

pkg/kubelet/kuberuntime/kuberuntime_container_linux.go

+	if swapControllerAvailable() {
+		if swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo); utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
+			// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:
+			// https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory
+			switch m.memorySwapBehavior {
+			case kubelettypes.LimitedSwap:
+				swapConfigurationHelper.ConfigureLimitedSwap(lcr, pod, container)
+			default:
+				swapConfigurationHelper.ConfigureUnlimitedSwap(lcr)
+			}
+		} else {
+			swapConfigurationHelper.ConfigureNoSwap(lcr)
 		}
-	} else {
-		swapConfigurationHelper.ConfigureNoSwap(lcr)
+	} else if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
+		klog.ErrorS(errors.New("no cgroup swap controller present"), "ignoring NodeSwap feature", "pod", klog.KObj(pod), "containerName", container.Name)
 	}


Is there some way to cleanup these if/elses to read more nicely. In partifular, this is harad to look at:

if swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo); utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {

What about:

if !swapControllerAvailable() && !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) { // Nothing to do } if !swapControllerAvailable() && utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) { klog.ErrorS(errors.New("no cgroup swap controller present"), "ignoring NodeSwap feature", "pod", klog.KObj(pod), "containerName", container.Name) } if swapControllerAvailable() && !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) { swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo) swapConfigurationHelper.ConfigureNoSwap(lcr) } if swapControllerAvailable() && utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) { // NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec: // https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo) switch m.memorySwapBehavior { case kubelettypes.LimitedSwap: swapConfigurationHelper.ConfigureLimitedSwap(lcr, pod, container) default: swapConfigurationHelper.ConfigureUnlimitedSwap(lcr) } }

I was wondering about the nesting too. What about adding an addSwapResources method that we call here instead.

func (m *kubeGenericRuntimeManager) addSwapResources(...) { if !swapControllerAvailable() && !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) { return } if !swapControllerAvailable() { klog.ErrorS(errors.New("no cgroup swap controller present"), "ignoring NodeSwap feature", "pod", klog.KObj(pod), "containerName", container.Name) return } swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo) if !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) { swapConfigurationHelper.ConfigureNoSwap(lcr) return } // NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec: // https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo) switch m.memorySwapBehavior { case kubelettypes.LimitedSwap: swapConfigurationHelper.ConfigureLimitedSwap(lcr, pod, container) default: swapConfigurationHelper.ConfigureUnlimitedSwap(lcr) } }

definitely seems cleaner to me, though I'd still prefer including some variant of utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) on all if statements to make it clear

@klueska I reworked this a bit with quick-returns where applicable. If you really want me to check the feature gate on all if statements, I can do so.

sftim

Some feedback - hope it helps

sftim · 2023-09-26T15:37:24Z

pkg/kubelet/kuberuntime/kuberuntime_container_linux.go

+	// swap is only configured if a swap controller is available.
+	if swapControllerAvailable() {
+		if swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo); utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
+			// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:


Suggested change

// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:

// NOTE(ehashman): Behavior is defined in the opencontainers runtime spec:

sftim · 2023-09-26T15:38:01Z

pkg/kubelet/kuberuntime/kuberuntime_container_linux_test.go

(nit) In tests, please write “QoS” not “Qos”

sftim · 2023-09-26T15:39:29Z

pkg/kubelet/kuberuntime/kuberuntime_container_linux.go

-	} else {
-		swapConfigurationHelper.ConfigureNoSwap(lcr)
+	} else if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
+		klog.ErrorS(errors.New("no cgroup swap controller present"), "ignoring NodeSwap feature", "pod", klog.KObj(pod), "containerName", container.Name)


Try to write the error logging as if the feature is already stable. The value of the feature gate doesn't seem very relevant here.

Thanks @sftim. I elected to always log (at InfoS) that we're not configuring swap but to include the SwapBehavior value as well. This should be enough to give users a signal that its being skipped without making the logic in the function overly complex. Hope this is sufficient from your side.

sftim · 2023-09-26T15:40:25Z

pkg/kubelet/kuberuntime/kuberuntime_container_linux.go

-			swapConfigurationHelper.ConfigureLimitedSwap(lcr, pod, container)
-		default:
-			swapConfigurationHelper.ConfigureUnlimitedSwap(lcr)
+	// swap is only configured if a swap controller is available.


Suggested change

// swap is only configured if a swap controller is available.

// swap is only configured if a swap cgroup controller is available.

Thanks @sftim. Should be addressed in latest revision. Note that I factored out this logic into a method, so the comment is now included as part of its docstring.

This change bypasses all logic to set swap in the linux container resources if a swap controller is not available on node. Failing to do so may cause errors in runc when starting a container with a swap configuration -- even if this is set to 0. Signed-off-by: Evan Lezar <elezar@nvidia.com>

klueska · 2023-09-27T11:34:05Z

@elezar thanks for tracking down this bug and fixing it!
Once this change gets merged we should back-port it to for the next 1.28 patch release.

/lgtm
/approve

k8s-ci-robot · 2023-09-27T11:34:13Z

LGTM label has been added.

Git tree hash: 55212c517565fed5dae6a3642f336905ac1c4b7d

k8s-ci-robot · 2023-09-27T11:34:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elezar, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [klueska]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…784-upstream-release-1.28 Automated cherry pick of #120784: Use local isCgroup2UnifiedMode consistently

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Sep 20, 2023

elezar mentioned this pull request Sep 20, 2023

Fix swap #120783

Closed

k8s-ci-robot requested review from matthyx and yujuhong September 20, 2023 18:52

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 20, 2023

k8s-ci-robot requested a review from klueska September 20, 2023 19:43

elezar force-pushed the fix-startup-failure-on-non-swap branch 2 times, most recently from 6e6273c to 73107ae Compare September 21, 2023 21:03

elezar requested a review from iholder101 September 22, 2023 12:19

k8s-ci-robot assigned iholder101 Sep 22, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2023

pacoxu moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage Sep 25, 2023

elezar mentioned this pull request Sep 25, 2023

do not touch swap for cgroup v1 if not available #119486

Merged

klueska reviewed Sep 25, 2023

View reviewed changes

sftim reviewed Sep 26, 2023

View reviewed changes

elezar force-pushed the fix-startup-failure-on-non-swap branch from 73107ae to 394bcaf Compare September 26, 2023 19:37

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2023

k8s-ci-robot assigned klueska Sep 27, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 27, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 27, 2023

k8s-ci-robot merged commit 05f4099 into kubernetes:master Sep 27, 2023
13 of 14 checks passed

SIG Node PR Triage automation moved this from Needs Approver to Done Sep 27, 2023

k8s-ci-robot added this to the v1.29 milestone Sep 27, 2023

klueska mentioned this pull request Sep 28, 2023

Automated cherry pick of #120784: Use local isCgroup2UnifiedMode consistently #120924

Merged

k8s-ci-robot added a commit that referenced this pull request Sep 29, 2023

Merge pull request #120924 from klueska/automated-cherry-pick-of-#120…

5be61dd

…784-upstream-release-1.28 Automated cherry pick of #120784: Use local isCgroup2UnifiedMode consistently

elezar deleted the fix-startup-failure-on-non-swap branch October 13, 2023 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only configure swap if swap is enabled #120784

only configure swap if swap is enabled #120784

elezar commented Sep 20, 2023 •

edited

k8s-ci-robot commented Sep 20, 2023

elezar commented Sep 20, 2023 •

edited

klueska commented Sep 20, 2023

klueska commented Sep 20, 2023

klueska commented Sep 20, 2023

elezar commented Sep 21, 2023

elezar commented Sep 22, 2023

elezar commented Sep 22, 2023

iholder101 commented Sep 22, 2023

k8s-ci-robot commented Sep 22, 2023

iholder101 commented Sep 22, 2023

klueska left a comment

klueska Sep 25, 2023 •

edited

elezar Sep 25, 2023

klueska Sep 25, 2023 •

edited

elezar Sep 26, 2023

sftim left a comment

sftim Sep 26, 2023

elezar Sep 26, 2023

sftim Sep 26, 2023

elezar Sep 26, 2023

sftim Sep 26, 2023

elezar Sep 26, 2023

sftim Sep 26, 2023

elezar Sep 26, 2023

klueska commented Sep 27, 2023

k8s-ci-robot commented Sep 27, 2023

k8s-ci-robot commented Sep 27, 2023

	// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:
	// NOTE(ehashman): Behavior is defined in the opencontainers runtime spec:

	// swap is only configured if a swap controller is available.
	// swap is only configured if a swap cgroup controller is available.

only configure swap if swap is enabled #120784

only configure swap if swap is enabled #120784

Conversation

elezar commented Sep 20, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Sep 20, 2023

elezar commented Sep 20, 2023 • edited

klueska commented Sep 20, 2023

klueska commented Sep 20, 2023

klueska commented Sep 20, 2023

elezar commented Sep 21, 2023

elezar commented Sep 22, 2023

elezar commented Sep 22, 2023

iholder101 commented Sep 22, 2023

k8s-ci-robot commented Sep 22, 2023

iholder101 commented Sep 22, 2023

klueska left a comment

Choose a reason for hiding this comment

klueska Sep 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Sep 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sftim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska commented Sep 27, 2023

k8s-ci-robot commented Sep 27, 2023

k8s-ci-robot commented Sep 27, 2023

elezar commented Sep 20, 2023 •

edited

elezar commented Sep 20, 2023 •

edited

klueska Sep 25, 2023 •

edited

klueska Sep 25, 2023 •

edited