Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only configure swap if swap is enabled #120784

Merged

Conversation

elezar
Copy link
Contributor

@elezar elezar commented Sep 20, 2023

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

This PR fixes the startup of kubernetes on systems using cgroupv2 where swap is not enabled. This can be demonstrated on such as system using kind and any kubernetes version of 1.28.0, 1.28.1, or 1.28.2.

Since the memory.swap.max value is set, this causes containerd and runc to try to write a value for the cgroup causing containers to fail when starting with messages such as:

Sep 20 09:48:54 k8s-dra-driver-cluster-control-plane kubelet[318]: E0920 09:48:54.918262     318 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with RunContainerError: \"failed to create co
ntainerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set unified resource \\\"memory.swa
p.max\\\": open /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-burstable.slice/kubelet-kubepods-burstable-poddfde4e625667960b6c2b494dd6946943.slice/cri-containerd-4085d63ae91fe66c6fd8ea3e7d2d20d3f7492ef51f11621e2dcbb7f47d345d0f.
scope/memory.swap.max: no such file or directory: unknown\"" pod="kube-system/kube-apiserver-k8s-dra-driver-cluster-control-plane" podUID="dfde4e625667960b6c2b494dd6946943"

This is independent of whether NodeSwap is enabled or not.

Note that the issue seemed to have been introduced in #118764 with the fix extending the logic added in #119486 to also be applicable to cgroupv2 systems with swap disabled.

Special notes for your reviewer:

Assuming:

  • cgroupv2 is being used (/sys/fs/cgroup/cgroup.controllers is present)
  • swap is disabled (swapon -s shows nothing)

This should be reproducible in kind with the following:

kind create cluster --retain --image kindest/node:v1.28.0 --name test-cluster

which will create a single-node kind cluster using k8s v1.28.0. This will cause an error when starting the control-plane node.

Running:

kind export logs --name test-cluster

Will ensure that the kubelet logs are available. These can be inspected using

~$ grep memory.swap.max /tmp/769685610/test-cluster-control-plane/kubelet.log | tail -1
Sep 20 19:59:32 foo-control-plane kubelet[263]: E0920 19:59:32.930675     263 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with RunContainerError: \"failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set unified resource \\\"memory.swap.max\\\": open /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-burstable.slice/kubelet-kubepods-burstable-podc576cd164244f9e5e46e146c0d642304.slice/cri-containerd-8729d25356d4538d93740e37da5106fbce905d354c50af77aae373ea9d68b010.scope/memory.swap.max: no such file or directory: unknown\"" pod="kube-system/etcd-foo-control-plane" podUID="c576cd164244f9e5e46e146c0d642304"

Where this particular example shows the etcd container failing to start.

Furthermore, we can confirm that this error is not present in an earlier k8s version:

$ kind create cluster --image kindest/node:v1.27.6 --name test-cluster
Creating cluster "test-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.27.6) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-test-cluster"
You can now use your cluster with:

Does this PR introduce a user-facing change?

Fixed a bug where containers would not start on cgroupv2 systems where swap is disabled.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 20, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @elezar. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Sep 20, 2023
@elezar elezar mentioned this pull request Sep 20, 2023
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 20, 2023
@elezar
Copy link
Contributor Author

elezar commented Sep 20, 2023

/cc @iholder101 @pacoxu

@klueska
Copy link
Contributor

klueska commented Sep 20, 2023

/cc

@klueska
Copy link
Contributor

klueska commented Sep 20, 2023

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 20, 2023
@klueska
Copy link
Contributor

klueska commented Sep 20, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 20, 2023
@elezar elezar force-pushed the fix-startup-failure-on-non-swap branch 2 times, most recently from 6e6273c to 73107ae Compare September 21, 2023 21:03
@elezar
Copy link
Contributor Author

elezar commented Sep 21, 2023

/retest

2 similar comments
@elezar
Copy link
Contributor Author

elezar commented Sep 22, 2023

/retest

@elezar
Copy link
Contributor Author

elezar commented Sep 22, 2023

/retest

@iholder101
Copy link
Contributor

Thank you @elezar!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ee094ed71c873dc446ef4634e38341d6c210e2d5

@iholder101
Copy link
Contributor

@mrunalp @klueska PTAL

@pacoxu pacoxu moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage Sep 25, 2023
Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment otherwise LGTM

Comment on lines 111 to 126
if swapControllerAvailable() {
if swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo); utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:
// https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory
switch m.memorySwapBehavior {
case kubelettypes.LimitedSwap:
swapConfigurationHelper.ConfigureLimitedSwap(lcr, pod, container)
default:
swapConfigurationHelper.ConfigureUnlimitedSwap(lcr)
}
} else {
swapConfigurationHelper.ConfigureNoSwap(lcr)
}
} else {
swapConfigurationHelper.ConfigureNoSwap(lcr)
} else if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
klog.ErrorS(errors.New("no cgroup swap controller present"), "ignoring NodeSwap feature", "pod", klog.KObj(pod), "containerName", container.Name)
}
Copy link
Contributor

@klueska klueska Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to cleanup these if/elses to read more nicely. In partifular, this is harad to look at:

if swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo); utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {

What about:

	if !swapControllerAvailable() && !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
		// Nothing to do
	}
	if !swapControllerAvailable() && utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
		klog.ErrorS(errors.New("no cgroup swap controller present"), "ignoring NodeSwap feature", "pod", klog.KObj(pod), "containerName", container.Name)
	}
	if swapControllerAvailable() && !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
		swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo)
		swapConfigurationHelper.ConfigureNoSwap(lcr)
	}
	if swapControllerAvailable() && utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
		// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:
		// https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory
		swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo)
		switch m.memorySwapBehavior {
		case kubelettypes.LimitedSwap:
			swapConfigurationHelper.ConfigureLimitedSwap(lcr, pod, container)
		default:
			swapConfigurationHelper.ConfigureUnlimitedSwap(lcr)
		}
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about the nesting too. What about adding an addSwapResources method that we call here instead.

func (m *kubeGenericRuntimeManager) addSwapResources(...) {
	if !swapControllerAvailable() && !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
		return
	}
	if !swapControllerAvailable() {
		klog.ErrorS(errors.New("no cgroup swap controller present"), "ignoring NodeSwap feature", "pod", klog.KObj(pod), "containerName", container.Name)
		return
	}
	swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo)
	if !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
		swapConfigurationHelper.ConfigureNoSwap(lcr)
		return
	}
	// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:
	// https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory
	swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo)
	switch m.memorySwapBehavior {
	case kubelettypes.LimitedSwap:
		swapConfigurationHelper.ConfigureLimitedSwap(lcr, pod, container)
	default:
		swapConfigurationHelper.ConfigureUnlimitedSwap(lcr)
	}
}

Copy link
Contributor

@klueska klueska Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely seems cleaner to me, though I'd still prefer including some variant of utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) on all if statements to make it clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska I reworked this a bit with quick-returns where applicable. If you really want me to check the feature gate on all if statements, I can do so.

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback - hope it helps

// swap is only configured if a swap controller is available.
if swapControllerAvailable() {
if swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo); utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:
// NOTE(ehashman): Behavior is defined in the opencontainers runtime spec:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) In tests, please write “QoS” not “Qos”

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

} else {
swapConfigurationHelper.ConfigureNoSwap(lcr)
} else if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {
klog.ErrorS(errors.New("no cgroup swap controller present"), "ignoring NodeSwap feature", "pod", klog.KObj(pod), "containerName", container.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to write the error logging as if the feature is already stable. The value of the feature gate doesn't seem very relevant here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sftim. I elected to always log (at InfoS) that we're not configuring swap but to include the SwapBehavior value as well. This should be enough to give users a signal that its being skipped without making the logic in the function overly complex. Hope this is sufficient from your side.

swapConfigurationHelper.ConfigureLimitedSwap(lcr, pod, container)
default:
swapConfigurationHelper.ConfigureUnlimitedSwap(lcr)
// swap is only configured if a swap controller is available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// swap is only configured if a swap controller is available.
// swap is only configured if a swap cgroup controller is available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sftim. Should be addressed in latest revision. Note that I factored out this logic into a method, so the comment is now included as part of its docstring.

This change bypasses all logic to set swap in the linux container
resources if a swap controller is not available on node. Failing
to do so may cause errors in runc when starting a container with
a swap configuration -- even if this is set to 0.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the fix-startup-failure-on-non-swap branch from 73107ae to 394bcaf Compare September 26, 2023 19:37
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2023
@klueska
Copy link
Contributor

klueska commented Sep 27, 2023

@elezar thanks for tracking down this bug and fixing it!
Once this change gets merged we should back-port it to for the next 1.28 patch release.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 27, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 55212c517565fed5dae6a3642f336905ac1c4b7d

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elezar, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 27, 2023
@k8s-ci-robot k8s-ci-robot merged commit 05f4099 into kubernetes:master Sep 27, 2023
13 of 14 checks passed
SIG Node PR Triage automation moved this from Needs Approver to Done Sep 27, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Sep 27, 2023
k8s-ci-robot added a commit that referenced this pull request Sep 29, 2023
…784-upstream-release-1.28

Automated cherry pick of #120784: Use local isCgroup2UnifiedMode consistently
@elezar elezar deleted the fix-startup-failure-on-non-swap branch October 13, 2023 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

8 participants