unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown #3986

113xiaoji · 2023-08-18T03:25:38Z

Description

When using logic from #3931, we discarded bindfd and adopted memfd. The pod has two containers: a main container and a sidecar. The request for the sidecar container is set to 10Mb and limit is 20MB. When I attempt to delete the pod and rebuild it, I face the following error:

Steps to reproduce the issue

1.Create a container with a memory limit set to 20MB.
2.Start it using the memfd method.
3.Check the value in mem.usage_in_bytes.

Alternatively, when used with Kubernetes:
The pod has two containers: a primary container and a sidecar. The request for the sidecar container is set to 10Mb and the limit is 20MB. When I delete the pod, I wait for the pod to be rebuilt.

Describe the results you received and expected

Error Log:

    Message:      failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown

code

func setMemory(path string, val int64) error {
	if val == 0 {
		return nil
	}

	err := cgroups.WriteFile(path, cgroupMemoryLimit, strconv.FormatInt(val, 10))
	if !errors.Is(err, unix.EBUSY) {
		return err
	}

	// EBUSY means the kernel can't set new limit as it's too low
	// (lower than the current usage). Return more specific error.
	usage, err := fscommon.GetCgroupParamUint(path, cgroupMemoryUsage)
	if err != nil {
		return err
	}
	max, err := fscommon.GetCgroupParamUint(path, cgroupMemoryMaxUsage)
	if err != nil {
		return err
	}

	return fmt.Errorf("unable to set memory limit to %d (current usage: %d, peak usage: %d)", val, usage, max)
}

code

		case procHooks:
			// Setup cgroup before prestart hook, so that the prestart hook could apply cgroup permissions.
			if err := p.manager.Set(p.config.Config.Cgroups.Resources); err != nil {
				return fmt.Errorf("error setting cgroup config for procHooks process: %w", err)
			}
			if p.intelRdtManager != nil {
				if err := p.intelRdtManager.Set(p.config.Config); err != nil {
					return fmt.Errorf("error setting Intel RDT config for procHooks process: %w", err)
				}
			}
			if len(p.config.Config.Hooks) != 0 {
				s, err := p.container.currentOCIState()
				if err != nil {
					return err
				}
				// initProcessStartTime hasn't been set yet.
				s.Pid = p.cmd.Process.Pid
				s.Status = specs.StateCreating
				hooks := p.config.Config.Hooks

				if err := hooks[configs.Prestart].RunHooks(s); err != nil {
					return err
				}
				if err := hooks[configs.CreateRuntime].RunHooks(s); err != nil {
					return err
				}
			}
			// Sync with child.
			if err := writeSync(p.messageSockPair.parent, procResume); err != nil {
				return err
			}
			sentResume = true

Upon checking move_charge_at_immigrate, it's not enabled, and I'm on cgroupv1.

Upon examining the kernel 4.18 source code:

static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
	struct cgroup_subsys_state *css;
	struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
	struct mem_cgroup *from;
	struct task_struct *leader, *p;
	struct mm_struct *mm;
	unsigned long move_flags;
	int ret = 0;

	/* charge immigration isn't supported on the default hierarchy */
	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
		return 0;

	/*
	 * Multi-process migrations only happen on the default hierarchy
	 * where charge immigration is not used.  Perform charge
	 * immigration if @tset contains a leader and whine if there are
	 * multiple.
	 */
	p = NULL;
	cgroup_taskset_for_each_leader(leader, css, tset) {
		WARN_ON_ONCE(p);
		p = leader;
		memcg = mem_cgroup_from_css(css);
	}
	if (!p)
		return 0;

	/*
	 * We are now commited to this value whatever it is. Changes in this
	 * tunable will only affect upcoming migrations, not the current one.
	 * So we need to save it, and keep it going.
	 */
	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
	if (!move_flags)
		return 0;

	from = mem_cgroup_from_task(p);

	VM_BUG_ON(from == memcg);

	mm = get_task_mm(p);
	if (!mm)
		return 0;
	/* We move charges only when we move a owner of the mm */
	if (mm->owner == p) {
		VM_BUG_ON(mc.from);
		VM_BUG_ON(mc.to);
		VM_BUG_ON(mc.precharge);
		VM_BUG_ON(mc.moved_charge);
		VM_BUG_ON(mc.moved_swap);

		spin_lock(&mc.lock);
		mc.mm = mm;
		mc.from = from;
		mc.to = memcg;
		mc.flags = move_flags;
		spin_unlock(&mc.lock);
		/* We set mc.moving_task later */

		ret = mem_cgroup_precharge_mc(mm);
		if (ret)
			mem_cgroup_clear_mc();
	} else {
		mmput(mm);
	}
	return ret;
}

For cgroupv2 version, the code directly returns 0:

/* charge immigration isn't supported on the default hierarchy */
	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
		return 0;

If move_charge_at_immigrate=0, it directly returns 0 as well:

	/*
	 * We are now commited to this value whatever it is. Changes in this
	 * tunable will only affect upcoming migrations, not the current one.
	 * So we need to save it, and keep it going.
	 */
	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
	if (!move_flags)
		return 0;

The issue disappears when I switch back to runc 1.1.2 or use the memfd-bind binary.

Question 1: At that time, what was consuming the memory? memfd shouldn't consume the container's memory.
@lifubang @cyphar

What version of runc are you using?

master

Host OS information

NAME="EulerOS"
VERSION="2.0 (SP10x86_64)"
ID="euleros"
VERSION_ID="2.0"
PRETTY_NAME="EulerOS 2.0 (SP10x86_64)"
ANSI_COLOR="0;31"

Host kernel information

Linux PaaSOM-1 4.18.0-147.5.2.14.h1050.eulerosv2r10.x86_64 #1 SMP Sun Oct 16 18:12:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

lifubang · 2023-08-18T05:15:48Z

At that time, what was consuming the memory? memfd shouldn't consume the container's memory.

~~I think runc init was consuming the memory.~~
~~After we writting runc binary to memfd , it used runc init's memory.~~
~~You can test the example in https://man7.org/linux/man-pages/man2/memfd_create.2.html.~~

Sorry, The above description is wrong.

lifubang · 2023-08-18T08:02:41Z

Question 1: At that time, what was consuming the memory? memfd shouldn't consume the container's memory.

I have done some tests. Maybe depends on go's version.
If you use go 1.20, the issue may disappear.

Which version of golang did you use when you saw this problem?

113xiaoji · 2023-08-18T08:35:10Z

Question 1: At that time, what was consuming the memory? memfd shouldn't consume the container's memory.

I have done some tests. Maybe depends on go's version. If you use go 1.20, the issue may disappear.

Which version of golang did you use when you saw this problem?

I'm using Go version 1.19.6. Let me switch to version 1.20+ and try again. Thank you.

113xiaoji · 2023-08-18T09:49:39Z

I switched the Go version to 1.20.7, but the problem still persists.

Warning  FailedStart       7s                kubelet            Error: failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set memory limit to 20971520 (current usage: 23396352, peak usage: 23408640): unknown

@lifubang

cyphar · 2023-08-20T14:17:26Z

I'm not sure if the cgroup configuration for procHooks is actually necessary -- we already set up the cgroups far earlier. IIRC the reason for this being added in the first place is that folks wanted to configure cgroups in hooks and we used to do a Set afterwards (which would overwrite those changes), but we removed that because it was completely unnecessary -- and I suspect this one is unnecessary as well (@kolyshkin wdyt?).

What pre-start or create-runtime hooks are you running?

It should be noted that the cgroup configuration didn't fail when the process was first configured, so there's something weird going on here. It should also be noted that the memory limit doesn't make sense -- we are just setting the same limit twice and yet the container is using more memory than the limit? Is the original limit a soft limit somehow?

113xiaoji · 2023-08-20T15:53:48Z

It should be noted that the cgroup configuration didn't fail when the process was first configured, so there's something weird going on here. It should also be noted that the memory limit doesn't make sense -- we are just setting the same limit twice and yet the container is using more memory than the limit? Is the original limit a soft limit somehow?
How to check, i don't think so, i think it is not a soft limit

What pre-start or create-runtime hooks are you running?

I check the config.json, there is a prestart hook

	"hooks": {
		"prestart": [{
			"path": "/var/lib/docker/hooks/remount_sys.sh",
			"args": ["remount_sys.sh"]
		}]
	},

The script has configured devices.allow, and I'm not sure if it's related to this issue.

@cyphar

113xiaoji · 2023-08-21T02:17:30Z

t should be noted that the cgroup configuration didn't fail when the process was first configured, so there's something weird going on here. It should also be noted that the memory limit doesn't make sense -- we are just setting the same limit twice and yet the container is using more memory than the limit? Is the original limit a soft limit somehow?

This point also confuses me. How can we confirm whether the original limit is using a soft limit? I think it's highly unlikely that it is.

kolyshkin · 2023-08-25T03:10:34Z

I'm not sure if the cgroup configuration for procHooks is actually necessary -- we already set up the cgroups far earlier. IIRC the reason for this being added in the first place is that folks wanted to configure cgroups in hooks and we used to do a Set afterwards (which would overwrite those changes), but we removed that because it was completely unnecessary -- and I suspect this one is unnecessary as well (@kolyshkin wdyt?).

The cgroups setup is split between Apply (which no longer does Set, but merely creates a cgroup and adds a pid to it) and Set (which actually sets the limits). From the cursory look at the code, we do need to call Set here.

OTOH it looks like we call Set (and run CreateRuntime hook) twice in case we're running in host mount namespace, and I can't figure out why. Alas, I see no integration tests related to host mntns. Anyway, this is orthogonal to this issue.

kolyshkin · 2023-08-26T02:33:24Z

OTOH it looks like we call Set (and run CreateRuntime hook) twice in case we're running in host mount namespace, and I can't figure out why. Alas, I see no integration tests related to host mntns. Anyway, this is orthogonal to this issue.

Addressed by #3996.

113xiaoji · 2023-08-26T03:44:27Z

Do you need me to provide any additional information?

kolyshkin · 2023-09-01T18:30:43Z

Well, it is clear what's happening -- higher memory usage due to switching from bindfd to memfd.

This is being addressed in #3987. If you want to use current runc HEAD, the workaround is to raise the memory limits.

113xiaoji · 2023-09-02T03:24:36Z

Well, it is clear what's happening -- higher memory usage due to switching from bindfd to memfd.

This is being addressed in #3987. If you want to use current runc HEAD, the workaround is to raise the memory limits.

I think there are still some points that we haven't analyzed clearly, according to the previous analysis memfd will only consume the host's memory, not the container's memory. So theoretically, if the host memory is sufficient, switching from bindfd to memfd should not cause the container to fail to start.

lifubang · 2023-09-24T12:46:54Z

@113xiaoji #3987 has been merged, could you please test it to see whether you can reproduce your issue in the main branch or not. Thanks.

113xiaoji · 2023-09-24T21:48:49Z

OK i will reproduce later

113xiaoji · 2023-09-27T08:43:39Z

@113xiaoji #3987 has been merged, could you please test it to see whether you can reproduce your issue in the main branch or not. Thanks.

I have tested it with the latest main branch, without adding the runc_nodmz tag, and the issue wasn’t reproduced. However, what confuses me is that the preceding memfd indeed occupied the memory of the container?

lifubang · 2023-09-27T09:16:41Z

without adding the runc_nodmz tag, and the issue wasn’t reproduced

✌️ Thanks.

what confuses me is that the preceding memfd indeed occupied the memory of the container?

Yes, this question still has no answer at this time.

113xiaoji closed this as completed Aug 18, 2023

113xiaoji reopened this Aug 18, 2023

This comment was marked as duplicate.

Sign in to view

113xiaoji closed this as completed Aug 21, 2023

113xiaoji reopened this Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown #3986

unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown #3986

113xiaoji commented Aug 18, 2023 •

edited

lifubang commented Aug 18, 2023 •

edited

lifubang commented Aug 18, 2023

113xiaoji commented Aug 18, 2023

113xiaoji commented Aug 18, 2023

cyphar commented Aug 20, 2023

113xiaoji commented Aug 20, 2023 •

edited

This comment was marked as duplicate.

113xiaoji commented Aug 21, 2023

kolyshkin commented Aug 25, 2023

kolyshkin commented Aug 26, 2023

113xiaoji commented Aug 26, 2023

kolyshkin commented Sep 1, 2023

113xiaoji commented Sep 2, 2023

lifubang commented Sep 24, 2023

113xiaoji commented Sep 24, 2023

113xiaoji commented Sep 27, 2023

lifubang commented Sep 27, 2023

unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown #3986

unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown #3986

Comments

113xiaoji commented Aug 18, 2023 • edited

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

lifubang commented Aug 18, 2023 • edited

lifubang commented Aug 18, 2023

113xiaoji commented Aug 18, 2023

113xiaoji commented Aug 18, 2023

cyphar commented Aug 20, 2023

113xiaoji commented Aug 20, 2023 • edited

This comment was marked as duplicate.

113xiaoji commented Aug 21, 2023

kolyshkin commented Aug 25, 2023

kolyshkin commented Aug 26, 2023

113xiaoji commented Aug 26, 2023

kolyshkin commented Sep 1, 2023

113xiaoji commented Sep 2, 2023

lifubang commented Sep 24, 2023

113xiaoji commented Sep 24, 2023

113xiaoji commented Sep 27, 2023

lifubang commented Sep 27, 2023

113xiaoji commented Aug 18, 2023 •

edited

lifubang commented Aug 18, 2023 •

edited

113xiaoji commented Aug 20, 2023 •

edited