Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown #3986

Open
113xiaoji opened this issue Aug 18, 2023 · 17 comments

Comments

@113xiaoji
Copy link

113xiaoji commented Aug 18, 2023

Description

When using logic from #3931, we discarded bindfd and adopted memfd. The pod has two containers: a main container and a sidecar. The request for the sidecar container is set to 10Mb and limit is 20MB. When I attempt to delete the pod and rebuild it, I face the following error:

Steps to reproduce the issue

1.Create a container with a memory limit set to 20MB.
2.Start it using the memfd method.
3.Check the value in mem.usage_in_bytes.

Alternatively, when used with Kubernetes:
The pod has two containers: a primary container and a sidecar. The request for the sidecar container is set to 10Mb and the limit is 20MB. When I delete the pod, I wait for the pod to be rebuilt.

Describe the results you received and expected

Error Log:

    Message:      failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown

code

func setMemory(path string, val int64) error {
	if val == 0 {
		return nil
	}

	err := cgroups.WriteFile(path, cgroupMemoryLimit, strconv.FormatInt(val, 10))
	if !errors.Is(err, unix.EBUSY) {
		return err
	}

	// EBUSY means the kernel can't set new limit as it's too low
	// (lower than the current usage). Return more specific error.
	usage, err := fscommon.GetCgroupParamUint(path, cgroupMemoryUsage)
	if err != nil {
		return err
	}
	max, err := fscommon.GetCgroupParamUint(path, cgroupMemoryMaxUsage)
	if err != nil {
		return err
	}

	return fmt.Errorf("unable to set memory limit to %d (current usage: %d, peak usage: %d)", val, usage, max)
}

code

		case procHooks:
			// Setup cgroup before prestart hook, so that the prestart hook could apply cgroup permissions.
			if err := p.manager.Set(p.config.Config.Cgroups.Resources); err != nil {
				return fmt.Errorf("error setting cgroup config for procHooks process: %w", err)
			}
			if p.intelRdtManager != nil {
				if err := p.intelRdtManager.Set(p.config.Config); err != nil {
					return fmt.Errorf("error setting Intel RDT config for procHooks process: %w", err)
				}
			}
			if len(p.config.Config.Hooks) != 0 {
				s, err := p.container.currentOCIState()
				if err != nil {
					return err
				}
				// initProcessStartTime hasn't been set yet.
				s.Pid = p.cmd.Process.Pid
				s.Status = specs.StateCreating
				hooks := p.config.Config.Hooks

				if err := hooks[configs.Prestart].RunHooks(s); err != nil {
					return err
				}
				if err := hooks[configs.CreateRuntime].RunHooks(s); err != nil {
					return err
				}
			}
			// Sync with child.
			if err := writeSync(p.messageSockPair.parent, procResume); err != nil {
				return err
			}
			sentResume = true

Upon checking move_charge_at_immigrate, it's not enabled, and I'm on cgroupv1.

Upon examining the kernel 4.18 source code:

static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
	struct cgroup_subsys_state *css;
	struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
	struct mem_cgroup *from;
	struct task_struct *leader, *p;
	struct mm_struct *mm;
	unsigned long move_flags;
	int ret = 0;

	/* charge immigration isn't supported on the default hierarchy */
	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
		return 0;

	/*
	 * Multi-process migrations only happen on the default hierarchy
	 * where charge immigration is not used.  Perform charge
	 * immigration if @tset contains a leader and whine if there are
	 * multiple.
	 */
	p = NULL;
	cgroup_taskset_for_each_leader(leader, css, tset) {
		WARN_ON_ONCE(p);
		p = leader;
		memcg = mem_cgroup_from_css(css);
	}
	if (!p)
		return 0;

	/*
	 * We are now commited to this value whatever it is. Changes in this
	 * tunable will only affect upcoming migrations, not the current one.
	 * So we need to save it, and keep it going.
	 */
	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
	if (!move_flags)
		return 0;

	from = mem_cgroup_from_task(p);

	VM_BUG_ON(from == memcg);

	mm = get_task_mm(p);
	if (!mm)
		return 0;
	/* We move charges only when we move a owner of the mm */
	if (mm->owner == p) {
		VM_BUG_ON(mc.from);
		VM_BUG_ON(mc.to);
		VM_BUG_ON(mc.precharge);
		VM_BUG_ON(mc.moved_charge);
		VM_BUG_ON(mc.moved_swap);

		spin_lock(&mc.lock);
		mc.mm = mm;
		mc.from = from;
		mc.to = memcg;
		mc.flags = move_flags;
		spin_unlock(&mc.lock);
		/* We set mc.moving_task later */

		ret = mem_cgroup_precharge_mc(mm);
		if (ret)
			mem_cgroup_clear_mc();
	} else {
		mmput(mm);
	}
	return ret;
}

For cgroupv2 version, the code directly returns 0:

/* charge immigration isn't supported on the default hierarchy */
	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
		return 0;

If move_charge_at_immigrate=0, it directly returns 0 as well:

	/*
	 * We are now commited to this value whatever it is. Changes in this
	 * tunable will only affect upcoming migrations, not the current one.
	 * So we need to save it, and keep it going.
	 */
	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
	if (!move_flags)
		return 0;

The issue disappears when I switch back to runc 1.1.2 or use the memfd-bind binary.

Question 1: At that time, what was consuming the memory? memfd shouldn't consume the container's memory.
@lifubang @cyphar

What version of runc are you using?

master

Host OS information

NAME="EulerOS"
VERSION="2.0 (SP10x86_64)"
ID="euleros"
VERSION_ID="2.0"
PRETTY_NAME="EulerOS 2.0 (SP10x86_64)"
ANSI_COLOR="0;31"

Host kernel information

Linux PaaSOM-1 4.18.0-147.5.2.14.h1050.eulerosv2r10.x86_64 #1 SMP Sun Oct 16 18:12:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

@lifubang
Copy link
Member

lifubang commented Aug 18, 2023

At that time, what was consuming the memory? memfd shouldn't consume the container's memory.

I think runc init was consuming the memory.
After we writting runc binary to memfd , it used runc init's memory.
You can test the example in https://man7.org/linux/man-pages/man2/memfd_create.2.html.

Sorry, The above description is wrong.

@lifubang
Copy link
Member

Question 1: At that time, what was consuming the memory? memfd shouldn't consume the container's memory.

I have done some tests. Maybe depends on go's version.
If you use go 1.20, the issue may disappear.

Which version of golang did you use when you saw this problem?

@113xiaoji
Copy link
Author

Question 1: At that time, what was consuming the memory? memfd shouldn't consume the container's memory.

I have done some tests. Maybe depends on go's version. If you use go 1.20, the issue may disappear.

Which version of golang did you use when you saw this problem?

I'm using Go version 1.19.6. Let me switch to version 1.20+ and try again. Thank you.

@113xiaoji 113xiaoji reopened this Aug 18, 2023
@113xiaoji
Copy link
Author

I switched the Go version to 1.20.7, but the problem still persists.

Warning  FailedStart       7s                kubelet            Error: failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set memory limit to 20971520 (current usage: 23396352, peak usage: 23408640): unknown

@lifubang

@cyphar
Copy link
Member

cyphar commented Aug 20, 2023

I'm not sure if the cgroup configuration for procHooks is actually necessary -- we already set up the cgroups far earlier. IIRC the reason for this being added in the first place is that folks wanted to configure cgroups in hooks and we used to do a Set afterwards (which would overwrite those changes), but we removed that because it was completely unnecessary -- and I suspect this one is unnecessary as well (@kolyshkin wdyt?).

What pre-start or create-runtime hooks are you running?

It should be noted that the cgroup configuration didn't fail when the process was first configured, so there's something weird going on here. It should also be noted that the memory limit doesn't make sense -- we are just setting the same limit twice and yet the container is using more memory than the limit? Is the original limit a soft limit somehow?

@113xiaoji
Copy link
Author

113xiaoji commented Aug 20, 2023

It should be noted that the cgroup configuration didn't fail when the process was first configured, so there's something weird going on here. It should also be noted that the memory limit doesn't make sense -- we are just setting the same limit twice and yet the container is using more memory than the limit? Is the original limit a soft limit somehow?
How to check, i don't think so, i think it is not a soft limit

What pre-start or create-runtime hooks are you running?

I check the config.json, there is a prestart hook

	"hooks": {
		"prestart": [{
			"path": "/var/lib/docker/hooks/remount_sys.sh",
			"args": ["remount_sys.sh"]
		}]
	},

The script has configured devices.allow, and I'm not sure if it's related to this issue.

@cyphar

@113xiaoji

This comment was marked as duplicate.

@113xiaoji 113xiaoji reopened this Aug 21, 2023
@113xiaoji
Copy link
Author

t should be noted that the cgroup configuration didn't fail when the process was first configured, so there's something weird going on here. It should also be noted that the memory limit doesn't make sense -- we are just setting the same limit twice and yet the container is using more memory than the limit? Is the original limit a soft limit somehow?

This point also confuses me. How can we confirm whether the original limit is using a soft limit? I think it's highly unlikely that it is.

@kolyshkin
Copy link
Contributor

I'm not sure if the cgroup configuration for procHooks is actually necessary -- we already set up the cgroups far earlier. IIRC the reason for this being added in the first place is that folks wanted to configure cgroups in hooks and we used to do a Set afterwards (which would overwrite those changes), but we removed that because it was completely unnecessary -- and I suspect this one is unnecessary as well (@kolyshkin wdyt?).

The cgroups setup is split between Apply (which no longer does Set, but merely creates a cgroup and adds a pid to it) and Set (which actually sets the limits). From the cursory look at the code, we do need to call Set here.

OTOH it looks like we call Set (and run CreateRuntime hook) twice in case we're running in host mount namespace, and I can't figure out why. Alas, I see no integration tests related to host mntns. Anyway, this is orthogonal to this issue.

@kolyshkin
Copy link
Contributor

OTOH it looks like we call Set (and run CreateRuntime hook) twice in case we're running in host mount namespace, and I can't figure out why. Alas, I see no integration tests related to host mntns. Anyway, this is orthogonal to this issue.

Addressed by #3996.

@113xiaoji
Copy link
Author

Do you need me to provide any additional information?

@kolyshkin
Copy link
Contributor

Well, it is clear what's happening -- higher memory usage due to switching from bindfd to memfd.

This is being addressed in #3987. If you want to use current runc HEAD, the workaround is to raise the memory limits.

@113xiaoji
Copy link
Author

Well, it is clear what's happening -- higher memory usage due to switching from bindfd to memfd.

This is being addressed in #3987. If you want to use current runc HEAD, the workaround is to raise the memory limits.

I think there are still some points that we haven't analyzed clearly, according to the previous analysis memfd will only consume the host's memory, not the container's memory. So theoretically, if the host memory is sufficient, switching from bindfd to memfd should not cause the container to fail to start.

@lifubang
Copy link
Member

@113xiaoji #3987 has been merged, could you please test it to see whether you can reproduce your issue in the main branch or not. Thanks.

@113xiaoji
Copy link
Author

OK i will reproduce later

@113xiaoji
Copy link
Author

@113xiaoji #3987 has been merged, could you please test it to see whether you can reproduce your issue in the main branch or not. Thanks.

I have tested it with the latest main branch, without adding the runc_nodmz tag, and the issue wasn’t reproduced. However, what confuses me is that the preceding memfd indeed occupied the memory of the container?

@lifubang
Copy link
Member

without adding the runc_nodmz tag, and the issue wasn’t reproduced

✌️ Thanks.

what confuses me is that the preceding memfd indeed occupied the memory of the container?

Yes, this question still has no answer at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants