[rfc] how best to reduce runc's effects on cgroup memory limits #4021

cyphar · 2023-09-17T05:33:54Z

Temporarily "unresolving" to un-hide on GitHub, and maybe there's some follow-up discussions to be had (NOT for this PR fwiw) but we can move the comment elsewhere.

We initially set this limit to 4MB, and raised it to 6MB when later versions of runc became more memory-hungry (I think that was part of a security fix?)

While looking that up, I found;

our initial minimum was 512K (back in the LXC days); moby/moby@9ee1116 (validate memory limits & error out if it's less than 524288 moby/moby#892)

it was updated to 4MB (after we switched to libcontainer); moby/moby@9b75541 (Update memory limit for container moby/moby#8173)

Update memory limit for container moby/moby#8173 (comment) "This is because over time the docker binary has grown and because we are linking to more libraries it takes more memory to load the docker binary as the container's init process."

it was raised to 6MB to account for runc requiring more in moby/moby@d2e2340 (Set minimum memory limit to 6M, to account for higher startup memory use moby/moby#41168)

Honestly, from Moby/Docker's (and other "higher level" runtimes, I don't think it's "ideal"

It requires higher-level runtimes to have knowledge about the implementation's requirements

It requires them to (hard-code) based on the "worst case" scenario (e.g. would crun have a lower requirement?)

It limits users to set the limit to what their container process requires at runtime (which may be lower than memory needed during init).

This is orthogonally related to a (rather heated) discussion we had recently about moby/moby#45511, during which a question was raised "what does the runtime-spec actually describe?"

Is it declarative? Does it define the container "at runtime" (after execve) to execute, meaning: the spec is "declarative", and defines the user's intent w.r.t. describing requirements for the container(s process).

Or does it describe the container including the initial init (runc) to execute: the spec defines what's needed to construct the container (including capabilities and resources needed to do so).

The difference is very subtle, and in MOST cases, they are interchangeable (which is likely why they were never explicitly defined), however there are situations such as the one being discussed here (and the one I linked), where

My (personal) ideal would be to somehow get to the "declarative" variant; the user describes their intent ("make it so!"), and the runtime(s) take care of "whatever is needed" to make it happen. This would be the most user-friendly approach, and most "portable", as it would remove implementation-specific conditions out of the equation. But of course, this may not be something that can be realized (easily), beyond that ship possibly having sailed (runc effectively being the "reference implementation", and the runtime-spec usually being adjusted to fit (not the reverse)).

That said, MAYBE there are some options if we take the "features" option into account (opencontainers/runtime-spec#1130); perhaps an addition there could either specify the minimum requirements for the runtime that's used, or advertise the runtime's own requirements (i.e.; for this specific case: add XXk of memory to what's specified in the provided OCI spec).

Originally posted by @thaJeztah in #3987 (comment)

The text was updated successfully, but these errors were encountered:

cyphar · 2023-09-17T05:34:22Z

To be honest I don't think we can have a hard guarantee for memory usage, peak memory usage is not deterministic for one (and, as evidenced by the CentOS test failures it seems that sometimes we can end up using 2.5mb more in some environments).

I think we also can't do something simple like applying the cgroup setting later, and we have to join the cgroup to unshare the cgroup namespace.

Maybe we could work around it by joining the cgroup just for cgroupns creation, leave the cgroup, and then get re-attached right before the exec. It would make things more complicated but it would take runc out of the equation for memory usage. We couldn't do this for containers that have CAP_SYS_PTRACE (for the same reasons as with runc-dmz) because an attacker could inject code into the joining process and run outside of cgroup restrictions. For runc exec we could just delay cgroup attach because we don't need to create the namespace. @kolyshkin is this idea insane?

Originally posted by @cyphar in #3987 (comment)

thaJeztah · 2023-09-17T10:07:38Z

Thanks! Also /cc @neersighted who was in one of those conversations at the time.

rata · 2023-09-18T14:14:50Z

The following is a more "radical" approach conceptually, lot of semantic changes. But something to bear in mind.

I was at All Systems Go last week, systemd is trying an approach where it serializes the info the child will need, and fork into a small child, and the child recovers the serialized info, does that and exec. This seems to be promising for systemd as a way to reduce the memory usage, as the child can be way smaller.

I think they expect to merge that PR this week, it is called systemd-executor or something like that (haven't searched for the PR, though).

cyphar · 2023-09-18T15:05:48Z

@rata That is very similar to how runc currently works. We fork a child (runc init) and we give it a parsed form of the container configuration. We have to do an exec because of how Go works (there's no way around that, sadly) but the idea is very similar.

That being said, I do wonder where the 5-7mb of memory usage is coming from. We do a lot of work in runc init by necessity but I would expect something more like 1mb of memory usage.

EDIT: Some local tests on my machine show that the actual peak memory usage by runc before starting the container is actually about 2.2-3mb. It seems possible that something about our CI setups leads to more memory usage (up to 7mb on our CentOS CI machines). While 2-3mb is worse than 1mb or 0mb, I don't think it's that unreasonable. These tests were done with #3987 applied. I can try to go back to a pre-memfd version of runc if needed but RUNC_DMZ=legacy has no effect on the memory usage.

rata · 2023-09-19T10:43:44Z

@cyphar no, it is very different because, while runc forks several times, it never execs into a very small binary to apply the cgroup limits. It just apply them while running the runc binary. Therefore, the runc binary of ~13MB imposes a limit for the cgroup mem limit.

cyphar · 2023-09-19T11:47:01Z

The runc binary size doesn't affect the cgroup limit. I can run a 2.5mb limit container (with RUNC_DMZ=legacy no less) on my machine. If the runc binary size had any effect on memory usage that shouldn't be possible.

I think this is because we never re-exec runc after joining the cgroup so the memory accounting for the binary is attributed to the host. The memory usage also isn't from binary cloning, I think that has been pretty well-established at this point.

My pet theory was that runc exits so quickly that the GC doesn't have a chance to run (if you try to run with GODEBUG=gctrace=1, there is no output) and so we are just looking at the total memory usage of runc without anything being freed. But some playing around with debug.FreeOSMemory seems to indicate that we might actually be using all of that memory.

cyphar mentioned this issue Sep 17, 2023

nsexec: cloned binary rework #3987

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rfc] how best to reduce runc's effects on cgroup memory limits #4021

[rfc] how best to reduce runc's effects on cgroup memory limits #4021

cyphar commented Sep 17, 2023

cyphar commented Sep 17, 2023

thaJeztah commented Sep 17, 2023

rata commented Sep 18, 2023

cyphar commented Sep 18, 2023 •

edited

rata commented Sep 19, 2023

cyphar commented Sep 19, 2023

[rfc] how best to reduce runc's effects on cgroup memory limits #4021

[rfc] how best to reduce runc's effects on cgroup memory limits #4021

Comments

cyphar commented Sep 17, 2023

cyphar commented Sep 17, 2023

thaJeztah commented Sep 17, 2023

rata commented Sep 18, 2023

cyphar commented Sep 18, 2023 • edited

rata commented Sep 19, 2023

cyphar commented Sep 19, 2023

cyphar commented Sep 18, 2023 •

edited