Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rfc] how best to reduce runc's effects on cgroup memory limits #4021

Open
cyphar opened this issue Sep 17, 2023 · 6 comments
Open

[rfc] how best to reduce runc's effects on cgroup memory limits #4021

cyphar opened this issue Sep 17, 2023 · 6 comments

Comments

@cyphar
Copy link
Member

cyphar commented Sep 17, 2023

Temporarily "unresolving" to un-hide on GitHub, and maybe there's some follow-up discussions to be had (NOT for this PR fwiw) but we can move the comment elsewhere.

We initially set this limit to 4MB, and raised it to 6MB when later versions of runc became more memory-hungry (I think that was part of a security fix?)

While looking that up, I found;

Honestly, from Moby/Docker's (and other "higher level" runtimes, I don't think it's "ideal"

  • It requires higher-level runtimes to have knowledge about the implementation's requirements
  • It requires them to (hard-code) based on the "worst case" scenario (e.g. would crun have a lower requirement?)
  • It limits users to set the limit to what their container process requires at runtime (which may be lower than memory needed during init).

This is orthogonally related to a (rather heated) discussion we had recently about moby/moby#45511, during which a question was raised "what does the runtime-spec actually describe?"

  • Is it declarative? Does it define the container "at runtime" (after execve) to execute, meaning: the spec is "declarative", and defines the user's intent w.r.t. describing requirements for the container(s process).
  • Or does it describe the container including the initial init (runc) to execute: the spec defines what's needed to construct the container (including capabilities and resources needed to do so).

The difference is very subtle, and in MOST cases, they are interchangeable (which is likely why they were never explicitly defined), however there are situations such as the one being discussed here (and the one I linked), where

My (personal) ideal would be to somehow get to the "declarative" variant; the user describes their intent ("make it so!"), and the runtime(s) take care of "whatever is needed" to make it happen. This would be the most user-friendly approach, and most "portable", as it would remove implementation-specific conditions out of the equation. But of course, this may not be something that can be realized (easily), beyond that ship possibly having sailed (runc effectively being the "reference implementation", and the runtime-spec usually being adjusted to fit (not the reverse)).

That said, MAYBE there are some options if we take the "features" option into account (opencontainers/runtime-spec#1130); perhaps an addition there could either specify the minimum requirements for the runtime that's used, or advertise the runtime's own requirements (i.e.; for this specific case: add XXk of memory to what's specified in the provided OCI spec).

Originally posted by @thaJeztah in #3987 (comment)

@cyphar
Copy link
Member Author

cyphar commented Sep 17, 2023

To be honest I don't think we can have a hard guarantee for memory usage, peak memory usage is not deterministic for one (and, as evidenced by the CentOS test failures it seems that sometimes we can end up using 2.5mb more in some environments).

I think we also can't do something simple like applying the cgroup setting later, and we have to join the cgroup to unshare the cgroup namespace.

Maybe we could work around it by joining the cgroup just for cgroupns creation, leave the cgroup, and then get re-attached right before the exec. It would make things more complicated but it would take runc out of the equation for memory usage. We couldn't do this for containers that have CAP_SYS_PTRACE (for the same reasons as with runc-dmz) because an attacker could inject code into the joining process and run outside of cgroup restrictions. For runc exec we could just delay cgroup attach because we don't need to create the namespace. @kolyshkin is this idea insane?

Originally posted by @cyphar in #3987 (comment)

@thaJeztah
Copy link
Member

Thanks! Also /cc @neersighted who was in one of those conversations at the time.

@rata
Copy link
Contributor

rata commented Sep 18, 2023

The following is a more "radical" approach conceptually, lot of semantic changes. But something to bear in mind.

I was at All Systems Go last week, systemd is trying an approach where it serializes the info the child will need, and fork into a small child, and the child recovers the serialized info, does that and exec. This seems to be promising for systemd as a way to reduce the memory usage, as the child can be way smaller.

I think they expect to merge that PR this week, it is called systemd-executor or something like that (haven't searched for the PR, though).

@cyphar
Copy link
Member Author

cyphar commented Sep 18, 2023

@rata That is very similar to how runc currently works. We fork a child (runc init) and we give it a parsed form of the container configuration. We have to do an exec because of how Go works (there's no way around that, sadly) but the idea is very similar.

That being said, I do wonder where the 5-7mb of memory usage is coming from. We do a lot of work in runc init by necessity but I would expect something more like 1mb of memory usage.

EDIT: Some local tests on my machine show that the actual peak memory usage by runc before starting the container is actually about 2.2-3mb. It seems possible that something about our CI setups leads to more memory usage (up to 7mb on our CentOS CI machines). While 2-3mb is worse than 1mb or 0mb, I don't think it's that unreasonable. These tests were done with #3987 applied. I can try to go back to a pre-memfd version of runc if needed but RUNC_DMZ=legacy has no effect on the memory usage.

@rata
Copy link
Contributor

rata commented Sep 19, 2023

@cyphar no, it is very different because, while runc forks several times, it never execs into a very small binary to apply the cgroup limits. It just apply them while running the runc binary. Therefore, the runc binary of ~13MB imposes a limit for the cgroup mem limit.

@cyphar
Copy link
Member Author

cyphar commented Sep 19, 2023

The runc binary size doesn't affect the cgroup limit. I can run a 2.5mb limit container (with RUNC_DMZ=legacy no less) on my machine. If the runc binary size had any effect on memory usage that shouldn't be possible.

I think this is because we never re-exec runc after joining the cgroup so the memory accounting for the binary is attributed to the host. The memory usage also isn't from binary cloning, I think that has been pretty well-established at this point.

My pet theory was that runc exits so quickly that the GC doesn't have a chance to run (if you try to run with GODEBUG=gctrace=1, there is no output) and so we are just looking at the total memory usage of runc without anything being freed. But some playing around with debug.FreeOSMemory seems to indicate that we might actually be using all of that memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants