dmz: don't use runc-dmz in complicated capability setups #4137

cyphar · 2023-12-08T06:50:01Z

Due to the fact that runc-dmz is an intermediate binary without any special set-capability file attributes, using runc-dmz for containers with a non-root user can result in different capability sets being applied after the second execve().

Linux capabilities are quite complicated, and there are loads of different interactions between file and process capability sets, so we should just go with the most conservative rule to determine if we can't use runc-dmz -- if the inheritable, permitted, and bounding sets are not equal to the ambient set then we don't use runc-dmz.

Fixes: dac4171 ("runc-dmz: reduce memfd binary cloning cost with small C binary")
Fixes #4125 and is a safe alternative to #4129.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

Due to the fact that runc-dmz is an intermediate binary without any special set-capability file attributes, using runc-dmz for containers with a non-root user can result in different capability sets being applied after the second execve(). Linux capabilities are quite complicated, and there are loads of different interactions between file and process capability sets, so we should just go with the most conservative rule to determine if we can't use runc-dmz -- if the inheritable, permitted, and bounding sets are not equal to the ambient set then we don't use runc-dmz. Fixes: dac4171 ("runc-dmz: reduce memfd binary cloning cost with small C binary") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

113xiaoji · 2023-12-09T01:55:55Z

Does memfd still support?How to open it if has merged this pr.

cyphar · 2023-12-10T08:59:18Z

@113xiaoji RUNC_DMZ=legacy still works to disable runc-dmz, and this adds one more case where we use a /proc/self/exe memfd.

AkihiroSuda · 2023-12-11T10:32:58Z

libcontainer/container_linux.go

+	return true
+}
+
+func shouldUseDmzBinary(p *Process, c *configs.Config) bool {


Can we just make dmz opt-in via an env or a CLI flag?

runc-dmz: Inheritable capabilities are dropped when they previously weren't #4125 (comment)

That would be the simplest solution, but it seems like a bit of a shame to have this code and not use it... Should we remove the SELinux logic too?

Probably we should define a ternary env var like RUNC_USE_DMZ=(1|0|auto).

The default value should be auto, however, for runc v1.2, I'd suggest to just treat this as an alias for 0 (false) to minimize the incompatibility.

In a future version of runc, we may implement more clever logic for auto.

cc @lifubang WDYT?

RUNC_DMZ=legacy can disable the dmz feature now. You mean you worry about there will be more imcompatible reasons not included in #4158 ? But we should know that if we set the default value to legacy, the k8s e2e test case about this area will fail? How to improve this test case in k8s?

I think we might masquerade this in k8s if we disable runc-dmz if the container is not running as root. I think if it runs as root we don't need to change the capabilities.

I'm not sure if the root detection is hard or not safe and that is why it wasn't done here. I haven't looked into it.

Someone mentioned that checking if we are root would be sufficient. To be honest, I struggle to understand all of the interactions of capabilities with everything else in the kernel (some of the functions in commoncap are actual line noise to my eyes).

The issue is that runc binary overwrites are only relevant for uid 0 in most cases. However, if runc-dmz is only used for unprivileged container users maybe that'd be okay for now (not uid 0 and no caps).

rata · 2023-12-18T11:14:13Z

Please ping when this is ready for review now :-D

rata · 2023-12-18T16:10:24Z

From the PR description:

if the inheritable, permitted, and bounding sets are not equal to the ambient set then we don't use runc-dmz.

btw, it seems Kubernetes sets only bounding, effective and permitted (therefore, those will be different to ambient, that isn't set). I think this will open the "Kubernetes e2e tests fail with new runc" can of worms. That seems better than regressing on this, but saying it so we are warned :-)

rata · 2024-01-08T17:49:48Z

@cyphar friendly ping?

rata · 2024-01-11T11:27:55Z

@cyphar everything fine? Was the repo delete a mistake?

cyphar · 2024-01-11T12:24:05Z

I will recreate PRs in a bit, yeah it was a mistake 😅. I still have everything locally.

kolyshkin · 2024-01-17T20:59:57Z

@cyphar perhaps we'd better make runc-dmz non-default.

cyphar · 2024-01-17T21:45:27Z

Yeah, I agree.

lifubang · 2024-01-20T01:43:30Z

I had another research, I found that all things came from CVE-2019-5736, when we were dealing with it, we didn't want to deal with #!script name args.... I looked into the code about execve, I found that in linux, it support only two types, one is binfmt(flat/elf/misc/elf_fdpic), the other is script format. Before the kernel starts the program in execve, they do about two steps:

Reading 256 bytes from the program file to the buf; (https://github.com/torvalds/linux/blob/9d64bf433c53cab2f48a3fff7a1f2a696bc5229a/fs/exec.c#L1662-L1668)
Checking the first two bytes equals to #! or not, if it is script format, we can read the real entrypoint. (https://github.com/torvalds/linux/blob/master/fs/binfmt_script.c#L34-L42)

If this is workable, I think maybe we can using another way to defeat it. For example:

We should declare in runtime-spec that the container's entrypoint should be the file in the container or mounted to the container, or else we MUST return an error;
Reading 256 bytes from the program file to the buf;
Checking the first two types equals to #! or not, if it is script format, we can read the real entrypoint.
Check the root of the entrypoint and the root of the container are the same or not. if not, return an error.(It's easy, I had written it in a private fork)
Remove all codes about CVE-2019-5736.

If it is worth to research, I will continue to do it. Looking forward to your feedback. @cyphar @kolyshkin

cyphar · 2024-01-20T05:49:44Z

@lifubang There is a TOCTOU race if you read the file and then execute it afterwards. The file can change underneath you and the only way to exec the file is to actually call execve. It's not that we "didn't want to deal with #!" -- there is no other solution for CVE-2019-5736 unless we add support for restricting execve to the kernel. The entrypoint problem is recursive, and the core issue is we shouldn't be parsing executable files in userspace because the only way to have the correct behaviour for execve is to actually do the execve (which is unsafe).

Myself and several other maintainers spent several months dealing with this issue, if you think you've found a simple workaround it probably means you haven't considered all aspects of the attack. It's possible that we missed something, but I doubt we missed something as simple as you describe.

lifubang · 2024-01-22T10:09:17Z

@cyphar perhaps we'd better make runc-dmz non-default.

Though we will make runc-dmz non-default, we still need to recreate this PR to fix this problem, and pephaps we can check it more strickly to disable runc-dmz.

cyphar · 2024-01-22T10:15:01Z

If we make it non-default we only need to handle the security-related cases that affect runc-dmz. If a user enables runc-dmz and something breaks, they've found out they'll need to disable it.

lifubang · 2024-01-23T10:35:02Z

@lifubang There is a TOCTOU race if you read the file and then execute it afterwards. The file can change underneath you and the only way to exec the file is to actually call execve. It's not that we "didn't want to deal with #!" -- there is no other solution for CVE-2019-5736 unless we add support for restricting execve to the kernel. The entrypoint problem is recursive, and the core issue is we shouldn't be parsing executable files in userspace because the only way to have the correct behaviour for execve is to actually do the execve (which is unsafe).

Myself and several other maintainers spent several months dealing with this issue, if you think you've found a simple workaround it probably means you haven't considered all aspects of the attack. It's possible that we missed something, but I doubt we missed something as simple as you describe.

Thanks, I know it's very diffcult and complex. I want to do a last discussion, if it's not valuable, I will give up.
I saw into the syscall execveat, if we open the entrypoint program file of the container as a fd with the O_CLOEXEC flag, execveat will return ENOENT error when the fd points to an interpreter program(such as a script starting with "#!"). So maybe we can use this feature to protect CVE-2019-5736?

We should declare in runtime-spec that the container's entrypoint should be the file in the container or mounted to the container, or else we MUST return an error;
Reading the first 2 bytes from the program file to the buf;
Checking the first two bytes equals to #! or not.
If yes, it is a Shebang format(script) file, we can follow the link to get the final real entrypoint file name, then open it as a fd with O_CLOEXEC flag;
If no, it is a binfmt file, we can open it as a fd with O_CLOEXEC flag;
Check the root of the fd and the root of the container are the same or not. if not, return an error.
Using the fd from the step 3 to call execveat to start or exec into the container.
I think the malicious process can only modify the content of the file refered by fd, if it becomes a script format file, because of O_CLOEXEC, execveat will block it.
Could you help to indicate what's the race condition in this flow?

cyphar · 2024-01-23T12:53:00Z

@lifubang For the record it would've been helpful if you mentioned BINPRM_FLAGS_PATH_INACCESSIBLE, which is what makes your O_CLOEXEC approach block #! scripts. I'd completely forgotten that this behaviour exists. 😅

My honest opinion is that I don't think BINPRM_FLAGS_PATH_INACCESSIBLE is a strong enough security boundary. I suspect it'd be possible to use the ELF loader to get /proc/self/exe loaded as a library, which would also give the container access to the file. It also just feels messy and I don't feel comfortable removing code we know protects against this issue (as well as DirtyCOW-style issues).

For the record, the race condition was that after step 3 the file contents can be changed, but since BINPRM_FLAGS_PATH_INACCESSIBLE blocks #! execution the race condition doesn't exist. Also, the interpreter of a #! script can also be a #! script AFAIK, so we would need to resolve things recursively, but that's a minor issue.

The long-term solution to removing the CVE-2019-5736 code is to restrict re-opening through magic-links in the kernel. Here is a talk I gave on this topic in May of last year. I have some prototypes and will work on them when I get back from vacation at the beginning of March.

rata · 2024-01-26T14:48:03Z

@lifubang very nice research, but I think your approach doesn't work with the exploit we created at Kinvolk in 2019 for that CVE: https://kinvolk.io/blog/2019/02/runc-breakout-vulnerability-mitigated-on-flatcar-linux/.

It uses LD_PRELOAD and __attribute__((constructor)) to exploit it, it is not using an interpreter (it executes a binary).

I agree with @cyphar that BINPRM_FLAGS_PATH_INACCESSIBLE, sadly, doesn't seem like a strong-enough barrier. But very nice research, though!

lifubang · 2024-02-01T10:36:47Z

It uses LD_PRELOAD and __attribute__((constructor)) to exploit it, it is not using an interpreter (it executes a binary).

Thanks, I tested it, and found that LD_PRELOAD and __attribute__((constructor)) still needs to use /proc/self/exe to exploit, but in my approach, I changed the runc behavior, runc will abandon any entrypoints that not belongs to the container file system jail. For example, If you use /proc/self/exe as the value of .process.args, runc will have an ability to detect it and stop to execve it.

lifubang · 2024-02-01T10:42:11Z

I suspect it'd be possible to use the ELF loader to get /proc/self/exe loaded as a library,

I don't know how to test this behavior.
I tested LD_PRELOAD and __attribute__((constructor)) mentioned by @rata , I found that, when running the proload program, /proc/self/exe has became the entrypoint of the container, not runc init binary. So maybe the /proc/self/exe will also be erased and given the correct value when we start to load ELF.

rata · 2024-02-01T11:13:31Z

@lifubang

Thanks, I tested it, and found that LD_PRELOAD and __attribute__((constructor)) still needs to use /proc/self/exe to exploit, but in my approach, I changed the runc behavior, runc will abandon any entrypoints that not belongs to the container file system jail. For example, If you use /proc/self/exe as the value of .process.args, runc will have an ability to detect it and stop to execve it.

I don't follow, the example I linked to doesn't use /proc/self/exe in the entrypoint, it uses /usr/bin/sh. The self-exe thingy is used in a library that is LD_PRELOAD'ed, and that is it. What am I missing?

lifubang · 2024-02-01T11:32:47Z

What am I missing?

/exe -> /proc/self/exe

lifubang · 2024-02-01T11:36:52Z

I found that, when running the proload program in foo.so, /proc/self/exe has became the entrypoint of the container, not runc init binary, if we don't use /proc/self/exe as the entrypoint of the container.

@rata

rata · 2024-02-01T12:44:45Z

@lifubang ohh, sorry, I didn't understand what that meant. Thanks!

AkihiroSuda reviewed Dec 11, 2023

View reviewed changes

cyphar mentioned this pull request Dec 14, 2023

Avoid using runc-dmz if capabilities are set as non-root #4129

Closed

cyphar added this to the 1.2.0 milestone Dec 14, 2023

lifubang mentioned this pull request Jan 1, 2024

Reasons that can't use runc-dmz #4158

Open

cyphar closed this by deleting the head repository Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dmz: don't use runc-dmz in complicated capability setups #4137

dmz: don't use runc-dmz in complicated capability setups #4137

cyphar commented Dec 8, 2023

113xiaoji commented Dec 9, 2023

cyphar commented Dec 10, 2023

AkihiroSuda Dec 11, 2023

cyphar Dec 14, 2023

AkihiroSuda Jan 3, 2024

AkihiroSuda Jan 3, 2024

lifubang Jan 3, 2024

rata Jan 3, 2024

cyphar Jan 10, 2024

rata commented Dec 18, 2023

rata commented Dec 18, 2023 •

edited

rata commented Jan 8, 2024

rata commented Jan 11, 2024

cyphar commented Jan 11, 2024

kolyshkin commented Jan 17, 2024

cyphar commented Jan 17, 2024

lifubang commented Jan 20, 2024 •

edited

cyphar commented Jan 20, 2024 •

edited

lifubang commented Jan 22, 2024

cyphar commented Jan 22, 2024

lifubang commented Jan 23, 2024

cyphar commented Jan 23, 2024 •

edited

rata commented Jan 26, 2024

lifubang commented Feb 1, 2024

lifubang commented Feb 1, 2024

rata commented Feb 1, 2024

lifubang commented Feb 1, 2024 •

edited

lifubang commented Feb 1, 2024 •

edited

rata commented Feb 1, 2024

dmz: don't use runc-dmz in complicated capability setups #4137

dmz: don't use runc-dmz in complicated capability setups #4137

Conversation

cyphar commented Dec 8, 2023

113xiaoji commented Dec 9, 2023

cyphar commented Dec 10, 2023

AkihiroSuda Dec 11, 2023

Choose a reason for hiding this comment

cyphar Dec 14, 2023

Choose a reason for hiding this comment

AkihiroSuda Jan 3, 2024

Choose a reason for hiding this comment

AkihiroSuda Jan 3, 2024

Choose a reason for hiding this comment

lifubang Jan 3, 2024

Choose a reason for hiding this comment

rata Jan 3, 2024

Choose a reason for hiding this comment

cyphar Jan 10, 2024

Choose a reason for hiding this comment

rata commented Dec 18, 2023

rata commented Dec 18, 2023 • edited

rata commented Jan 8, 2024

rata commented Jan 11, 2024

cyphar commented Jan 11, 2024

kolyshkin commented Jan 17, 2024

cyphar commented Jan 17, 2024

lifubang commented Jan 20, 2024 • edited

cyphar commented Jan 20, 2024 • edited

lifubang commented Jan 22, 2024

cyphar commented Jan 22, 2024

lifubang commented Jan 23, 2024

cyphar commented Jan 23, 2024 • edited

rata commented Jan 26, 2024

lifubang commented Feb 1, 2024

lifubang commented Feb 1, 2024

rata commented Feb 1, 2024

lifubang commented Feb 1, 2024 • edited

lifubang commented Feb 1, 2024 • edited

rata commented Feb 1, 2024

rata commented Dec 18, 2023 •

edited

lifubang commented Jan 20, 2024 •

edited

cyphar commented Jan 20, 2024 •

edited

cyphar commented Jan 23, 2024 •

edited

lifubang commented Feb 1, 2024 •

edited

lifubang commented Feb 1, 2024 •

edited