Revert "seccomp: block socket calls to AF_VSOCK in default profile" (commit 57b2290) #45317

hammerg · 2023-04-13T07:36:05Z

Description

In our project we would like to use vsock for guest-host communication. However due to commit 57b2290, it requires either a custom seccomp profile or a privileged container. I've read the reasons for this change in the commit's message but I failed to understand its logic.

The claim is that they want to use vsock as well and because they don't how to limit connection attempts, their solution is to block everyone from using this feature and use a "security" as en excuse. How giving a full privileges to a pod that requires only an inter-process communication with its host is more secured?

The existence of the vsock is controlled by the virtual machine. If a host doesn't want to allow a guest-to-host communication it simply doesn't create the virtio-vsock device. And in case this type of communication is needed then there is either a listener application which expect these connection and may block unwanted incoming connections or an application that initiate the connection to a guest application and in this case one might choose to use a lower port number which required capabilities.

So, since the default seccomp profile allows other IPC means, like pipes and sockets, I request to remove this rule. I think I saw it was also removed in an older branch as well.

Thanks,

Gal.

The text was updated successfully, but these errors were encountered:

sam-thibault · 2023-04-13T09:21:15Z

Noting PR for easy reference:

seccomp: block socket calls to AF_VSOCK in default profile #44562

sam-thibault · 2023-04-13T09:23:07Z

@thaJeztah can you comment here?

thaJeztah · 2023-04-13T11:12:20Z

The claim is that they want to use vsock as well and because they don't how to limit connection attempts, their solution is to block everyone from using this feature and use a "security" as en excuse.

Allowing AF_VSOCK is an actual security issue. It's not namespaced in current kernels (#44670 (comment)), which means that effectively containers are escaping their sandbox and obtain access out side of it.

The existence of the vsock is controlled by the virtual machine. If a host doesn't want to allow a guest-to-host communication it simply doesn't create the virtio-vsock device.

For sure; that may not always be under control of the user though; cloud providers may set up these sockets, and containers now being able to access them.

All of that combined, the decision was to make this something to explicitly grant a container access to, and not as a default. That said, I didn't see a strong rejection on having an option that makes this easier to set up (without having to write a fully custom seccomp profile), partial discussion happened in #44670; I see that one was closed, but perhaps we need a ticket to discuss if improvements can be made for that.

thaJeztah · 2023-04-13T11:12:44Z

/cc @neersighted @AkihiroSuda

hammerg · 2023-04-13T14:10:51Z

Allowing AF_VSOCK is an actual security issue. It's not namespaced in current kernels (#44670 (comment)), which means that effectively containers are escaping their sandbox and obtain access out side of it.

Unless I missed a change in Linux, AF_VSOCK can only be used for a guest<->host communication. So the only escape a container is allowed is to send messages to a known and expecting application on the host's side. It is an implementation of the official virtio standard in the kernel, who apparently doesn't see it as a risk.

So what's exactly is the security issue? and if this is a security issue then why giving it more permission by running it in a privileged pod?

For sure; that may not always be under control of the user though; cloud providers may set up these sockets, and containers now being able to access them.

That's exactly my point, if the host allows using these sockets, then why the guest is prohibited from using it?

All of that combined, the decision was to make this something to explicitly grant a container access to, and not as a default. That said, I didn't see a strong rejection on having an option that makes this easier to set up (without having to write a fully custom seccomp profile), partial discussion happened in #44670; I see that one was closed, but perhaps we need a ticket to discuss if improvements can be made for that.

I understand. Is this issue is equivalent to a ticket? ;-)

neersighted · 2023-04-13T14:58:57Z

The primary issue is that AF_VSOCK is not contained, and 'containers' are supposed to namespace every kernel resource by default. Allowing access to an abstract resource that can represent a high level of privilege out of the box is something that most users find objectionable, and surprising.

Now, this is not a Linux example as I am less aware of what Linux users of AF_VSOCK exist, but in the Hyper-V world (AF_HYPERV is the Windows version of AF_VSOCK), one of the standard integration services allows execution of PowerShell scripts as the Administrator user, as well as arbitrary file copies, in all guest VMs. Anything with access to those AF_HYPERV sockets on the host (which are restricted to the Hyper-V service only, thanks to Windows' architecture) thus has maximum privilege in the VMs.

The motivation for the change in containerd was the same -- KubeVirt is doing something similar, and realized that allowing arbitrary Pods maximum privilege in their managed VMs was the status quo without filtering AF_VSOCK by default.

Similarly, we took the change in moby/moby as we had also filtered AF_VSOCK in the past, and generally intended to do so.

I don't think a blanket revert is a good idea, for several reasons:

AF_VSOCK really can represent a high level of privilege and is 'uncontainable' -- we generally don't want to make resources we can't containerize available by default
As this is a real increase in security posture, reverting it suddenly will be potentially dangerous and misleading to the majority of users for whom the change is positive
I don't think we should deviate too far from containerd's seccomp profile anyway; any revert in moby/moby is likely predicated on a revert in containerd/containerd first

Any breaking change (even fixing a bug) always breaks somebody, and the maintainers are always doing the best they can to try and balance many concerns. I think the right balance was struck here with the immediate change, but at the same time, there are likely additional users of containers who would like to use AF_VSOCK at some point, fully understand the implications, and there should be an option for them.

What we discussed in principle, and what seems like the best route forward here, is adding some option to allow AF_VSOCK that is more ergonomic that a custom seccomp profile. The idea of virtual capabilities was discussed, but generally rejected as capabilities really are a 'proper noun' concept in the Linux kernel.

Instead, it seems like some more granular mechanism for adding/removing 'entitlements' to a container is the way forward (see #32801); however, I don't think we've ruled out something more short-term either (e.g. maybe a daemon-level flag).

hammerg · 2023-04-18T07:59:41Z

The primary issue is that AF_VSOCK is not contained, and 'containers' are supposed to namespace every kernel resource by default. Allowing access to an abstract resource that can represent a high level of privilege out of the box is something that most users find objectionable, and surprising.

For me it is more surprising to give full privilege when only a transport mean is required. The AF_VSOCK resource might not be "contained" but it is also a very limited resource that is easily controlled by the host.

Now, this is not a Linux example as I am less aware of what Linux users of AF_VSOCK exist, but in the Hyper-V world (AF_HYPERV is the Windows version of AF_VSOCK), one of the standard integration services allows execution of PowerShell scripts as the Administrator user, as well as arbitrary file copies, in all guest VMs. Anything with access to those AF_HYPERV sockets on the host (which are restricted to the Hyper-V service only, thanks to Windows' architecture) thus has maximum privilege in the VMs.

The motivation for the change in containerd was the same -- KubeVirt is doing something similar, and realized that allowing arbitrary Pods maximum privilege in their managed VMs was the status quo without filtering AF_VSOCK by default.

So you give one example of a use-case where a container with access to AF_VSOCK also requires administrator access and force everyone to match this scheme? In our use case we need a way to communicate with the host without running a privileged container. After considering several options we wanted to use AF_VSOCK only to find out it is was blocked few months ago.

Similarly, we took the change in moby/moby as we had also filtered AF_VSOCK in the past, and generally intended to do so.

And it was also not filtered in the past and the try to filter it was reverted.

I don't think a blanket revert is a good idea, for several reasons:

* `AF_VSOCK` really can represent a high level of privilege and is 'uncontainable' -- we generally don't want to make resources we can't containerize available by default

I don't agree it is a "high level of privilege". It is like saying a TCP socket (access is permitted to unprivileged) is high level of privilege because it allow connections to resources not containerize within the host.

* As this is a real increase in security posture, reverting it suddenly will be potentially dangerous and misleading to the majority of users for whom the change is positive

What security posture? Can you give a solid example on how to exploit it?

* I don't think we should deviate too far from containerd's seccomp profile anyway; any revert in moby/moby is likely predicated on a revert in containerd/containerd first

Should I start a similar thread in the containerd/containerd project as well then?

Any breaking change (even fixing a bug) always breaks somebody, and the maintainers are always doing the best they can to try and balance many concerns. I think the right balance was struck here with the immediate change, but at the same time, there are likely additional users of containers who would like to use AF_VSOCK at some point, fully understand the implications, and there should be an option for them.

That didn't seem to bother you when the access to AF_VSOCK was blocked, forcing unprivileged pods to have full access.

What we discussed in principle, and what seems like the best route forward here, is adding some option to allow AF_VSOCK that is more ergonomic that a custom seccomp profile. The idea of virtual capabilities was discussed, but generally rejected as capabilities really are a 'proper noun' concept in the Linux kernel.

Instead, it seems like some more granular mechanism for adding/removing 'entitlements' to a container is the way forward (see #32801); however, I don't think we've ruled out something more short-term either (e.g. maybe a daemon-level flag).

I'm open to hear about any solution that will save me a long talks with our deployment engineers :-).

Gal.

szaimen · 2023-05-01T12:11:03Z

Hi, I think I am running into this problem after updating Docker Desktop on Windows to v4.19 when trying to connect to a bind-mounted tcp unix socket inside a container that runs as unprivileged (non-root user) process. Connecting to the socket as root user inside the container still works but it apparently broke due to the update because it worked also with the unprivileged user before the update.

How can I work around the problem for the time being? I tried --security-opt seccomp=unconfined like documented in https://docs.docker.com/engine/security/seccomp/#run-without-the-default-seccomp-profile but it didnt make it work...

thaJeztah · 2023-05-01T12:43:49Z

If disabling seccomp doesn't fix your issue, then it should be unrelated, and may be a general permissions issue or related to how Docker Desktop shared files between the host and the VM; does the socket inside the container have permissions for the non-root user to access it? If you have steps to reproduce, it might be better to report an issue in the Docker Desktop issue tracker (https://github.com/docker/for-win/issues)

szaimen · 2023-05-01T13:21:19Z

Thanks! reported to docker/for-win#13447

cpuguy83 · 2023-05-01T16:22:09Z

@hammerg

I don't agree it is a "high level of privilege". It is like saying a TCP socket (access is permitted to unprivileged) is high level of privilege because it allow connections to resources not containerize within the host.

This is not the same thing as vsock.
Vsock is a host-side resource, very much similar to a network device in the host network namespace.
Work is even being done in the kernel to actually move vsock into the network namespace.

The default container does not allow containers to bind to ports on host interfaces.
The default container does not allow connecting to ports on the host's localhost.
This is how we need to think about vsock sockets.

We could do something like if --net=host then allow vsock.
Maybe if the container has CAP_NET_ADMIN we could also allow it, but I think that is a real stretch.

kingudk · 2023-06-06T11:43:21Z

I'd like to see a fix as well. we use containers for build and test and one piece of our software requires AF_VSOCK for it's sole communication and as such we need to be able to access AF_VSOCK to test the application.

That build broke a while ago, and I have not had time to dig into it until now. I do understand why the change has been brought about, but a such a change of behaviour should be more visible.

I don't think that it should be reverted, but a way to enable access to AF_VSOCK should be provided (and should have been provided when it was blocked). For now I can side step the issue by running that specific build using the unconfined seccomp profile but I'd rather have a proper solution.

The issue here seemlingly have been dormant since May 1, is there any progress on it?

AkihiroSuda · 2023-06-06T11:54:05Z

I don't think that it should be reverted, but a way to enable access to AF_VSOCK should be provided (and should have been provided when it was blocked).

You can do that by setting --security-opt seccomp=unconfined or provide a custom seccomp profile JSON that allows AF_VSOCK.
This will not be default, though, for security reason.

neersighted · 2024-04-03T13:15:54Z

Closing this in favor of #44670.

hammerg added kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny status/0-triage labels Apr 13, 2023

sam-thibault added area/security kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. and removed kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny labels Apr 13, 2023

thaJeztah added the area/security/seccomp label Apr 13, 2023

This was referenced May 1, 2023

run docker-cli inside container as root nextcloud/all-in-one#2449

Closed

docker desktop - connecting to docker socket inside container with different user than root is broken after updating to 4.19 nextcloud/all-in-one#2450

Closed

neersighted closed this as not planned Won't fix, can't repro, duplicate, stale Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "seccomp: block socket calls to AF_VSOCK in default profile" (commit 57b2290) #45317

Revert "seccomp: block socket calls to AF_VSOCK in default profile" (commit 57b2290) #45317

hammerg commented Apr 13, 2023 •

edited by thaJeztah

sam-thibault commented Apr 13, 2023

sam-thibault commented Apr 13, 2023

thaJeztah commented Apr 13, 2023

thaJeztah commented Apr 13, 2023

hammerg commented Apr 13, 2023

neersighted commented Apr 13, 2023 •

edited

hammerg commented Apr 18, 2023

szaimen commented May 1, 2023 •

edited

thaJeztah commented May 1, 2023

szaimen commented May 1, 2023

cpuguy83 commented May 1, 2023

kingudk commented Jun 6, 2023

AkihiroSuda commented Jun 6, 2023

neersighted commented Apr 3, 2024

Revert "seccomp: block socket calls to AF_VSOCK in default profile" (commit 57b2290) #45317

Revert "seccomp: block socket calls to AF_VSOCK in default profile" (commit 57b2290) #45317

Comments

hammerg commented Apr 13, 2023 • edited by thaJeztah

Description

sam-thibault commented Apr 13, 2023

sam-thibault commented Apr 13, 2023

thaJeztah commented Apr 13, 2023

thaJeztah commented Apr 13, 2023

hammerg commented Apr 13, 2023

neersighted commented Apr 13, 2023 • edited

hammerg commented Apr 18, 2023

szaimen commented May 1, 2023 • edited

thaJeztah commented May 1, 2023

szaimen commented May 1, 2023

cpuguy83 commented May 1, 2023

kingudk commented Jun 6, 2023

AkihiroSuda commented Jun 6, 2023

neersighted commented Apr 3, 2024

hammerg commented Apr 13, 2023 •

edited by thaJeztah

neersighted commented Apr 13, 2023 •

edited

szaimen commented May 1, 2023 •

edited