Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "seccomp: block socket calls to AF_VSOCK in default profile" (commit 57b2290) #45317

Closed
hammerg opened this issue Apr 13, 2023 · 14 comments
Labels
area/security/seccomp area/security kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. status/0-triage

Comments

@hammerg
Copy link

hammerg commented Apr 13, 2023

Description

In our project we would like to use vsock for guest-host communication. However due to commit 57b2290, it requires either a custom seccomp profile or a privileged container. I've read the reasons for this change in the commit's message but I failed to understand its logic.

The claim is that they want to use vsock as well and because they don't how to limit connection attempts, their solution is to block everyone from using this feature and use a "security" as en excuse. How giving a full privileges to a pod that requires only an inter-process communication with its host is more secured?

The existence of the vsock is controlled by the virtual machine. If a host doesn't want to allow a guest-to-host communication it simply doesn't create the virtio-vsock device. And in case this type of communication is needed then there is either a listener application which expect these connection and may block unwanted incoming connections or an application that initiate the connection to a guest application and in this case one might choose to use a lower port number which required capabilities.

So, since the default seccomp profile allows other IPC means, like pipes and sockets, I request to remove this rule. I think I saw it was also removed in an older branch as well.

Thanks,

Gal.
@hammerg hammerg added kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny status/0-triage labels Apr 13, 2023
@sam-thibault
Copy link
Contributor

Noting PR for easy reference:

@sam-thibault
Copy link
Contributor

@thaJeztah can you comment here?

@sam-thibault sam-thibault added area/security kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. and removed kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny labels Apr 13, 2023
@thaJeztah
Copy link
Member

The claim is that they want to use vsock as well and because they don't how to limit connection attempts, their solution is to block everyone from using this feature and use a "security" as en excuse.

Allowing AF_VSOCK is an actual security issue. It's not namespaced in current kernels (#44670 (comment)), which means that effectively containers are escaping their sandbox and obtain access out side of it.

The existence of the vsock is controlled by the virtual machine. If a host doesn't want to allow a guest-to-host communication it simply doesn't create the virtio-vsock device.

For sure; that may not always be under control of the user though; cloud providers may set up these sockets, and containers now being able to access them.

All of that combined, the decision was to make this something to explicitly grant a container access to, and not as a default. That said, I didn't see a strong rejection on having an option that makes this easier to set up (without having to write a fully custom seccomp profile), partial discussion happened in #44670; I see that one was closed, but perhaps we need a ticket to discuss if improvements can be made for that.

@thaJeztah
Copy link
Member

/cc @neersighted @AkihiroSuda

@hammerg
Copy link
Author

hammerg commented Apr 13, 2023

Allowing AF_VSOCK is an actual security issue. It's not namespaced in current kernels (#44670 (comment)), which means that effectively containers are escaping their sandbox and obtain access out side of it.

Unless I missed a change in Linux, AF_VSOCK can only be used for a guest<->host communication. So the only escape a container is allowed is to send messages to a known and expecting application on the host's side. It is an implementation of the official virtio standard in the kernel, who apparently doesn't see it as a risk.

So what's exactly is the security issue? and if this is a security issue then why giving it more permission by running it in a privileged pod?

For sure; that may not always be under control of the user though; cloud providers may set up these sockets, and containers now being able to access them.

That's exactly my point, if the host allows using these sockets, then why the guest is prohibited from using it?

All of that combined, the decision was to make this something to explicitly grant a container access to, and not as a default. That said, I didn't see a strong rejection on having an option that makes this easier to set up (without having to write a fully custom seccomp profile), partial discussion happened in #44670; I see that one was closed, but perhaps we need a ticket to discuss if improvements can be made for that.

I understand. Is this issue is equivalent to a ticket? ;-)

@neersighted
Copy link
Member

neersighted commented Apr 13, 2023

The primary issue is that AF_VSOCK is not contained, and 'containers' are supposed to namespace every kernel resource by default. Allowing access to an abstract resource that can represent a high level of privilege out of the box is something that most users find objectionable, and surprising.

Now, this is not a Linux example as I am less aware of what Linux users of AF_VSOCK exist, but in the Hyper-V world (AF_HYPERV is the Windows version of AF_VSOCK), one of the standard integration services allows execution of PowerShell scripts as the Administrator user, as well as arbitrary file copies, in all guest VMs. Anything with access to those AF_HYPERV sockets on the host (which are restricted to the Hyper-V service only, thanks to Windows' architecture) thus has maximum privilege in the VMs.

The motivation for the change in containerd was the same -- KubeVirt is doing something similar, and realized that allowing arbitrary Pods maximum privilege in their managed VMs was the status quo without filtering AF_VSOCK by default.

Similarly, we took the change in moby/moby as we had also filtered AF_VSOCK in the past, and generally intended to do so.

I don't think a blanket revert is a good idea, for several reasons:

  • AF_VSOCK really can represent a high level of privilege and is 'uncontainable' -- we generally don't want to make resources we can't containerize available by default
  • As this is a real increase in security posture, reverting it suddenly will be potentially dangerous and misleading to the majority of users for whom the change is positive
  • I don't think we should deviate too far from containerd's seccomp profile anyway; any revert in moby/moby is likely predicated on a revert in containerd/containerd first

Any breaking change (even fixing a bug) always breaks somebody, and the maintainers are always doing the best they can to try and balance many concerns. I think the right balance was struck here with the immediate change, but at the same time, there are likely additional users of containers who would like to use AF_VSOCK at some point, fully understand the implications, and there should be an option for them.

What we discussed in principle, and what seems like the best route forward here, is adding some option to allow AF_VSOCK that is more ergonomic that a custom seccomp profile. The idea of virtual capabilities was discussed, but generally rejected as capabilities really are a 'proper noun' concept in the Linux kernel.

Instead, it seems like some more granular mechanism for adding/removing 'entitlements' to a container is the way forward (see #32801); however, I don't think we've ruled out something more short-term either (e.g. maybe a daemon-level flag).

@hammerg
Copy link
Author

hammerg commented Apr 18, 2023

The primary issue is that AF_VSOCK is not contained, and 'containers' are supposed to namespace every kernel resource by default. Allowing access to an abstract resource that can represent a high level of privilege out of the box is something that most users find objectionable, and surprising.

For me it is more surprising to give full privilege when only a transport mean is required. The AF_VSOCK resource might not be "contained" but it is also a very limited resource that is easily controlled by the host.

Now, this is not a Linux example as I am less aware of what Linux users of AF_VSOCK exist, but in the Hyper-V world (AF_HYPERV is the Windows version of AF_VSOCK), one of the standard integration services allows execution of PowerShell scripts as the Administrator user, as well as arbitrary file copies, in all guest VMs. Anything with access to those AF_HYPERV sockets on the host (which are restricted to the Hyper-V service only, thanks to Windows' architecture) thus has maximum privilege in the VMs.

The motivation for the change in containerd was the same -- KubeVirt is doing something similar, and realized that allowing arbitrary Pods maximum privilege in their managed VMs was the status quo without filtering AF_VSOCK by default.

So you give one example of a use-case where a container with access to AF_VSOCK also requires administrator access and force everyone to match this scheme? In our use case we need a way to communicate with the host without running a privileged container. After considering several options we wanted to use AF_VSOCK only to find out it is was blocked few months ago.

Similarly, we took the change in moby/moby as we had also filtered AF_VSOCK in the past, and generally intended to do so.

And it was also not filtered in the past and the try to filter it was reverted.

I don't think a blanket revert is a good idea, for several reasons:

* `AF_VSOCK` really can represent a high level of privilege and is 'uncontainable' -- we generally don't want to make resources we can't containerize available by default

I don't agree it is a "high level of privilege". It is like saying a TCP socket (access is permitted to unprivileged) is high level of privilege because it allow connections to resources not containerize within the host.

* As this is a real increase in security posture, reverting it suddenly will be potentially dangerous and misleading to the majority of users for whom the change is positive

What security posture? Can you give a solid example on how to exploit it?

* I don't think we should deviate too far from containerd's seccomp profile anyway; any revert in moby/moby is likely predicated on a revert in containerd/containerd first

Should I start a similar thread in the containerd/containerd project as well then?

Any breaking change (even fixing a bug) always breaks somebody, and the maintainers are always doing the best they can to try and balance many concerns. I think the right balance was struck here with the immediate change, but at the same time, there are likely additional users of containers who would like to use AF_VSOCK at some point, fully understand the implications, and there should be an option for them.

That didn't seem to bother you when the access to AF_VSOCK was blocked, forcing unprivileged pods to have full access.

What we discussed in principle, and what seems like the best route forward here, is adding some option to allow AF_VSOCK that is more ergonomic that a custom seccomp profile. The idea of virtual capabilities was discussed, but generally rejected as capabilities really are a 'proper noun' concept in the Linux kernel.

Instead, it seems like some more granular mechanism for adding/removing 'entitlements' to a container is the way forward (see #32801); however, I don't think we've ruled out something more short-term either (e.g. maybe a daemon-level flag).

I'm open to hear about any solution that will save me a long talks with our deployment engineers :-).

Gal.

@szaimen
Copy link

szaimen commented May 1, 2023

Hi, I think I am running into this problem after updating Docker Desktop on Windows to v4.19 when trying to connect to a bind-mounted tcp unix socket inside a container that runs as unprivileged (non-root user) process. Connecting to the socket as root user inside the container still works but it apparently broke due to the update because it worked also with the unprivileged user before the update.

How can I work around the problem for the time being? I tried --security-opt seccomp=unconfined like documented in https://docs.docker.com/engine/security/seccomp/#run-without-the-default-seccomp-profile but it didnt make it work...

@thaJeztah
Copy link
Member

If disabling seccomp doesn't fix your issue, then it should be unrelated, and may be a general permissions issue or related to how Docker Desktop shared files between the host and the VM; does the socket inside the container have permissions for the non-root user to access it? If you have steps to reproduce, it might be better to report an issue in the Docker Desktop issue tracker (https://github.com/docker/for-win/issues)

@szaimen
Copy link

szaimen commented May 1, 2023

Thanks! reported to docker/for-win#13447

@cpuguy83
Copy link
Member

cpuguy83 commented May 1, 2023

@hammerg

I don't agree it is a "high level of privilege". It is like saying a TCP socket (access is permitted to unprivileged) is high level of privilege because it allow connections to resources not containerize within the host.

This is not the same thing as vsock.
Vsock is a host-side resource, very much similar to a network device in the host network namespace.
Work is even being done in the kernel to actually move vsock into the network namespace.

The default container does not allow containers to bind to ports on host interfaces.
The default container does not allow connecting to ports on the host's localhost.
This is how we need to think about vsock sockets.

We could do something like if --net=host then allow vsock.
Maybe if the container has CAP_NET_ADMIN we could also allow it, but I think that is a real stretch.

@kingudk
Copy link

kingudk commented Jun 6, 2023

I'd like to see a fix as well. we use containers for build and test and one piece of our software requires AF_VSOCK for it's sole communication and as such we need to be able to access AF_VSOCK to test the application.

That build broke a while ago, and I have not had time to dig into it until now. I do understand why the change has been brought about, but a such a change of behaviour should be more visible.

I don't think that it should be reverted, but a way to enable access to AF_VSOCK should be provided (and should have been provided when it was blocked). For now I can side step the issue by running that specific build using the unconfined seccomp profile but I'd rather have a proper solution.

The issue here seemlingly have been dormant since May 1, is there any progress on it?

@AkihiroSuda
Copy link
Member

I don't think that it should be reverted, but a way to enable access to AF_VSOCK should be provided (and should have been provided when it was blocked).

You can do that by setting --security-opt seccomp=unconfined or provide a custom seccomp profile JSON that allows AF_VSOCK.
This will not be default, though, for security reason.

@neersighted
Copy link
Member

Closing this in favor of #44670.

@neersighted neersighted closed this as not planned Won't fix, can't repro, duplicate, stale Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/security/seccomp area/security kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. status/0-triage
Projects
None yet
Development

No branches or pull requests

8 participants