Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet needs SELinux mounts to allow auto volume relabeling #1123

Closed
sedlund opened this issue Feb 17, 2022 · 9 comments · Fixed by #1152
Closed

Kubelet needs SELinux mounts to allow auto volume relabeling #1123

sedlund opened this issue Feb 17, 2022 · 9 comments · Fixed by #1152

Comments

@sedlund
Copy link

sedlund commented Feb 17, 2022

Description

Kubelet provides a mechanism for SELinux context relabeling but bind mounts in Typhoon are not supplied to allow it.

Also see: #935

Steps to Reproduce

Install Rook ceph storage. Pods created with PVC will attach storage, but upon accessing the volume you receive:

kaf pod-ephemeral.yaml
pod/csi-rbd-demo-ephemeral-pod created

keti -n default csi-rbd-demo-ephemeral-pod -- bash

root@csi-rbd-demo-ephemeral-pod:/# df -h /myspace
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd0       976M  2.6M  958M   1% /myspace
root@csi-rbd-demo-ephemeral-pod:/# ls -ld /myspace
drwxrwxrwx. 3 root root 4096 Feb 17 15:27 /myspace
root@csi-rbd-demo-ephemeral-pod:/# ls -l /myspace
ls: cannot open directory '/myspace': Permission denied
root@csi-rbd-demo-ephemeral-pod:/#

/var/log/audit/audit.log

time->Thu Feb 17 16:20:42 2022
type=AVC msg=audit(1645114842.211:658): avc:  denied  { read } for  pid=9558 comm="ls" name="/" dev="rbd0" ino=2 scontext=system_u:system_r:container_t:s0:c591,c614 tcontext=system_u:object_r:unlabeled_t:s0 tclass=dir permissive=0

Expected behavior

SElinux relabeling to work properly.

Environment

  • Platform: bare-metal
  • OS: fedora-coreos 35
  • Release: Typhoon v1.23.3

Possible Solution

From: rook/rook#7575 (comment)

Explains the issue well. Adding two bind mounts to the kubelet allows it to do relabeling.

  --volume /etc/selinux:/etc/selinux \
  --volume /sys/fs/selinux:/sys/fs/selinux
@log1cb0mb
Copy link

log1cb0mb commented Feb 20, 2022

I completely forgot to submit PR for this which I had planned to do after figuring out the issue with rook/ceph dynamic PV mounts as detailed in my comment.
A bit related on the topic, as mentioned in rook/rook#7575 (comment)
--volume /var/lib/kubelet:/var/lib/kubelet:rshared,z relabel flag for whole directory should also be removed otherwise it removes the container context specific labeling once kubelet restarts so I wonder if that is a typo because we do not seem to do the same for flatcar where docker is being used to run kubelet :

-v /var/lib/kubelet:/var/lib/kubelet:rshared \

@dghubble
Copy link
Member

dghubble commented Apr 25, 2022

@sedlund Can you provide a clear minimal repro? Not involving a whole rook, ceph, or other system. AWS CSI volumes work fine for example.

@log1cb0mb read more about /var/lib/kubelet relabel in the original commit 72c94f1. People trying to remove it in a personal fork (unsupported) leads to probelms like #1142

@sedlund
Copy link
Author

sedlund commented Apr 25, 2022

@dghubble I've had to switch to a kubeadm based install because of containerd/containerd#6767 so I can't carry the torch for this one sorry.

@solacelost
Copy link
Contributor

solacelost commented Apr 25, 2022

@dghubble - I've gone ahead and built you a minimal reproducer.

tempest.tf

provider "aws" {
  region = "us-east-2"
  shared_credentials_files = [
    "/home/<your user>/.aws/credentials"
  ]
}

provider "ct" {}

terraform {
  required_providers {
    ct = {
      source  = "poseidon/ct"
      version = "0.10.0"
    }
    aws = {
      source  = "hashicorp/aws"
      version = "4.5.0"
    }
  }
}

module "tempest" {
  source = "git::https://github.com/poseidon/typhoon//aws/fedora-coreos/kubernetes?ref=v1.23.6"

  # AWS
  cluster_name = "tempest"
  dns_zone     = "<your zone>"
  dns_zone_id  = "<your zone id>"

  # configuration
  ssh_authorized_key = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMP2QHu8XD6z4OOftE9J6z9CIc3lhnE1yKI460mzmCB3 jharmison@gmail.com"

  # optional
  worker_count = 2
  worker_type  = "t3.small"
}

resource "local_file" "kubeconfig-tempest" {
  content  = module.tempest.kubeconfig-admin
  filename = "/home/<your user>/.kube/tempest.config"
}

aws-iam-secret.yaml

apiVersion: v1
kind: Secret
metadata:
  name: aws-secret
  namespace: kube-system
stringData:
  key_id: "<redacted>"
  access_key: "<redacted>"

storageclass.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  csi.storage.k8s.io/fstype: xfs
  type: io1
  iopsPerGB: "50"
  encrypted: "true"

Commands:

# deploy cluster
terraform apply -auto-approve
export KUBECONFIG=~/.kube/tempest.config
kubectl get nodes -w
# wait for nodes to come ready
# apply ebs CSI
kubectl apply -f aws-iam-secret.yaml
kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.5"
kubectl rollout status -w deploy/ebs-csi-controller -n kube-system
kubectl apply -f storageclass.yaml
# run a simple test using the default CSI-backed storageclass
kubectl apply -f https://raw.githubusercontent.com/yasker/kbench/main/deploy/fio.yaml
kubectl logs -l kbench=fio -f

My output:

$ kubectl logs -l kbench=fio -f
TEST_FILE: /volume/test
TEST_OUTPUT_PREFIX: test_device
TEST_SIZE: 30G
Benchmarking iops.fio into test_device-iops.json
fio: pid=0, err=13/file:filesetup.c:162, func=open, error=Permission denied
fio: pid=0, err=13/file:filesetup.c:162, func=open, error=Permission denied
fio: pid=0, err=13/file:filesetup.c:162, func=open, error=Permission denied
fio: pid=0, err=13/file:filesetup.c:162, func=open, error=Permission denied

@solacelost
Copy link
Contributor

solacelost commented Apr 25, 2022

I'll update my PR with a test of the exact same set of steps, simply using my branch, once I finish running through it.

edited to add: @sedlund I've been using a fork of Typhoon in which I've swapped containerd for cri-o on fcos. There were some road bumps to get there (especially around Cilium and CNI) but they weren't insurmountable. I figure it goes against the ideas of Typhoon proper to add that much complexity into options, but if there's willingness to look it over and consider an alternative runtime implementation I'd be willing to contribute it.

@dghubble
Copy link
Member

You're seeing that AWS CSI volumes show permission denied for file access? For simplicity, you should be able to remove kbench from the equation - just an alpine pod with the same mount and touch a file to see the same symptom. To dig into why, can you inspec the container's volume from the host? Something doesn't add up. Are you using cri-o?

AWS CSI volumes are used on Typhoon regularly. They get relabled as expected.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-ebs
provisioner: ebs.csi.aws.com
mountOptions:
  - context="system_u:object_r:container_file_t:s0"          <- notice
parameters:
  type: gp3
  fstype: ext4
  encrypted: "true"
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
sudo ls -alZ /var/lib/kubelet/pods/406d504b-f4d1-4eba-9b9b-9eeb0944e6bb/volumes/kubernetes.io~csi/pvc-8dfce316-d86e-437f-ae43-3e624b32a570/mount
total 66588
drwxr-xr-x. 3 root root system_u:object_r:container_file_t:s0     4096 Apr 25 16:01 .
drwxr-x---. 3 root root system_u:object_r:container_file_t:s0       40 Apr 21 03:28 ..
-rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 68161536 Apr 25 16:01 database.sqlite

I've written about why containerd was chosen over cri-o after a long period of evaluation here. I'm not looking to maintain both. Its important to me that Typhoon be what I actively run for real and what I can endorse.

@solacelost
Copy link
Contributor

You're seeing that AWS CSI volumes show permission denied for file access?

Yes, the configs and commands I added here are exactly what I ran.

For simplicity, you should be able to remove kbench from the equation - just an alpine pod with the same mount and touch a file to see the same symptom. To dig into why, can you inspec the container's volume from the host?

Sure, I can reprovision and do this tomorrow. It will be that the mount is lacking the container_file_t label. I've not been alone in having this problem (a year ago, as in the linked Rook issue), and had it here as well.

Something doesn't add up. Are you using cri-o?

Not in the outputs linked above. Again, I ran exactly what I pasted, including an official Typhoon release. I'm using cri-o in my own cluster, also with the patch I PR'd, and it's working great.

AWS CSI volumes are used on Typhoon regularly. They get relabled as expected.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-ebs
provisioner: ebs.csi.aws.com
mountOptions:
  - context="system_u:object_r:container_file_t:s0"          <- notice
parameters:
  type: gp3
  fstype: ext4
  encrypted: "true"
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

I have not been explicitly setting mountOptions on StorageClasses, and I did not expect to have to do so. The kubelet recognizes that SELinux relabelling is required on other systems and does so automatically (as Typhoon does in the branch I PR'd yesterday). This is particularly useful in the case of, for example, cri-o which uses a MAC-style labelling system with unique contexts per pod, compared to the simple containerd flat implementation for SELinux contexts, though I can understand how setting the mountOptions would work for you. It may make sense to document that somewhere, as it wasn't obvious to me.

I've written about why containerd was chosen over cri-o after a long period of evaluation here. I'm not looking to maintain both. Its important to me that Typhoon be what I actively run for real and what I can endorse.

I see no reason to revisit this and understand your points there. I'm using a oneshot unit with the experimental module layering support and happy to keep doing so, delaying releases to align with CRI-O as you mentioned. I'm also happy to maintain my own (private) fork and self-support in doing so. Typhoon's systems have worked well for me, and I appreciate the work you've put into it. I'm going to continue to maintain my patch set for my own infrastructure.

@log1cb0mb
Copy link

log1cb0mb commented Apr 26, 2022

@log1cb0mb read more about /var/lib/kubelet relabel in the original commit 72c94f1. People trying to remove it in a personal fork (unsupported) leads to probelms like #1142

@dghubble Relabel flag does solve or help with issues like #1142 but that particular issue is containerd relabelling broken. I already opened issue regarding that: containerd/containerd#6767

As mentioned relabelling the whole directory not only is harmful from security hardening perspective as it remove container specific context labels but also potential issues like: kubernetes/kubernetes#69799

The details of security hardening I mentioned in this comment: rook/rook#7575 (comment)

For typhoon, if default typhoon configuration is being used then one should consider removing relabel (:z) set under kubelet service for /var/lib/kubelet: --volume /var/lib/kubelet:/var/lib/kubelet:rshared,z
Otherwise with volume mounts being relabled with proper container/pod specific context label/level for e,g: system_u:object_r:container_file_t:s0:c96,c345 but with above :z flag set directly under kubelet service for whole /var/lib/kubelet, at time of service restart, including volume mounts will be relabelled back to general system_u:object_r:container_file_t:s0 context which means any container if escaped will be able to access data of other containers.

From the original commit:

SELinux relabel /var/lib/kubelet so ConfigMaps can be read by containers

Isnt this part actually done by container runtime to relabel with pod specific context labels so not sure why would it be required to be relabelled? More importantly configmap or similar volume mounts are generated at runtime and then labelled with appropriate pod specific context labels so that volume is basically already being accessed by container anyway. The relabel from kubelet is only applied once the kubelet service reloads.

In short, kubelet itself should not perform any relabelling but instead simply pass relabelling info to container runtime as its supposed to and container runtime should take care of appropriate relabelling. Assuming kubelet has those selinux bind mounts so that it can set "selinuxRelabel": true under container info basically information passed to container runtimer to perform relabelling.

A background on how I discovered this behaviour was with NetApp Ontap/trident CSI and its volume mounts, the ontap version that i was using did not support seclabel info to be sent or attached to NFS mounts which lead to kubelet failing to perform SElinux relabelling operation with error like:
podman[307369]: Error: lsetxattr /var/lib/kubelet/pods/bd55b362-d62b-46e3-8402-23963e3a51d4/volumes/kubernetes.io~csi/pvc-d91b6a09-ce32-4c80-b309-fcb84c30523b/mount: operation not supported

This lead to kubelet service failing to start completely as relabel operation kept failing, the workaround was that I had to remove those mounts so that kubelet wouldnt have to relabel any directory/mount without seclabel attached to it.

@dghubble
Copy link
Member

Alright, closing in on a concrete rationale.

Using AWS CSI using StorageClass without an explicit mount option. The mount will have the following context (and not be accessible from within the container):

drwx------. 2 root root system_u:object_r:unlabeled_t:s0      16384 Apr 26 16:11 lost+found

You expect Kublet to automatically relabel a volume. With the mount flags,

--volume /etc/selinux:/etc/selinux \
--volume /sys/fs/selinux:/sys/fs/selinux

The mount will have the following context (labels random of course) (and be accessible from the container).

drwx------. 2 root root system_u:object_r:container_file_t:s0:c170,c708 16384 Apr 26 16:25 lost+found

And on recreate, the volume is relabled again.

This is a better technical rationale than the various mentions of adding the flags to get apps or vendor products to work. @solacelost can you update your commit message with this info? Or I can formulate it so its a good record.

/var/lib/kubelet

Kubelet mounts /var/lib/kubelet with z. If you remove this flag, the first breakage you'll hit is with ConfigMaps not being readable. That was true 3 years ago with docker-shim and remains true in containerd, as well as for other cases (CNI init-containers). Its pragmatic and the flag cannot be removed currently.

FCOS nodes are still SELinux enforcing and /var/lib/kubelet files are still context labeled. They don't always use separate MLS labels per container, but as you've seen, components continue to have shaky SELinux handling here (its not just containerd, some workloads need consistent MLS contexts between host reboots if enabled). The z relabel avoids these cases, but keeps most context and enforcement unchanged, its a reasonable balance.

For the remainder of this issue, I'll focus on the OP's issue.

@dghubble dghubble changed the title kubelet needs bind mounts to allow selinux relabeling Kubelet needs SeLinux mounts to allow auto volume relabeling Apr 26, 2022
@dghubble dghubble changed the title Kubelet needs SeLinux mounts to allow auto volume relabeling Kubelet needs SELinux mounts to allow auto volume relabeling Apr 26, 2022
dghubble pushed a commit that referenced this issue Apr 27, 2022
fixes #1123

Enables the use of CSI drivers with a StorageClass that lacks an explicit context mount option. In cases where the kubelet lacks mounts for `/etc/selinux` and `/sys/fs/selinux`, it is unable to set the `:Z` option for the CRI volume definition automatically. See [KEP 1710](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/1710-selinux-relabeling/README.md#volume-mounting) for more information on how SELinux is passed to the CRI by Kubelet.

Prior to this change, a not-explicitly-labelled mount would have an `unlabeled_t` SELinux type on the host. Following this change, the Kubelet and CRI work together to dynamically relabel mounts that lack an explicit context specification every time it is rebound to a pod with SELinux type `container_file_t` and appropriate context labels to match the specifics for the pod it is bound to. This enables applications running in containers to consume dynamically provisioned storage on SELinux enforcing systems without explicitly setting the context on the StorageClass or PersistentVolume.
Snaipe pushed a commit to aristanetworks/monsoon that referenced this issue Apr 13, 2023
fixes poseidon#1123

Enables the use of CSI drivers with a StorageClass that lacks an explicit context mount option. In cases where the kubelet lacks mounts for `/etc/selinux` and `/sys/fs/selinux`, it is unable to set the `:Z` option for the CRI volume definition automatically. See [KEP 1710](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/1710-selinux-relabeling/README.md#volume-mounting) for more information on how SELinux is passed to the CRI by Kubelet.

Prior to this change, a not-explicitly-labelled mount would have an `unlabeled_t` SELinux type on the host. Following this change, the Kubelet and CRI work together to dynamically relabel mounts that lack an explicit context specification every time it is rebound to a pod with SELinux type `container_file_t` and appropriate context labels to match the specifics for the pod it is bound to. This enables applications running in containers to consume dynamically provisioned storage on SELinux enforcing systems without explicitly setting the context on the StorageClass or PersistentVolume.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants