Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.6/1.7] kubernetes ephemeral-storage limits not enforced with remote snapshotters #10095

Closed
Kern-- opened this issue Apr 19, 2024 · 3 comments
Labels

Comments

@Kern--
Copy link
Contributor

Kern-- commented Apr 19, 2024

Description

When using a remote snapshotter (or any other snapshotter that doesn't place snapshots under the containerd root directory), ephemeral storage limits are not enforced by the kubelet. The container can blow past its limits and keep running indefinitely.

The kublet logs show errors like:

kubelet[3094]: E0419 15:57:23.046299    3094 cri_stats_provider.go:448] "Failed toget the info of the filesystem with mountpoint" err="failed to get device for dir \"/var/lib/containerd/io.containerd.snapshotter.v1.soci\": stat failed on /var/lib/containerd/io.containerd.snapshotter.v1.soci with error: no such file or directory" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.soci"

and

kubelet[3094]: E0419 15:56:55.022396    3094 kubelet.go:1436]  "Image garbage collection failed multiple times in a row" err="invalid capacity 0 on image filesystem"

It looks like the kublet is unable to run ephemeral storage checks and image garbage collection because it's looking for image filesystem information in the wrong place.

Steps to reproduce the issue

  1. Configure containerd to use a remote snapshotter in a k8s environment
  2. Create a pod with an ephemeral storage limit:
resources:
  limits:
    ephemeral-storage: 20M
  requests:
    ephemeral-storage: 10M
  1. Exec into the container and allocate more disk space than allowed
# fallocate -l 1G test1
  1. Observe that the pod does not get evicted and the kubelet logs show errors above

Describe the results you received and expected

The pod should be evicted and the kubelet logs should not show erorrs

What version of containerd are you using?

containerd github.com/containerd/containerd 1.7.11 64b8a81

Any other relevant information

Related downstream issue awslabs/soci-snapshotter#1093

Show configuration if it is related to CRI plugin.

$ cat /etc/containerd/config.toml

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[proxy_plugins.soci]
type = "snapshot"
address = "/run/soci-snapshotter-grpc/soci-snapshotter-grpc.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
discard_unpacked_layers = true
snapshotter = "soci"
# This line is required for containerd to send information about how to lazily load the image to the snapshotter
disable_snapshot_annotations = false

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
@Kern--
Copy link
Contributor Author

Kern-- commented Apr 19, 2024

Related to #9216

@Kern--
Copy link
Contributor Author

Kern-- commented Apr 19, 2024

From my investigation, this is fixed in 2.0/main by:

  1. Split CRI image service from GRPC handler #9152 which refactored the CRI plugin to get a map of snapshotter -> correct snapshotter root dir based on an exported root key on the snapshotter or the default hard coded path
  2. Add exports to proxy plugin config #9253 which allows proxy plugins to have exports
  3. Snapshotters: Export the root path #10073 which exports snapshotter root for the remaining snapshotters that didn't before

Rebasing #9152 onto 1.6/1.7 would be tricky because there's a lot of structural change. #9216 was an attempt to fix this before the structural changes and would probably be a better starting point.

@Kern--
Copy link
Contributor Author

Kern-- commented May 6, 2024

This is fixed in containerd 1.7.16.

1.6 backport is still pending.

@Kern-- Kern-- closed this as completed May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant