You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are providing a service that deploys and manages pull through caches.
We are using the storage.delete.enabled option to denote whether the "garbage collection of images" for the pull through cache is enabled or not.
This setting on our side is mutable - users can control whether the GC (under the hood the storage.delete.enabled option) is enabled or not.
However there is the following case where the migration doesn't behave as expected. If you run a proxy with storage.delete.enabled=false, then it adds entries to the scheduler-state.json file, after ttl is passed, it tries to remove the blobs/manifests. When storage.delete.enabled=false blob doesn't get removed but the corresponding entry from the scheduler-state.json file gets removed. Hence, the corresponding blob could never be garbage collected if proxy is restarted to run with storage.delete.enabled=true.
Reproduce
Run a registry in proxy mode with upstream for and with GC disabled (storage.delete.enabled=false). The following config is used:
Pull the alpine:3.14.0 image from the registry proxy.
Edit the scheduler-state.json file: set one of the blob layers to expire for by updating the ExpiryData field. Let's do it for the library/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402 blob.
Then restart the proxy. When the expiry data is passed, then it should try to delete the blob:
time="2024-01-10T10:45:19.748260501Z" level=error msg="Scheduler error returned from OnExpire(library/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402): operation unsupported" go.version=go1.20.8 instance.id=1fceb0ff-4044-4e41-8119-0584e06cdba4 service=registry version=2.8.3
The blob is as expected not removed because of the storage.delete.enabled=false option.
However the corresponding entry gets removed from the scheduler-state.json.
At this point the blob leaks on the file system and the proxy can no longer garbage collect it if storage.delete.enabled=true gets set to true in the future.
The corresponding source code is
I would expect the switch from storage.delete.enabled=false to storage.delete.enabled=true to garbage collect all blobs/manifests on time.Now() + ttl where time.Now() is the switch to storage.delete.enabled=true (enablement of the garbage collection).
I also wanted to chat about the storage.delete.enabled=false field. I feel that it is not designed for the purpose we use it. The source code in proxy gives the impression that it is designed to run always with GC enabled.
If proxy has the strict requirement storage.delete.enabled to be always true, then this can be validated on startup and proxy could fail to start if the field is set to false.
On the other hand, I think it is useful to have a setting that allows to control whether GC is enabled or not.
For v3.0.0 where the ttl is configurable the same issue can be translated as:
proxy: Switch from proxy.ttl=0 to proxy.ttl > 0 leaks blobs that never get garbage collected
ialidzhikov
changed the title
proxy: Switch from storage.delete.enabled=false to storage.delete.enabled=true leaks blobs that never get garbage collected
proxy: Switch from proxy.ttl=0 to proxy.ttl > 0 leaks blobs that never get garbage collected
Feb 8, 2024
Description
We are providing a service that deploys and manages pull through caches.
We are using the
storage.delete.enabled
option to denote whether the "garbage collection of images" for the pull through cache is enabled or not.This setting on our side is mutable - users can control whether the GC (under the hood the
storage.delete.enabled
option) is enabled or not.However there is the following case where the migration doesn't behave as expected. If you run a proxy with
storage.delete.enabled=false
, then it adds entries to the scheduler-state.json file, after ttl is passed, it tries to remove the blobs/manifests. Whenstorage.delete.enabled=false
blob doesn't get removed but the corresponding entry from the scheduler-state.json file gets removed. Hence, the corresponding blob could never be garbage collected if proxy is restarted to run withstorage.delete.enabled=true
.Reproduce
Run a registry in proxy mode with upstream for and with GC disabled (
storage.delete.enabled=false
). The following config is used:Pull the
alpine:3.14.0
image from the registry proxy.Make sure your
scheduler-state.json
file looks similar to:scheduler-state.json
file: set one of the blob layers to expire for by updating the ExpiryData field. Let's do it for thelibrary/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402
blob.Then restart the proxy. When the expiry data is passed, then it should try to delete the blob:
The blob is as expected not removed because of the
storage.delete.enabled=false
option.However the corresponding entry gets removed from the
scheduler-state.json
.At this point the blob leaks on the file system and the proxy can no longer garbage collect it if
storage.delete.enabled=true
gets set to true in the future.The corresponding source code is
distribution/registry/proxy/scheduler/scheduler.go
Lines 176 to 206 in 3dda067
Expected behavior
I would expect the switch from
storage.delete.enabled=false
tostorage.delete.enabled=true
to garbage collect all blobs/manifests ontime.Now() + ttl
where time.Now() is the switch tostorage.delete.enabled=true
(enablement of the garbage collection).registry version
Additional Info
No response
The text was updated successfully, but these errors were encountered: