[cinder-csi-plugin:bug] Volume Snapshot deletion is not detecting errors #2294

josedev-union · 2023-07-12T17:44:19Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
I am using Delete policy for both PV class and VS(VolumeSnapshot) class.
We use TrilioVault for Kubernetes for K8s backup tool and run TVK preflight plugin as a part of preflight check of the cluster provisioning.
This TVK preflight check plugin creates a PVC and a VS from this source PVC, then create a restore PVC from the snapshot.
Once all checks are finished, it deletes all test resources.

In openstack , if a volume was created from a volume snapshot, this snapshot cannot be deleted before its child volume is deleted.
So the first attempt to delete VS is failed because it takes a bit time to delete the restored volume. Ofc, the source volume deletion is also failed because it can be deleted once the VS and restored volume are deleted.

But openstack snapshot api is async and it means it responds 202 (return value 0) to DELETE snapshot request every time even if the deletion failed. i.e. this cs.Cloud.DeleteSnapshot func will never fail in our scenario and k8s VS and VSC(VolumeSnapshotContent) objects are deleted without any issue and it will not be requeued even though openstack resources are there.

cloud-provider-openstack/pkg/csi/cinder/controllerserver.go

Line 415 in 9ed6d96

err := cs.Cloud.DeleteSnapshot(id)

It results garbage resources(volumes of the source PVC, and volumeSnapshots of the VS) on Openstack cloud.

What you expected to happen:
I think this could be considered as Openstack API issue more or less. But i think we can make a workaround at this plugin level.
So csi plugin will check the volume snapshot status after requesting the deletion. If the volume snapshot is removed, then it will return OK. If the volume snapshot status is changed to ERROR or its status remains Available (we can set timeout for this check), it will return Error so it can be requeued.

How to reproduce it:
It can be reproduced by using these example resources https://github.com/kubernetes/cloud-provider-openstack/tree/master/examples/cinder-csi-plugin/snapshot

1. Create a PVC.
1. Create a VS from it.
1. Create a PVC from VS
1. Delete VS first.

Anything else we need to know?:

Volume deletion is requeued if it is failed correctly because Openstack cinder api returns fail as sync mode.
Only volume snapshot deletion is the problem.
The VS object on k8s has been deleted but Openstack volume snapshot exists. So the original PV cannot be deleted. So this issue produces garbage volumes as well as gargage volume snapshot.

Environment:

openstack-cloud-controller-manager(or other related binary) version:
We just use cinder-csi-plugin. The version list is
csiplugin: registry.k8s.io/provider-os/cinder-csi-plugin:v1.25.6
snapshotter: k8s.gcr.io/sig-storage/csi-snapshotter:v6.2.1
provisioner: k8s.gcr.io/sig-storage/csi-provisioner:v3.1.1
atttacher: k8s.gcr.io/sig-storage/csi-attacher:v3.5.1
OpenStack version: ocata, wallaby (i think all version, even antelope will has the same issue)
Others:
K8s: 1.25.9

The text was updated successfully, but these errors were encountered:

kayrus · 2023-07-12T17:53:49Z

@josedev-union see a similar issue in Lirt/velero-plugin-for-openstack#78
Do you think a full clone will be a solution to your issue?

josedev-union · 2023-07-12T18:38:45Z

@josedev-union see a similar issue in Lirt/velero-plugin-for-openstack#78 Do you think a full clone will be a solution to your issue?

@kayrus we don't have a plan to use velero.
Btw, i don't think clone can replace snapshot. Volume clone is only possible for Available(idle) volumes. But in practice, we have to backup applications on-fly without downtime you know.

kayrus · 2023-07-12T20:01:44Z

@josedev-union I'm not asking you to use velero. I'm just sharing an issue that has the same consequences.

Volume clone is only possible for Available(idle) volumes

The volume can be cloned with the in-use status. If I'm not mistaken, the shadow snapshot with the force: true is created, then a new volume is restored from it. Can you try with the in-use volume in your environment openstack volume create --source test-in-use-vol?

jichenjc · 2023-07-13T03:43:22Z

But openstack snapshot api is async and it means it responds 202 (return value 0) to DELETE snapshot request every time even if the deletion failed. i.e. this cs.Cloud.DeleteSnapshot func will never fail in our scenario and k8s VS and VSC(VolumeSnapshotContent) objects are deleted without any issue and it will not be requeued even though openstack resources are there.

I think the key ask here is to wait for snapshot really deleted just like

cloud-provider-openstack/pkg/csi/cinder/controllerserver.go

Line 388 in 9ed6d96

err = cs.Cloud.WaitSnapshotReady(snap.ID)

wait for snapshot ready , because the API is 202 response (https://docs.openstack.org/api-ref/block-storage/v3/index.html?expanded=delete-a-snapshot-detail#delete-a-snapshot) , so I think we should may consider wait for snapshot
to really delete (with a timeout of course)

kayrus · 2023-07-13T07:38:26Z

@jichenjc if the snapshot is in error state, API won't allow to delete it. The logic must be a bit clever and complicated, e.g. reset snapshot status, then try to delete it. I faced the same issue in velero plugin, testing this fix in my env. If the tests are good, we can onboard this logic into CSI.

jichenjc · 2023-07-13T07:44:52Z

that's good suggestion ,will read your provided PR :) thanks for the great info~

josedev-union · 2023-07-13T08:02:21Z

@jichenjc if the snapshot is in error state, API won't allow to delete it. The logic must be a bit clever and complicated, e.g. reset snapshot status, then try to delete it. I faced the same issue in velero plugin, testing this fix in my env. If the tests are good, we can onboard this logic into CSI.

I agree on that it will be complicated a bit. But i think we need to keep in mind that reset status requires admin permission. We run cinder-csi-plugin using tenant scope service account with member role. (Downstream clusters will not have such a wide permission normally)
I checked you PR and it will result error for reset status because of perm lack. Ofc, without reset operation, snapshot deletion will fail with other reason like timeout or invalid status if the snapshot is in error_deleting status for example.
It may seem that only the error message is different, but in fact, this is a change in the role requirements for the plugin.

kayrus · 2023-07-13T08:05:25Z

@josedev-union we have two kinds of permissions in our openstack env: admin and cloud admin. admin permissions allow to reset the status, cloud admin can force delete the resource. if you also have such roles, then the status reset can be triggered. nevertheless this reset option should be toggleable in CSI config.

kayrus · 2023-07-13T19:15:55Z

/assign

jichenjc · 2023-07-14T02:20:38Z

admin permissions allow to reset the status, cloud admin can force delete the resource.

https://docs.openstack.org/keystone/latest/admin/service-api-protection.html#admin

so in your context (right side is defintion in above) ?

admin = project admin ， cloud  admin = system admin

mdbooth · 2023-07-18T09:45:26Z

But openstack snapshot api is async and it means it responds 202 (return value 0) to DELETE snapshot request every time even if the deletion failed. i.e. this cs.Cloud.DeleteSnapshot func will never fail in our scenario and k8s VS and VSC(VolumeSnapshotContent) objects are deleted without any issue and it will not be requeued even though openstack resources are there.

I think the key ask here is to wait for snapshot really deleted just like

cloud-provider-openstack/pkg/csi/cinder/controllerserver.go

Line 388 in 9ed6d96

err = cs.Cloud.WaitSnapshotReady(snap.ID)

wait for snapshot ready , because the API is 202 response (https://docs.openstack.org/api-ref/block-storage/v3/index.html?expanded=delete-a-snapshot-detail#delete-a-snapshot) , so I think we should may consider wait for snapshot
to really delete (with a timeout of course)

I think in this case we MUST NOT wait in the controller's reconcile loop for the resource to be deleted. Depending on the order of deletion this may take an arbitrarily long time. It may never deleted if the user doesn't also delete the dependent resources. We must exit the reconcile loop and allow the manager to call us again with exponential backoff.

josedev-union · 2023-07-18T12:03:56Z

But openstack snapshot api is async and it means it responds 202 (return value 0) to DELETE snapshot request every time even if the deletion failed. i.e. this cs.Cloud.DeleteSnapshot func will never fail in our scenario and k8s VS and VSC(VolumeSnapshotContent) objects are deleted without any issue and it will not be requeued even though openstack resources are there.

I think the key ask here is to wait for snapshot really deleted just like

cloud-provider-openstack/pkg/csi/cinder/controllerserver.go

Line 388 in 9ed6d96

err = cs.Cloud.WaitSnapshotReady(snap.ID)

wait for snapshot ready , because the API is 202 response (https://docs.openstack.org/api-ref/block-storage/v3/index.html?expanded=delete-a-snapshot-detail#delete-a-snapshot) , so I think we should may consider wait for snapshot
to really delete (with a timeout of course)

I think in this case we MUST NOT wait in the controller's reconcile loop for the resource to be deleted. Depending on the order of deletion this may take an arbitrarily long time. It may never deleted if the user doesn't also delete the dependent resources. We must exit the reconcile loop and allow the manager to call us again with exponential backoff.

yeah, that coulld be also another solution.
The primary requirement is to requeue the deletion of the VS until the associated OpenStack resource is successfully removed, either within a specific limit of attempts or until it is successfully deleted.

kayrus · 2023-07-19T20:39:14Z

I just noticed that manila CSI doesn't have a finalizer, and during my tests I had a manila pvc which was successfully deleted from k8s api, but still stuck in openstack.

jichenjc · 2023-07-20T07:35:38Z

I think in this case we MUST NOT wait in the controller's reconcile loop for the resource to be deleted. Depending on the order of deletion this may take an arbitrarily long time. It may never deleted if the user doesn't also delete the dependent resources. We must exit the reconcile loop and allow the manager to call us again with exponential backoff.

agree, this might be another solution , seems this is better option but the goal is same, check the deletion instead of take 202 as complete ..

jeffyjf · 2023-09-22T03:17:51Z

Hi, Is anyone still working on this issue. @kayrus If you haven't started this work, could you leave it to me?

I think that we should introduce finalizer to resolve the issue. When user create a volume with snapshot, cinder-csi-plugin should add a finalizer to the snapshot. When the last volume created with this snapshot be deleted, cinder-csi-plugin will remove this finalizer.

mdbooth · 2023-09-25T09:17:04Z

Hi, Is anyone still working on this issue. @kayrus If you haven't started this work, could you leave it to me?

I think that we should introduce finalizer to resolve the issue. When user create a volume with snapshot, cinder-csi-plugin should add a finalizer to the snapshot. When the last volume created with this snapshot be deleted, cinder-csi-plugin will remove this finalizer.

Or alternatively each individual volume created from the snapshot would add its own finalizer. That would have the advantage of being its own reference count, and also giving the user a clue as to specifically which volumes are blocking the deletion of the snapshot.

I can't think of a case where I've seen this pattern used before, though. We should take a moment to consider if there's a reason for that.

jeffyjf · 2023-09-25T12:21:28Z

Hi @mdbooth

We all agree that if a volume created from a snapshot, the snapshot shouldn't be deleted before the volume, right? But, I haven't understood the sentence:

Or alternatively each individual volume created from the snapshot would add its own finalizer.

Your mean that add finalizer to PersistentVolume？ IMO, Add finalizer to PersistentVolume can't prevent the VolumeSnapShot by deleted before PersistentVolume. Or, I guess that maybe your mean to add owner references to PersistentVolume, right? But, owner references also can't prevent the VolumeSnapShot by deleted before PersistentVolume.

For more informations about finalizer and owner references, please refer to the blow links.
[1] https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/
[2] https://kubernetes.io/docs/concepts/architecture/garbage-collection/

mdbooth · 2023-09-26T09:51:56Z

Hi, Is anyone still working on this issue. @kayrus If you haven't started this work, could you leave it to me?

I think that we should introduce finalizer to resolve the issue. When user create a volume with snapshot, cinder-csi-plugin should add a finalizer to the snapshot. When the last volume created with this snapshot be deleted, cinder-csi-plugin will remove this finalizer.

I was referring to this finalizer ☝️

k8s-triage-robot · 2024-01-29T03:04:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-28T03:56:30Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

josedev-union · 2024-02-28T08:05:33Z

/remove-lifecycle rotten

jichenjc added the kind/bug Categorizes issue or PR as related to a bug. label Jul 13, 2023

k8s-ci-robot assigned kayrus Jul 13, 2023

jeffyjf linked a pull request Sep 26, 2023 that will close this issue

[cinder-csi-plugin] Add finalizer to VolumeSnapshot when create PVC with the VolumeSnapshot #2369

Open

This was referenced Oct 24, 2023

Add volumeSnapshotRef to VolumeContentSource for CreateVolumeRequest in csi.proto kubernetes/enhancements#4310

Closed

Expose metadata of VolumeContentSource(VS/PVC) to SP plugin kubernetes-csi/external-provisioner#1069

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cinder-csi-plugin:bug] Volume Snapshot deletion is not detecting errors #2294

[cinder-csi-plugin:bug] Volume Snapshot deletion is not detecting errors #2294

josedev-union commented Jul 12, 2023

kayrus commented Jul 12, 2023

josedev-union commented Jul 12, 2023 •

edited

kayrus commented Jul 12, 2023 •

edited

jichenjc commented Jul 13, 2023

kayrus commented Jul 13, 2023 •

edited

jichenjc commented Jul 13, 2023

josedev-union commented Jul 13, 2023

kayrus commented Jul 13, 2023

kayrus commented Jul 13, 2023

jichenjc commented Jul 14, 2023 •

edited

mdbooth commented Jul 18, 2023

josedev-union commented Jul 18, 2023

kayrus commented Jul 19, 2023 •

edited

jichenjc commented Jul 20, 2023

jeffyjf commented Sep 22, 2023

mdbooth commented Sep 25, 2023

jeffyjf commented Sep 25, 2023 •

edited

mdbooth commented Sep 26, 2023

k8s-triage-robot commented Jan 29, 2024

k8s-triage-robot commented Feb 28, 2024

josedev-union commented Feb 28, 2024

[cinder-csi-plugin:bug] Volume Snapshot deletion is not detecting errors #2294

[cinder-csi-plugin:bug] Volume Snapshot deletion is not detecting errors #2294

Comments

josedev-union commented Jul 12, 2023

kayrus commented Jul 12, 2023

josedev-union commented Jul 12, 2023 • edited

kayrus commented Jul 12, 2023 • edited

jichenjc commented Jul 13, 2023

kayrus commented Jul 13, 2023 • edited

jichenjc commented Jul 13, 2023

josedev-union commented Jul 13, 2023

kayrus commented Jul 13, 2023

kayrus commented Jul 13, 2023

jichenjc commented Jul 14, 2023 • edited

mdbooth commented Jul 18, 2023

josedev-union commented Jul 18, 2023

kayrus commented Jul 19, 2023 • edited

jichenjc commented Jul 20, 2023

jeffyjf commented Sep 22, 2023

mdbooth commented Sep 25, 2023

jeffyjf commented Sep 25, 2023 • edited

mdbooth commented Sep 26, 2023

k8s-triage-robot commented Jan 29, 2024

k8s-triage-robot commented Feb 28, 2024

josedev-union commented Feb 28, 2024

josedev-union commented Jul 12, 2023 •

edited

kayrus commented Jul 12, 2023 •

edited

kayrus commented Jul 13, 2023 •

edited

jichenjc commented Jul 14, 2023 •

edited

kayrus commented Jul 19, 2023 •

edited

jeffyjf commented Sep 25, 2023 •

edited