Forcefully detach the volumes on pod deletion if kubelet is unavailable #67419

NickrenREN · 2018-08-15T03:38:02Z

Forcely detach the volumes on pod deletion if kubelet is unavailable

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #65392

Special notes for your reviewer:

Release note:

Forcely detach the volumes on pod deletion if kubelet is unavailable

/sig storage
/kind bug
/assign @verult

NickrenREN · 2018-08-15T04:59:00Z

/retest

gnufied · 2018-08-16T22:08:20Z

/assign

verult

This implementation only checks for grace period upon pod update, so it won't trigger pod deletion from the DSW at the right time.

Talked to @saad-ali offline, and his proposal is to delete the pod from the DSW as soon as deletionTimestamp is set, configure the detach timeout that waits for volume unmount to be 6min + the pod's deletionGracePeriod. If multiple pods on the same node has the volume mounted, go for the longest wait time to be safe, i.e. choose the latest pod deletionTimeStamp + deletionGracePeriod to begin the 6min countdown.

verult · 2018-08-16T19:09:00Z

pkg/controller/volume/attachdetach/util/util.go

@@ -155,9 +156,30 @@ func DetermineVolumeAction(pod *v1.Pod, desiredStateOfWorld cache.DesiredStateOf
 		// should be detached or not
 		return keepTerminatedPodVolume
 	}
+	if isPodNeededToBeRemoved(pod) {


This will remove all pods after the grace period, not just terminated pods.

Maybe we should have two different versions of IsPodTerminated call, one that checks for container status and one that doesn't (for this method).

the containers status checking is done in IsPodTerminated. which is also called in DetermineVolumeAction.
should we still add container status checking After grace period?

Is there a reason where DeletionTimestamp will be set for pods that are not being terminated? again - I think we should verify with sig-node and confirm at what point, we should consider a pod as "deleted" and hence proceed with detaching volumes it was using. cc @sjenning @derekwaynecarr

I think this may not be enough to fix the bug, because while this function will cause removal of the pod from DSWP, I think the next function that adds the pods will add it right back and volume will not get detached.

thanks for pointing this out @gnufied ,
and @verult mentioned that there is a meeting about this, let's hold it now and if this PR is the way we want to go, i will update it.

NickrenREN · 2018-08-17T01:37:44Z

This implementation only checks for grace period upon pod update, so it won't trigger pod deletion from the DSW at the right time.

Not really, DetermineVolumeAction will be called by dswp too in findAndRemoveDeletedPods, so pods in broken node will be detected finally.

gnufied · 2018-08-17T04:28:54Z

cc @sjenning likely this is also related to bug we saw on AWS where volumes from shutdown nodes are not being detached because pods are left in "Running" phase but has DeletionTimeStamp set.

gnufied · 2018-08-17T04:39:03Z

pkg/controller/volume/attachdetach/populator/desired_state_of_world_populator_test.go

@@ -135,3 +135,117 @@ func TestFindAndAddActivePods_FindAndRemoveDeletedPods(t *testing.T) {
 	}

 }
+
+func TestFindAndRemoveDeletedPodsInFailedNodes(t *testing.T) {


We should try and get some e2e tests for this. IIRC - this code path broke before too..(but for different reason).

verult · 2018-08-17T18:47:51Z

@NickrenREN ACK. It might be helpful to also document somewhere all the timing parameters involved, including deletionGracePeriod, the DSWP sync loop sleep period, and later the detach timeout for actual detach

k8s-ci-robot · 2018-08-20T12:56:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: NickrenREN
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: saad-ali

If they are not already assigned, you can assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

pkg/controller/volume/attachdetach/OWNERS
test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

NickrenREN · 2018-08-21T13:37:27Z

@verult I added the comments in dswp.go file
@gnufied i added a e2e test case for this
PTAL, thanks

yastij · 2018-08-23T15:12:30Z

some related work on the node shutdown side #66213

NickrenREN · 2018-08-28T09:30:52Z

/retest

NickrenREN · 2018-09-18T05:10:51Z

/close

k8s-ci-robot · 2018-09-18T05:10:51Z

@NickrenREN: Closing this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot assigned verult Aug 15, 2018

k8s-ci-robot requested review from gnufied and saad-ali August 15, 2018 03:38

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 15, 2018

NickrenREN mentioned this pull request Aug 15, 2018

PVC Protection doesn't wait until Pod volumes have been unmounted #65552

Closed

k8s-ci-robot assigned gnufied Aug 16, 2018

verult reviewed Aug 16, 2018

View reviewed changes

gnufied reviewed Aug 17, 2018

View reviewed changes

NickrenREN force-pushed the force-detach branch 2 times, most recently from da1e64b to 8871dc4 Compare August 20, 2018 12:56

NickrenREN force-pushed the force-detach branch 2 times, most recently from 09fe821 to 80a03cf Compare August 21, 2018 13:32

gnufied mentioned this pull request Aug 24, 2018

Handle force detach #67847

Closed

NickrenREN changed the title ~~Forcely detach the volumes on pod deletion if kubelet is unavailable~~ Forcefully detach the volumes on pod deletion if kubelet is unavailable Aug 27, 2018

forcefully detach the volumes if node is down

6606f91

NickrenREN force-pushed the force-detach branch from 80a03cf to c990142 Compare August 28, 2018 03:56

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Aug 28, 2018

NickrenREN force-pushed the force-detach branch from 33f0551 to a12d4a7 Compare August 28, 2018 04:12

NickrenREN force-pushed the force-detach branch from a12d4a7 to ad45eeb Compare August 28, 2018 05:17

add UT and e2e for the change

642f545

NickrenREN force-pushed the force-detach branch from ad45eeb to 642f545 Compare August 28, 2018 06:46

shay-berman mentioned this pull request Aug 29, 2018

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion #65392

Closed

k8s-ci-robot closed this Sep 18, 2018

NickrenREN deleted the force-detach branch September 18, 2018 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forcefully detach the volumes on pod deletion if kubelet is unavailable #67419

Forcefully detach the volumes on pod deletion if kubelet is unavailable #67419

NickrenREN commented Aug 15, 2018

NickrenREN commented Aug 15, 2018

gnufied commented Aug 16, 2018

verult left a comment •

edited

verult Aug 16, 2018

NickrenREN Aug 16, 2018

gnufied Aug 17, 2018

gnufied Aug 22, 2018 •

edited

NickrenREN Aug 27, 2018

NickrenREN commented Aug 17, 2018

gnufied commented Aug 17, 2018

gnufied Aug 17, 2018

verult commented Aug 17, 2018

k8s-ci-robot commented Aug 20, 2018

NickrenREN commented Aug 21, 2018

yastij commented Aug 23, 2018

NickrenREN commented Aug 28, 2018

NickrenREN commented Sep 18, 2018

k8s-ci-robot commented Sep 18, 2018

Forcefully detach the volumes on pod deletion if kubelet is unavailable #67419

Forcefully detach the volumes on pod deletion if kubelet is unavailable #67419

Conversation

NickrenREN commented Aug 15, 2018

NickrenREN commented Aug 15, 2018

gnufied commented Aug 16, 2018

verult left a comment • edited

Choose a reason for hiding this comment

verult Aug 16, 2018

Choose a reason for hiding this comment

NickrenREN Aug 16, 2018

Choose a reason for hiding this comment

gnufied Aug 17, 2018

Choose a reason for hiding this comment

gnufied Aug 22, 2018 • edited

Choose a reason for hiding this comment

NickrenREN Aug 27, 2018

Choose a reason for hiding this comment

NickrenREN commented Aug 17, 2018

gnufied commented Aug 17, 2018

gnufied Aug 17, 2018

Choose a reason for hiding this comment

verult commented Aug 17, 2018

k8s-ci-robot commented Aug 20, 2018

NickrenREN commented Aug 21, 2018

yastij commented Aug 23, 2018

NickrenREN commented Aug 28, 2018

NickrenREN commented Sep 18, 2018

k8s-ci-robot commented Sep 18, 2018

verult left a comment •

edited

gnufied Aug 22, 2018 •

edited