Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forcefully detach the volumes on pod deletion if kubelet is unavailable #67419

Closed
wants to merge 2 commits into from

Conversation

NickrenREN
Copy link
Contributor

Forcely detach the volumes on pod deletion if kubelet is unavailable

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #65392

Special notes for your reviewer:

Release note:

Forcely detach the volumes on pod deletion if kubelet is unavailable 

/sig storage
/kind bug
/assign @verult

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 15, 2018
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 15, 2018
@NickrenREN
Copy link
Contributor Author

/retest

@gnufied
Copy link
Member

gnufied commented Aug 16, 2018

/assign

Copy link
Contributor

@verult verult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation only checks for grace period upon pod update, so it won't trigger pod deletion from the DSW at the right time.

Talked to @saad-ali offline, and his proposal is to delete the pod from the DSW as soon as deletionTimestamp is set, configure the detach timeout that waits for volume unmount to be 6min + the pod's deletionGracePeriod. If multiple pods on the same node has the volume mounted, go for the longest wait time to be safe, i.e. choose the latest pod deletionTimeStamp + deletionGracePeriod to begin the 6min countdown.

@@ -155,9 +156,30 @@ func DetermineVolumeAction(pod *v1.Pod, desiredStateOfWorld cache.DesiredStateOf
// should be detached or not
return keepTerminatedPodVolume
}
if isPodNeededToBeRemoved(pod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will remove all pods after the grace period, not just terminated pods.

Maybe we should have two different versions of IsPodTerminated call, one that checks for container status and one that doesn't (for this method).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the containers status checking is done in IsPodTerminated. which is also called in DetermineVolumeAction.
should we still add container status checking After grace period?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason where DeletionTimestamp will be set for pods that are not being terminated? again - I think we should verify with sig-node and confirm at what point, we should consider a pod as "deleted" and hence proceed with detaching volumes it was using. cc @sjenning @derekwaynecarr

Copy link
Member

@gnufied gnufied Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may not be enough to fix the bug, because while this function will cause removal of the pod from DSWP, I think the next function that adds the pods will add it right back and volume will not get detached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing this out @gnufied ,
and @verult mentioned that there is a meeting about this, let's hold it now and if this PR is the way we want to go, i will update it.

@NickrenREN
Copy link
Contributor Author

This implementation only checks for grace period upon pod update, so it won't trigger pod deletion from the DSW at the right time.

Not really, DetermineVolumeAction will be called by dswp too in findAndRemoveDeletedPods, so pods in broken node will be detected finally.

@gnufied
Copy link
Member

gnufied commented Aug 17, 2018

cc @sjenning likely this is also related to bug we saw on AWS where volumes from shutdown nodes are not being detached because pods are left in "Running" phase but has DeletionTimeStamp set.

@@ -135,3 +135,117 @@ func TestFindAndAddActivePods_FindAndRemoveDeletedPods(t *testing.T) {
}

}

func TestFindAndRemoveDeletedPodsInFailedNodes(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try and get some e2e tests for this. IIRC - this code path broke before too..(but for different reason).

@verult
Copy link
Contributor

verult commented Aug 17, 2018

@NickrenREN ACK. It might be helpful to also document somewhere all the timing parameters involved, including deletionGracePeriod, the DSWP sync loop sleep period, and later the detach timeout for actual detach

@NickrenREN NickrenREN force-pushed the force-detach branch 2 times, most recently from da1e64b to 8871dc4 Compare August 20, 2018 12:56
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: NickrenREN
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: saad-ali

If they are not already assigned, you can assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@NickrenREN NickrenREN force-pushed the force-detach branch 2 times, most recently from 09fe821 to 80a03cf Compare August 21, 2018 13:32
@NickrenREN
Copy link
Contributor Author

@verult I added the comments in dswp.go file
@gnufied i added a e2e test case for this
PTAL, thanks

@yastij
Copy link
Member

yastij commented Aug 23, 2018

some related work on the node shutdown side #66213

@gnufied gnufied mentioned this pull request Aug 24, 2018
@NickrenREN NickrenREN changed the title Forcely detach the volumes on pod deletion if kubelet is unavailable Forcefully detach the volumes on pod deletion if kubelet is unavailable Aug 27, 2018
@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Aug 28, 2018
@NickrenREN
Copy link
Contributor Author

/retest

@NickrenREN
Copy link
Contributor Author

/close

@k8s-ci-robot
Copy link
Contributor

@NickrenREN: Closing this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@NickrenREN NickrenREN deleted the force-detach branch September 18, 2018 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion
5 participants