Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatefulSet PVC auto-delete implementation #99728

Merged
merged 5 commits into from Nov 18, 2021

Conversation

mattcary
Copy link
Contributor

@mattcary mattcary commented Mar 3, 2021

/kind feature
/sig apps

What this PR does / why we need it:

This contains the implementation for the StatefulSet PVC auto-delete KEP

Which issue(s) this PR address:

kubernetes/enhancements#1847

Special notes for your reviewer:

This PR is unfortunately rather large.

The first two api commits were originally submitted in #99378, which was reverted when the implementation didn't get finished in time for 1.22.

The next three are this change: the implementation, unit tests, and finally the feature gate.

The unit tests required a bit of refactoring as the faking was done at the wrong level to efficiently test the new behavior. So it's quite large, but splitting it up would be very difficult at the point :-(

The final commit is the feature gate. I think I'm doing it the right way based on other feature gate implementations, but please let me know if I missed some key bit.

Does this PR introduce a user-facing change?

Adds a feature gate StatefulSetAutoDeletePVC, which allows PVCs automatically created for StatefulSet pods to be automatically deleted.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/1847-autoremove-statefulset-pvcs

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 3, 2021
@mattcary
Copy link
Contributor Author

mattcary commented Mar 3, 2021

/cc @kk-src
/cc @msau42

@k8s-ci-robot
Copy link
Contributor

@mattcary: GitHub didn't allow me to request PR reviews from the following users: kk-src.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @kk-src
/cc @msau42

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Mar 3, 2021
@mattcary mattcary force-pushed the control branch 2 times, most recently from 8160dd0 to e51c6dd Compare March 3, 2021 21:45
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 4, 2021
@mattcary mattcary force-pushed the control branch 4 times, most recently from 47724b0 to 2541894 Compare March 7, 2021 02:27
@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Mar 7, 2021
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 7, 2021
claimName := getPersistentVolumeClaimName(set, &templates[i], ordinal)
claim, err := spc.objectMgr.GetClaim(templates[i].Namespace, claimName)
switch {
case apierrors.IsNotFound(err):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should error or we should log a warning and continue. This would most likely happen if someone deleted the object out of band?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also happens when everything is getting created IIUC. In UpdateStatefulPod, above, we ignore an error from this method when appropriate.

I think it's nicer to do it that way, where the caller knows when it's ok to ignore missing things, than to plumb that logic down here.

I may eat my words on this when we do more complicated test cases, but for the time being at least to me it makes it easier to reason about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If objectMgr is fed by an informer, this in an inherently racy check (create object A, check cache until B shows up). Generally we don't warn in those cases, because that's creating log spam (the race is roughly guaranteed to always happen at least once on the first pass, and depending on watch latency, possibly longer).

If a PVC doesn't exist, retention is irrelevant. So I'd probably say apierrors.IsNotFound should not cause an error to be returned, and should log at a high level (i.e. not be visible except during debugging such as 4 or 5). It doesn't matter whether it's a race on startup, a user driven deletion, or a bug elsewhere in the controller - the controller has to accept that cross object comparisons driven by inherently laggy caches can't be done transactionally, and the way we do it is break our checks into simple invariants (if no PVC exists but should -> try to create, if PVC shouldn't exist but does -> try to destroy) and minimizing the amount of noise we create in logs and events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. Done.

case err == nil:
// A claim is stale if it doesn't match the pod's UID, including if the pod has no UID.
if hasStaleOwnerRef(pvc, pod) {
fmt.Printf("%s/%s on %s/%v\n", pod.Name, pod.GetUID(), pvc.Name, pvc.GetOwnerReferences())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use klog to log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes this is printf debugging that leaked out from the commit with the test :-P. fixed.

if isStale, err := ssc.podControl.PodClaimIsStale(set, replicas[i]); err != nil {
return &status, err
} else if isStale {
// If a pod has a stale PVC, no more work can be done this round.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone manually created the PVC and didn't put an ownerref on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That case of manual pod creation will be caught in UpdateStatefulPod.

A stale owner ref is where there's a ref to a different pod, meaning the race condition during a scale-down followed immediately by a scale back up before the PVC can get deleted. Since things may be in process of getting unmounted it seems best to just wait for everything to settle down.

@@ -108,23 +145,42 @@ func (spc *realStatefulPodControl) UpdateStatefulPod(set *apps.StatefulSet, pod
return err
}
}
// if the Pod's PVCs are not consistent with the StatefulSet's PVC deletion policy, update the PVC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any new logic in the controller should be feature gated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I'm reviewing a commit at a time and missed 89caf6d

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought it would be easier to split that out. I'm overthinking it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature gate checks happen in this commit now (see getPersistentVolumeClaimDeletePolicy in stateful-set_utils.go).

Also note that we will need to update claim ownerRefs if the feature gate is rolled back. So the feature gate check has to happen deeper down anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a code organization perspective the feature gate is buried deeply in the utils. Its hard to read the code and determine the controller behavior when its enabled/disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see with the StatefulSetMinReadySeconds that the feature gate testing is hoisted up as you describe. So I've done the same with this feature.

// Otherwise the error may be due to some other update, and we'll have to reconcile the next time around.
// TODO: does the reconcile need to be sped up?
} else if !match {
if err := spc.UpdatePodClaimForDeletionPolicy(set, pod); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something to consider as a future optimization, but we could save an update if we create+update in one call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, although it gets complicated because if the pod is new as well it may not have the UID necessary for the owner ref.

So +1 to future optimization :-)

func getClaimPodName(set *apps.StatefulSet, claim *v1.PersistentVolumeClaim) string {
podName := ""

statefulClaimRegex := regexp.MustCompile(fmt.Sprintf(".*-(%s-[0-9]+)$", set.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pvc name could contain dashes too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that's okay, because we inject the actual set name into the regex. Really we're just picking out the ordinal (that's also why we can't precompile the RE).

eg if the claim is foo-bar-baz-8, and set.Name = baz, then the RE is foo-bar-(baz-\d+), and we do the right thing. If set.Name = bar-baz, then the RE is foo-(bar-baz-\d+) and again we're okay.

At least I think so, maybe I missed something.

if hasOwnerRef(claim, set) {
return false
}
podScaledDown := getOrdinal(pod) >= int(*set.Spec.Replicas)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With parallel pod management can we still make the assumption that the highest ordinals are scaled down first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding with parallel pod management is that Replicas will still be set to the final number of replicas. We just don't know which of the condemned pods will get deleted first.

In other words, with both parallel pod management as well as conventional, all the condemned pods will have ordinal > Replicas, and we will set owner refs on all of those PVCs at the same time. We don't need to do anything different w/rt PVC deletion AFAICT.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is correct, all condemned pods will have ordinal > Replicas

return false
}
default:
klog.Errorf("Unknown policy %v!", set.Spec.PersistentVolumeClaimDeletePolicy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ftn only returns a bool. Since the policy should be validated, it seems excessive to return an error.

I have changed the behavior to match Retain, though, since I'm going to have to deal with a nil pointer since making the field optional in the API.

// Sometimes the version and kind are not set pod.TypeMeta. These are necessary for the ownerRef.
// TODO: there must be a better way to do this other than hardcoding the pod version?
podMeta := pod.TypeMeta
if podMeta.APIVersion == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this happen in a real cluster or only unit tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was happening in a real cluster. Maybe this means something has to wait a reconcile loop for the pod to be created?

// updated and false otherwise.
func setOwnerRef(target, owner metav1.Object, ownerType *metav1.TypeMeta) bool {
if owner.GetUID() == types.UID("") {
panic(fmt.Sprintf("Bad owner for %s", owner.GetName()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we error instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, more debugging code. Removed.

}

if !utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) && (oldSet == nil || oldSet.Spec.PersistentVolumeClaimDeletePolicy == apps.Retain) {
newSet.Spec.PersistentVolumeClaimDeletePolicy = apps.Retain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be addressed in the API pr, but policy should be a pointer, and then dropping it would be setting it to nil.

@@ -351,7 +351,7 @@ func buildControllerRoles() ([]rbacv1.ClusterRole, []rbacv1.ClusterRoleBinding)
rbacv1helpers.NewRule("update").Groups(appsGroup).Resources("statefulsets/finalizers").RuleOrDie(),
rbacv1helpers.NewRule("get", "create", "delete", "update", "patch").Groups(legacyGroup).Resources("pods").RuleOrDie(),
rbacv1helpers.NewRule("get", "create", "delete", "update", "patch", "list", "watch").Groups(appsGroup).Resources("controllerrevisions").RuleOrDie(),
rbacv1helpers.NewRule("get", "create").Groups(legacyGroup).Resources("persistentvolumeclaims").RuleOrDie(),
rbacv1helpers.NewRule("get", "create", "delete", "update", "list").Groups(legacyGroup).Resources("persistentvolumeclaims").RuleOrDie(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new rules should be feature gated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (will be in next push up). Also delete isn't needed I don't think.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 17, 2021
@jingxu97
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 18, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 18, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kow3ns, mattcary, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jingxu97
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 18, 2021
@mattcary
Copy link
Contributor Author

/retest

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@jingxu97
Copy link
Contributor

/test pull-kubernetes-unit

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-ci-robot k8s-ci-robot merged commit b8af116 into kubernetes:master Nov 18, 2021
SIG Auth Old automation moved this from Changes Requested to Closed / Done Nov 18, 2021
@liggitt
Copy link
Member

liggitt commented Nov 21, 2021

The delete PVC unit tests started flaking when this was merged:

https://storage.googleapis.com/k8s-triage/index.html?ci=0&pr=1&test=PVC

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/106568/pull-kubernetes-unit/1462136832413143040

Looks like there's a data race inside runTestOverPVCRetentionPolicies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
SIG Auth Old
Closed / Done
Development

Successfully merging this pull request may close these issues.

None yet