New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StatefulSet PVC auto-delete implementation #99728
Conversation
@mattcary: GitHub didn't allow me to request PR reviews from the following users: kk-src. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
8160dd0
to
e51c6dd
Compare
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
47724b0
to
2541894
Compare
claimName := getPersistentVolumeClaimName(set, &templates[i], ordinal) | ||
claim, err := spc.objectMgr.GetClaim(templates[i].Namespace, claimName) | ||
switch { | ||
case apierrors.IsNotFound(err): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should error or we should log a warning and continue. This would most likely happen if someone deleted the object out of band?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also happens when everything is getting created IIUC. In UpdateStatefulPod, above, we ignore an error from this method when appropriate.
I think it's nicer to do it that way, where the caller knows when it's ok to ignore missing things, than to plumb that logic down here.
I may eat my words on this when we do more complicated test cases, but for the time being at least to me it makes it easier to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If objectMgr is fed by an informer, this in an inherently racy check (create object A, check cache until B shows up). Generally we don't warn in those cases, because that's creating log spam (the race is roughly guaranteed to always happen at least once on the first pass, and depending on watch latency, possibly longer).
If a PVC doesn't exist, retention is irrelevant. So I'd probably say apierrors.IsNotFound should not cause an error to be returned, and should log at a high level (i.e. not be visible except during debugging such as 4 or 5). It doesn't matter whether it's a race on startup, a user driven deletion, or a bug elsewhere in the controller - the controller has to accept that cross object comparisons driven by inherently laggy caches can't be done transactionally, and the way we do it is break our checks into simple invariants (if no PVC exists but should -> try to create, if PVC shouldn't exist but does -> try to destroy) and minimizing the amount of noise we create in logs and events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. Done.
case err == nil: | ||
// A claim is stale if it doesn't match the pod's UID, including if the pod has no UID. | ||
if hasStaleOwnerRef(pvc, pod) { | ||
fmt.Printf("%s/%s on %s/%v\n", pod.Name, pod.GetUID(), pvc.Name, pvc.GetOwnerReferences()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use klog to log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes this is printf debugging that leaked out from the commit with the test :-P. fixed.
if isStale, err := ssc.podControl.PodClaimIsStale(set, replicas[i]); err != nil { | ||
return &status, err | ||
} else if isStale { | ||
// If a pod has a stale PVC, no more work can be done this round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone manually created the PVC and didn't put an ownerref on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That case of manual pod creation will be caught in UpdateStatefulPod.
A stale owner ref is where there's a ref to a different pod, meaning the race condition during a scale-down followed immediately by a scale back up before the PVC can get deleted. Since things may be in process of getting unmounted it seems best to just wait for everything to settle down.
@@ -108,23 +145,42 @@ func (spc *realStatefulPodControl) UpdateStatefulPod(set *apps.StatefulSet, pod | |||
return err | |||
} | |||
} | |||
// if the Pod's PVCs are not consistent with the StatefulSet's PVC deletion policy, update the PVC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any new logic in the controller should be feature gated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops I'm reviewing a commit at a time and missed 89caf6d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I thought it would be easier to split that out. I'm overthinking it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature gate checks happen in this commit now (see getPersistentVolumeClaimDeletePolicy in stateful-set_utils.go).
Also note that we will need to update claim ownerRefs if the feature gate is rolled back. So the feature gate check has to happen deeper down anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a code organization perspective the feature gate is buried deeply in the utils. Its hard to read the code and determine the controller behavior when its enabled/disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see with the StatefulSetMinReadySeconds that the feature gate testing is hoisted up as you describe. So I've done the same with this feature.
// Otherwise the error may be due to some other update, and we'll have to reconcile the next time around. | ||
// TODO: does the reconcile need to be sped up? | ||
} else if !match { | ||
if err := spc.UpdatePodClaimForDeletionPolicy(set, pod); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something to consider as a future optimization, but we could save an update if we create+update in one call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, although it gets complicated because if the pod is new as well it may not have the UID necessary for the owner ref.
So +1 to future optimization :-)
func getClaimPodName(set *apps.StatefulSet, claim *v1.PersistentVolumeClaim) string { | ||
podName := "" | ||
|
||
statefulClaimRegex := regexp.MustCompile(fmt.Sprintf(".*-(%s-[0-9]+)$", set.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pvc name could contain dashes too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that's okay, because we inject the actual set name into the regex. Really we're just picking out the ordinal (that's also why we can't precompile the RE).
eg if the claim is foo-bar-baz-8, and set.Name = baz, then the RE is foo-bar-(baz-\d+), and we do the right thing. If set.Name = bar-baz, then the RE is foo-(bar-baz-\d+) and again we're okay.
At least I think so, maybe I missed something.
if hasOwnerRef(claim, set) { | ||
return false | ||
} | ||
podScaledDown := getOrdinal(pod) >= int(*set.Spec.Replicas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With parallel pod management can we still make the assumption that the highest ordinals are scaled down first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding with parallel pod management is that Replicas will still be set to the final number of replicas. We just don't know which of the condemned pods will get deleted first.
In other words, with both parallel pod management as well as conventional, all the condemned pods will have ordinal > Replicas, and we will set owner refs on all of those PVCs at the same time. We don't need to do anything different w/rt PVC deletion AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is correct, all condemned pods will have ordinal > Replicas
return false | ||
} | ||
default: | ||
klog.Errorf("Unknown policy %v!", set.Spec.PersistentVolumeClaimDeletePolicy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this return an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ftn only returns a bool. Since the policy should be validated, it seems excessive to return an error.
I have changed the behavior to match Retain, though, since I'm going to have to deal with a nil pointer since making the field optional in the API.
// Sometimes the version and kind are not set pod.TypeMeta. These are necessary for the ownerRef. | ||
// TODO: there must be a better way to do this other than hardcoding the pod version? | ||
podMeta := pod.TypeMeta | ||
if podMeta.APIVersion == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this happen in a real cluster or only unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was happening in a real cluster. Maybe this means something has to wait a reconcile loop for the pod to be created?
// updated and false otherwise. | ||
func setOwnerRef(target, owner metav1.Object, ownerType *metav1.TypeMeta) bool { | ||
if owner.GetUID() == types.UID("") { | ||
panic(fmt.Sprintf("Bad owner for %s", owner.GetName())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, more debugging code. Removed.
} | ||
|
||
if !utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) && (oldSet == nil || oldSet.Spec.PersistentVolumeClaimDeletePolicy == apps.Retain) { | ||
newSet.Spec.PersistentVolumeClaimDeletePolicy = apps.Retain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be addressed in the API pr, but policy should be a pointer, and then dropping it would be setting it to nil.
@@ -351,7 +351,7 @@ func buildControllerRoles() ([]rbacv1.ClusterRole, []rbacv1.ClusterRoleBinding) | |||
rbacv1helpers.NewRule("update").Groups(appsGroup).Resources("statefulsets/finalizers").RuleOrDie(), | |||
rbacv1helpers.NewRule("get", "create", "delete", "update", "patch").Groups(legacyGroup).Resources("pods").RuleOrDie(), | |||
rbacv1helpers.NewRule("get", "create", "delete", "update", "patch", "list", "watch").Groups(appsGroup).Resources("controllerrevisions").RuleOrDie(), | |||
rbacv1helpers.NewRule("get", "create").Groups(legacyGroup).Resources("persistentvolumeclaims").RuleOrDie(), | |||
rbacv1helpers.NewRule("get", "create", "delete", "update", "list").Groups(legacyGroup).Resources("persistentvolumeclaims").RuleOrDie(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new rules should be feature gated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done (will be in next push up). Also delete isn't needed I don't think.
/lgtm |
Change-Id: I16b50e6853bba65fc89c793d2b9b335581c02407
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kow3ns, mattcary, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
/test pull-kubernetes-unit |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The delete PVC unit tests started flaking when this was merged: https://storage.googleapis.com/k8s-triage/index.html?ci=0&pr=1&test=PVC Looks like there's a data race inside runTestOverPVCRetentionPolicies |
/kind feature
/sig apps
What this PR does / why we need it:
This contains the implementation for the StatefulSet PVC auto-delete KEP
Which issue(s) this PR address:
kubernetes/enhancements#1847
Special notes for your reviewer:
This PR is unfortunately rather large.
The first two api commits were originally submitted in #99378, which was reverted when the implementation didn't get finished in time for 1.22.
The next three are this change: the implementation, unit tests, and finally the feature gate.
The unit tests required a bit of refactoring as the faking was done at the wrong level to efficiently test the new behavior. So it's quite large, but splitting it up would be very difficult at the point :-(
The final commit is the feature gate. I think I'm doing it the right way based on other feature gate implementations, but please let me know if I missed some key bit.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: