-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Install pdcsi driver by default #102701
Install pdcsi driver by default #102701
Conversation
Welcome @leiyiz! |
Hi @leiyiz. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign I'll take an initial pass through this before pinging the owners. |
Also, Leiyi, can you squash your commits? |
23109d1
to
99424ad
Compare
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of initial comments.
Could you please add a description of what the PR is about and what were the steps to test it? |
/assign |
0fd95bd
to
158a1da
Compare
/retest |
2 similar comments
/retest |
/retest |
apiVersion: policy/v1beta1 | ||
kind: PodSecurityPolicy | ||
metadata: | ||
name: csi-gce-pd-controller-psp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally if this is gce-pd it should be under https://github.com/kubernetes/kubernetes/tree/master/cluster/gce/addons rather than https://github.com/kubernetes/kubernetes/tree/master/cluster/addons
If this is something that cannot reasonably be expected by default for all kubernetes clusters everywhere as of v1.22, I'm wary of accepting this |
@cheftako Does gce/addons work the same, but only gets invoked for GCE clusters? If so, then I agree that makes sense. Please realize due to the total lack of documentation for anything here we've been poking around the best we can. So any clarifying guidance you can give is appreciated! @spiffxp Is your concern around addons vs addons/gce? This is needed for any cluster that supports the gce-pd storage provisioner. I'm not sure what the scope of that is---it's in-tree, so technically in all k8s clusters, but I suspect only works on GCE? If that's the case than we can more explicitly scope the GCE PD driver to install only on GCE clusters. But that means once migration is enabled the gce-pd provisioner will silently stop working, I'm not sure if that's WAI or not. See the above comment for our need for guidance on this :-) |
Yes, gce/addons works the same way as addons but should only be invoked for GCE clusters.
Agreed. Please feel free to reach out if I can be of assistence. |
I mean this in the nicest and kindest possible way. The reason it's so hard and undocumented on how to add features to this code is because people shouldn't be adding features to this code. It's been deprecated since forever, and is effectively unstaffed. I know, I know, the commit history shows that clearly people are still shoving things into this. But I need to genuinely ask, is there some way you could accomplish your goals without modifying Maybe this is my ignorance about the state of addons showing through. Is there not some way of installing out-of-tree manifests during or after cluster provisioning without having to use cluster/addons? Is cluster/addons still the only place things can be added? Surely not...
I think I'm lacking context here. Is this the in-tree to out-of-tree cloud-provider migration we're talking about?
Ideally, the "default" cluster that is stood up for e2e tests used to gate merges and releases for this OSS project does not actually depend on cloud-provider-specific features. It comes primarily from a desire for speed. I'd rather we break our addiction to slow, flakey e2e tests, that require real actual clouds and VMs... and instead consider how we could land things against quicker, lower-fidelity tests more frequently, such that the cost of change is lower. Secondarily, it comes from wanting to level the playing field. In the world where all cloud-provider-specific code is out-of-tree, I think it's unreasonable to expect each cloud provider to get their own merge-blocking presubmit to this repo to ensure their specific features work. The world of today where some cloud providers were lucky enough to be in-tree before the decision was made to move out-of-tree is not a level playing field, and it's not clear to me why the project would want that to continue. Or, tl;dr I have not see a proposal on what merge-blocking / release-blocking testing looks like in the post-cloud-provider-extraction world, but this doesn't seem like the right path forward |
Thanks for your detailed reply @spiffxp. Here's the background & what I understood coming into this PR.
It sounds like you're proposing that we actually just drop running gce-pd tests in k/k altogether. Given my last bullet point above that might actually be reasonable, since we already have test coverage and since the backing implementation will be the pd csi driver it might be more appropriate to test gce-pd in the pd csi repo anyway. The pd csi repo e2e tests set up the cluster itself, so the pd csi driver gets installed without having to muck with kube-up. @msau42 What do you think about this? If we want to go this way we should probably have a sig-storage meeting to get consensus. Also, is my assumption about needing to install the driver in kube-up correct? I may be not understanding how ginkgo does parallel testing. Technically it would be possible to deploy the driver in hack/e2e-internal/e2e-up.sh, but I think that's even messier and less maintainable than putting it in cluster/. |
It's not just gce tests. It's any test that requires a storage provider, such as statefulset and default storageclass, which also run on other non-gce environments. There are a number of non-framework tests that would need to be either moved out or refactored into the framework: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/storage. The effort will take a few releases, and if we depend on this, will delay the overall cloud-provider extraction effort. Given our time constraints, I would prefer an approach where we keep the tests, but make them release-informing instead of blocking. So we still install the csi driver as part of whatever cluster deployment mechanism those jobs use (whether it be kube-up or the new cloud-provider-gcp kube-up). Over time we will work on migrating the tests, but right now we'll lose too much coverage if we just drop them. |
Is there a cloud agnostic csi driver that could be installed to meet the needs of these tests? |
We have csi hostpath and mock drivers but they are not able to cover all of the features and codepaths that a real storage driver can cover. |
I've got a longer response drafted somewhere but my tldr is, can we meet your goals by not enabling these drivers by default, and solely on the jobs that require them? The longer response is about understanding what loss of coverage we're talking about specifically. I was looking at kind vs gce e2e default/parallel jobs as a proxy |
I had a not quite similar thought, where we run these tests in optional, release-informing jobs instead of the blocking jobs today. But I think from a cloud-provider-gcp perspective, if the goal of "gce kube-up" is to have a functional working cluster on gcp, we should enable the pd csi driver by default.
I'll help put together a list of the tests that are not covered by existing storage framework tests, but the gist is: 1) any test case that depends on a default storageclass (such as statefulset and statefulset upgrade). These are also test cases that we have been considering for a future stateful workload conformance profile. 2) legacy cloud provider specific tests Even though kind has a default local storageclass, it doesn't exercise a lot of the multi-node logic in the scheduler and attach detach controller, and storage features such as volume expansion. |
I started to look through the tests. As Michelle said it is the storage tests that exercise interesting things like multi-node and attaching. The statefulset tests (which seem to be the only nonstorage tests that need a default storageclass) would probably work fine with a hostpath or similar provider? It would also be easy to enable the pd csi driver per-job, but doing so would still IIUC require adding this deployment to cluster/gce. A key detail is that we need the GCE SA token that's in the master for the driver. Otherwise much larger changes are needed to the e2e framework. |
If a statefulset test requires any data resiliency across nodes, such as the upgrade test, then it needs a real storage system. |
There doesn't seem to be any upgrade tests? There's one that looks at pods rejected by a node, but that doesn't require a storageclass. Nothing else seems to upgrade a node, so having a PVC pinned to a node (ie, hostpath) should work fine. Unless there are upgrade tests that aren't explicitly run from test/e2e/apps/statefulset.go? |
upgrade tests are run from test/e2e/upgrades/... it looks like they are currently broken, opened #103697 to track edit: looks like the statefulset upgrade tests are run from https://testgrid.k8s.io/google-gce-upgrade#gce-1.13-1.14-upgrade-cluster-new&width=5 ... also still broken, but expected to be fixed shortly |
Ah, it looks like e2e/upgrades/apps/statefulset.go has an implicit dependency on a default storageclass. So the blast radius for this is bigger than I thought it was. Anyway, here's the reasoning I was trying to run down.
If we can no longer get gce-specific changes into kube-up, then that practically means we can't support gce-specific e2e tests. I'd love to have upstream tests (maybe they would have caught the CSI performance problem that is currently blocking CSI migration in GKE), but TBH we're only barely keeping up with the maintenance burden on our internal storage + CSI driver tests as it is, and if there's going to be a lot of friction to maintain k/k e2e storage tests until the framework is moved over to the glorious external cloud provider future, I'd prefer to cut out that burden now. ie, abandon this PR and give up supporting gce in kube-up. Remove the default storage class in cluster. Concretely that means that gce-pd, post csi migration, is only tested internally in GKE and gce-pd on GCE is effectively unsupported. Maybe this position is a little extreme, but frankly I'm also feeling burned that I spent a month scrounging to learn that kube-up was the right place to update e2e tests for pd csi, 2 weeks of my SWE's time figuring out how addons work and making a PR, having that PR languish for a month and miss the code freeze, only to learn at this point from a slightly different audience that kube-up shouldn't be changed and that (exaggerating slightly) nothing can be done until cloud-provider usage in e2e tests is totally refactored. I have great respect for the complexity of the problem and the work you're doing keeping k8s going, the utility of the cloud provider extraction and the difficulty of getting people to contribute to such a project, and how it's impossible to deprecate something (kube-up) if people like me keep putting new features into it. I don't feel like anyone here has done anything wrong. Everyone is acting with good faith & in the interests of kubernetes. But I'm still frustrated. I don't think we're being very effective here. gce-pd CSI migration missed 1.22, which may very well delay cloud provider extraction itself IIUC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share the frustration. I want to be reasonable here but it's just not clear to me that enabling this by default for all jobs is wise right before we are trying to stabilize our CI signal. I get the impression it could result in previously unseen bugs across a wide variety of tests. I would be far more comfortable if we knew which specific jobs really needed this to accomplish your goals. Do we have signal on how tests behave with this enabled? What testing is not possible until this is enabled?
I'm not the only approver in this directory, and I'm late to the party here. If SIG Testing was consulted on this KEP and I missed it, my bad. Is there someone else who's an approver here who was, and is comfortable articulating why this should proceed as originally planned?
# Optional: install pd csi driver | ||
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is in conflict with the comment
# Optional: install pd csi driver | |
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}" | |
# Optional: install pd csi driver | |
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-false}" | |
# Optional: install pd csi driver | ||
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
# Optional: install pd csi driver | |
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}" | |
# Optional: install pd csi driver | |
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-false}" | |
e.g. I could see this being a reasonable compromise, with agreement to remove this later, if cluster addons is truly the only way our jobs can install anything prior to running tests |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/close This is being done in cloud-provider-gcp based on all the comments above, see kubernetes/cloud-provider-gcp#265 |
@mattcary: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind feature
this PR installs by default pd-csi controller on master node through a static manifest on kube up and pd-csi plugin daemon set though addon-manager. Also provided is an option to not install those components through setting an environmental variable
ENABLE_PDCSI_DRIVER
when invoking kube up.Fixes #97985
KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/625-csi-migration