Install pdcsi driver by default #102701

leiyiz · 2021-06-08T18:42:02Z

/kind feature

this PR installs by default pd-csi controller on master node through a static manifest on kube up and pd-csi plugin daemon set though addon-manager. Also provided is an option to not install those components through setting an environmental variable ENABLE_PDCSI_DRIVER when invoking kube up.

Fixes #97985

KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/625-csi-migration

NONE

k8s-ci-robot · 2021-06-08T18:42:10Z

Welcome @leiyiz!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2021-06-08T18:42:11Z

Hi @leiyiz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

leiyiz · 2021-06-08T18:42:26Z

@mattcary

mattcary · 2021-06-08T18:48:43Z

/assign

I'll take an initial pass through this before pinging the owners.

mattcary · 2021-06-08T18:49:19Z

Also, Leiyi, can you squash your commits?

leiyiz · 2021-06-08T21:08:54Z

Also, Leiyi, can you squash your commits?

done

mattcary

A couple of initial comments.

cluster/gce/config-default.sh

cluster/gce/manifests/pdcsi-controller.yaml

cluster/addons/pd-csi/pd-csi.yaml

mauriciopoppe · 2021-06-09T20:08:48Z

Could you please add a description of what the PR is about and what were the steps to test it?

Jiawei0227 · 2021-06-09T20:09:35Z

/assign

leiyiz · 2021-07-09T03:30:19Z

/retest

leiyiz · 2021-07-09T16:58:53Z

/retest

leiyiz · 2021-07-09T18:32:31Z

/retest

cheftako · 2021-07-12T06:09:20Z

cluster/addons/pdcsi-driver/pd-csi-psp.yaml

+apiVersion: policy/v1beta1
+kind: PodSecurityPolicy
+metadata:
+  name: csi-gce-pd-controller-psp


Ideally if this is gce-pd it should be under https://github.com/kubernetes/kubernetes/tree/master/cluster/gce/addons rather than https://github.com/kubernetes/kubernetes/tree/master/cluster/addons

spiffxp · 2021-07-12T19:24:01Z

If this is something that cannot reasonably be expected by default for all kubernetes clusters everywhere as of v1.22, I'm wary of accepting this

mattcary · 2021-07-12T19:45:04Z

@cheftako Does gce/addons work the same, but only gets invoked for GCE clusters? If so, then I agree that makes sense.

Please realize due to the total lack of documentation for anything here we've been poking around the best we can. So any clarifying guidance you can give is appreciated!

@spiffxp Is your concern around addons vs addons/gce? This is needed for any cluster that supports the gce-pd storage provisioner. I'm not sure what the scope of that is---it's in-tree, so technically in all k8s clusters, but I suspect only works on GCE? If that's the case than we can more explicitly scope the GCE PD driver to install only on GCE clusters. But that means once migration is enabled the gce-pd provisioner will silently stop working, I'm not sure if that's WAI or not.

See the above comment for our need for guidance on this :-)

cheftako · 2021-07-13T00:09:36Z

@cheftako Does gce/addons work the same, but only gets invoked for GCE clusters? If so, then I agree that makes sense.

Yes, gce/addons works the same way as addons but should only be invoked for GCE clusters.
I'm not sure if there are any other platforms still using cluster but I still think it makes the intent more clear.

Please realize due to the total lack of documentation for anything here we've been poking around the best we can. So any clarifying guidance you can give is appreciated!

Agreed. Please feel free to reach out if I can be of assistence.

spiffxp · 2021-07-13T00:57:55Z

Please realize due to the total lack of documentation for anything here we've been poking around the best we can. So any clarifying guidance you can give is appreciated!

I mean this in the nicest and kindest possible way. The reason it's so hard and undocumented on how to add features to this code is because people shouldn't be adding features to this code. It's been deprecated since forever, and is effectively unstaffed.

I know, I know, the commit history shows that clearly people are still shoving things into this. But I need to genuinely ask, is there some way you could accomplish your goals without modifying cluster/? Like what about https://github.com/kubernetes/cloud-provider-gcp, is that not the place we're trying to have GCP-specific things tested? That seems to be where the image used in these manifests is coming from...

Maybe this is my ignorance about the state of addons showing through. Is there not some way of installing out-of-tree manifests during or after cluster provisioning without having to use cluster/addons? Is cluster/addons still the only place things can be added? Surely not...

But that means once migration is enabled the gce-pd provisioner will silently stop working, I'm not sure if that's WAI or not.

I think I'm lacking context here. Is this the in-tree to out-of-tree cloud-provider migration we're talking about?

Is your concern around addons vs addons/gce?

Ideally, the "default" cluster that is stood up for e2e tests used to gate merges and releases for this OSS project does not actually depend on cloud-provider-specific features. It comes primarily from a desire for speed. I'd rather we break our addiction to slow, flakey e2e tests, that require real actual clouds and VMs... and instead consider how we could land things against quicker, lower-fidelity tests more frequently, such that the cost of change is lower.

Secondarily, it comes from wanting to level the playing field. In the world where all cloud-provider-specific code is out-of-tree, I think it's unreasonable to expect each cloud provider to get their own merge-blocking presubmit to this repo to ensure their specific features work. The world of today where some cloud providers were lucky enough to be in-tree before the decision was made to move out-of-tree is not a level playing field, and it's not clear to me why the project would want that to continue.

Or, tl;dr I have not see a proposal on what merge-blocking / release-blocking testing looks like in the post-cloud-provider-extraction world, but this doesn't seem like the right path forward

mattcary · 2021-07-13T16:17:05Z

Thanks for your detailed reply @spiffxp.

Here's the background & what I understood coming into this PR.

CSI migration is the deprecation of in-tree storage providers, in this case gce-pd, in favor of CSI drivers. It accomplishes the cloud provider migration for storage drivers as a side effect, but really it was aimed at pulling storage drivers in general out of k/k, not just cloud providers. See this KEP for more details.
kube-up is how e2e tests are run.
The PD CSI driver needs to be installed once, globally, in order to continue to run gce-pd tests in parallel post-migration. This needs to be done as part of kube-up and not part of the test suite in order for tests to run in parallel.
The gce-pd provider is only tested via the k/k e2e jobs (presubmit and some periodics, eg this).
The PD CSI driver is continuously tested from its repo (eg). We are also running csi migration jobs to test gce-pd under migration.

It sounds like you're proposing that we actually just drop running gce-pd tests in k/k altogether. Given my last bullet point above that might actually be reasonable, since we already have test coverage and since the backing implementation will be the pd csi driver it might be more appropriate to test gce-pd in the pd csi repo anyway. The pd csi repo e2e tests set up the cluster itself, so the pd csi driver gets installed without having to muck with kube-up.

@msau42 What do you think about this? If we want to go this way we should probably have a sig-storage meeting to get consensus.

Also, is my assumption about needing to install the driver in kube-up correct? I may be not understanding how ginkgo does parallel testing.

Technically it would be possible to deploy the driver in hack/e2e-internal/e2e-up.sh, but I think that's even messier and less maintainable than putting it in cluster/.

msau42 · 2021-07-13T17:39:50Z

It sounds like you're proposing that we actually just drop running gce-pd tests in k/k altogether

It's not just gce tests. It's any test that requires a storage provider, such as statefulset and default storageclass, which also run on other non-gce environments. There are a number of non-framework tests that would need to be either moved out or refactored into the framework: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/storage. The effort will take a few releases, and if we depend on this, will delay the overall cloud-provider extraction effort.

Given our time constraints, I would prefer an approach where we keep the tests, but make them release-informing instead of blocking. So we still install the csi driver as part of whatever cluster deployment mechanism those jobs use (whether it be kube-up or the new cloud-provider-gcp kube-up). Over time we will work on migrating the tests, but right now we'll lose too much coverage if we just drop them.

spiffxp · 2021-07-13T19:14:47Z

Is there a cloud agnostic csi driver that could be installed to meet the needs of these tests?

msau42 · 2021-07-13T19:35:29Z

Is there a cloud agnostic csi driver that could be installed to meet the needs of these tests?

We have csi hostpath and mock drivers but they are not able to cover all of the features and codepaths that a real storage driver can cover.

spiffxp · 2021-07-14T00:51:39Z

I've got a longer response drafted somewhere but my tldr is, can we meet your goals by not enabling these drivers by default, and solely on the jobs that require them?

The longer response is about understanding what loss of coverage we're talking about specifically. I was looking at kind vs gce e2e default/parallel jobs as a proxy

msau42 · 2021-07-14T01:18:01Z

Can we meet your goals by not enabling these drivers by default, and solely on the jobs that require them?

I had a not quite similar thought, where we run these tests in optional, release-informing jobs instead of the blocking jobs today. But I think from a cloud-provider-gcp perspective, if the goal of "gce kube-up" is to have a functional working cluster on gcp, we should enable the pd csi driver by default.

what loss of coverage

I'll help put together a list of the tests that are not covered by existing storage framework tests, but the gist is: 1) any test case that depends on a default storageclass (such as statefulset and statefulset upgrade). These are also test cases that we have been considering for a future stateful workload conformance profile. 2) legacy cloud provider specific tests

Even though kind has a default local storageclass, it doesn't exercise a lot of the multi-node logic in the scheduler and attach detach controller, and storage features such as volume expansion.

mattcary · 2021-07-14T16:54:30Z

I started to look through the tests. As Michelle said it is the storage tests that exercise interesting things like multi-node and attaching. The statefulset tests (which seem to be the only nonstorage tests that need a default storageclass) would probably work fine with a hostpath or similar provider?

It would also be easy to enable the pd csi driver per-job, but doing so would still IIUC require adding this deployment to cluster/gce. A key detail is that we need the GCE SA token that's in the master for the driver. Otherwise much larger changes are needed to the e2e framework.

msau42 · 2021-07-14T18:11:26Z

The statefulset tests (which seem to be the only nonstorage tests that need a default storageclass) would probably work fine with a hostpath or similar provider?

If a statefulset test requires any data resiliency across nodes, such as the upgrade test, then it needs a real storage system.

mattcary · 2021-07-14T18:40:07Z

If a statefulset test requires any data resiliency across nodes, such as the upgrade test, then it needs a real storage system.

There doesn't seem to be any upgrade tests? There's one that looks at pods rejected by a node, but that doesn't require a storageclass.

Nothing else seems to upgrade a node, so having a PVC pinned to a node (ie, hostpath) should work fine.

Unless there are upgrade tests that aren't explicitly run from test/e2e/apps/statefulset.go?

liggitt · 2021-07-14T18:56:39Z

Unless there are upgrade tests that aren't explicitly run from test/e2e/apps/statefulset.go?

upgrade tests are run from test/e2e/upgrades/...

it looks like they are currently broken, opened #103697 to track

edit: looks like the statefulset upgrade tests are run from https://testgrid.k8s.io/google-gce-upgrade#gce-1.13-1.14-upgrade-cluster-new&width=5 ... also still broken, but expected to be fixed shortly

mattcary · 2021-07-14T22:23:58Z

Ah, it looks like e2e/upgrades/apps/statefulset.go has an implicit dependency on a default storageclass. So the blast radius for this is bigger than I thought it was.

Anyway, here's the reasoning I was trying to run down.

e2e tests are cluster/kube-up + test/e2e
By @spiffxp's comments, kube-up is deprecated, especially from the point of view of cloud provider support
Therefore testing cloud-provider specific stuff in the e2e tests is deprecated and we should stop testing that in k/k.

If we can no longer get gce-specific changes into kube-up, then that practically means we can't support gce-specific e2e tests. I'd love to have upstream tests (maybe they would have caught the CSI performance problem that is currently blocking CSI migration in GKE), but TBH we're only barely keeping up with the maintenance burden on our internal storage + CSI driver tests as it is, and if there's going to be a lot of friction to maintain k/k e2e storage tests until the framework is moved over to the glorious external cloud provider future, I'd prefer to cut out that burden now.

ie, abandon this PR and give up supporting gce in kube-up. Remove the default storage class in cluster. Concretely that means that gce-pd, post csi migration, is only tested internally in GKE and gce-pd on GCE is effectively unsupported.

Maybe this position is a little extreme, but frankly I'm also feeling burned that I spent a month scrounging to learn that kube-up was the right place to update e2e tests for pd csi, 2 weeks of my SWE's time figuring out how addons work and making a PR, having that PR languish for a month and miss the code freeze, only to learn at this point from a slightly different audience that kube-up shouldn't be changed and that (exaggerating slightly) nothing can be done until cloud-provider usage in e2e tests is totally refactored.

I have great respect for the complexity of the problem and the work you're doing keeping k8s going, the utility of the cloud provider extraction and the difficulty of getting people to contribute to such a project, and how it's impossible to deprecate something (kube-up) if people like me keep putting new features into it. I don't feel like anyone here has done anything wrong. Everyone is acting with good faith & in the interests of kubernetes. But I'm still frustrated.

I don't think we're being very effective here. gce-pd CSI migration missed 1.22, which may very well delay cloud provider extraction itself IIUC.

spiffxp

I share the frustration. I want to be reasonable here but it's just not clear to me that enabling this by default for all jobs is wise right before we are trying to stabilize our CI signal. I get the impression it could result in previously unseen bugs across a wide variety of tests. I would be far more comfortable if we knew which specific jobs really needed this to accomplish your goals. Do we have signal on how tests behave with this enabled? What testing is not possible until this is enabled?

I'm not the only approver in this directory, and I'm late to the party here. If SIG Testing was consulted on this KEP and I missed it, my bad. Is there someone else who's an approver here who was, and is comfortable articulating why this should proceed as originally planned?

spiffxp · 2021-07-14T00:45:29Z

cluster/gce/config-default.sh

+# Optional: install pd csi driver
+ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}"
+


The default value is in conflict with the comment

Suggested change

# Optional: install pd csi driver

ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}"

# Optional: install pd csi driver

ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-false}"

spiffxp · 2021-07-14T00:45:46Z

cluster/gce/config-test.sh

+# Optional: install pd csi driver
+ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}"
+


Ditto

Suggested change

# Optional: install pd csi driver

ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}"

# Optional: install pd csi driver

ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-false}"

spiffxp · 2021-07-15T02:42:20Z

It would also be easy to enable the pd csi driver per-job, but doing so would still IIUC require adding this deployment to cluster/gce.

e.g. I could see this being a reasonable compromise, with agreement to remove this later, if cluster addons is truly the only way our jobs can install anything prior to running tests

spiffxp · 2021-07-15T02:54:32Z

https://gist.github.com/spiffxp/147fd1588ba521861c004297928b0481 was the quick check I did against https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default vs https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel

k8s-triage-robot · 2021-10-13T15:40:58Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mattcary · 2021-10-13T15:57:37Z

/close

This is being done in cloud-provider-gcp based on all the comments above, see kubernetes/cloud-provider-gcp#265

k8s-ci-robot · 2021-10-13T15:57:54Z

@mattcary: Closed this PR.

In response to this:

/close

This is being done in cloud-provider-gcp based on all the comments above, see kubernetes/cloud-provider-gcp#265

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 8, 2021

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. labels Jun 8, 2021

k8s-ci-robot requested review from cheftako and copejon June 8, 2021 18:42

k8s-ci-robot assigned mattcary Jun 8, 2021

leiyiz force-pushed the install-pdcsi-default branch from 23109d1 to 99424ad Compare June 8, 2021 21:06

k8s-ci-robot removed the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Jun 8, 2021

mattcary reviewed Jun 8, 2021

View reviewed changes

cluster/gce/config-default.sh Outdated Show resolved Hide resolved

cluster/gce/manifests/pdcsi-controller.yaml Outdated Show resolved Hide resolved

cluster/addons/pd-csi/pd-csi.yaml Outdated Show resolved Hide resolved

k8s-ci-robot assigned Jiawei0227 Jun 9, 2021

leiyiz force-pushed the install-pdcsi-default branch from 0fd95bd to 158a1da Compare July 9, 2021 01:01

cheftako reviewed Jul 12, 2021

View reviewed changes

spiffxp reviewed Jul 15, 2021

View reviewed changes

mattcary mentioned this pull request Aug 4, 2021

GCE PD in-tree to CSI driver migration kubernetes/enhancements#1488

Closed

2 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 13, 2021

k8s-ci-robot closed this Oct 13, 2021

mattcary mentioned this pull request May 22, 2024

Failing test due to pv expecting a "topology.gke.io/zone" label that is not set in OSS kubernetes nodes on GCE #124721

Open

		# Optional: install pd csi driver
		ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}"

Install pdcsi driver by default #102701

Install pdcsi driver by default #102701

Conversation

leiyiz commented Jun 8, 2021 • edited

k8s-ci-robot commented Jun 8, 2021

k8s-ci-robot commented Jun 8, 2021

leiyiz commented Jun 8, 2021

mattcary commented Jun 8, 2021

mattcary commented Jun 8, 2021

leiyiz commented Jun 8, 2021

mattcary left a comment

Choose a reason for hiding this comment

mauriciopoppe commented Jun 9, 2021

Jiawei0227 commented Jun 9, 2021

leiyiz commented Jul 9, 2021

leiyiz commented Jul 9, 2021

leiyiz commented Jul 9, 2021

cheftako Jul 12, 2021

Choose a reason for hiding this comment

spiffxp commented Jul 12, 2021

mattcary commented Jul 12, 2021

cheftako commented Jul 13, 2021

spiffxp commented Jul 13, 2021

mattcary commented Jul 13, 2021

msau42 commented Jul 13, 2021

spiffxp commented Jul 13, 2021

msau42 commented Jul 13, 2021

spiffxp commented Jul 14, 2021 • edited

msau42 commented Jul 14, 2021

mattcary commented Jul 14, 2021

msau42 commented Jul 14, 2021

mattcary commented Jul 14, 2021

liggitt commented Jul 14, 2021 • edited

mattcary commented Jul 14, 2021

spiffxp left a comment

Choose a reason for hiding this comment

spiffxp Jul 14, 2021

Choose a reason for hiding this comment

spiffxp Jul 14, 2021

Choose a reason for hiding this comment

spiffxp commented Jul 15, 2021

spiffxp commented Jul 15, 2021

k8s-triage-robot commented Oct 13, 2021

mattcary commented Oct 13, 2021

k8s-ci-robot commented Oct 13, 2021

leiyiz commented Jun 8, 2021 •

edited

spiffxp commented Jul 14, 2021 •

edited

liggitt commented Jul 14, 2021 •

edited