Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install pdcsi driver by default #102701

Closed
wants to merge 1 commit into from

Conversation

leiyiz
Copy link
Contributor

@leiyiz leiyiz commented Jun 8, 2021

/kind feature

this PR installs by default pd-csi controller on master node through a static manifest on kube up and pd-csi plugin daemon set though addon-manager. Also provided is an option to not install those components through setting an environmental variable ENABLE_PDCSI_DRIVER when invoking kube up.

Fixes #97985

KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/625-csi-migration

NONE

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 8, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @leiyiz!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 8, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @leiyiz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. labels Jun 8, 2021
@leiyiz
Copy link
Contributor Author

leiyiz commented Jun 8, 2021

@mattcary

@k8s-ci-robot k8s-ci-robot added area/provider/gcp Issues or PRs related to gcp provider area/test sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 8, 2021
@mattcary
Copy link
Contributor

mattcary commented Jun 8, 2021

/assign

I'll take an initial pass through this before pinging the owners.

@mattcary
Copy link
Contributor

mattcary commented Jun 8, 2021

Also, Leiyi, can you squash your commits?

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Jun 8, 2021
@leiyiz
Copy link
Contributor Author

leiyiz commented Jun 8, 2021

Also, Leiyi, can you squash your commits?

done

Copy link
Contributor

@mattcary mattcary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of initial comments.

cluster/gce/config-default.sh Outdated Show resolved Hide resolved
cluster/gce/manifests/pdcsi-controller.yaml Outdated Show resolved Hide resolved
cluster/addons/pd-csi/pd-csi.yaml Outdated Show resolved Hide resolved
@mauriciopoppe
Copy link
Member

Could you please add a description of what the PR is about and what were the steps to test it?

@Jiawei0227
Copy link
Contributor

/assign

@leiyiz
Copy link
Contributor Author

leiyiz commented Jul 9, 2021

/retest

2 similar comments
@leiyiz
Copy link
Contributor Author

leiyiz commented Jul 9, 2021

/retest

@leiyiz
Copy link
Contributor Author

leiyiz commented Jul 9, 2021

/retest

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: csi-gce-pd-controller-psp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spiffxp
Copy link
Member

spiffxp commented Jul 12, 2021

If this is something that cannot reasonably be expected by default for all kubernetes clusters everywhere as of v1.22, I'm wary of accepting this

@mattcary
Copy link
Contributor

@cheftako Does gce/addons work the same, but only gets invoked for GCE clusters? If so, then I agree that makes sense.

Please realize due to the total lack of documentation for anything here we've been poking around the best we can. So any clarifying guidance you can give is appreciated!

@spiffxp Is your concern around addons vs addons/gce? This is needed for any cluster that supports the gce-pd storage provisioner. I'm not sure what the scope of that is---it's in-tree, so technically in all k8s clusters, but I suspect only works on GCE? If that's the case than we can more explicitly scope the GCE PD driver to install only on GCE clusters. But that means once migration is enabled the gce-pd provisioner will silently stop working, I'm not sure if that's WAI or not.

See the above comment for our need for guidance on this :-)

@cheftako
Copy link
Member

@cheftako Does gce/addons work the same, but only gets invoked for GCE clusters? If so, then I agree that makes sense.

Yes, gce/addons works the same way as addons but should only be invoked for GCE clusters.
I'm not sure if there are any other platforms still using cluster but I still think it makes the intent more clear.

Please realize due to the total lack of documentation for anything here we've been poking around the best we can. So any clarifying guidance you can give is appreciated!

Agreed. Please feel free to reach out if I can be of assistence.

@spiffxp
Copy link
Member

spiffxp commented Jul 13, 2021

Please realize due to the total lack of documentation for anything here we've been poking around the best we can. So any clarifying guidance you can give is appreciated!

I mean this in the nicest and kindest possible way. The reason it's so hard and undocumented on how to add features to this code is because people shouldn't be adding features to this code. It's been deprecated since forever, and is effectively unstaffed.

I know, I know, the commit history shows that clearly people are still shoving things into this. But I need to genuinely ask, is there some way you could accomplish your goals without modifying cluster/? Like what about https://github.com/kubernetes/cloud-provider-gcp, is that not the place we're trying to have GCP-specific things tested? That seems to be where the image used in these manifests is coming from...

Maybe this is my ignorance about the state of addons showing through. Is there not some way of installing out-of-tree manifests during or after cluster provisioning without having to use cluster/addons? Is cluster/addons still the only place things can be added? Surely not...

But that means once migration is enabled the gce-pd provisioner will silently stop working, I'm not sure if that's WAI or not.

I think I'm lacking context here. Is this the in-tree to out-of-tree cloud-provider migration we're talking about?

Is your concern around addons vs addons/gce?

Ideally, the "default" cluster that is stood up for e2e tests used to gate merges and releases for this OSS project does not actually depend on cloud-provider-specific features. It comes primarily from a desire for speed. I'd rather we break our addiction to slow, flakey e2e tests, that require real actual clouds and VMs... and instead consider how we could land things against quicker, lower-fidelity tests more frequently, such that the cost of change is lower.

Secondarily, it comes from wanting to level the playing field. In the world where all cloud-provider-specific code is out-of-tree, I think it's unreasonable to expect each cloud provider to get their own merge-blocking presubmit to this repo to ensure their specific features work. The world of today where some cloud providers were lucky enough to be in-tree before the decision was made to move out-of-tree is not a level playing field, and it's not clear to me why the project would want that to continue.

Or, tl;dr I have not see a proposal on what merge-blocking / release-blocking testing looks like in the post-cloud-provider-extraction world, but this doesn't seem like the right path forward

@mattcary
Copy link
Contributor

Thanks for your detailed reply @spiffxp.

Here's the background & what I understood coming into this PR.

  • CSI migration is the deprecation of in-tree storage providers, in this case gce-pd, in favor of CSI drivers. It accomplishes the cloud provider migration for storage drivers as a side effect, but really it was aimed at pulling storage drivers in general out of k/k, not just cloud providers. See this KEP for more details.
  • kube-up is how e2e tests are run.
  • The PD CSI driver needs to be installed once, globally, in order to continue to run gce-pd tests in parallel post-migration. This needs to be done as part of kube-up and not part of the test suite in order for tests to run in parallel.
  • The gce-pd provider is only tested via the k/k e2e jobs (presubmit and some periodics, eg this).
  • The PD CSI driver is continuously tested from its repo (eg). We are also running csi migration jobs to test gce-pd under migration.

It sounds like you're proposing that we actually just drop running gce-pd tests in k/k altogether. Given my last bullet point above that might actually be reasonable, since we already have test coverage and since the backing implementation will be the pd csi driver it might be more appropriate to test gce-pd in the pd csi repo anyway. The pd csi repo e2e tests set up the cluster itself, so the pd csi driver gets installed without having to muck with kube-up.

@msau42 What do you think about this? If we want to go this way we should probably have a sig-storage meeting to get consensus.

Also, is my assumption about needing to install the driver in kube-up correct? I may be not understanding how ginkgo does parallel testing.

Technically it would be possible to deploy the driver in hack/e2e-internal/e2e-up.sh, but I think that's even messier and less maintainable than putting it in cluster/.

@msau42
Copy link
Member

msau42 commented Jul 13, 2021

It sounds like you're proposing that we actually just drop running gce-pd tests in k/k altogether

It's not just gce tests. It's any test that requires a storage provider, such as statefulset and default storageclass, which also run on other non-gce environments. There are a number of non-framework tests that would need to be either moved out or refactored into the framework: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/storage. The effort will take a few releases, and if we depend on this, will delay the overall cloud-provider extraction effort.

Given our time constraints, I would prefer an approach where we keep the tests, but make them release-informing instead of blocking. So we still install the csi driver as part of whatever cluster deployment mechanism those jobs use (whether it be kube-up or the new cloud-provider-gcp kube-up). Over time we will work on migrating the tests, but right now we'll lose too much coverage if we just drop them.

@spiffxp
Copy link
Member

spiffxp commented Jul 13, 2021

Is there a cloud agnostic csi driver that could be installed to meet the needs of these tests?

@msau42
Copy link
Member

msau42 commented Jul 13, 2021

Is there a cloud agnostic csi driver that could be installed to meet the needs of these tests?

We have csi hostpath and mock drivers but they are not able to cover all of the features and codepaths that a real storage driver can cover.

@spiffxp
Copy link
Member

spiffxp commented Jul 14, 2021

I've got a longer response drafted somewhere but my tldr is, can we meet your goals by not enabling these drivers by default, and solely on the jobs that require them?

The longer response is about understanding what loss of coverage we're talking about specifically. I was looking at kind vs gce e2e default/parallel jobs as a proxy

@msau42
Copy link
Member

msau42 commented Jul 14, 2021

Can we meet your goals by not enabling these drivers by default, and solely on the jobs that require them?

I had a not quite similar thought, where we run these tests in optional, release-informing jobs instead of the blocking jobs today. But I think from a cloud-provider-gcp perspective, if the goal of "gce kube-up" is to have a functional working cluster on gcp, we should enable the pd csi driver by default.

what loss of coverage

I'll help put together a list of the tests that are not covered by existing storage framework tests, but the gist is: 1) any test case that depends on a default storageclass (such as statefulset and statefulset upgrade). These are also test cases that we have been considering for a future stateful workload conformance profile. 2) legacy cloud provider specific tests

Even though kind has a default local storageclass, it doesn't exercise a lot of the multi-node logic in the scheduler and attach detach controller, and storage features such as volume expansion.

@mattcary
Copy link
Contributor

I started to look through the tests. As Michelle said it is the storage tests that exercise interesting things like multi-node and attaching. The statefulset tests (which seem to be the only nonstorage tests that need a default storageclass) would probably work fine with a hostpath or similar provider?

It would also be easy to enable the pd csi driver per-job, but doing so would still IIUC require adding this deployment to cluster/gce. A key detail is that we need the GCE SA token that's in the master for the driver. Otherwise much larger changes are needed to the e2e framework.

@msau42
Copy link
Member

msau42 commented Jul 14, 2021

The statefulset tests (which seem to be the only nonstorage tests that need a default storageclass) would probably work fine with a hostpath or similar provider?

If a statefulset test requires any data resiliency across nodes, such as the upgrade test, then it needs a real storage system.

@mattcary
Copy link
Contributor

If a statefulset test requires any data resiliency across nodes, such as the upgrade test, then it needs a real storage system.

There doesn't seem to be any upgrade tests? There's one that looks at pods rejected by a node, but that doesn't require a storageclass.

Nothing else seems to upgrade a node, so having a PVC pinned to a node (ie, hostpath) should work fine.

Unless there are upgrade tests that aren't explicitly run from test/e2e/apps/statefulset.go?

@liggitt
Copy link
Member

liggitt commented Jul 14, 2021

Unless there are upgrade tests that aren't explicitly run from test/e2e/apps/statefulset.go?

upgrade tests are run from test/e2e/upgrades/...

it looks like they are currently broken, opened #103697 to track

edit: looks like the statefulset upgrade tests are run from https://testgrid.k8s.io/google-gce-upgrade#gce-1.13-1.14-upgrade-cluster-new&width=5 ... also still broken, but expected to be fixed shortly

@mattcary
Copy link
Contributor

Ah, it looks like e2e/upgrades/apps/statefulset.go has an implicit dependency on a default storageclass. So the blast radius for this is bigger than I thought it was.

Anyway, here's the reasoning I was trying to run down.

  1. e2e tests are cluster/kube-up + test/e2e
  2. By @spiffxp's comments, kube-up is deprecated, especially from the point of view of cloud provider support
  3. Therefore testing cloud-provider specific stuff in the e2e tests is deprecated and we should stop testing that in k/k.

If we can no longer get gce-specific changes into kube-up, then that practically means we can't support gce-specific e2e tests. I'd love to have upstream tests (maybe they would have caught the CSI performance problem that is currently blocking CSI migration in GKE), but TBH we're only barely keeping up with the maintenance burden on our internal storage + CSI driver tests as it is, and if there's going to be a lot of friction to maintain k/k e2e storage tests until the framework is moved over to the glorious external cloud provider future, I'd prefer to cut out that burden now.

ie, abandon this PR and give up supporting gce in kube-up. Remove the default storage class in cluster. Concretely that means that gce-pd, post csi migration, is only tested internally in GKE and gce-pd on GCE is effectively unsupported.

Maybe this position is a little extreme, but frankly I'm also feeling burned that I spent a month scrounging to learn that kube-up was the right place to update e2e tests for pd csi, 2 weeks of my SWE's time figuring out how addons work and making a PR, having that PR languish for a month and miss the code freeze, only to learn at this point from a slightly different audience that kube-up shouldn't be changed and that (exaggerating slightly) nothing can be done until cloud-provider usage in e2e tests is totally refactored.

I have great respect for the complexity of the problem and the work you're doing keeping k8s going, the utility of the cloud provider extraction and the difficulty of getting people to contribute to such a project, and how it's impossible to deprecate something (kube-up) if people like me keep putting new features into it. I don't feel like anyone here has done anything wrong. Everyone is acting with good faith & in the interests of kubernetes. But I'm still frustrated.

I don't think we're being very effective here. gce-pd CSI migration missed 1.22, which may very well delay cloud provider extraction itself IIUC.

Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share the frustration. I want to be reasonable here but it's just not clear to me that enabling this by default for all jobs is wise right before we are trying to stabilize our CI signal. I get the impression it could result in previously unseen bugs across a wide variety of tests. I would be far more comfortable if we knew which specific jobs really needed this to accomplish your goals. Do we have signal on how tests behave with this enabled? What testing is not possible until this is enabled?

I'm not the only approver in this directory, and I'm late to the party here. If SIG Testing was consulted on this KEP and I missed it, my bad. Is there someone else who's an approver here who was, and is comfortable articulating why this should proceed as originally planned?

Comment on lines +424 to +426
# Optional: install pd csi driver
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is in conflict with the comment

Suggested change
# Optional: install pd csi driver
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}"
# Optional: install pd csi driver
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-false}"

Comment on lines +476 to +478
# Optional: install pd csi driver
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Suggested change
# Optional: install pd csi driver
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-true}"
# Optional: install pd csi driver
ENABLE_PDCSI_DRIVER="${ENABLE_PDCSI_DRIVER:-false}"

@spiffxp
Copy link
Member

spiffxp commented Jul 15, 2021

It would also be easy to enable the pd csi driver per-job, but doing so would still IIUC require adding this deployment to cluster/gce.

e.g. I could see this being a reasonable compromise, with agreement to remove this later, if cluster addons is truly the only way our jobs can install anything prior to running tests

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 13, 2021
@mattcary
Copy link
Contributor

/close

This is being done in cloud-provider-gcp based on all the comments above, see kubernetes/cloud-provider-gcp#265

@k8s-ci-robot
Copy link
Contributor

@mattcary: Closed this PR.

In response to this:

/close

This is being done in cloud-provider-gcp based on all the comments above, see kubernetes/cloud-provider-gcp#265

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/gcp Issues or PRs related to gcp provider area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Install gce-pd-csi-driver by default in the kube-up script