Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make subnet spec id field required for SSA to work with CC #3748

Merged
merged 5 commits into from Sep 27, 2022

Conversation

sedefsavas
Copy link
Contributor

@sedefsavas sedefsavas commented Sep 25, 2022

What this PR does / why we need it:
This PR makes the id field in the subnet spec required to solve SSA coauthoring issue in kubernetes-sigs/cluster-api#6320.

Before this PR, some fields in network.subnets list could be set, although the infrastructure was managed by CAPA and this was unintentional, just because there were no webhook checks in place.

After this PR, the only 2 possible cases we support are:

  • User can set any/all fields in subnet spec when using external infrastructure as long as id of the subnet is provided.
  • User cannot set any field in subnet spec if infrastructure is managed by CAPA.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3530
Fixes #3528

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 25, 2022
@sedefsavas
Copy link
Contributor Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@sedefsavas: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sedefsavas
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e-clusterclass

@sedefsavas
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e-clusterclass

@sedefsavas
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e-clusterclass

@Skarlso
Copy link
Contributor

Skarlso commented Sep 26, 2022

@sedefsavas Hi 👋 For my sake, would you mind clarifying the abbreviations in the title, please? :) Is CC ClusterClass?

@sedefsavas
Copy link
Contributor Author

@Skarlso I added the issues that this PR solves.
SSA -> Server side apply
CC-> ClusterClass

@sedefsavas sedefsavas changed the title [WIP] Make subnet spec id field required for SSA to work with CC Make subnet spec id field required for SSA to work with CC Sep 26, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2022
@Skarlso
Copy link
Contributor

Skarlso commented Sep 27, 2022

@sedefsavas This is now ready for review, right? :)

@sedefsavas
Copy link
Contributor Author

sedefsavas commented Sep 27, 2022

@sedefsavas This is now ready for review, right? :)

Yes, ready. Passed ClusterClass tests.
Let me run the regular tests too.

@Skarlso
Copy link
Contributor

Skarlso commented Sep 27, 2022

Awesome, I shall take a crack at it. :)

@sedefsavas
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@@ -242,7 +242,7 @@ func (v *VPCSpec) IsIPv6Enabled() bool {
// SubnetSpec configures an AWS Subnet.
type SubnetSpec struct {
// ID defines a unique identifier to reference this resource.
ID string `json:"id,omitempty"`
ID string `json:"id"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what happens in case of existing clusters is that their ID will be set to a default empty string, right?
Does that affect the cluster at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means if a subnet spec is defined, it should have the id field.
Making id a unique identifier solves SSA coauthoring issue. More info in the CAPI issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw this one. Just was wondering about existing configs since now it will not just omit the value but update it to an empty string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For existing configs (by this I mean existing clusters), even though initially id field was empty, after the creation, ids are filled by CAPA controllers, so no existing cluster should have their id fields empty during upgrading to v1beta2.

If you are asking about using an existing template, they will no longer work for creating new clusters.
This is a breaking change for existing templates that use the unsupported use cases and the reason why we needed a new API version is because of this. We will need to document this properly with the v1beta2 release.

@@ -0,0 +1,28 @@
- op: add
Copy link
Contributor

@Skarlso Skarlso Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... (and this is probably my lack of knowledge in this area, but please bear with me) The way these different from currently configurable networking, like just plain adding your VPC ID into the cluster config, is that these values will now be managed fields and used in SSA. So the user doesn't have to change things in a config file but rather apply new patch values? Did I get that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this particular scenario tests BYO VPC and subnets which was not working before SSA and id field being compulsory.

@sedefsavas
Copy link
Contributor Author

Clusterclass and EKS tests passed.

On the unmanaged side, only multi-workload test failed because it is no longer a supported workflow (see the explanation in PR definition). Removed the test.

In summary, if the infrastructure is not external (i.e., id is not set), then the fields below cannot be filled (as we have in the multi-workload test).

  network:
    subnets:
      - availabilityZone: "${AWS_AVAILABILITY_ZONE_1}"
        cidrBlock: "10.0.0.0/24"

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Sep 27, 2022

@sedefsavas: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-e2e 042ce4573106a476a8736b27c0250ad1f77ead3a link false /test pull-cluster-api-provider-aws-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sedefsavas
Copy link
Contributor Author

All e2e tests passed before, the new commit only removed the multi-az test, which was failing due to its use case became unsupported with this PR. So, not running them again.

@Skarlso
Copy link
Contributor

Skarlso commented Sep 27, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 27, 2022
@richardcase
Copy link
Member

Clusterclass and EKS tests passed.

On the unmanaged side, only multi-workload test failed because it is no longer a supported workflow (see the explanation in PR definition). Removed the test.

In summary, if the infrastructure is not external (i.e., id is not set), then the fields below cannot be filled (as we have in the multi-workload test).

  network:
    subnets:
      - availabilityZone: "${AWS_AVAILABILITY_ZONE_1}"
        cidrBlock: "10.0.0.0/24"

@sedefsavas - sorry but are you saying that the previous logic of only setting az/cidrblock (and not id) didn't work previously? If it did work this seems quite a change in behaviour.

@richardcase
Copy link
Member

Based on a conversation in slack with @sedefsavas:

  • We'll mark this change in behaviour as a breaking change in the release notes
  • We'll try to come up with an alternative that people can use (perhaps explicitly specifying failure domains or something similar)

Not from the conversation, but it would be good to [re-]add an e2e test that explicitly selects a subset of AZs

So based on this:

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 27, 2022
@k8s-ci-robot k8s-ci-robot merged commit 3271b7a into kubernetes-sigs:main Sep 27, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.x milestone Sep 27, 2022
@Ankitasw
Copy link
Member

Ankitasw commented Nov 9, 2022

Fixes: #3536

@anmolsachan
Copy link

Based on a conversation in slack with @sedefsavas:

  • We'll mark this change in behaviour as a breaking change in the release notes
  • We'll try to come up with an alternative that people can use (perhaps explicitly specifying failure domains or something similar)

Not from the conversation, but it would be good to [re-]add an e2e test that explicitly selects a subset of AZs

So based on this:

/approve

@richardcase @Ankitasw Hi,

Although I understand the intent behind this change, it takes away the following functionalities from the user.

  1. Ability to create custom VPC using CAPI. One of the primary intents of using CAPI is to leave the responsibility of Infrastructure creation to CAPI. With this change, if I have to create 100 clusters with my custom VPC configuration, I will have to now provision VPCs using some other tool like terraform etc.

  2. Ability to customize the subnets - Count of public and private subnets, Subnet ranges.

Example:

network:
    subnets:
      - cidrBlock: 10.0.32.0/20
        isPublic: true
        availabilityZone: us-west-2c
      - cidrBlock: 10.0.160.0/20
        isPublic: false
        availabilityZone: us-west-2c
      - cidrBlock: 10.0.0.0/20
        isPublic: true
        availabilityZone: us-west-2a
      - cidrBlock: 10.0.128.0/20
        isPublic: false
        availabilityZone: us-west-2a
      - cidrBlock: 10.0.176.0/20
        isPublic: false
        availabilityZone: us-west-2a
      - cidrBlock: 10.0.16.0/20
        isPublic: true
        availabilityZone: us-west-2b
      - cidrBlock: 10.0.144.0/20
        isPublic: false
        availabilityZone: us-west-2b
      - cidrBlock: 10.0.192.0/20
        isPublic: false
        availabilityZone: us-west-2b
  1. Availability Zone selection. If I want CAPI to create the AWS VPC, this would also now stop me to choose the availability zones reliably in other APIs like :
    AwsMachinePool:
    AwsManagedMachinePool:

Would there be a possibility to provide solutions to the above?

@AverageMarcus
Copy link
Member

This is a huge breaking change for us!

We currently rely on being able to have complex subnet layouts that are still managed by CAPA and a lot of work has gone into CAPA recently by myself to make these complex subnet layouts workable via filters and the like.

Currently we do similar to the following:

spec:
  network:
    subnets:
      - cidrBlock: 10.0.0.0/24
        availabilityZone: eu-west-1a
        tags:
          subnet-role: control-plane
      - cidrBlock: 10.0.1.0/24
        availabilityZone: eu-west-1b
        tags:
          subnet-role: control-plane
... etc ...

This allowed us to use CAPA to manage all the network infrastructure while still having the ability to segment out infrastructure into specific subnets by making use of the subnet filters based on tags.

We also have enterprise customers with policies in place about specific tags needing to exist on all AWS resources, this would also now no longer be possible for them when using CAPA on its own.

This PR completely blocks that possibility and now means we need an external solution to configure all the network setup for our CAPA clusters.

@richardcase You stated "We'll try to come up with an alternative that people can use (perhaps explicitly specifying failure domains or something similar)" did this happen? Is there anywhere I can track / contribute to this?

I see that we're not the only ones hitting issues with this and an issue was opened (and closed without discussion) covering this topic - #3883

There has also been discussion on Slack about this problem: https://kubernetes.slack.com/archives/CD6U2V71N/p1675159432540519

I'd like to re-open the discussion around this. I know the problem was to solve for Cluster Class and server-side apply but it seems short-sighted to me to reduce the feature-set of CAPA to this degree.

@richardcase
Copy link
Member

@MarcusNoble - Coming up with the alternative did not happen yet and in all honesty it was forgotten about. The problem is that there was significant pressure from certain areas to get ClusterClass support into CAPA and the lack of a key on the subnets caused issues.

The specifying of subnets has been an area of pain for as long as I can remember, as the original implementation wasn't great.

In the slack discussion on this i did raise it was a change in behaviour but was told its undocumented and is a side effect. In hindsight, we should've been more diligent on this and not bowed to pressure to get clusterclass support out.

If we plan to change this then it will be a breaking API change and so we will need to bump the API version / CAPA version number.

We should take this opportunity to rethink how we handle the networking side of CAPA and once and for all deprecate the old subnet code which has been problematic for years. Lets add this to: #1484 (and also create a specific issue for this problem).

@richardcase
Copy link
Member

This change has also highlighted that we didn't include this change in the release notes, which is a good example as to why we should go back to using the Prow release notes plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tasks for adopting CAPI's Server Side Apply Add a required field to SubnetSpec to use as listMapKey in SSA
7 participants