Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SubnetSpec changes in v1beta2 break existing use-cases #4026

Closed
AverageMarcus opened this issue Jan 31, 2023 · 23 comments · Fixed by #4474
Closed

SubnetSpec changes in v1beta2 break existing use-cases #4026

AverageMarcus opened this issue Jan 31, 2023 · 23 comments · Fixed by #4474
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@AverageMarcus
Copy link
Member

/kind bug
/kind api-change

The changes introduced in #3748 (aimed at solving server side apply issues with Cluster Class) break existing use-cases around subnets created by CAPA.

The change makes the ID property on SubnetSpec now be a required field where previously this was optional. This change means that configuration to subnets used for CAPA clusters can now only be done if infrastructure is created by the user resulting in CAPA only allowing for either the default subnet layout or a bring-your-own network where the user would be required to create everything (VPC, subnets, etc.) externally before the creation of the AWSCluster CR.

With v1beta1 and earlier it was possible to specify some of the subnet configuration (such as the CIDR block and the AWS tags) but still have CAPA create the resources for you.

For example, we have a situation where we need more control over what resources end up in what subnets and rely on AWS tags and subnet filters to achieve this.

spec:
  network:
    subnets:
      - cidrBlock: 10.0.0.0/24
        availabilityZone: eu-west-1a
        tags:
          subnet-role: control-plane
      - cidrBlock: 10.0.1.0/24
        availabilityZone: eu-west-1b
        tags:
          subnet-role: control-plane
... etc ...

With this approach we can configure CAPA to create all the subnets we need, with the tags we can then filter on, without having to use an external process to create these resources. This is no longer possible with v1beta2.

In the PR it was acknowledged that this was a breaking change and that we'd need to come up with an alternative to solve this problem going forward but unfortunately was forgotten about so I'm opening this issue to have that discussion and try to come up with a solution to both the original issue the PR was fixing and the use cases currently in use that require configuring subnets in CAPA.

This also fits into the wider discussion around networking in CAPA and the improvements / changes desired.

Related:

Environment:

  • Cluster-api-provider-aws version: v2.0.2
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 31, 2023
@AverageMarcus
Copy link
Member Author

/cc @anmolsachan @joeunlog as you both either commented on the PR or raised an issue about this problem

@richardcase
Copy link
Member

/triage accepted
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Feb 1, 2023
@Skarlso
Copy link
Contributor

Skarlso commented Feb 1, 2023

To keep both things happy, what about adding a new entry called subnetMutation and then providing user values there and we would merge that with the existing settings?

Though, that might pose a challenge in finding the correct subnet to apply it to. Uh....

@luthermonson
Copy link
Contributor

Just a thought to add to the pile. It's my opinion that CAPA should only offer two methods to get a cluster...

  1. you are new to this and have aws secret/key and you want a kube cluster, this is great for small/medium sized orgs who are ok having 1 VPC per cluster and will likely never reach the 5 vpc soft limit amazon sets on new accounts any time soon. in this scenario capa maintains the entire infra lifecycle.
  2. you know everything, and you want to tell capa exactly what it should use, this is great for enterprise orgs who have different teams who manage the different pieces (networking team, storage team) and will need to be told where/when their cluster/nodes can exist. in this scenario capa manages nothing if IDs are provided...

I don't feel there should be a middle ground where CAPA acts on its own to try and reconcile infrastructure based on partial user input. this will lead to bloating the CAPA API just to avoid using an existing tool specialized in solving this problem. To get people who want to graduate from CAPA maintaining all infra to having specified infra the CAPA repo could maintain a sample terraform module to help migrate off the default CAPA infra.

@Skarlso the idea of adding subnetMutation won't stop, there will always be something else to mutate and code will fill up with if/else blocks trying to figure out what the user wants and when CAPA assumed wrong the users will just come complaining. CAPA is a consumer of AWS infrastructure, it shouldn't be in the business of creating/maintaining it, but I can concede that method 1 above lowers the barrier to entry to new users and should be maintained.

@Skarlso
Copy link
Contributor

Skarlso commented Feb 2, 2023

To add to this, if you would like to stay in the Kubernetes world but still use terraform together with CAPA, there is an option to use tf-controller. It's a Terraform Kubernetes controller which a lot of options and nice things to reconcile terraform manifests into a cluster. It's great if you are looking for a hybrid solution.

@AverageMarcus
Copy link
Member Author

CAPA is a consumer of AWS infrastructure, it shouldn't be in the business of creating/maintaining it,

Not sure I agree with this. The whole purpose of CAPA as far as I see it is to manage AWS infrastructure - creating, updating and deleting it.

There is a third (and like more) scenarios where CAPI/CAPA is used besides the two you proposed - there are many platform offerings built on top of it that leverage the standard nature of cluster-api to build a larger product for customers.

I think it's also worth pointing out that...

Manages the bootstrapping of VPCs, gateways, security groups and instances.

is the second listed feature of CAPA in the readme. If that is was change I think that's a much bigger discussion that would also need to include CAPI and other providers to ensure consistency.

@Skarlso
Copy link
Contributor

Skarlso commented Feb 2, 2023

There is a third (and like more) scenarios where CAPI/CAPA is used besides the two you proposed - there are many platform offerings built on top of it that leverage the standard nature of cluster-api to build a larger product for customers.

Sure, but you understand that CAPA cannot cater for every possible need of these platforms. :)

I, too, believe that most of CAPA's problems today stem from the fact it's trying to "merge" partial user settings with defaults and stuff that comes from AWS.

I wouldn't go as far as to completely rule out user input, but I sure as heck would recommend not touching unmanaged things AT ALL. There are only problems stemming from that route. I even created a proposal to remove settings any kind of tags for unmanaged content. We document everything that we require, then the user can set them up as they like. There are at least 3 issues because of how capa tries to juggle various tags on unmanaged entities.

I'm am of the same mind regarding other content, such as subnets. You should be able to have some user input, sure, but at some point, we just have to stop trying to merge partial input with defaults and information coming from AWS.

@AverageMarcus
Copy link
Member Author

I wouldn't go as far as to completely rule out user input, but I sure as heck would recommend not touching unmanaged things AT ALL. There are only problems stemming from that route. I even created a proposal to remove settings any kind of tags for unmanaged content. We document everything that we require, then the user can set them up as they like. There are at least 3 issues because of how capa tries to juggle various tags on unmanaged entities.

I agree with this. If the user is providing their own infrastructure then CAPA should be "read-only". But I do still think (and think you agree if I understood) that when CAPA is managing the infra it should be possible to configure them to some degree. My example of this is being able to set the CIDRs and AWS tags used for subnets.

@phillipsj
Copy link

We are currently investigating the use of CAPA and this is a huge blocker. The possibility that it can alter resources it doesn't manage with the permission set that is currently required is horrifying.

@AndiDog
Copy link
Contributor

AndiDog commented Feb 9, 2023

In Giant Swarm's fork, I have reverted the buggy changes and added a unit test on top which covers the "managed VPC/subnet" use case. With that change, it seems the functionality is again working as before.

See giantswarm#437.

We will use that in production in order to get unblocked, and then help drafting the design document for this issue ("Rethinking Networking in CAPA").

@k8s-triage-robot
Copy link

This issue is labeled with priority/critical-urgent but has not been updated in over 30 days, and should be re-triaged.
Critical-urgent issues must be actively worked on as someone's top priority right now.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority {important-soon, important-longterm, backlog}
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 11, 2023
@Skarlso
Copy link
Contributor

Skarlso commented Mar 11, 2023

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Mar 11, 2023
@Skarlso
Copy link
Contributor

Skarlso commented Mar 11, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 11, 2023
@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2023
@k8s-triage-robot
Copy link

This issue is labeled with priority/critical-urgent but has not been updated in over 30 days, and should be re-triaged.
Critical-urgent issues must be actively worked on as someone's top priority right now.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority {important-soon, important-longterm, backlog}
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels May 11, 2023
@Skarlso
Copy link
Contributor

Skarlso commented May 11, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 11, 2023
@k8s-triage-robot
Copy link

This issue is labeled with priority/critical-urgent but has not been updated in over 30 days, and should be re-triaged.
Critical-urgent issues must be actively worked on as someone's top priority right now.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority {important-soon, important-longterm, backlog}
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jun 10, 2023
@Skarlso
Copy link
Contributor

Skarlso commented Jun 11, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 11, 2023
@k8s-triage-robot
Copy link

This issue is labeled with priority/critical-urgent but has not been updated in over 30 days, and should be re-triaged.
Critical-urgent issues must be actively worked on as someone's top priority right now.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority {important-soon, important-longterm, backlog}
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jul 11, 2023
@AndiDog
Copy link
Contributor

AndiDog commented Jul 11, 2023

We will try covering this issue in a prototype for the CAPA - Rethinking Networking Proposal. See also meeting minutes of CAPA office hours for what was discussed.

@vincepri
Copy link
Member

Thanks @AndiDog and @AverageMarcus for creating an alternative proposal.

I'm looking through the open issues here and I agree that this is a big regression that shouldn't have happened in the first case. @richardcase and other maintainers, we should prepare for a breaking change and revert the behavior in this case; this was one of the primary values for Cluster API AWS providers.

Was the SSA bug originally due to ClusterClass and Managed Topologies?

@vincepri
Copy link
Member

/triage accepted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants