Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force new clusters on changes to Workload Identity parameters. #2073

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mt-inside
Copy link

The GCP API allows this to be changed in place, but it doesn't take
effect, so we should force cluster rebuilds.

I used Terraform on one of my clusters to turn on Workload Identity. I'd heard I needed to make a new cluster but TF offered to change the setting in-place, and the application was successful: GCP didn't reject the request, and the setting was visible on the cluster afterwards. However it simply doesn't work; Pods with Workload Identity correctly configured continue to use the Node's Instance Service Account. I lost a lot of time debugging this and don't want anyone else to suffer the same.

The GCP API allows this to be changed in place, but it doesn't take
effect, so we should force cluster rebuilds.
@ghost ghost added the size/xs label May 15, 2020
@ghost ghost requested a review from rileykarson May 15, 2020 10:46
Copy link
Collaborator

@rileykarson rileykarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm your experience with the feature, did you also change the node_config.workload_metadata_config.node_metadata to have a value of GKE_METADATA_SERVER, enabling workload identity at the node level as well?

Configuration is a little finicky- workload_identity_config enables the feature at the cluster level, but doesn't propagate the information to individual instances, who need to be updated using workload_metadata_config. Configuring the value on node pools will currently recreate them (hashicorp/terraform-provider-google#4041 is tracking update support- it's assigned in this sprint for our development team), but updating using gcloud/the Console should possible today (and you can persist the change into Terraform by changing your config once the change is made out-of-band).

@mt-inside
Copy link
Author

Ah, I did not; I didn't know about it. I think we should definetly cross-link them in the documentation, I had no idea.

I use a separate Node Pool, and interestingly I still haven't set node_config.workload_metadata_config.node_metadata in my terraform file, and yet there it is on the node pool of my recreated cluster. I assume that when creating a new node pool, that setting is infered from the cluster's workload_identity_config setting, either by the TF provider or the GCP API? But if you update the cluster setting in-place, that second change isn't made?

That's super-confusing and leads to situations like mine where everything worked in a newly-made test cluster (with only the cluster-level field set), but didn't work when that one setting was applied to an existing cluster.

Given that changing the setting on google_container_cluster needs a setting on another resource (google_container_node_pool) to change, and that isn't possible, is the best course of action still to force recreation of the cluster? Either that or document the coupling in BIG red letters ;)

@rileykarson
Copy link
Collaborator

rileykarson commented May 15, 2020

I assume that when creating a new node pool, that setting is infered from the cluster's workload_identity_config setting, either by the TF provider or the GCP API? But if you update the cluster setting in-place, that second change isn't made?

That's correct! The GKE API changes defaults for new pools (you can see that documented here).

Given that changing the setting on google_container_cluster needs a setting on another resource (google_container_node_pool) to change, and that isn't possible, is the best course of action still to force recreation of the cluster? Either that or document the coupling in BIG red letters ;)

I'm good with the big red letters- while adding ForceNew to the field would lead to less confusing scenarios like you encountered, we'd then receive requests to re-enable it from users who expect the functionality, because it exists in the API. Plus, I think the intention with GKE is for clusters to be large & long-lived, so anything that could destroy clusters is something we'd prefer to avoid.

Would you like to take a shot at adding the warning to the docs, given that you just ran into this unintuitive behaviour and probably know what you would have like to see there? I can spin out an issue, and get it triaged into a future sprint otherwise.

@mt-inside
Copy link
Author

That's correct! The GKE API changes defaults for new pools (you can see that documented here).

Ok I missed those docs, just used the getting started guide and assumptions. My bad.

I'm good with the big red letters- while adding ForceNew to the field would lead to less confusing scenarios like you encountered, we'd then receive requests to re-enable it from users who expect the functionality, because it exists in the API. Plus, I think the intention with GKE is for clusters to be large & long-lived, so anything that could destroy clusters is something we'd prefer to avoid.

Agreed.

Would you like to take a shot at adding the warning to the docs, given that you just ran into this unintuitive behaviour and probably know what you would have like to see there? I can spin out an issue, and get it triaged into a future sprint otherwise.

I'll give it a shot this weekend :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants