Force new clusters on changes to Workload Identity parameters. #2073

mt-inside · 2020-05-15T10:46:46Z

The GCP API allows this to be changed in place, but it doesn't take
effect, so we should force cluster rebuilds.

I used Terraform on one of my clusters to turn on Workload Identity. I'd heard I needed to make a new cluster but TF offered to change the setting in-place, and the application was successful: GCP didn't reject the request, and the setting was visible on the cluster afterwards. However it simply doesn't work; Pods with Workload Identity correctly configured continue to use the Node's Instance Service Account. I lost a lot of time debugging this and don't want anyone else to suffer the same.

The GCP API allows this to be changed in place, but it doesn't take effect, so we should force cluster rebuilds.

rileykarson

Just to confirm your experience with the feature, did you also change the node_config.workload_metadata_config.node_metadata to have a value of GKE_METADATA_SERVER, enabling workload identity at the node level as well?

Configuration is a little finicky- workload_identity_config enables the feature at the cluster level, but doesn't propagate the information to individual instances, who need to be updated using workload_metadata_config. Configuring the value on node pools will currently recreate them (hashicorp/terraform-provider-google#4041 is tracking update support- it's assigned in this sprint for our development team), but updating using gcloud/the Console should possible today (and you can persist the change into Terraform by changing your config once the change is made out-of-band).

mt-inside · 2020-05-15T17:28:01Z

Ah, I did not; I didn't know about it. I think we should definetly cross-link them in the documentation, I had no idea.

I use a separate Node Pool, and interestingly I still haven't set node_config.workload_metadata_config.node_metadata in my terraform file, and yet there it is on the node pool of my recreated cluster. I assume that when creating a new node pool, that setting is infered from the cluster's workload_identity_config setting, either by the TF provider or the GCP API? But if you update the cluster setting in-place, that second change isn't made?

That's super-confusing and leads to situations like mine where everything worked in a newly-made test cluster (with only the cluster-level field set), but didn't work when that one setting was applied to an existing cluster.

Given that changing the setting on google_container_cluster needs a setting on another resource (google_container_node_pool) to change, and that isn't possible, is the best course of action still to force recreation of the cluster? Either that or document the coupling in BIG red letters ;)

rileykarson · 2020-05-15T19:38:38Z

I assume that when creating a new node pool, that setting is infered from the cluster's workload_identity_config setting, either by the TF provider or the GCP API? But if you update the cluster setting in-place, that second change isn't made?

That's correct! The GKE API changes defaults for new pools (you can see that documented here).

Given that changing the setting on google_container_cluster needs a setting on another resource (google_container_node_pool) to change, and that isn't possible, is the best course of action still to force recreation of the cluster? Either that or document the coupling in BIG red letters ;)

I'm good with the big red letters- while adding ForceNew to the field would lead to less confusing scenarios like you encountered, we'd then receive requests to re-enable it from users who expect the functionality, because it exists in the API. Plus, I think the intention with GKE is for clusters to be large & long-lived, so anything that could destroy clusters is something we'd prefer to avoid.

Would you like to take a shot at adding the warning to the docs, given that you just ran into this unintuitive behaviour and probably know what you would have like to see there? I can spin out an issue, and get it triaged into a future sprint otherwise.

mt-inside · 2020-05-15T20:29:47Z

That's correct! The GKE API changes defaults for new pools (you can see that documented here).

Ok I missed those docs, just used the getting started guide and assumptions. My bad.

I'm good with the big red letters- while adding ForceNew to the field would lead to less confusing scenarios like you encountered, we'd then receive requests to re-enable it from users who expect the functionality, because it exists in the API. Plus, I think the intention with GKE is for clusters to be large & long-lived, so anything that could destroy clusters is something we'd prefer to avoid.

Agreed.

Would you like to take a shot at adding the warning to the docs, given that you just ran into this unintuitive behaviour and probably know what you would have like to see there? I can spin out an issue, and get it triaged into a future sprint otherwise.

I'll give it a shot this weekend :)

Force new clusters on changes to Workload Identity parameters.

2c3acbf

The GCP API allows this to be changed in place, but it doesn't take effect, so we should force cluster rebuilds.

ghost added the size/xs label May 15, 2020

ghost requested a review from rileykarson May 15, 2020 10:46

rileykarson reviewed May 15, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force new clusters on changes to Workload Identity parameters. #2073

Force new clusters on changes to Workload Identity parameters. #2073

mt-inside commented May 15, 2020

rileykarson left a comment

mt-inside commented May 15, 2020

rileykarson commented May 15, 2020 •

edited

mt-inside commented May 15, 2020

Force new clusters on changes to Workload Identity parameters. #2073

Are you sure you want to change the base?

Force new clusters on changes to Workload Identity parameters. #2073

Conversation

mt-inside commented May 15, 2020

rileykarson left a comment

Choose a reason for hiding this comment

mt-inside commented May 15, 2020

rileykarson commented May 15, 2020 • edited

mt-inside commented May 15, 2020

rileykarson commented May 15, 2020 •

edited