Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Azure Workload Identities #2724

Closed
karlschriek opened this issue Apr 27, 2022 · 39 comments · Fixed by #3906
Closed

Support Azure Workload Identities #2724

karlschriek opened this issue Apr 27, 2022 · 39 comments · Fixed by #3906
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@karlschriek
Copy link

karlschriek commented Apr 27, 2022

Azure recently added Azure Workload Identities (see https://github.com/Azure/azure-workload-identity and https://azure.github.io/azure-workload-identity/docs/) to AKS. Officially the functionality is still in Preview, but the approach is quite stable and we are already using it in various production situations.

AWI is essentially the equivalent of AWS's IAM Roles for Service Accounts and works the same. I.e, you cluster becomes an OIDC identity provider and a specific service account in a specific namespace can be designated as federated principal to which Azure IAM roles can be attached. This is significantly more secure than using credentials (i.e. Service Principals with client secrets) or Managed Service Identities (for which the whole Node is able to assume the identity).

Also similarly to IRSA, AWI works by annotating a Service Account, as follows:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: external-dns
  labels:
    azure.workload.identity/use: 'true'
  annotations:
    azure.workload.identity/client-id: bbbbbbbb-xxx-yyyyy-mmmm-rrrrrrrrrr
    azure.workload.identity/service-account-token-expiration: '86400'
    azure.workload.identity/tenant-id: yyyyyyy-xxx-zzzz-xxxx-rrrrrrrrrrrrr

By attaching this ServiceAccount to a Pod, we get the following env vars in the Pod:

AZURE_AUTHORITY_HOST: https://login.microsoftonline.com/
AZURE_CLIENT_ID: bbbbbbbb-xxx-yyyyy-mmmm-rrrrrrrrrr
AZURE_FEDERATED_TOKEN_FILE: /var/run/secrets/azure/tokens/azure-identity-token
AZURE_TENANT_ID: yyyyyyy-xxx-zzzz-xxxx-rrrrrrrrrrrrr

At the location AZURE_FEDERATED_TOKEN_FILE a temporary token is mounted. To be able to use this we would just need to configure Azure authentication to use the AZURE_FEDERATED_TOKEN_FILE for login.

I would really love to see this supported in External DNS.

@karlschriek karlschriek added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 27, 2022
@karlschriek
Copy link
Author

We should keep an eye on Azure/go-autorest#680.

Once that is merged it should be fairly trivial to implement this.l

@darkn3rd
Copy link
Contributor

darkn3rd commented May 9, 2022

I am interested in this, in the scope of testing and updating the tutorial. What would be the contents of azure.json needed? Do we need to explicitly expose a token, can a service account impersonate the identity (service principal) without a need for a token?

@karlschriek
Copy link
Author

karlschriek commented Jun 2, 2022

Azure/go-autorest#680 has been merged, the functionality is included in the latest version (github.com/Azure/go-autorest/autorest/adal v0.9.20).

I hacked together a solution for this earlier by making the following changes in provider/azure/config.go:

  1. Added useWorkloadIdentity bool to the JSON string
type config struct {
       ...
	UseWorkloadIdentity         bool              `json:"useWorkloadIdentity" yaml:"useWorkloadIdentity"`
}

Add the following section to func getAccessToken

	// Try to retrieve token with AWI
	if cfg.UseWorkloadIdentity {
		log.Info("Using token provided by Workload Identity.")

		resource := "https://management.azure.com"
		awiClientId := os.Getenv("AZURE_CLIENT_ID")
		awiTenantId := os.Getenv("AZURE_TENANT_ID")

		jwtBytes, err := ioutil.ReadFile(os.Getenv("AZURE_FEDERATED_TOKEN_FILE"))
		if err != nil {
			return nil, fmt.Errorf("Failed to get Azure Workload Identity token for file: %v", err)
		}

		jwt := string(jwtBytes)

		oauthConfig, err := adal.NewOAuthConfig(environment.ActiveDirectoryEndpoint, awiTenantId)
		if err != nil {
			return nil, fmt.Errorf("failed to retrieve OAuth config: %v", err)
		}

		token, err := adal.NewServicePrincipalTokenFromFederatedToken(*oauthConfig, awiClientId, jwt, resource)

		return token, nil
	}

@darkn3rd the way Azure Workload Identities work is that the Service Account attached to the Pod is able to impersonate a Service Principal and fetch a token for that SP, which is injected into the Pod in a file at the location AZURE_FEDERATED_TOKEN_FILE. This token eventually expires (in which case the contents of the file is replaced with the new token). So one thing that is actually needed is logic that will redo the read from file in case the token is no longer valid.

@pinkfloydx33
Copy link

Looking forward to this; trying out the preview for MI support and hitting blocks like this across various tooling that doesn't support AZWI

@karlschriek
Copy link
Author

karlschriek commented Jul 15, 2022

Sorry for the long silence, I'll try to make some time to create a PR on this. I want to come back to the issue of refreshing the token. Is there some logic loop within external-dns that can take care of this? I.e., tries to write, fails because of lacking authorization, restarts the flow to get a token which will eventually hit the block of code under if cfg.UseWorkloadIdentity again?

@pinkfloydx33
Copy link

You may want to take a look at KEDA's implementation, specifically pkg/scalers/azure/azure_aad_workload_identity.go.

There seems to be two different ways they are connecting to Azure with AZWI. Depending on the resource, sometimes they call GetAzureADWorkloadIdentityToken and just pass the token file on every connection. They do this for Storage Queues, Service Bus and Log Analytics. It doesn't look like they do any sort of manual refresh, just pass the token each time...

For everything else (Key Vault, Appinsights, Data Explorer and Azure Monitor) it looks like they implement the autorest.Authorizer/adal.Refresher interfaces and pass them along as part of the connection. The rest is handled by msal/adal automagically. In these cases, the refresh logic seems to be pretty basic and ultimately triggered by azure/autorest.

I may be wrong here as I haven't dug into their implementation too extensively, but it appears to be fairly basic. The mechanism to use seems to be based upon the type of resource being accessed and what the underlying libs support for it. Those scalers connect to the azure resources multiple times per minute. In the simpler (first) case, it looks like they just read in the token on each new connection.

I'm not an expert on their implementation, nor yours... and I'm only a novice in go so take my peanut gallery comments for what they are :) That said I'm sure the keda implementation can at least provide some inspiration. They've made some changes since that first PR I linked so probably best to look at the current state of main

@karlschriek
Copy link
Author

karlschriek commented Jul 15, 2022

Just to be really clear, we actually don't need to do anything to refresh the token. The AWI webhook takes care of that. I.e., periodically the contents of the file located at the path stored in $AZURE_FEDERATED_TOKEN_FILE is updated with a new token. All we need to do is make sure that this file is re-read. @pinkfloydx33 This I think is what the first approach you are describing above implements.

So the simplest way to deal with this I think is to make sure that whatever function is calling getAccessToken either calls it every time its means to interact with a DNS zone, or at least calls for it again if the token it currently has cached isn't working anymore. I have limited knowledge of how this flow is implemented in external-dns, but I can confirm that just adding the code snippet I pasted a few comments up will only work initially. After the initial token expires it stops working.

@pinkfloydx33
Copy link

Right, this should only be "tricky" if external-dns is caching the file/token, since the AZWI webhook rotates it out as necessary.

That said you may want to look at Keda. They implement the first patten in a way that's very, very close to the official example. If you look at the sample , they don't call NewServicePrincipalTokenFromFederatedToken. Instead they use NewCredFromAssertion and do a bit of string manipulation on the passed in resource.

I don't know if the two approaches are equivalent, nor if they're expected to differ due to the underlying auth clients or target resource (vault vs dns).... But it might be worth trying out?

(sorry just spit balling here)

@karlschriek
Copy link
Author

karlschriek commented Jul 15, 2022

I see they use microsoft-authentication-library-for-go, not just adal. Without going into it too deeply, it actually looks like a less abstracted implementation, since they after calling NewCredFromAssertion go and construct an AADToken object. The getAccessToken function in external-dns is set up to return an adal.ServicePrincipalToken, which is exactly what you get back from the NewServicePrincipalTokenFromFederatedToken. I think I would stick with this.

@pinkfloydx33
Copy link

Ah OK makes sense. As I said above, I'm not too knowledgeable in any of the tools. Ignore me :)

@darkn3rd
Copy link
Contributor

darkn3rd commented Jul 15, 2022

How similar is Azure WI to Google's WI? Does cert-manager support Azure WI yet?

I found @karlschriek ticket cert-manager/cert-manager#5085

@karlschriek
Copy link
Author

karlschriek commented Jul 15, 2022 via email

@pinkfloydx33
Copy link

Also, at the moment AZWI is tied to App Registrations only. In private preview they now have support for managed identities. It is going to replace "pod-identity v1 preview" for anyone who might have that extension installed in their AKS cluster (the OSS project for pod-identity is sticking around so far as I know)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 13, 2022
@pinkfloydx33
Copy link

AZWI is now in public preview with pod-identity being deprecated. Would greatly appreciate this. We are currently using the migration sidecar they've made available but would prefer something native.

@derage
Copy link

derage commented Oct 20, 2022

+1 for this as well. I spent a couple days on trying to solve this before I ran into this issue. As @pinkfloydx33 mentioned, a good work around is to use the workload identity sidecar mentioned in this article. It still took me a while to figure out how to get this working with external-dns (since this was my first time setup), so figured I would add more details.

first make sure you have the workload identity chart installed properly. Then you can follow a guide like this one to create the service account (just ignore giving access to key-vault and instead follow the doc in external-dns to give it access to the dns zone instead

Then for my setup Im using this chart as a dependency in a parent chart, so I was able to setup a secret file like this in that parent chart:

apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: azure-config-file
stringData:
  azure.json: |-
    {
      "tenantId": "{{ .Values.global.azure.tenantId }}",
      "subscriptionId": "{{ .Values.global.azure.subscriptionId }}",
      "resourceGroup": "{{ .Values.global.azure.resourceGroup }}",
      "userAssignedIdentityID": "{{ .Values.global.azure.externaldnsAzureClientId }}",
      "useManagedIdentityExtension": true
    }

Then in my values.yaml I was able to do this:

global:
  azure:
    resourceGroup: &resourceGroup "my-resrouce-group"
    tenantId: &tenantId "default"
    subscriptionId: &subscriptionId "default"
    externaldnsAzureClientId: test
externaldns:
  podAnnotations:
    azure.workload.identity/inject-proxy-sidecar: "true"
  provider: azure
  azure:
    resourceGroup: *resourceGroup
    secretName: azure-config-file
    useManagedIdentityExtension: true
  serviceAccount:
    create: false
    name: external-dns

The 2 key parts to pay attention to get this working is the json file need to have "useManagedIdentityExtension": true and the pod needs to have an annotation on it like azure.workload.identity/inject-proxy-sidecar: "true". Once these 2 things were setup properly it does get externaldns working with workload identities.

I was really hoping to have a zero-config parent chart I could install into our k8s clusters, so it would be really awesome if externaldns could support this new workload like external-secrets. That way we dont need to give the tenantid, subscriptionid and resourcgroup to the helm chart, it could just leverage the service account created.

@weisdd
Copy link
Contributor

weisdd commented Oct 27, 2022

Just to highlight it to whoever is subscribed to this issue, there's a new PR (#3111) that introduces support for AKS Workload Identity while also making sure an up-to-date federated token is used to request a new access token.
I tested it on one of my AKS clusters for more than 24h (expiration period for the AAD access token), everything went fine. So, I think it's robust enough, though would be great if someone else also volunteers to test it.

Here's an example how you could build your own image:

export IMAGE=quay.io/weisdd/external-dns
export VERSION=0.13.1-wi
docker buildx build --platform linux/amd64 . -t $IMAGE:$VERSION -f Dockerfile.mini --push

If you're not afraid to use my build in your lab (never trust strangers :D), then you could pull it here for amd64: quay.io/weisdd/external-dns:0.13.1-wi. The only difference in code compared to what you see in my PR is an additional diagnostics message:

var refreshFunc adal.TokenRefresh = func(context context.Context, resource string) (*adal.Token, error) {
    log.Info("Refreshing the token") // <-- here

@pinkfloydx33
Copy link

I can gladly give the image a try over the weekend (I'm off tomorrow) in my sandbox cluster.

Keda managed to solve the issue of the rotating tokens somehow and its been working quite nicely for a few months. I'm curious if anyone has taken a look at their implementation for a comparison?

@weisdd
Copy link
Contributor

weisdd commented Oct 27, 2022

@pinkfloydx33 Thanks for the upcoming test!

I hadn't looked at KEDA code before. Just to add a bit more to the context, I was recently asked to help with adoption of AKS, and, this week, began to look at a potential setup starting from the core components, which happened to be external-dns and cert-manager (and, of course, relying on Workload Identities looked more attractive than on Pod-Managed Identities). So, KEDA was simply out of scope at that moment.
By skimming through the code, I already see that KEDA relies on its own implementation, even has the following comment here:

// ADWorkloadIdentityCredential is a type that implements the TokenCredential interface.
// Once azure-sdk-for-go supports Workload Identity we can remove this and use default implementation
// https://github.com/Azure/azure-sdk-for-go/issues/15615

So, even though, there's some space for inspiration from KEDA implementation, a different library requires a different approach. Migration from adal to something else requires a bigger effort in terms of planning and consideration. Since Workload Identity has been moved to Preview in AKS just a couple of weeks back, I would rather wait and see how it goes further. Meanwhile, we should keep an eye on https://github.com/Azure/azure-workload-identity/blob/main/pkg/proxy/proxy.go, because it's an official sidecar proxy (used for apps that don't yet support WI, but do - Pod-Managed Identities) and is part of the the whole WI effort on AKS. - It's likely to incorporate the latest developments. :)

@pinkfloydx33
Copy link

Ok cool. I hadn't looked at KEDA's implementation; my golang is low-intermediate/high-beginner so it likely wouldn't have mattered if I did. I just knew it was there as they were the first major app I know of to offer support (it was actually completed on their end months before they finally released it). We were fortunate to be part of AZWI AKS private preview, so I've had plenty of time to migrate our apps and third-party stuff, including keda.

The only apps left for us to migrate are external-dns, cert-manager and akv2k8s... The latter of which has had zero movement on the subject since my initial issue in their github repo.

But for all three, we are currently using the side car approach and have stripped the pod identity v1 (aks native add-on) from our clusters entirely. It was truly a big deal for us. I've been tracking the PRs for certmanager/externaldns so that I can get rid of the sidecars...so anything I can do to help test I'm all for.

@pinkfloydx33
Copy link

Installed in my cluster as of ~7:30am EST. Working for both Azure DNS and Azure Private DNS (expected as they use same configuration code).

Will check back in 24h for refresh, etc.

@pinkfloydx33
Copy link

Each instance successfully refreshed their tokens, 5 minutes shy of 24 hours in both cases. Verified that external-dns was still able to sync records after the token refresh. All looks good.

@weisdd
Copy link
Contributor

weisdd commented Oct 31, 2022

@pinkfloydx33 thanks for the tests, and glad to hear everything went well :)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 30, 2022
@zerodayyy
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 30, 2022
@mblaschke
Copy link

any progress here? AAD Pod Identity is already officially depcreated and workload identity is going GA in March 2023.

@weisdd
Copy link
Contributor

weisdd commented Jan 22, 2023

@mblaschke there's a working solution here: #3111, but it has stayed unreviewed since the end of October. I had a brief discussion with one of the maintainers on Slack: https://kubernetes.slack.com/archives/C771MKDKQ/p1669924073459489 (if summarised: maintainers are busy, and there's kind of pause till a plugins provider is ready, so I'm unlikely to see any reviews in near future). That made me think there's no point in keeping the PR updated until I get pinged by maintainers, and it hasn't happened so far. If someone can manage to get some attention from maintainers, I can certainly update the PR and tackle their requests for changes if any.

As a side note, I see that many multi-cloud open source projects seem to lack interest or expertise in Azure, which makes it much harder to have some changes accepted into the main code base. I wish Microsoft organized a team of engineers who would help to drive changes in the projects as AKS is a big thing on regulated market.

@djsly
Copy link

djsly commented Apr 22, 2023

@weisdd what can I do to help unblock this issue/PR ? is it impossible to make it move forward at this point ?

@weisdd
Copy link
Contributor

weisdd commented Apr 22, 2023

@djsly Unfortunately, nothing has changed since my last comment. I still haven't got any reviews from maintainers.

@michelefa1988
Copy link

Following. I also would like to update to aad-workload-identity given it is will be depreciated in a few months. I also noticed that workload identity is no longer in preview mode

@christertime
Copy link

Anything that can done to help unblock this up? AAD Pod Identity is quickly reaching EOL

@weisdd
Copy link
Contributor

weisdd commented Jun 27, 2023

@christertime Well, the only way to go forward is to somehow bring maintainers' attention to this issue.

So, everyone is just stuck now using the workload identity sidecar proxy.

@ImIOImI
Copy link

ImIOImI commented Jul 27, 2023

It sounds like this isn't going to get merged before AAD Pod Identity's are EOL. What's the best workaround? MS docs mention the proxy sidecar shouldn't go to production and there's certainly no guarantee that it won't be anything other than a permanent hack given how long this issue has been open.

@pinkfloydx33
Copy link

We're using the proxy side car. Fingers crossed. It's a shame this hasn't been given priority.

@karlschriek
Copy link
Author

karlschriek commented Jul 27, 2023 via email

@michelefa1988
Copy link

Is it so hard for a maintainer to issue a not fully tested/ unstable BETA build which works on Azure/ AKS? Given some of us only need this features, I do not personally mind if the rest of the features are not fully tested. Creating a build should take minutes assuming the relevant pipelines are available. :(

@bigfleet
Copy link

bigfleet commented Aug 7, 2023

Pod identity use cases are now failing for new clusters on Azure (at least in my experience). Anybody using pod identity hopefully already fixed their chart versions on the basis of #2383 and maintainers can do a merge on #3111 , new release, and chart in sensible order. Azure needs your kind help, maintainers! Best wishes for your speedy completion of plugin work. Thank you!

@bigfleet
Copy link

For those following this ticket, there is movement on #3040 -- I have no idea how difficult "re-implement[ing] support for workload identity using the new library" would be, but that path does show activity.

@weisdd
Copy link
Contributor

weisdd commented Aug 25, 2023

@bigfleet I'll look into rewriting #3111 using the updated SDK once #3040 is merged. Most likely, it'll be a trivial change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet