Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed DBToken. Made the components uses Managed Identity #3584

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

anshulvermapatel
Copy link
Collaborator

@anshulvermapatel anshulvermapatel commented May 16, 2024

Which issue this PR addresses:

As part of the ARO-5512 , we needed to disable local auth for cosmosdb. In order to disable local authentication, this PR is to make our components rely on managed identities/entra id rather than keys for authentication/authorizing cosmosdb.

This is currently a draft PR as I will be testing it in the INT env first before going forward, however, reviews are appreciated.

I have manually changed some generated code here. That was because the PR I created to have the generated code changed is not merged yet. Once that is merged, I will generate the code from the go-cosmosdb repo itself just like how we usually do.

Fixes
ARO-7195
ARO-7314

Copy link

Please rebase pull request.

@github-actions github-actions bot removed the needs-rebase branch needs a rebase label May 17, 2024
Copy link
Collaborator

@bennerv bennerv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to remove:

  • rm docs/dbtoken-service.md
  • The load balancing rule to allow gateway to talk to the dbtoken service from pkg/deploy (should be in templates)

We probably don't need to peer the gateway and RP vnets anymore. We might be able to remove that peering as well. Which would be good for overall security posture.

if err != nil {
return nil, err
}
_, err = poller.PollUntilDone(ctx, nil)
if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error logic handling is changing here. Should it be? Should we only ignore errors of http.StatusConflict from cosmosdb?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this method is to create/update so if there is already something there, it will just make an update which would be nothing if the triggers are same. I ran this multiple times on the same cosmosdb database and it didn't return http.StatusConflict even when the triggers were already there.

Comment on lines 85 to 82
ctx, cancel := context.WithTimeout(ctx, time.Minute*5)
defer cancel()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an over-arching context assocaited with this function already? is the old method triggerc.Create(ctx, trigger) blocking vs now we have to poll?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we used to get access to create/modify triggers using the keys, however now since we are using managed identities at cosmosdb scope, we cannot do any management operations on triggers(as it being management resource) because for that, the request must go through ARM and not directly to database.
https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-setup-rbac#permission-model

@@ -71,6 +73,22 @@ func NewMasterKeyAuthorizer(ctx context.Context, log *logrus.Entry, token azcore
return cosmosdb.NewMasterKeyAuthorizer(getDatabaseKey(keys, log))
}

func NewTokenAuthorizer(ctx context.Context, log *logrus.Entry, cred azcore.TokenCredential, databaseAccountName string, scopes []string) (cosmosdb.Authorizer, error) {
scopes = append(scopes, fmt.Sprintf("https://%s.documents.azure.com/.default", databaseAccountName))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scope is incorrect for government cloud (and any non-Azure Public Cloud)

We should read the url from the env struct we use everywhere to be cloud aware.

https://learn.microsoft.com/en-us/azure/azure-government/compare-azure-government-global-azure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! .. sorry for that, I'll correct it.

pkg/database/openshiftclusters.go Outdated Show resolved Hide resolved
pkg/database/subscriptions.go Outdated Show resolved Hide resolved
pkg/deploy/devconfig.go Outdated Show resolved Hide resolved
@@ -71,6 +73,22 @@ func NewMasterKeyAuthorizer(ctx context.Context, log *logrus.Entry, token azcore
return cosmosdb.NewMasterKeyAuthorizer(getDatabaseKey(keys, log))
}

func NewTokenAuthorizer(ctx context.Context, log *logrus.Entry, cred azcore.TokenCredential, databaseAccountName string, scopes []string) (cosmosdb.Authorizer, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a problem inherently with this function.

The token we request from AAD with the scope is a short-lived credential (~12 hrs max I believe). That means that at some point, the token will expire both from AAD MSI perspective and cosmosdb perspective. As a result, we may need some refreshing of the actual database client in all places where it's used.

Previously, we used the primary database keys which never expire but are rotatable. That's why we didn't need to implement any database client reconstruction due to token expiration. Now we more than likely do.

The best way to test this would be to run an RP for a long time (or get the expiration of the token from this function) and see what happens when you get past the expiration time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The azcore.TokenCredential, it is indeed an object for tokens but it does not statically contains the token. It has a getToken() method which gives a refreshed token every time we call the method.
There is a token refresh logic in place in this go-cosmosdb code in this PR which cache the token. Every time we make a request, it is check is the current token is expired or 5 mins left to expire, if so it calls getToken() to get an updated token else use the cached ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible on quieter regions that we might not have a request for over 12 hours? Would that have any effect?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because everything a request goes to cosmosdb, there's a method Authorize() which gets called, and this method has the logic to acquire the token if the token is expired. The same logic which we were discussing lately.

@anshulvermapatel
Copy link
Collaborator Author

/azp run ci,e2e

Copy link

Pull request contains merge conflicts.

@github-actions github-actions bot added the needs-rebase branch needs a rebase label May 22, 2024
Copy link

Please rebase pull request.

@github-actions github-actions bot removed the needs-rebase branch needs a rebase label May 22, 2024
@anshulvermapatel anshulvermapatel added the firefly Issues or Pull requests owned by Team Firefly label May 27, 2024
@anshulvermapatel anshulvermapatel marked this pull request as ready for review May 27, 2024 14:37
Copy link
Contributor

@jaitaiwan jaitaiwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments. Nothing critical. Seems like this PR really needs folks to manually test - is there instructions for how one might do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any changes in this file that aren't formatting? A quick review seemed like there wasn't any meaningful changes to this file which make me think it should be removed from the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I think prettier plugin changed this file. I will remove this changes.
I did remove this portion though:

az keyvault certificate import
--vault-name "$KEYVAULT_PREFIX-dbt"
--name dbtoken-server
--file secrets/localhost.pem >/dev/null

@@ -71,6 +73,22 @@ func NewMasterKeyAuthorizer(ctx context.Context, log *logrus.Entry, token azcore
return cosmosdb.NewMasterKeyAuthorizer(getDatabaseKey(keys, log))
}

func NewTokenAuthorizer(ctx context.Context, log *logrus.Entry, cred azcore.TokenCredential, databaseAccountName string, scopes []string) (cosmosdb.Authorizer, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible on quieter regions that we might not have a request for over 12 hours? Would that have any effect?

globalaccounts storage.AccountsClient
deployments features.DeploymentsClient
groups features.ResourceGroupsClient
// loadbalancers network.LoadBalancersClient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just remove this entirely I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove what? line 52 // loadbalancers network.LoadBalancersClient ?
I was not sure about if this is affecting something or not, so I just commented it out for now. I will remove that as well once I have more peer reviews on this.

globalaccounts: storage.NewAccountsClient(_env.Environment(), *config.Configuration.GlobalSubscriptionID, authorizer),
deployments: features.NewDeploymentsClient(_env.Environment(), config.SubscriptionID, authorizer),
groups: features.NewResourceGroupsClient(_env.Environment(), config.SubscriptionID, authorizer),
// loadbalancers: network.NewLoadBalancersClient(_env.Environment(), config.SubscriptionID, authorizer),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just remove this entirely I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing

@anshulvermapatel
Copy link
Collaborator Author

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@anshulvermapatel anshulvermapatel force-pushed the anshulvermapatel/ARO-7195 branch 3 times, most recently from f691b3b to d507229 Compare May 28, 2024 13:30
@niontive
Copy link
Contributor

Regarding impact for billing service:

  1. Billing uses master key authorization and doesn't go through DB token service. Consequently, we can remove db token service and not affect billing.

  2. Billing is using an older ARO-RP module release, so we can update the latest one and not affect billing.

  3. For rollout, don't disable local auth for CosmosDB. We'll migrate billing service to not use master key authorization, rollout billing, then we should be good to disable local auth for Cosmos DB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
firefly Issues or Pull requests owned by Team Firefly ready-for-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants