-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race Condition Issue with User Assigned Managed Identity's PrincipalId and SqlResourceSqlRoleAssignment #2816
Comments
The repro link no longer works, please could you provide a new copy of the code @zbuchheit |
Just rebuilt a test case for this. To keep creating new identities and assigning them, change the value of import * as azure from "@pulumi/azure-native";
import * as random from "@pulumi/random";
import { env } from "process";
const resourceGroup = new azure.resources.ResourceGroup("resource-group");
const cosmosAccount = new azure.documentdb.DatabaseAccount("cosmos-account", {
resourceGroupName: resourceGroup.name,
locations: [
{
failoverPriority: 0,
locationName: "East US",
},
{
failoverPriority: 1,
locationName: "West US",
},
],
databaseAccountOfferType: "Standard",
});
const cosmosDBDataContributorRoleDefinition =
new azure.documentdb.SqlResourceSqlRoleDefinition(
"cosmos-db-data-contributor-role-definition",
{
resourceGroupName: resourceGroup.name,
accountName: cosmosAccount.name,
type: "BuiltInRole",
roleName: "Data Contributor",
assignableScopes: [cosmosAccount.id],
permissions: [
{
dataActions: ["Microsoft.DocumentDB/databaseAccounts/readMetadata"],
},
],
}
);
const n = env["TEST_N"] ?? "";
const userAssignedIdentity = new azure.managedidentity.UserAssignedIdentity(
"user-assigned-identity" + n,
{
resourceGroupName: resourceGroup.name,
}
);
const sqlResourceSqlRoleAssignment =
new azure.documentdb.SqlResourceSqlRoleAssignment(
"sql-resource-sql-role-assignment" + n,
{
accountName: cosmosAccount.name,
principalId: userAssignedIdentity.principalId,
resourceGroupName: resourceGroup.name,
roleAssignmentId: new random.RandomUuid("testRandomUuid" + n).result,
roleDefinitionId: cosmosDBDataContributorRoleDefinition.id,
scope: cosmosAccount.id,
}
); |
After running 10 iterations of the above test I've not managed to reproduce the issue. It's possible that this might have been affected by #3042 in release v2.29.0 which could delay the subsequent creation very slightly while it's waiting for the read. This could also be dependent on the speed and latency of the machine performing the update in conjunction with the region being used. Adding This also appears quite similar to an occasional integration test failure we see during testing of the
|
Sorry I changed the referenced repo to private. This is the code. I broke it into two stacks so that I didn't have to wait for a cosmosdb instance to stand up. Code Repro w/ Stack Referenceusing Pulumi;
using Pulumi.AzureAD;
using System;
using System.Threading.Tasks;
using ManagedIdentity = Pulumi.AzureNative.ManagedIdentity;
using Pulumi.AzureNative.DocumentDB;
return await Pulumi.Deployment.RunAsync(async () =>
{
const int initialDelayMilliseconds = 500;
const int MaximumRetries = 10;
const int ExponentialBackoffFactor = 2;
const string CosmosDbBuiltInDataContributorId = "00000000-0000-0000-0000-000000000002";
var stackReference = new StackReference($"{Deployment.Instance.OrganizationName}/cosmos-db-stack-reference/{Deployment.Instance.StackName}");
var resourceGroupName = stackReference.RequireOutput("resourceGroupName").Apply(id => (string)id);
var cosmosAccount = GetDatabaseAccount.Invoke(new GetDatabaseAccountInvokeArgs
{
ResourceGroupName = resourceGroupName,
AccountName = stackReference.RequireOutput("cosmosDbAccountName").Apply(id => (string)id)
});
var userAssignedIdentity = new ManagedIdentity.UserAssignedIdentity("user-assigned-identity", new ManagedIdentity.UserAssignedIdentityArgs
{
ResourceGroupName = resourceGroupName
}).PrincipalId;
/*
The following code is a workaround for a bug in the Azure SDK. The bug is that the PrincipalId property of the UserAssignedIdentity resource is not populated
when the resource is created. This is a problem because we need the PrincipalId to assign the role to the identity. The workaround is to poll the resource
until the PrincipalId is populated by reference AD. It would be preferable to be able to use the PrincipalId property of the UserAssignedIdentity resource as
it only requires an api call to the Resource Manager API. The workaround requires an additional call to the AD API to get the Service Principal.
*/
// var userAssignedIdentityPrincipalId = userAssignedIdentity.Apply(async principalId =>
// {
// GetServicePrincipalResult? servicePrincipalResult = null;
// for (int attempt = 1; attempt <= MaximumRetries; attempt++)
// {
// try
// {
// servicePrincipalResult = await GetServicePrincipal.InvokeAsync(new GetServicePrincipalArgs
// {
// ObjectId = principalId
// });
// Pulumi.Log.Debug($"Attempt {attempt} succeeded in fetching Service Principal.");
// return servicePrincipalResult;
// }
// catch (Exception e)
// {
// Pulumi.Log.Debug($"Attempt {attempt} failed to fetch Service Principal.");
// if (attempt == MaximumRetries)
// {
// Pulumi.Log.Error($"Service Principal not resolved after {attempt} tries. Exception: {e.Message}");
// throw;
// }
// int delay = initialDelayMilliseconds * (int)Math.Pow(ExponentialBackoffFactor, attempt);
// Pulumi.Log.Debug($"Waiting for {delay}ms before retrying.");
// await Task.Delay(delay);
// }
// }
// return servicePrincipalResult;
// });
var cosmosDBDataContributorRoleDefinition = GetSqlResourceSqlRoleDefinition.Invoke(new GetSqlResourceSqlRoleDefinitionInvokeArgs
{
AccountName = cosmosAccount.Apply(ca => ca.Name),
ResourceGroupName = resourceGroupName,
RoleDefinitionId = CosmosDbBuiltInDataContributorId
}).Apply(roleDefinition => roleDefinition.Id);
var sqlResourceSqlRoleAssignment = new SqlResourceSqlRoleAssignment($"sql-resource-sql-role-assignment", new SqlResourceSqlRoleAssignmentArgs
{
AccountName = cosmosAccount.Apply(account => account.Name),
PrincipalId = userAssignedIdentity,
ResourceGroupName = resourceGroupName,
RoleAssignmentId = new Pulumi.Random.RandomUuid("testRandomUuid").Result,
RoleDefinitionId = cosmosDBDataContributorRoleDefinition,
Scope = cosmosAccount.Apply(account => account.Id),
});
}); Referenced Stackusing Pulumi.AzureNative.Resources;
using System.Collections.Generic;
return await Pulumi.Deployment.RunAsync(() =>
{
var resourceGroup = new ResourceGroup("resource-group", new ResourceGroupArgs{
ResourceGroupName = "zbuchheit",
Tags = { { "Environment", "Dev" }, {"Owner", "Zbuchheit"} },
});
var cosmosAccount = new Pulumi.AzureNative.DocumentDB.DatabaseAccount("cosmos-account", new Pulumi.AzureNative.DocumentDB.DatabaseAccountArgs{
AccountName = "zbuchheit",
CreateMode = Pulumi.AzureNative.DocumentDB.CreateMode.Default,
ResourceGroupName = resourceGroup.Name,
Locations = new[] {
new Pulumi.AzureNative.DocumentDB.Inputs.LocationArgs
{
LocationName = resourceGroup.Location,
FailoverPriority = 0,
IsZoneRedundant = false
}
},
DatabaseAccountOfferType = Pulumi.AzureNative.DocumentDB.DatabaseAccountOfferType.Standard,
EnableFreeTier = true,
Kind = Pulumi.AzureNative.DocumentDB.DatabaseAccountKind.GlobalDocumentDB,
Tags = { { "Environment", "Dev" }, {"Owner", "Zbuchheit"} },
});
var sqlCosmosDBDatabase = new Pulumi.AzureNative.DocumentDB.SqlResourceSqlDatabase("sql-resource-sql-db", new()
{
AccountName = cosmosAccount.Name,
DatabaseName = "zbuchheit",
Location = resourceGroup.Location,
Options = null,
Resource = new Pulumi.AzureNative.DocumentDB.Inputs.SqlDatabaseResourceArgs
{
Id = "zbuchheit",
},
ResourceGroupName = resourceGroup.Name,
});
// Export the primary key of the Storage Account
return new Dictionary<string, object?>
{
["resourceGroupName"] = resourceGroup.Name,
["cosmosDbAccountName"] = cosmosAccount.Name,
};
}); |
Thank you for the snippets! I tried but failed to reproduce it so far:
@zbuchheit Are you still able to repro the issue? If so, in which region? |
I will pull up my repo and try it and see if I can get it to occur. Will update with findings. Thanks |
@mikhailshilkov I did get it to happen again in eastus2 but I didn't capture verbose enough logs during that run so trying again. |
Looking at the code at version 2.10 of the provider, this is the source of the error: pulumi-azure-native/provider/pkg/provider/provider.go Lines 904 to 911 in cf2ad62
This helps us understand the original error message reported, just to help clarify which part we're looking at. The first half is the read error after the create failed, which is less suprising, given that the create failed:
The second half is the actual create error we need to focus on:
|
indeed, we never see that for the normal role assignments, even though they are done for the same resources. It happens only for the Cosmos ones, Cosmos somehow is always in its own world in Azure. At the same time for cosmos role assignments we see it reliably. Even added retry + timout of 5min is not enough in most cases to ensure the identity is there. It's also a problem persisting for months, so nothing new nor temporary in Azure. I'd assume this is the default behavior. |
I was able to catch verbose logs for this and pass them along to engineering. Adding a note in here for transparency. |
I've dug through the logs but can't identify anything we can do here to resolve the issue of what appears to be Azure's internal replication latency. Marking this as blocked until we can identify an actionable solution. |
well, if it's a consistent and expected behavior it should be solved at the provider level otherwise it just pushes this effort to every client. At least there should be an official workaround, and currently there doesn't seem to be one without using external tools like arm client. |
@danielrbradley Could you please add a summary/extract of HTTP logs to the issue? I.e., show which request errors, what is the timing of requests, etc. Why is not possible to retry the HTTP request on our side? |
@mikhailshilkov it's not retrying of HTTP requests that we need - it's retrying of the whole resource creation. What we're seeing is that after we've created the role, we can retrieve it correctly - even from other regions. However, after we've started the process of creating the It could be possible to build a new kind of retry mechanism for this, though I'd also consider if this should be a feature of the core Pulumi engine rather than this specific provider: pulumi/pulumi#7932 |
As an update from me, I opened a ticket with MSFT to see if we can get some traction on that side. I provided most of the logs and described the behavior we have seen. They said they are contacting the product team to do some investigation on it. 🤞 |
seems like there is a similar issue with kusto in TF hashicorp/terraform-provider-azurerm#18355 |
This comment was marked as outdated.
This comment was marked as outdated.
I'm posting a full HTTP log so that it's easier to discuss what is going on.
This is a service bug - the tenant exists but it's reported as if it's not (probably, some eventual consistency issue).
Maybe also related to Azure/go-autorest#634 |
What happened?
When using a User Assigned Managed Identity and attempting to use SqlResourceSqlRoleAssignment for Cosmos DB to give RBAC permissions, occasionally it will error out the following error.
error: resource partially created but read failed autorest/azure: Service returned an error. Status=404 Code="NotFound" Message="Unable to find a SQL Role Assignment with ID [XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX].\r\nActivityId: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX, Microsoft.Azure.Documents.Common/2.14.0": Code="BadRequest" Message="The provided principal ID [XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX] was not found in the AAD tenant(s) [XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX] which are associated with the customer's subscription.\r\nActivityId: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0"
Subsequent runs will succeed in creating a SqlResourceSqlRoleAssignment.
Example
Example Repo with Referenced Code Above
Output of
pulumi about
CLI
Version 3.79.0
Go Version go1.21.0
Go Compiler gc
Plugins
NAME VERSION
azure-native 2.10.0
azuread 5.42.0
dotnet unknown
random 4.14.0
Host
OS darwin
Version 13.6
Arch arm64
Dotnet 7
Additional context
This could be an issue on the Microsoft side. When I create a normal role assignment, it performs a PUT against RoleAssignment and responds with a 201 Created immediately. When I attempt to create a SqlResourceSqlRoleAssignment, it responds 202 Accepted and has a status of "Enqueued" and eventually fails after several GET requests to check the status of the resource.
Contributing
Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
The text was updated successfully, but these errors were encountered: