Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refreshing project involving machine learning services managed network settings causes resources to be replaced #3225

Open
pierskarsenbarg opened this issue Apr 18, 2024 · 13 comments
Labels
awaiting-upstream Awaiting upstream dependency blocked kind/bug Some behavior is incorrect or out of spec upstream/service

Comments

@pierskarsenbarg
Copy link
Member

pierskarsenbarg commented Apr 18, 2024

What happened?

When creating a MachineLearningService.Workspace it's possible to create the managednetwork outbound rules as a separate resource (and apparently customers are encouraged to do so by Microsoft). However, when you do this and run pulumi refresh it updates the workspace resource with the values set in the network settings resource and then deletes the network settings rule resource.

Then when the update happens, Pulumi recreates the settingsrule resource (because it no longer exists in Azure) and clears out the managed settings rule in the workspace and then the same thing happens all over again.

Example

Code can be found here: https://github.com/pulumi/customer-engineering/tree/master/customer-support/5019 (private repo)

Output of pulumi about

CLI
Version      3.113.0
Go Version   go1.22.2
Go Compiler  gc

Plugins
NAME          VERSION
azure-native  2.27.0
dotnet        unknown

Host
OS       darwin
Version  14.4.1
Arch     arm64

This project is written in dotnet: executable='/usr/local/share/dotnet/dotnet' version='8.0.201'

Current Stack: pierskarsenbarg/azure-ml-csharp/dev

TYPE                                                      URN
pulumi:pulumi:Stack                                       urn:pulumi:dev::azure-ml-csharp::pulumi:pulumi:Stack::azure-ml-csharp-dev
pulumi:providers:azure-native                             urn:pulumi:dev::azure-ml-csharp::pulumi:providers:azure-native::default_2_27_0
azure-native:resources:ResourceGroup                      urn:pulumi:dev::azure-ml-csharp::azure-native:resources:ResourceGroup::pl-cs-ml-rg
azure-native:storage:StorageAccount                       urn:pulumi:dev::azure-ml-csharp::azure-native:storage:StorageAccount::mlsa
azure-native:operationalinsights:Workspace                urn:pulumi:dev::azure-ml-csharp::azure-native:operationalinsights:Workspace::appInsights
azure-native:insights:Component                           urn:pulumi:dev::azure-ml-csharp::azure-native:insights:Component::component
azure-native:machinelearningservices/v20231001:Workspace  urn:pulumi:dev::azure-ml-csharp::azure-native:machinelearningservices/v20231001:Workspace::mlworkspace


Found no pending operations associated with dev

Backend
Name           pulumi.com
URL            https://app.pulumi.com/pierskarsenbarg
User           pierskarsenbarg
Organizations  pierskarsenbarg, karsenbarg, team-ce, demo
Token type     personal

Dependencies:
NAME                VERSION
Pulumi              3.61.0
Pulumi.AzureNative  2.27.0

Pulumi locates its logs in /var/folders/x8/cdd9j87s607fwpy0q62mfmmw0000gn/T/ by default

Additional context

If you look at state export that I've taken before and after the refresh, you can see here that the outboundRules input is empty and we have networksettingsrule resource here: https://github.com/pulumi/customer-engineering/blob/60d8ce2e2835bab670bce807e276934186ebb1fd/customer-support/5019/stack-before-refresh.json#L421-L479

but after the refresh the ML workspace now has the network settings in the outboundRule section: https://github.com/pulumi/customer-engineering/blob/60d8ce2e2835bab670bce807e276934186ebb1fd/customer-support/5019/stack-after-refresh.json#L324-L334 and the setting rule resource is gone.

I've also added https://github.com/pulumi/customer-engineering/blob/master/customer-support/5019/out.txt which are verbose logs and if you search for HTTP/2.0 404 Not Found you can see that it's the rules that can't be found.

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@pierskarsenbarg pierskarsenbarg added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Apr 18, 2024
@danielrbradley
Copy link
Member

This looks like a similar problem to #611 and #1112 - where certain sub-property collections are also represented as standalone resources.

One oddity here is that the standalone resource gets removed on refresh. We'd need to identify why this isn't returned correctly by the API as there might be something else going on. Otherwise, the existing mechanism should be sufficient to handle maintaining the existing property transparently if it wasn't originally set.

@danielrbradley danielrbradley removed the needs-triage Needs attention from the triage team label Apr 19, 2024
@danielrbradley
Copy link
Member

It looks like there's an issue with this resource's GET function. Running the repro myself showed this warning in the diagnostics:

Updating (dev)

View in Browser (Ctrl+O): https://app.pulumi.com/daniel-pulumi-corp/sdf/dev/updates/43

     Type                                                                Name         Status              Info
     pulumi:pulumi:Stack                                                 sdf-dev                          
 +   ├─ azure-native:resources:ResourceGroup                             pl-cs-ml-rg  created (0.66s)     
 +   ├─ azure-native:storage:StorageAccount                              mlsa         created (23s)       
 +   ├─ azure-native:operationalinsights:Workspace                       appInsights  created (31s)       
 +   ├─ azure-native:keyvault:Vault                                      vault        created (34s)       
 +   ├─ azure-native:insights:Component                                  component    created (1s)        
 +   ├─ azure-native:machinelearningservices/v20231001:Workspace         mlworkspace  created (33s)       
 +   └─ azure-native:machinelearningservices:ManagedNetworkSettingsRule  mlsrule      created (32s)       1 warning

Diagnostics:
  azure-native:machinelearningservices:ManagedNetworkSettingsRule (mlsrule):
    warning: Failed to read resource after Create. Please report this issue.
        Verbose logs contain more information, see https://www.pulumi.com/docs/support/troubleshooting/#verbose-logging.
Here's the repro, modified to include the missing Vault
using Pulumi;
using Pulumi.AzureNative.Resources;
using Pulumi.AzureNative.Storage;
using OperationalInsights = Pulumi.AzureNative.OperationalInsights;
using Pulumi.AzureNative.OperationalInsights.Inputs;
using Insights = Pulumi.AzureNative.Insights;
using Pulumi.AzureNative.KeyVault;
using MLS = Pulumi.AzureNative.MachineLearningServices;
using Pulumi.AzureNative.MachineLearningServices.Inputs;
using System;

return await Pulumi.Deployment.RunAsync(() =>
{
    var resourceGroup = new ResourceGroup("pl-cs-ml-rg", new ResourceGroupArgs
    {
        Location = "uksouth"
    });

    var storageAccount = new StorageAccount("mlsa", new Pulumi.AzureNative.Storage.StorageAccountArgs
    {
        ResourceGroupName = resourceGroup.Name,
        Sku = new Pulumi.AzureNative.Storage.Inputs.SkuArgs
        {
            Name = Pulumi.AzureNative.Storage.SkuName.Standard_LRS
        },
        Kind = Kind.StorageV2
    });

    var appInsights = new OperationalInsights.Workspace("appInsights", new()
    {
        Features = new WorkspaceFeaturesArgs
        {
            EnableLogAccessUsingOnlyResourcePermissions = true
        },
        PublicNetworkAccessForIngestion = OperationalInsights.PublicNetworkAccessType.Enabled,
        PublicNetworkAccessForQuery = OperationalInsights.PublicNetworkAccessType.Enabled,
        ResourceGroupName = resourceGroup.Name,
        RetentionInDays = 30,
        Sku = new WorkspaceSkuArgs
        {
            Name = OperationalInsights.WorkspaceSkuNameEnum.PerGB2018
        },
        WorkspaceCapping = new WorkspaceCappingArgs
        {
            DailyQuotaGb = -1
        },
        WorkspaceName = "pkmlcsopworkspace"
    });


    var insightsComponent = new Insights.Component("component", new()
    {
        ApplicationType = Insights.ApplicationType.Web,
        FlowType = "Redfield",
        IngestionMode = Insights.IngestionMode.LogAnalytics,
        Kind = "web",
        PublicNetworkAccessForIngestion = Insights.PublicNetworkAccessType.Enabled,
        PublicNetworkAccessForQuery = Insights.PublicNetworkAccessType.Enabled,
        RequestSource = "IbizaMachineLearningExtension",
        ResourceGroupName = resourceGroup.Name,
        ResourceName = "pkmlcsinworkspace",
        RetentionInDays = 90,
        WorkspaceResourceId = appInsights.Id
    });

    var vault = new Vault("vault", new VaultArgs
    {
        ResourceGroupName = resourceGroup.Name,
        Properties = new Pulumi.AzureNative.KeyVault.Inputs.VaultPropertiesArgs
        {
            TenantId = "xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx",
            Sku = new Pulumi.AzureNative.KeyVault.Inputs.SkuArgs
            {
                Family = "A",
                Name = Pulumi.AzureNative.KeyVault.SkuName.Standard,
            },
        }
    });

    var mlWorkspace = new MLS.V20231001.Workspace("mlworkspace", new()
    {
        ApplicationInsights = insightsComponent.Id,
        FriendlyName = "pk-ml-cs-workspace",
        Description = "",
        DiscoveryUrl = "https://uksouth.api.azureml.ms/discovery",
        HbiWorkspace = false,
        Identity = new MLS.V20231001.Inputs.ManagedServiceIdentityArgs
        {
            Type = MLS.V20231001.ManagedServiceIdentityType.SystemAssigned
        },
        KeyVault = vault.Id,
        ManagedNetwork = new MLS.V20231001.Inputs.ManagedNetworkSettingsArgs
        {
            IsolationMode = "AllowInternetOutbound",
            OutboundRules = new InputMap<object>()
        },
        PublicNetworkAccess = MLS.V20231001.PublicNetworkAccess.Enabled,
        ResourceGroupName = resourceGroup.Name,
        Sku = new MLS.V20231001.Inputs.SkuArgs
        {
            Name = "Basic",
            Tier = MLS.V20231001.SkuTier.Basic
        },
        StorageAccount = storageAccount.Id,
        WorkspaceName = "pk-ml-cs-workspace"
    });

    var rule = new MLS.ManagedNetworkSettingsRule("mlsrule", new()
    {
        ResourceGroupName = resourceGroup.Name,
        WorkspaceName = mlWorkspace.Name,
        Properties = new PrivateEndpointOutboundRuleArgs
        {
            Destination = new PrivateEndpointDestinationArgs
            {
                SparkEnabled = false,
                ServiceResourceId = storageAccount.Id,
                SubresourceTarget = "blob"
            },
            Category = MLS.RuleCategory.UserDefined,
            Type = "PrivateEndpoint"
        },
    });
});

Running this with verbose logs shows that it's attempting to perform the GET against https://management.azure.com/pk-ml-cs-workspace-2/outboundRules/mlsrule?api-version=2023-10-01.

According to the specification, this endpoint should be /subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.MachineLearningServices/workspaces/{workspaceName}/outboundRules/{ruleName}.

Running this under a debugger shows that the URL is being constructed from the resource's id field, which for all other resources starts with /subscriptions/0282681f-7a9e-424b-80b2-96babd57a8a1/resourceGroups/pl-cs-ml-rg20c310d8 etc, but this resource is starting with just the id of the ml workspace.

Side note: I don't seem able to delete this rule resource within the portal either - nothing happens when I click the delete button:
image

@pierskarsenbarg
Copy link
Member Author

pierskarsenbarg commented Apr 19, 2024

No you have to delete the ml workspace to delete the rule in the portal.

@danielrbradley
Copy link
Member

When performing the create, we calculate the id based on the specification and send the PUT request correctly.

The Azure response contains an id field which, if present, we take to be the authoritive source of the canonical Azure identifier for the resource:

if azureId, ok := response["id"].(string); ok {
id = azureId
}

However, on this API, Azure is returning an identifier which is missing the subscription and resource group context:

Original ID: "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/pl-cs-ml-rgxxxxxx/providers/Microsoft.MachineLearningServices/workspaces/pk-ml-cs-workspace-2/outboundRules/mlsrule"
Azure's ID: "pk-ml-cs-workspace-2/outboundRules/mlsrule"

This incorrect identifier is then used for subsequent requests, which then fail.

@pierskarsenbarg
Copy link
Member Author

It knows it’s attached to the workspace because it attaches the rule to the workspace which is weird. I thought it was like that vnet/subnet issue from before

@danielrbradley
Copy link
Member

Manually fixing the id field in the state (exporting, editing then importing) the performing a refresh shows only the diff of the Workspace resource:

     Type                                                                          Name         Plan       Info
     pulumi:pulumi:Stack                                                           sdf-dev                 
     ├─ azure-native:resources:ResourceGroup                                       pl-cs-ml-rg             
     ├─ azure-native:machinelearningservices/v20231001:ManagedNetworkSettingsRule  mlsrule                 
     ├─ azure-native:keyvault:Vault                                                vault                   
 ~   ├─ azure-native:machinelearningservices/v20231001:Workspace                   mlworkspace  update     [diff: ~managedNetwork,systemData]
     ├─ azure-native:storage:StorageAccount                                        mlsa                    
     ├─ azure-native:operationalinsights:Workspace                                 appInsights             
     └─ azure-native:insights:Component                                            component               

Resources:
    ~ 1 to update
    7 unchanged

Therefore there's two separate issues at play here:

  1. Azure's API is broken and returning the wrong id - causing subsequent operations to the rule to fail.
  2. We need to introduce the parent-child awareness to the Workspace to ensure it doesn't overwrite when all rules are managed externally (like VNets).

@danielrbradley
Copy link
Member

As it stands, I would advise against using the standalone rule resource until it's fixed by Azure as it's not currently workable given the broken API. At that point, it would be worth considering the enhancement to make it easier to manage rules just as external resources similar to VNets/Subnets.

@pierskarsenbarg
Copy link
Member Author

pierskarsenbarg commented Apr 19, 2024

Yeah I’ve already advised them not to but they were only doing it because MS told them to.

I don’t know how keen they are to start pushing MS to fix it but I’ll find out

@danielrbradley
Copy link
Member

Ok, marking as blocked as there's no reason to start work on the ability to use standalone rule resources until we can successfully rules as resources without errors from the API.

@danielrbradley danielrbradley added blocked awaiting-upstream Awaiting upstream dependency labels Apr 19, 2024
@Werner-Swart-83
Copy link

@danielrbradley which API endpoint is returning the wrong id? We are engaging with MS on this from our side.

@danielrbradley
Copy link
Member

@Werner-Swart-83 the incorrect ID is being returned after awaiting the result of the PUT /subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.MachineLearningServices/workspaces/{workspaceName}/outboundRules/{ruleName} creates a response which includes an id field with the value {workspaceName}/outboundRules/{ruleName}.

e.g. PUT /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/pl-cs-ml-rgxxxxxx/providers/Microsoft.MachineLearningServices/workspaces/pk-ml-cs-workspace-2/outboundRules/mlsrule
Responds with { "id": "pk-ml-cs-workspace-2/outboundRules/mlsrule", ... }

@Werner-Swart-83
Copy link

@danielrbradley MS has confirmed that the fix for that endpoint has been applied to WestEurope.

@danielrbradley
Copy link
Member

That's great! Thank you for the update @Werner-Swart-83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-upstream Awaiting upstream dependency blocked kind/bug Some behavior is incorrect or out of spec upstream/service
Projects
None yet
Development

No branches or pull requests

3 participants