Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reliably create VirtualNetworkLink (404 not found error) #3138

Open
alecor191 opened this issue Mar 9, 2024 · 2 comments
Open
Labels
area/providers impact/reliability Something that feels unreliable or flaky kind/bug Some behavior is incorrect or out of spec

Comments

@alecor191
Copy link

alecor191 commented Mar 9, 2024

What happened?

When creating a new VirtualNetworkLink resource, the operation fails with the following:

  azure-native:network:VirtualNetworkLink (vnl-dev-app-blue-hub-we)
    error: 1 error occurred:
	autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" 
	Message="The Resource 'Microsoft.Network/privateDnsZones/app-blue-dev-blue.privatelink.westeurope.azmk8s.io/virtualNetworkLinks/vnl-dev-app-blue-hub-we' 
          under resource group 'rg-dev-app-cluster-blue-we' 
          was not found. 
        For more details please go to https://aka.ms/ARMResourceNotFoundFix"

This is strange for the following reasons:

  • The action that is supposed to create the VirtualNetworkLink resource is failing with 404 that the resource is not found. This doesn't seem that it should ever happen?

  • The VirtualNetworkLink resource was created successfully, despite the Pulumi stack failing:

    image

  • When manually deleting the created VirtualNetworkLink and running pulumi up again, it works. I.e. Pulumi is able to create the resource and complete the pulumi up operation successfully.

Example

Here the code (TypeScript) used to create the VirtualNetworkLink resource:

const vnetLinkName = 'vnl-dev-app-blue-hub-we';

const vnetLink = new VirtualNetworkLink(vnetLinkName, {
    location: 'Global',
    privateZoneName: privateZone.name,
    registrationEnabled: false,
    resourceGroupName: resourceGroup.name,
    virtualNetwork: {
      id: hubNetworkId,
    },
    virtualNetworkLinkName: vnetLinkName,
  });

Output of pulumi about

CLI
Version      3.109.0
Go Version   go1.22.0
Go Compiler  gc

Plugins
NAME    VERSION
nodejs  unknown

Host
OS       Microsoft Windows 11 Pro
Version  10.0.22631 Build 22631
Arch     x86_64

This project is written in nodejs: executable='C:\Program Files\nodejs\node.exe' version='v20.11.1'

Additional context

  • Using package "@pulumi/azure-native": "2.30.0"
  • It seems we have a 100% repro if the Virtual Network as well as the Private DNS Zone resources are created together with the VirtualNetworkLink resource as part of the same pulumi up operation: It always fails when all resources need to be created and then it always works on re-run (when VNet and DNS Zone already exist)
  • Note that there shouldn't be a race condition, as we reference both the VNet as well as the DNS zone in the VirtualNetworkLink creation (see code above). So Pulumi is aware of the dependency and should wait for the dependencies to be created.

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@alecor191 alecor191 added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Mar 9, 2024
@mjeffryes
Copy link
Contributor

I believe that the source of the 404 is that we do a read of each resource immediately after creation (this improves consistency with later refresh commands). What's probably happening is that the vnetLink create command returns successfully, but the resource is actually still initializing (possibly due to waiting on some initialization in the zone or vnet given the difference in behavior when those are already provisioned), and the read fails to find it.

Possibly we could be retrying this 404 with a back off to see if it resolves later.

@mjeffryes mjeffryes added area/providers impact/reliability Something that feels unreliable or flaky and removed needs-triage Needs attention from the triage team labels Mar 12, 2024
@alecor191
Copy link
Author

Thanks a lot @mjeffryes for looking into this. Do you have some thoughts on if this will get addressed and by when? It would help us with planning, as currently this is quite an issue in our automation pipeline as we need to manually retry workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/providers impact/reliability Something that feels unreliable or flaky kind/bug Some behavior is incorrect or out of spec
Projects
None yet
Development

No branches or pull requests

2 participants