Skip to content

Latest commit

 

History

History
2711 lines (2106 loc) · 148 KB

index-portalfx-extension-monitor.md

File metadata and controls

2711 lines (2106 loc) · 148 KB

# Telemetry

## Getting Started

Starting 2016, Portal Telemetry has moved completely to the Kusto based solution.

Pre-requisites

Kusto.Explorer: Application

Kusto Cluster Info

Name: AzPortal Data Source: https://AzPortal.kusto.windows.net

Permissions

Partner teams wanting access need to join the Azure Portal Data group

Kusto Documentation & Links

Documentation

Kusto Discussions

Who can I contact ?

Ibiza Performance/ Reliability - Telemetry PM for Ibiza Performance and Reliability Telemetry

Ibiza Create - Telemetry PM for Ibiza Create Telemetry

Azure Fx Gauge Team - Telemetry Team

## Kusto Telemetry

Supported Databases

Name Details
AzPtlCosmos This is our main telemetry database. Data here is deduped, geo-coded, expanded and filtered. All the official dashboards\reports are based on this table. It is highly encouraged that this database is used for your needs. Data here is persisted for 120 days and excludes test traffic.
AzurePortal There will be many scenarios where you may want to debug your issues. For e.g., debugging perf issues. To look at diagnostic events, this is the right table to use. This is the raw data coming from MDS directly to Kusto and it is unprocessed. Data here is persisted for 45 days. To filter out test traffic when doing queries on this database, you should use userTypeHint == "".

Supported Kusto Tables

Database Table Name Details
AzPtlCosmos ClientTelemetry This is all the Client Telemetry data that is collected from the portal. This is the main table that should be good for most scenarios.
AzPtlCosmos ExtTelemetry This holds client events data for extensions using the Extension Telemetry feature.

Other useful Kusto tables are the ones where errors and warnings are getting logged. These tables are currently available only under AzurePortal database:

Database Table Name Details
AzurePortal ClientEvents This table contains errors and warnings thrown from Framework and Hubs IFrame.
AzurePortal ExtEvents This table contains errors and warnings thrown from an extension's IFrame. Your extension will log to this table only if you have previously onboarded to ExtTelemetry/ExtEvents tables.

Supported Functions

Supported Functions

Other functions in the databases are available for exploration but are mainly intended for internal usage and are subject to change at any time.

Query for Reported Numbers

On a weekly basis, we send out a Weekly Ibiza Status mail where we cover the KPI numbers for all extensions among other things. For folks not getting these emails, please join one of the groups in the screenshot below.

These emails have clickable Kusto links within the reported numbers. Clicking on these will take you to the Kusto query behind getting these numbers. We use functions to hide the complexity behind the queries that we use. To view the details about the queries, look under Functions\Public. Once you find the right function, if you right-click and do “Make a command script”, you will be able to see the details of that function. You can do this recursively for any functions underneath.

Connection Scope

Supported Cosmos streams

While we have moved to Kusto, we still have streams that continue to exist. This could be required if you want to enable some E2E automation, write super-complex queries that Kusto is unable to handle or need data older than 120 days.

Name Schema Cosmos Link
Client Telemetry DataSet=53004 Daily ClientTelemetry
ClientTelemetryForKustoExport DataSet=93405 Hourly ClientTelemetry

We plan to merge ClientTelemetryForKustoExport into ClientTelemetry stream very shortly. ClientTelemetryForKustoExport is the stream that currently feeds the Kusto database - AzPtlCosmos

ClientTelemetry (AzPtlCosmos)

Action

This represents an event in the portal.

The following actions are logged to ClientTelemetry table:
  • Blade events

    • BladeLoaded
      • Tracks the time it takes to open the blade and start seeing the part frames show up. BladeLoaded also includes loading and opening the action bar.
    • BladeLoadErrored
      • Triggered when loading a blade failed.
      • This event is used to track blade errors in our reliability metrics.
    • BladeOpened
      • Tracks the time it takes for BladeLoaded + all the parts to start loading. More specifically, it is when the blade’s Above The Fold lenses, parts and widgets have been created. It includes setting up the input bindings. The inputs themselves aren’t necessarily available yet (onInputsSet is not necessarily called yet). It also includes loading the collapsed state of the essentials part (if there is one).
    • BladeRevealed
      • All parts above the fold have called reveal content or resolved onInputsSet(). This action is triggered when a Blade is revealed but the parts within the blade may still be loading.
      • This event is used in our blade performance metrics.
    • BladeReady
      • All parts above the fold have resolved onInputsSet(). This action is triggered when a Blade Load is complete and it's ready for consumption by the user.
    • BladeFullOpened
      • Is the same as BladeOpened except it is for all the parts, not just the parts above the fold.
    • BladeFullRevealed
      • Is the same as BladeRevealed except it is for the all parts, not just the parts above the fold.
    • BladeFullReady
      • Is the same as BladeReady except it is for all the parts, not just the parts above the fold.
    • BladeButtonClicked
      • When the pin, unpin, maximize, minimize or close button on a blade is clicked.
    • CommandExecuted
      • When any of the Commands on a blade is clicked - like start, stop, etc.

    "name" column provides the name of the blade. This name is provided in "Extension/extension_name/Blade/blade_name" format.

  • Part events

    • PartClick
      • Triggered when a part is clicked.
    • PartLoaded
      • Tracks the time it takes for a part to start getting filled with some UI (e.g. … spinner)
    • PartErrored
      • Triggered when loading a part failed.
      • This event is used to track part errors in our reliability metrics.
    • PartReady
      • Triggered when the part has resolved onInputsSet().

    "name" column provides the name of the part. This name is provided in "Extension/extension_name/Blade/blade_name/Part/part_name" format.

  • Portal Ready events

    • TotalTimeToPortalReady
      • Tracks the time it takes to load the portal (load the splash screen and show the startboard or start rendering the blade if it was a deep link).
    • TotalTimeToStartBoardReady
      • Tracks the time to load the portal and show the startboard.
    • TotalTimeToDeepLinkReady
      • This event is triggered only if a user is using a deep link to call up the portal. It tracks the time it takes to load the portal and start rendering the deep linked blade.

    The portal load time is tracked in the "duration" column.

  • Extension events

    • LoadExtensions
      • Measures the time it takes Shell to create the extension's IFrame until Shell receives the extension's manifest.
      • "actionModifier" = start is triggered when an extension starts loading
      • "actionModifier" = cancel is triggered when an extension fails loading
      • "actionModifier" = complete is triggered when an extension finishes loading
    • InitializeExtensions
      • Measures the time since Shell receives the extension manifest until Shell receives an RPC response stating that the extension's state is Initialized.
      • "actionModifier" = start is triggered when an extension starts being initialized
      • "actionModifier" = cancel is triggered when an extension's initialization fails
      • "actionModifier" = complete is triggered when an extension's initialization finishes

    "name" column provides the name of the extension which is being loaded/initialized.

  • Create events

    • CreateFlowLaunched
      • Triggered when a user expresses the intent to create a resource in the Portal by launching its create blade. This event is mostly logged from the Marketplace extension. This event can be found mostly in ExtTelemetry table (where the logs from Marketplace extension go) and only partially in ClientTelemetry table.
    • ProvisioningStarted / ProvisioningEnded
      • Triggered when a new deployment started/ended. This event is being logged for both custom and ARM deployments.
    • CreateDeploymentStart / CreateDeploymentEnd
      • Triggered only if the deployment is done using the ARM Provisioner provided by Framework. For ARM deployments, the order of the logged events for a deployment is: "ProvisioningStarted", "CreateDeploymentStart", "CreateDeploymentEnd" and "ProvisioningEnded". Note that "CreateDeploymentStart" and "CreateDeploymentEnd" are only logged if the deployment is accepted by ARM. "CreateDeploymentStart"/"CreateDeploymentEnd" logs contain the correlationId that can be used to search for the deployment's status in ARM.

    "name" column provides the name of the package getting deployed, while "data" column provides more information about the deployment.

  • Side Bar events

    • SideBarItemClicked
      • When one of the items on the Side Bar (except + or Browse All) is clicked.
    • SideBarFavorite
      • When a resource type is marked as a favorite
    • SideBarUnFavorite
      • When a resource type is removed as a favorite

ActionModifier

This is used in tandem with the Action field. This represents a status of a particular Action. So for BladeReady for eg., you will see ActionModifier values of start, complete & cancel

Area

This field usually gives the extension name associated with the particular Action. This is derived from either then Name field or the Source field depending on the Action

Blade

This field gives the Blade name associated with the particular Action. This is derived from either then Name field or the Source field depending on the Action

BrowserFamily

This field represents the name of the Browser used by the User. This is derived from the UserAgent field

BrowserMajorVersion

This field represents the Major Version of the Browser used by the User. This is derived from the UserAgent field

BrowserMinorVersion

This field represents the Minor Version of the Browser used by the User. This is derived from the UserAgent field

ClientTime

This field gives the actual time of the event according to the client's clock (which can be off based on the client settings). This is a good field to reconstruct the precise sequence of events.

Data

The Data field is the most dynamic field in telemetry. It is a JSON object with no set structure. They often contain information specific to a particular Action.

Below is an example of the Data filed for Action "ProvisioningStarted"

	{
		"oldCreateApi": true,
		"launchingContext": {
		"galleryItemId": "Microsoft.SQLDatabase",
		"source": [
			"GalleryCreateBlade"
		],
		"menuItemId": "recentItems",
		"itemIndex": 0
		}
	}

Duration

This field gives the duration a particular Action took to complete. This value is non-zero only for Actions with ActionModifier having values either "complete", "succeeded", etc. The time is in milliseconds.

JourneyId

This field provides the journey Id for each action. A journey is basically a tiny sub-session within which a user navigates a flow of blades. This id allows us to identify the actions that the user took within any given journey, how many journey did a user interact with, etc.

Lens

This field gives the Lens name associated with the particular Action. This is derived from either then Name field or the Source field depending on the Action

Name

The Name field usually changes it's format based on the Action. In most scenarios, it usually has the following format

    Extension/<extensionName>/Blade/<BladeName>/Lens/<LensName>/PartInstance/<PartName>

This field is usually used to identify the extension\Blade\Lens\Part associated with a particular Action.

SessionId

This represents each sessions that the user opens. SessionId refreshes everytime a user logs in\refreshes.

Part

This field gives the Part name associated with the particular Action. This is derived from either then Name field or the Source field depending on the Action

PreciseTimeStamp

This field gives the time the event was logged by the server. It is in UTC.

UserId

This field identifies a user by PUID. We can use this to identify queries like daily active users, unique users using my feature, etc.

UserAgent

This represents the user agent of the user. This is a standard UserAgentString - User Agent

UserCity

This represents the City that the User has used the portal from. We derive this from the Users Client IP.

UserCountry

This represents the Country that the User has used the portal from. We derive this from the Users Client IP.

Read more about Kusto query language.

# Portal Telemetry Overview

Ibiza portal tracks several pieces of information as users navigate through the portal. Extensions do not need to consume any APIs to have this information collected.

Note: Currently, telemetry is made available to partners through Kusto. In order to access the Portal logs you must get access to Azure Portal Data group.

You can access our Kusto cluster using Kusto Explorer or Kusto Web Explorer. Our Kusto cluster contains two databases:

  • AzurePortal - which contains the raw data
  • AzPtlCosmos - which is our main telemetry database used in all the official dashboards and reports. Data from this database is deduped, geo-coded, expanded and has test traffic filtered out.

There are two tables used for telemetry:

  • ClientTelemetry - contains telemetry logged by Framework and Hubs. In this table, you can find all the telemetry events (e.g. BladeLoaded, PartLoaded) which are logged by default for any extension which is registered in the portal.
  • ExtTelemetry - contains extension telemetry. As an extension author, you may log additional telemetry to this table.
    • Note: Your extension will log to this table only if you have onboarded to the telemetry services provided by Framework.

You can read more here about Kusto and about the data provided in our Kusto cluster.

Tracked Actions

The following actions are logged to ClientTelemetry table:
  • Blade events

    • BladeLoaded
      • Tracks the time it takes to open the blade and start seeing the part frames show up. BladeLoaded also includes loading and opening the action bar.
    • BladeLoadErrored
      • Triggered when loading a blade failed.
      • This event is used to track blade errors in our reliability metrics.
    • BladeOpened
      • Tracks the time it takes for BladeLoaded + all the parts to start loading. More specifically, it is when the blade’s Above The Fold lenses, parts and widgets have been created. It includes setting up the input bindings. The inputs themselves aren’t necessarily available yet (onInputsSet is not necessarily called yet). It also includes loading the collapsed state of the essentials part (if there is one).
    • BladeRevealed
      • All parts above the fold have called reveal content or resolved onInputsSet(). This action is triggered when a Blade is revealed but the parts within the blade may still be loading.
      • This event is used in our blade performance metrics.
    • BladeReady
      • All parts above the fold have resolved onInputsSet(). This action is triggered when a Blade Load is complete and it's ready for consumption by the user.
    • BladeFullOpened
      • Is the same as BladeOpened except it is for all the parts, not just the parts above the fold.
    • BladeFullRevealed
      • Is the same as BladeRevealed except it is for the all parts, not just the parts above the fold.
    • BladeFullReady
      • Is the same as BladeReady except it is for all the parts, not just the parts above the fold.
    • BladeButtonClicked
      • When the pin, unpin, maximize, minimize or close button on a blade is clicked.
    • CommandExecuted
      • When any of the Commands on a blade is clicked - like start, stop, etc.

    "name" column provides the name of the blade. This name is provided in "Extension/extension_name/Blade/blade_name" format.

  • Part events

    • PartClick
      • Triggered when a part is clicked.
    • PartLoaded
      • Tracks the time it takes for a part to start getting filled with some UI (e.g. … spinner)
    • PartErrored
      • Triggered when loading a part failed.
      • This event is used to track part errors in our reliability metrics.
    • PartReady
      • Triggered when the part has resolved onInputsSet().

    "name" column provides the name of the part. This name is provided in "Extension/extension_name/Blade/blade_name/Part/part_name" format.

  • Portal Ready events

    • TotalTimeToPortalReady
      • Tracks the time it takes to load the portal (load the splash screen and show the startboard or start rendering the blade if it was a deep link).
    • TotalTimeToStartBoardReady
      • Tracks the time to load the portal and show the startboard.
    • TotalTimeToDeepLinkReady
      • This event is triggered only if a user is using a deep link to call up the portal. It tracks the time it takes to load the portal and start rendering the deep linked blade.

    The portal load time is tracked in the "duration" column.

  • Extension events

    • LoadExtensions
      • Measures the time it takes Shell to create the extension's IFrame until Shell receives the extension's manifest.
      • "actionModifier" = start is triggered when an extension starts loading
      • "actionModifier" = cancel is triggered when an extension fails loading
      • "actionModifier" = complete is triggered when an extension finishes loading
    • InitializeExtensions
      • Measures the time since Shell receives the extension manifest until Shell receives an RPC response stating that the extension's state is Initialized.
      • "actionModifier" = start is triggered when an extension starts being initialized
      • "actionModifier" = cancel is triggered when an extension's initialization fails
      • "actionModifier" = complete is triggered when an extension's initialization finishes

    "name" column provides the name of the extension which is being loaded/initialized.

  • Create events

    • CreateFlowLaunched
      • Triggered when a user expresses the intent to create a resource in the Portal by launching its create blade. This event is mostly logged from the Marketplace extension. This event can be found mostly in ExtTelemetry table (where the logs from Marketplace extension go) and only partially in ClientTelemetry table.
    • ProvisioningStarted / ProvisioningEnded
      • Triggered when a new deployment started/ended. This event is being logged for both custom and ARM deployments.
    • CreateDeploymentStart / CreateDeploymentEnd
      • Triggered only if the deployment is done using the ARM Provisioner provided by Framework. For ARM deployments, the order of the logged events for a deployment is: "ProvisioningStarted", "CreateDeploymentStart", "CreateDeploymentEnd" and "ProvisioningEnded". Note that "CreateDeploymentStart" and "CreateDeploymentEnd" are only logged if the deployment is accepted by ARM. "CreateDeploymentStart"/"CreateDeploymentEnd" logs contain the correlationId that can be used to search for the deployment's status in ARM.

    "name" column provides the name of the package getting deployed, while "data" column provides more information about the deployment.

  • Side Bar events

    • SideBarItemClicked
      • When one of the items on the Side Bar (except + or Browse All) is clicked.
    • SideBarFavorite
      • When a resource type is marked as a favorite
    • SideBarUnFavorite
      • When a resource type is removed as a favorite

Logging

There are two options for collecting telemetry and error/warning logs. You can either configure and use the Portal Framework's built-in telemetry services or you can utilize an entirely custom telemetry system.

We advise you to use the telemetry controller provided by Framework in order to take advantage of the system which is already in place.

Information should be collected in a way that that ensures no personally identifiable information (PII) is captured. It is very important for security and compliance reasons that PII data is not sent to telemetry services and you should have practices in place to ensure that this is enforced.

Onboarding to ExtTelemetry/ExtEvents tables

To start using the built-in controller provided by Framework for collecting telemetry and error/warning logs, just add this.EnablePortalLogging = true; in the constructor of your extension definition class:

  public Definition(ApplicationConfiguration applicationConfiguration)
  {
      this.EnablePortalLogging = true;
  }

You can read here more details about using the telemetry controller provided by Framework.

Logging telemetry to ExtTelemetry table

You can use the Portal telemetry APIs to log telemetry. However, before you do so you will need to initialize the telemetry service.

  // Initialize the telemetry functionality and make it available for use.
  MsPortalFx.Base.Diagnostics.Telemetry.initialize("ExtensionName", false /* traceBrowserInformation */ );

Note that you don't need to trace browser information to your particular extension as this data is collected globally. However, if you would like the browser information in your own telemetry store set traceBrowserInformation to true.

To log telemetry, you can call the trace method as shown below:

  MsPortalFx.Base.Diagnostics.Telemetry.trace({
      extension: "Microsoft_Azure_NewExtension",
      source: "Links",
      action: "LinkClicked",
      name: "Recommended",
      data: {...}
  });

Telemetry logs go to ExtTelemetry table, which is available in Kusto in both AzurePortal and AzPtlCosmos databases. The recommended format for name column is 'Extension/Microsoft_Azure_NewExtension/Blade/NewBladeName', if the event is related to a blade. Please do not stringify data and context columns when passing them through. These columns usually contain JSON values. You should pass their values as objects, as otherwise, this will result in double-encoded strings.

Logging errors/warnings to ExtEvents table

To log errors/warnings, you can call the error/warning methods as shown below:

  var log = new MsPortalFx.Base.Diagnostics.Log("logging_area");
  log.warning(errorMessage, code, args);
  log.error(errorMessage, code, args);

Args can be provided for additional information to get logged together with the message. Pass it as an object and do not stringify it before passing it through.

Errors and warnings are logged to ExtEvents table, which is available in Kusto only in AzurePortal database.

NOTE: Verbose logging is currently disabled in mpac/production, in order to prevent overly aggressive logging. We recommend you to use verbose logging only for debugging.

We have built Extension Errors Dashboard for giving you the ability to analyze easier the errors and warnings thrown by your extension.

NOTE: In the charts from Extension Errors Dashboard, we aggregate the error messages by omitting the text which is within double quotes (") or single quotes ('). We consider those parts to be the dynamic part of the message (e.g. an id, a timestamp etc.). For example, a message like [Could not find part "PartName1"] will be treated as [Could not find part ""]. Please use this format for all the logged error messages, if you want them to be aggregated by our queries.

Available Power BI Dashboards

Following are some of the dashboards that we support. If you do not have access to any of these please request the right permissions in Getting Started

Name PowerBi Link Metrics Description
Portal User Adoption Dashboard http://aka.ms/portalfx/dashboard/PortalUserAdoption
Portal Performance Dashboard http://aka.ms/portalfx/dashboard/PortalPerformance Perf Docs
Portal Reliability Dashboard http://aka.ms/portalfx/dashboard/PortalReliability Reliability Docs
Portal Create Dashboard http://aka.ms/portalfx/dashboard/PortalCreate Create Docs
Extension Errors Dashboard http://aka.ms/portalfx/dashboard/ExtensionErrors Extension Errors Docs

Collecting Feedback From Your Users

In February 2016 we introduced a standardized pane for collecting user feedback. We currently expose one method to extension developers.

Resource Deleted Survey

To ask a user why they deleted a resource use the openResourceDeletedFeedbackPane method:

  import * as FxFeedback from "Fx/Feedback";
  FxFeedback.openResourceDeletedFeedbackPane("displayNameOfTheDeletedResource", optionalObjectWithAnyAdditionalDataYouWantToLog);

Call this method after a user starts the deletion process for a resource. Shell will open the feedback pane with a standardized survey. The name of the resource you pass to the method will be shown to the user in the survey. Responses to this survey are logged to the telemetry tables. If the feedback pane is already open calls to this method will be no-ops.

Questions?

Read more about Kusto query language.

Ask questions on: https://stackoverflow.microsoft.com/questions/tagged?tagnames=ibiza-telemetry

How to view Live Telemetry

Using Fiddler

  1. Install Fiddler - http://www.telerik.com/fiddler
  2. Open Fiddler and configure the "Filters" as below Fiddler
  3. Open Portal and you should see all relevant telemetry logs emitted here.

NOTE

  • If the sign in flow would normally require 2FA (i.e. you are not already signed in), Fiddler will break the sign in flow
  • Fiddler can capture your passwords

Using Console Logs

  1. Enable Console Telemetry - https://portal.azure.com/?feature.consoletelemetry=true#
  2. Hit F12 and view the "Console" Tab.
  3. You will be able to see most of all telemetry logs within this window. The only known Action that doesn’t show up here is CreateFlowLaunched Fiddler

Viewing Blade Names

Pressing Ctrl-Alt-D in the Ibiza portal shows some component loading times

Fiddler

## Extension Errors Dashboard

Extension Errors dashboard gives you the ability to look into the errors and warnings thrown by your extension.

To view the Extension Errors PowerBi dashboard follow this link: Extension Errors PowerBi dashboard

Prerequisites

NOTE: Note that your extension's errors/warnings will be tracked in this dashboard only if you have previously onboarded to ExtTelemetry/ExtEvents tables.

Getting access to the Extension Errors Dashboard

In order to get acess to the Extension Errors Dashboard, you will need to join Azure Portal Data group.

Where to look for error/warning spikes

"Errors by Environment" and "Warnings by Environment" are the charts that you need to monitor. You should check to see if there are any significant spikes in the report.

There are three charts on each column:

  • Affected Users % = this is the percentage of users which had >= 1 error divided by the total number of users which were using the portal. This chart is very useful to detect changes in the error percentage pattern.
  • Affected Users Count = the total number of users which had an error thrown by the portal.
  • Error Count = the total number of errors thrown by the portal.

In order to hide irrelevant spikes (where the portal is used by less than 10 users), you can select the option "Show Data" -> "Where total users > 10".

Find the cause of error/warning spikes

If you want to analyze a spike, you can drill down into the top errors thrown by your extension in a specific hour by going to the "1 Hour Error Drilldown" chart.

You can drill down into the errors thrown by the extension by using the following functions from Kusto (AzurePortal database):

  • query to get the error counts for a specific environment between a startTime and an endTime, grouped by a specific time granularity (e.g. 1 hour):
GetExtensionErrorCounts(datetime("2016-07-25 00:00:00"), datetime("2016-07-26 00:00:00"), "Compute", "Error", "portal.azure.com", 1h)
| where clientVersion == "4.12.102.0 (82a67ee.160722-1641)"
  • query to get the top 10 errors from last hour, independent of client version:
Top10ExtErrorsFromLastHour("Compute", "Error", "portal.azure.com")
  • query to get a complete list of all the error messages for a specific environment that follow a message pattern between a startTime and an endTime:
GetExtensionErrorsByAggregatedErrorMessage(datetime("2016-07-25 18:15:00"), datetime("2016-07-26 18:30:00"), "Compute", "Error", "portal.azure.com", 'message: Script error')
| where clientVersion == "4.12.102.0 (82a67ee.160722-1641)"
| take 1000

Query hints:

  • You can select all the error messages between startTime and endTime by using "*" when looking for the error message. Otherwise, you can search by the entire aggregated error message or just by a part of it (e.g. 'message: Script error').
  • ErrorType can be: "Error", "Warning" or "Verbose".

Another useful chart is the "Last 24 Hours Error Summary", which shows the errors thrown by the an extension aggregated for the last 24 hours.

NOTE: We aggregate the error messages by omitting the text which is within double quotes (") or single quotes ('). We consider those parts to be the dynamic part of the message (e.g. an id, a timestamp etc.). For example, a message like [Could not find part "PartName1"] will be treated as [Could not find part ""]. Please use this format for all the logged error messages, if you want them to be aggregated by our queries.

Additional information

  • All time stamps shown in this dashboard are UTC time stamps.
  • Currently, we refresh automatically the dashboard 8 times a day (the maximum number of scheduled refreshes allowed by PowerBI), during working hours: 8:00 AM, 9:30 AM, 11:00 AM, 12:30 PM, 2:00 PM, 3:30 PM, 5:00 PM and 6:30 PM (Pacific Time).

# Create Telemetry

The Ibiza Create Flow PowerBi dashboard gives you live access to your extension's create flow telemetry. To view the Ibiza Create Flow PowerBi dashboard follow this link: Ibiza Create Flow PowerBi dashboard

Prerequisites

Getting access to the Ibiza Create Flow Dashboard

In order to get acess to the Create Flow Dashbaord, you will need to get access to the Security Group 'Azure Portal Data' (auxdatapartners).

Optional: Running or Creating Modified Kusto Queries

If you want to run or create modified versions of the Kusto queries provided below in the code samples then you will need access to our Kusto data tables. How to get setup using Kusto and getting access is explained here: PortalFx Telemetry - Getting Started

All queries performed here are against the AzurePortal.AzPtlCosmos database and we will assume you understand the ClientTelemetry table. To get up to speed on the ClientTelemtry table in the AzPtlComos database check out: PortalFx Telemetry - Kusto Databases

Create Flow Funnel

Overview

This report gives you ability to look at your extensions live create flow telemetry which gives you a overview of your create blade's usage and deployment success numbers.

Description of data fields

  • Create Flow Launched - The number of times your create blade has been opened by users.
  • Deployment Started - The number of times that a create has been started by your create blade.
  • Deployment Started % - A percentage representing how often your create blade being opened leads to a create being started.
  • Deployment Succeeded - The number of successful create deployements.
  • Deployment Succeeded % - The percentage of creates that led to a successful deployment. Unsunsuccessful deployments would include cancellations and failures.
  • Old Create API
    • If true, this means that your create blade is using the old deprecated Parameter Collector (Parameter Collector V1 or V2) and therefore your create telemetry is not dependable and is potentially innacurrate.
    • It is recommended to get this field to 'false', by updating your create blade to use the new Parameter Collector (Parameter Collector V3).
  • Custom Deployment
    • If true, this means that your create does not go through the official Arm Provisioner and therefore you only receive limited telemetry reporting.
    • If your experience's create deployments go through Arm but you are not using the Arm Provisioning then please refer to the Create Engine documentation

How these numbers are generated

Create Flow Funnel - Kusto Query

let timeSpan = 30d;
let startDate = GetStartDateForLastNDays(timeSpan);
let endDate = GetEndDateForTimeSpanQueries();
GetCreateFunnel(startDate, endDate)

Create Flow Funnel By Resource Name - Kusto Query

let timeSpan = 30d;
let startDate = GetStartDateForLastNDays(timeSpan);
let endDate = GetEndDateForTimeSpanQueries();
GetCreateFunnelByResourceName(startDate, endDate)

Create Flow Funnel Success Rate - Kusto Query

let timeSpan = 30d;
let startDate = GetStartDateForLastNDays(timeSpan);
let endDate = GetEndDateForTimeSpanQueries();
GetCreateFunnelByDay(startDate, endDate)
| extend DeploymentSucceededPerc = iff(DeploymentStarted == 0, 0.0, todouble(DeploymentSucceeded)/DeploymentStarted)
| project Date, ExtensionName, CreateBladeName, DeploymentSucceededPerc 

Create Flow Funnel Weekly Deployment Success - Kusto Query

let today = floor(now(),1d); 
let fri= today - dayofweek(today) - 2d; 
let sat = fri - 6d; 
let startDate = sat-35d; 
let endDate = fri; 
GetCreateFunnelByDay(startDate, endDate) 
| where Unsupported == 0 and CustomDeployment == 0 
| extend startOfWeek = iff(dayofweek(Date) == 6d,Date , Date - dayofweek(Date) - 1d) 
| summarize sum(CreateFlowLaunched), sum(DeploymentStartedWithExclusions), sum(DeploymentSucceeded) by startOfWeek, ExtensionName 
| extend ["Deployment Success %"] = iff(sum_DeploymentStartedWithExclusions == 0, todouble(0), bin(todouble(sum_DeploymentSucceeded)/sum_DeploymentStartedWithExclusions*100 + 0.05, 0.1)) 
| project startOfWeek, ["Extension"] = ExtensionName, ["Deployment Success %"]

Create Flow Origins

Overview

This report gives you an overview of where your create blade is being linked to and launched from.

Description of data fields

  • Create Flow Launched - The number of times your create blade has been opened by users.
  • New (%) - The percentage representing how often your create blade is opened from +New.
  • Browse (%) - The percentage representing how often your create blade is opened from Browse.
  • Marketplace (%) - The percentage representing how often your create blade is opened from the Marketplace.
  • DeepLink (%) - The percentage representing how often your create blade is opened from a internal or external link.

How these numbers are generated

Create Flow Origins - Kusto Query

let timeSpan = 30d;
let selectedData = 
GetClientTelemetryByTimeSpan(timeSpan, false)
| union (GetExtTelemetryByTimeSpan(timeSpan, false));

selectedData
| where Action == "CreateFlowLaunched"
| extend 
    CreateBladeName = _GetCreateBladeNameFromData(Data, ActionModifier),
    ExtensionId = _GetCreateExtensionNameFromData(Data, ActionModifier),
    OriginFromMenuItemId = extractjson("$.menuItemId", Data, typeof(string)),
    DataContext = extractjson("$.context", Data, typeof(string))
| extend OriginFromDataContext = extract('([^,"]*Blade[^,"]*)', 1, DataContext)
| project CreateBladeName, ExtensionId, Origin = iff(OriginFromMenuItemId == "recentItems" or OriginFromMenuItemId == "deepLinking", OriginFromMenuItemId, OriginFromDataContext), DataContext 
| extend Origin = iff(Origin == "", DataContext, Origin)
| summarize CreateFlowLaunched = count() by ExtensionId, CreateBladeName, Origin
| summarize 
    CreateFlowLaunched = sum(CreateFlowLaunched), 
    New = sum(iff(Origin contains "recentItems" or Origin contains "GalleryCreateMenuResultsListBlade", CreateFlowLaunched, 0)),
    Browse = sum(iff(Origin contains "BrowseResourceBlade", CreateFlowLaunched, 0)),
    Marketplace = sum(iff(Origin contains "GalleryItemDetailsBlade" or Origin contains "GalleryResultsListBlade" or Origin contains "GalleryHeroBanner", CreateFlowLaunched, 0)),
    DeepLink = sum(iff(Origin contains "deepLinking", CreateFlowLaunched, 0))
  by ExtensionId, CreateBladeName 
| join kind = inner (ExtensionLookup | extend ExtensionId = Extension) on ExtensionId
| project
    ["Extension Name"] = ExtensionName, 
    ["Create Blade Name"] = CreateBladeName, 
    ["Create Flow Launched"] = CreateFlowLaunched,
    ["+New"] = New,
    ["+New (%)"] = todouble(New) / CreateFlowLaunched,
    ["Browse"] = Browse,
    ["Browse (%)"] = todouble(Browse) / CreateFlowLaunched,
    ["Marketplace"] = Marketplace,
    ["Marketplace (%)"] = todouble(Marketplace) / CreateFlowLaunched,
    ["DeepLink"] = DeepLink,
    ["DeepLink (%)"] = todouble(DeepLink) / CreateFlowLaunched

Create Flow Errors

Overview

This report gives you an overview of create blades create errors, billing issues, and cancellations.

Description of data field

  • Deployment Cancelled Count - The number of cancalled deployments
  • Billing Error Count - The number of deployments that resulted in the biling error "no credit card on file".
  • Total Errors Count - The total number of deployments that resulted in an failure/error
  • Deployment Failed % - The percentage of deployments that resulted in a failure/error
  • Error Submitting Deployment Request % - The percentage of failed deployments that was because of the error "Error Submitting Deployment Request"
  • Error Provisiong Resource Group % - The percentage of failed deployments that was because of the error "Error Provisiong Resource Group".
  • Error Registering Resource Providers % - The percentage of failed deployments that was because of the error "Error Registering Resource Providers".
  • Unknown Failure % - The percentage of failed deployments that was because of the error "Unknown Failure".
  • Deployment Request Failed % - The percentage of failed deployments that was because of the error "Deployment Request Failed".
  • Error Getting Deployment Status % - The percentage of failed deployments that was because of the error "Error Getting Deployment Status".
  • Deployment Status Unknown % - The percentage of failed deployments that was because of the error "Deployment Status Unknown".
  • Invalid Args % - The percentage of failed deployments that was because of the error "Invalid Args".
  • Old Create API
    • If true, this means that your create blade is using the old deprecated Parameter Collector (Parameter Collector V1 or V2) and therefore your create telemetry is not dependable and is potentially innacurrate.
    • It is recommended to get this field to 'false', by updating your create blade to use the new Parameter Collector (Parameter Collector V3).
  • Custom Deployment
    • If true, this means that your create does not go through the official Arm Provisioner and therefore you only receive limited telemetry reporting.
    • If your experience's create deployments go through Arm but you are not using the Arm Provisioning then please refer to the Create Engine documentation

How these numbers are generated

Create Flow Errors - Kusto Query

let timeSpan = 30d;
let startDate = GetStartDateForLastNDays(timeSpan);
let endDate = GetEndDateForTimeSpanQueries();

let errors = 
(GetClientTelemetryByDateRange(startDate, endDate, false)
| union (GetExtTelemetryByDateRange(startDate, endDate, false))
| where Action == "ProvisioningEnded" and ActionModifier == "Failed" and isnotempty(Name))
| union (_GetArmCreateEvents(startDate, endDate) | where ExecutionStatus in ("Failed", "Cancelled", "BillingError") and isnotempty(Name))
| extend
    Date = bin(PreciseTimeStamp, 1d)
    | join kind = leftouter (
        _GetCreateBladesMapping(startDate, endDate)
        | project Date, Name, 
            ExtensionIdCurrent = ExtensionId, 
            CreateBladeNameCurrent = CreateBladeName, 
            UnsupportedCurrent = Unsupported, 
            CustomDeploymentCurrent = CustomDeployment
      ) on Date, Name 
    // Join again on previous day's mappings to cover case when mapping is not found in current day
    | join kind = leftouter (
        _GetCreateBladesMapping(startDate - 1d, endDate - 1d)
        | project Date, Name, 
            ExtensionIdPrev = ExtensionId, 
            CreateBladeNamePrev = CreateBladeName, 
            UnsupportedPrev = Unsupported, 
            CustomDeploymentPrev = CustomDeployment
        | extend Date = Date + 1d
      ) on Date, Name 
    // Use current day's mapping if available, otherwise, use previous day
    | extend 
        ExtensionId = iff(isnotempty(ExtensionIdCurrent), ExtensionIdCurrent, ExtensionIdPrev),
        CreateBladeName = iff(isnotempty(ExtensionIdCurrent), CreateBladeNameCurrent, CreateBladeNamePrev),
        Unsupported = iff(isnotempty(ExtensionIdCurrent), UnsupportedCurrent, UnsupportedPrev),
        CustomDeployment = iff(isnotempty(ExtensionIdCurrent), CustomDeploymentCurrent, CustomDeploymentPrev)
| where isnotempty(ExtensionId) and isnotempty(CreateBladeName)
| extend ExecutionStatus = iff(Action == "ProvisioningEnded", extractjson("$.provisioningStatus", Data, typeof(string)), ExecutionStatus);

errors
| summarize
    CustomDeploymentErrorsCount = count(Action == "ProvisioningEnded" and ActionModifier == "Failed"),
    // We exclude ProvisioningEnded events with ExecutionStatus in ("DeploymentFailed", "DeploymentCanceled") from ARMDeploymentErrorsCount as in this case, we get the number of deployments that failed or were cancelled from ARM.
    ARMDeploymentErrorsCount = count((Action startswith "CreateDeployment") or (Action == "ProvisioningEnded" and ExecutionStatus !in ("DeploymentFailed", "DeploymentCanceled")))
  by ExtensionId, CreateBladeName, Unsupported, CustomDeployment
| join kind = inner (
    errors
    | summarize
        ARMFailed = count(Action startswith "CreateDeployment" and ExecutionStatus == "Failed"),
        ARMCancelled = count(Action startswith "CreateDeployment" and ExecutionStatus == "Cancelled"),
        ARMBillingError = count(Action startswith "CreateDeployment" and ExecutionStatus == "BillingError"),
        Failed = count(Action == "ProvisioningEnded" and ExecutionStatus == "DeploymentFailed"),
        Cancelled = count(Action == "ProvisioningEnded" and ExecutionStatus == "DeploymentCanceled"),
        ErrorSubmitting = count(Action == "ProvisioningEnded" and ExecutionStatus == "ErrorSubmittingDeploymentRequest"),
        ErrorProvisioning = count(Action == "ProvisioningEnded" and ExecutionStatus == "ErrorProvisioningResourceGroup"),
        ErrorRegistering = count(Action == "ProvisioningEnded" and ExecutionStatus == "ErrorRegisteringResourceProviders"),
        ErrorGettingStatus = count(Action == "ProvisioningEnded" and ExecutionStatus == "ErrorGettingDeploymentStatus"),
        InvalidArgs = count(Action == "ProvisioningEnded" and ExecutionStatus == "InvalidArgs"),
        UnknownFailure = count(Action == "ProvisioningEnded" and ExecutionStatus == "UnknownFailure"),
        RequestFailed = count(Action == "ProvisioningEnded" and ExecutionStatus == "DeploymentRequestFailed"),
        StatusUnknown = count(Action == "ProvisioningEnded" and ExecutionStatus == "DeploymentStatusUnknown")
      by ExtensionId, CreateBladeName, Unsupported, CustomDeployment)
  on ExtensionId, CreateBladeName, Unsupported, CustomDeployment
| extend Failed = iff(CustomDeployment, Failed, ARMFailed)
| extend Cancelled = iff(CustomDeployment, Cancelled, ARMCancelled)
| extend BillingError = iff(CustomDeployment, 0, ARMBillingError)
| extend TotalCount = iff(CustomDeployment, CustomDeploymentErrorsCount, ARMDeploymentErrorsCount) - Cancelled - BillingError
| extend TotalCountDbl = todouble(TotalCount)
| join kind = leftouter (ExtensionLookup | extend ExtensionId = Extension) on ExtensionId
| project
    ["Extension Name"] = ExtensionName,
    ["Create Blade Name"] = CreateBladeName,
    ["Deployment Cancelled Count"] = Cancelled,
    ["Billing Error Count"] = BillingError,
    ["Total Errors Count"] = TotalCount,
    ["Deployment Failed %"] = iff(TotalCountDbl == 0.0, 0.0, Failed / TotalCountDbl),
    ["Error Submitting Deployment Request %"] = iff(TotalCountDbl == 0.0, 0.0, ErrorSubmitting / TotalCountDbl),
    ["Error Provisioning Resource Group %"] = iff(TotalCountDbl == 0.0, 0.0, ErrorProvisioning / TotalCountDbl),
    ["Error Registering Resource Providers %"] = iff(TotalCountDbl == 0.0, 0.0, ErrorRegistering / TotalCountDbl),
    ["Error Getting Deployment Status %"] = iff(TotalCountDbl == 0.0, 0.0, ErrorGettingStatus / TotalCountDbl),
    ["Invalid Args %"] = iff(TotalCountDbl == 0.0, 0.0, InvalidArgs / TotalCountDbl),
    ["Unknown Failure %"] = iff(TotalCountDbl == 0.0, 0.0, UnknownFailure / TotalCountDbl),
    ["Deployment Request Failed %"] = iff(TotalCountDbl == 0.0, 0.0, RequestFailed / TotalCountDbl),
    ["Deployment Status Unknown %"] = iff(TotalCountDbl == 0.0, 0.0, StatusUnknown / TotalCountDbl),
    ["Old Create API"] = Unsupported,
    ["Custom Deployment"] = CustomDeployment

Error Distribution

Overview

This report gives an overview of the errors that have occured over the last week and including how they have changed since the week before, aka WoW (Week or Week).

Description of reports

  • Error Distribution - The high level error that occured for all deployments in the last week.
  • Error Distribution By Extension - The number of create deployment error that occured by each extension over the last week.
  • Inner Error Distribution - Looking at the 'inner most error' inside of the error messages that is recorded for a create deployment failure. Error messages often become nested as they they reach different points of provisioning and have different stages record the failure reason. The 'inner most error' in theory should be the original reason why a create deployment failed.

How these numbers are generated

Create Error Distribution - Kusto Query

let today = floor(now(),1d);
let sat = today - dayofweek(today) - 8d;
let fri =  sat + 6d;
ClientTelemetry
| where PreciseTimeStamp >= sat and PreciseTimeStamp < fri + 1d
| where Action == "ProvisioningEnded" and ActionModifier == "Failed"
| extend provisioningStatus = extractjson("$.provisioningStatus", Data, typeof(string)),
  isCustomProvisioning = extractjson("$.isCustomProvisioning", Data, typeof(string)),
  oldCreateApi = extractjson("$.oldCreateApi", Data, typeof(string)),
  launchingContext = extract('"launchingContext"\\s?:\\s?{([^}]+)', 1, Data)
| where isnotempty(launchingContext) and isempty(extract("^(\"telemetryId\":\"[^\"]*\")$", 1, launchingContext)) and oldCreateApi != "true" and isCustomProvisioning != "true" and provisioningStatus != "DeploymentCanceled"
| where Data !contains "We could not find a credit card on file for your azure subscription." 
| summarize ["Error Count"] = count() by ["Error"] = provisioningStatus
| order by ["Error Count"] desc

Create Error Distribution By Extension - Kusto Query

let today = floor(now(),1d);
let sat = today - dayofweek(today) - 8d;
let fri =  sat + 6d;
ClientTelemetry
| where PreciseTimeStamp >= sat and PreciseTimeStamp < fri + 1d
| where Action == "ProvisioningEnded" and ActionModifier == "Failed"
| extend provisioningStatus = extractjson("$.provisioningStatus", Data, typeof(string)),
  isCustomProvisioning = extractjson("$.isCustomProvisioning", Data, typeof(string)),
  oldCreateApi = extractjson("$.oldCreateApi", Data, typeof(string)),
  launchingContext = extract('"launchingContext"\\s?:\\s?{([^}]+)', 1, Data)
| where isnotempty(launchingContext) and isempty(extract("^(\"telemetryId\":\"[^\"]*\")$", 1, launchingContext)) and oldCreateApi != "true" and isCustomProvisioning != "true" and provisioningStatus != "DeploymentCanceled"
| where Data !contains "We could not find a credit card on file for your azure subscription." 
| summarize ["Error Count"] = count() by Extension
| order by ["Error Count"] desc

Create Error Distribution By Error Code - Kusto Query

let today = floor(now(),1d);
let sat = today - dayofweek(today) - 8d;
let fri =  sat + 6d;
ClientTelemetry
| where PreciseTimeStamp >= sat and PreciseTimeStamp < fri + 1d
| where Action == "ProvisioningEnded" and ActionModifier == "Failed"
| extend provisioningStatus = extractjson("$.provisioningStatus", Data, typeof(string)),
  isCustomProvisioning = extractjson("$.isCustomProvisioning", Data, typeof(string)),
  oldCreateApi = extractjson("$.oldCreateApi", Data, typeof(string)),
  eCode = extractjson("$.details.code", Data, typeof(string)),
  launchingContext = extract('"launchingContext"\\s?:\\s?{([^}]+)', 1, Data)
| where isnotempty(launchingContext) and isempty(extract("^(\"telemetryId\":\"[^\"]*\")$", 1, launchingContext)) and oldCreateApi != "true"
| where provisioningStatus != "DeploymentFailed" and provisioningStatus != "DeploymentCanceled" and isCustomProvisioning != "true"
| where Data !contains "We could not find a credit card on file for your azure subscription." 
| summarize ["Error Count"] = count() by ["Error Code"] = eCode
| order by ["Error Count"] desc 

Create Inner Most Error Distribution - Kusto Query

let today = floor(now(),1d);
let sat = today - dayofweek(today) - 8d;
let fri =  sat + 6d;
ClientTelemetry
| where PreciseTimeStamp >= sat and PreciseTimeStamp < fri+1d
| where Action == "ProvisioningEnded" and ActionModifier == "Failed"
| join kind = inner (ExtensionLookup | project Extension, ExtensionName) on Extension
| where Data !contains "We could not find a credit card on file for your azure subscription."
| extend datajson = parsejson(Data)
| extend provisioningStatus = tostring(datajson.provisioningStatus), isCustomProvisioning = tostring(datajson.isCustomProvisioning), oldCreateApi = tostring(datajson.oldCreateApi), launchingContext = extract('"launchingContext"\\s?:\\s?{([^}]+)', 1, Data)
| where isnotempty(launchingContext) and isempty(extract("^(\"telemetryId\":\"[^\"]*\")$", 1, launchingContext)) and oldCreateApi != "true" and isCustomProvisioning != "true" and provisioningStatus != "DeploymentCanceled"
| extend code1 = tostring(datajson.details.code), statusCode = tostring(datajson.details.deploymentStatusCode), details = datajson.details.properties.error.details[0]
| extend message= tostring(details.message), code2 = tostring(details.code)
| extend messagejson = parsejson(message)
| extend code3 = tostring(messagejson.code), code4 = tostring(messagejson.error.code), code5= tostring(messagejson.error.details[0].code)
| extend errorCode1 = iff(code5 == "", code4, code5)
| extend errorCode2 = iff(errorCode1 == "", code3, errorCode1)
| extend errorCode3 = iff(errorCode2 == "", code2, errorCode2)
| extend errorCode4 = iff(errorCode3 == "", code1, errorCode3)
| extend errorCode5 = iff(errorCode4 == "", statusCode, errorCode4)
| summarize ["ErrorCount"] = count() by ["Extension"] = ExtensionName, errorCode5
| order by ErrorCount desc

Next Steps:

Create Troubleshooting

Overview

Creates are when a user tries to provision a resource using the portal. The goal of the Create Flow Regressions alert is to generate awareness when our create reliability seems to be degrading. This can happen for a number of reasons, this alert does not attempt to distinguish the reasons why.

The alert fires any time the success rate drops more than 5% below the bar on average over an hour. MDM will send an alert each time this happens. The first thing to do is take a look at MDM by selecting the link at the bottom of the ICM, this will show a trend of how long the alert has been active and to what degree.

The numbers are the percentage of regression. For example, if latest value is 10 it means the success rate has regressed by 10% below the bar. If it seems to be trending up then this is a much bigger concern than one that spiked then went down.

This bar is set on a blade by blade basis and can be adjusted as needed.

Types of Create Failures

There are three types of create failures:

  1. The create was successfully sent to ARM, but ARM eventually reported Failure rather than Success or Cancel
    • Billing errors such as no credit card are considered canceled creates rather than failures
  2. The create request was not accepted by ARM for any reason
  3. This is a custom create where the ProvisioningEnded is either missing or reports an error

Debugging Alerts

Follow the below documentation to understand and debug your create regressions that caused the alert.

Alert Regression Error Count

  If you want to see what errors are making up your regression percentage (over the last 24 hours ending at the datetime provided) and how many times these errors are occurring then the following query will give you the break down you are looking for (using websites as an example):    GetCreateRegressionErrorCount(now(),"websitesextension","webhostingplancreateblade") [Run in Kusto.Explorer] [Run in Kusto.WebExplorer] 

  This function is best used when trying to identify the main error that is causing your regression numbers to increase.    Input Parameters:  

  • End time – 24 hours ending at this end time will be the time span which is scanned for errors. Time range: [end time – 24 hours, end time] 
  • Extension – the extension you are looking into 
  • CreateBladeName – the name of the create blade which the errors occurred on 

Output Result Columns: 

  • extension – the extension specified 
  • CreateBladeName – the create blade name specified 
  • ErrorCode – the error code that specifies the type of error that occurred 
  • Hits – the number of times this error occurred 

Alert regression details

When things go wrong you will need to drill down. Once you have used GetCreateRegressionErrorCount to understand the main errors that are causing your regressions numbers (over the last 24 hours ending at the datetime provided) to spike, you will now need to understand what caused them.  The following query shows all of the failed creates with their error messages for a specific extension and blade (using websites as an example): 

GetCreateRegressionDetails(now(),"websitesextension","webhostingplancreateblade") [Run in Kusto.Explorer] [Run in Kusto.WebExplorer]

Input Parameters: 

  • End time – 24 hours ending at this end time will be the time span which is scanned for errors. Time range: [end time – 24 hours, end time]. 
  • Extension – The extension you are looking into. 
  • CreateBladeName – The name of the create blade which the errors occurred on. 

Output Result Columns: 

  • extension – The name of the extension. 
  • name – The name of the resource attempted to be created. 
  • CreateBladeName – The name of the create blade from which the create flow originated.  
  • status – The resulting status of the create. Regressions are represented only by failed creates, so this should always be marked as "Failed". 
  • MessageCode - In the case of a Failed status create flow, this typically is the name of the error which occurred. We try to always fill this information in for you, but if it is blank then you may have to go digging for this information in the data field. More information regarding this below. 
  • Message – In the case of a Failed status create flow, this typically is the resulting message of the error which occurred which provides context as to why the create flow was a failure.  We try to always fill this information in for you, but if it is blank then you may have to go digging for this information in the data field. More information regarding this below. 
  • StartTime – When the create was initiated. 
  • EndTime – When the create completed. If the EndTime is the same as the StartTime then the create failed to be initiated correctly or the information regarding its completion is lost. 
  • Duration – The length in time of the create from start to finish. If the duration is 00:00:00 then the create failed to be initiated correctly or the information regarding its completion is lost. 
  • telemetryId – The id which is used to identify the creates events which make up a create flow. 
  • userId – ID which represents the user that initiated the create. 
  • sessionId – ID which represents the sessions in which the create was initiated. 
  • CustomDeployment – Boolean representing if the create is a custom deployment and therefore was not initiated through the ARM provisioner.  
  • data – Contains all of the in-depth information regarding the different stages of the create flow  

The most interesting field is the data field. It contains JSON describing details of the create.  Understanding the data field is crucial to debugging simple to complicated regression issues.

The data field:

To understand how the data field is created, one must understand the life cycle of the create flow. This process is slightly different for a standard deployment through ARM vs a custom deployment (one that does not use the ARM provisioner). 

  1. When a create is initiated a ProvisioningStarted events is logged.  
  2. Once the request for that deployment is received and acknowledge by ARM a CreateDeploymentStart event is logged. *(not logged for custom deployment) *
  3. When the status of the completion of that deployment  is available a CreateDeploymentEnd event is logged.  *(not logged for custom deployment) *
  4. Once the deployment is finished and the Portal has finished the create process a ProvisioningEnded event is logged. 

The data field contains all of the data from each of these logged create events (if available) to give you the information from each stage of the lifecycle.  Each of these are represented by 3 main fields: 

  • action – The action logged (ProvisioningStarted, CreateDeploymentStart, CreateDeploymentEnd, ProvisioningEnded) 

  • actionModifier – The context in which the action was being logged. Here are the available different combinations of action and actionModifier:

    action actionModifier
    ProvisioningStarted  mark 
    CreateDeploymentStart   Failed 
    CreateDeploymentStart   Succeeded 
    CreateDeploymentEnd  Canceled 
    CreateDeploymentEnd  Failed 
    CreateDeploymentEnd  Succeeded 
    ProvisioningEnded  Failed 
    ProvisioningEnded  Succeeded 
  • data – the data field for this particular create event which makes up part of the greater create flow

Digging for the error MessageCode field and Message field in the data field

So, we were either unable to provide you with the correct error code or message, or you are looking for more context and information. The way to go about this is to start digging into the data field.  

  1. Locate the last create event with data available inside of the data field. This is typically the ProvisioningEnded event, but if that is not available then use the CreateDeploymentEnd event. If neither of these are available, then the information has been lost for an unknown reason and it isn't available at all. 
  2. Look into the data field of the event until you find the details field  
  3. The details field should contains a hierarchy of error codes and error message. The inner mode error code or message should be the underlying cause of the deployment failure. 

All Creates

When looking for patterns it is sometimes better to see the good with the bad. The following query returns a single row for each create:

GetCreatesByDateRange(ago(1d),now()) [Run in Kusto.Explorer] [Run in Kusto.WebExplorer]   The results include:

  • Extension
  • Name - name of the asset type
  • CreateBladeName
  • Status - has one of the following values
    • Succeeded
    • Failed
    • Unknown
    • Canceled - (billing errors are included here)
  • telemetryId - unique ID for the deployment
  • CustomDeployment - if not an ARM deployment this is true

All Creates With Additional Details

To query with more details the following query:

GetCreateDetailsByDateRange(ago(1d),now()) [Run in Kusto.Explorer] [Run in Kusto.WebExplorer]   Adds the following properties with multiple rows per telemetryId (each telemetryId == 1 create):

  • userId
  • sessionId
  • action
  • actionModifier
  • Data - this has a JSON string that contains most of the information needed for debugging

This function is best used when trying to identify the main error that is causing your regression numbers to increase.

Input Parameters:

  • End time – 24 hours ending at this end time will be the time span which is scanned for errors. Time range: [end time – 24 hours, end time]
  • Extension – the extension you are l

Output Result Columns:

  • Extension – the extension specified
  • CreateBladeName – the create blade name specified
  • ErrorCode – the overall error code that specifies the type of error that occurred
  • Hits – the number of times this error

Alert query

The alert itself is driven from the following query:

CreateFlowRegressions(now()) [Run in Kusto.Explorer] [Run in Kusto.WebExplorer]

This has strangely named columns that are required by MDM, but essentially it tracks success percentage over the last 24 hours versus the success bar:

  • d_ExtensionName
  • d_CreateBladeName
  • m_CreateRegressionPercent - percentage of regression below the bar
  • m_CreateRegressionCount - number of creates over the last 24 hours
  • timestamp

The alert is generated any time the regression is more than 5% from the bar.

Alert bar

The bar is a value we've captured based on current performance. This should be raised over time as the create becomes more reliable. PMs from the portal team will help you remember that this is needed.

To see the current bar settings use the following query:

_CreateFlowRegressionOverrides() [Run in Kusto.Explorer] [Run in Kusto.WebExplorer]

  • Extension
  • CreateBladeName
  • Ignore - if true this extension is excluded from alerting
  • Bar - this is the success percentage expected
  • NormalizedCount - not used
  • Reason - notes about why the bar was set

Alert summaries base

The alert is very specific as per the rules of MDM and does not provide any context. To see the state of creates more clearly try the following query:

_CreateFlowRegressionsBase(now(),24h,50) [Run in Kusto.Explorer] [Run in Kusto.WebExplorer]

The parameters are the start time, number of hours to check, minimum number of creates required. The parameters shown are what drives the alert query. Using this and adding a filter for your extension will give you a pretty clear idea of the current state.

This query gives:

  • EndTime
  • Extension
  • CreateBladeName
  • Count
  • SuccessRate
  • SuccessBar
  • Regression

The simple version of this takes an extension name parameter and automatically filters to the necessary section. For example, for websites the query would be:

GetCreateRegressionExtSummary(now(),"websitesextension") [Run in Kusto.Explorer] [Run in Kusto.WebExplorer]

Performance

## Overview

Portal performance from a customers perspective is seen as all experiences throughout the product. As an extension author you have a duty to uphold your experience to the performance bar at a minimum.

Area Sub-Area 80th Percentile Bar Telemetry Action How is it measured?
Blade Revealed See Power BI BladeRevealed Time it takes for the blade's OnInputsSet to resolve and all the parts on the blade and above the fold to reveal
Blade FullRevealed N/A BladeFullRevealed Same as Revealed but all the parts on the blade to reveal
Part Revealed See Power BI PartRevealed Time it takes for the part to be rendered and then the part's OnInputsSet to resolve or earlyReveal to be called
WxP N/A See Power BI N/A An overall experience score, calculated by weighting blade usage and the blade revealed time

Extension performance

Extension performance effects both Blade and Part performance, as your extension is loaded and unloaded as and when it is required. In the case a user is visiting your resource blade for the first time, the Fx will load up your extension and then request the view model, consequently your Blade/Part performance is effected. If the user were to browse away from your experience and browse back before your extension is unloaded obviously second visit will be faster, as they don't pay the cost of loading the extension.

Blade performance

Blade performance is spread across a couple of main areas:

  1. Blade's constructor
  2. Blade's OnInputsSet
  3. Any Parts within the Blade become revealed

All of which are encapsulated under the one BladeRevealed action.

Part performance

Similar to Blade performance, Part performance is spread across a couple of areas:

  1. Part's constructor
  2. Part's OnInputsSet

All of which are encapsulated under the one PartRevealed action.

WxP score

The WxP score is a per extension Weight eXPerience score (WxP). It is calculated by the as follows:

WxP = (BladeViewsMeetingTheBar * 80thPercentileBar) / ((BladeViewsMeetingTheBar * 80thPercentileBar) + ∑(BladeViewsNotMeetingTheBar * ActualLoadTimePerBlade))
Blade 80th Percentile Times Usage Count Meets 80th Percentile Bar?
Blade A 1.2 1000 Yes
Blade B 5 500 No
Blade C 6 400 No
WxP = (BladeViewsMeetingTheBar * 80thPercentileBar) / ((BladeViewsMeetingTheBar * 80thPercentileBar) + ∑(BladeViewsNotMeetingTheBar * ActualLoadTimePerBlade)) %
    = (4 * 1000) / ((4 * 1000) + ((5 * 500) + (6 * 400))) %
    = 44.94%

As you can see the model penalizes the number of views which don’t meet the bar and also the count of those.

How to assess your performance

There is two methods to assess your performance:

  1. Visit the IbizaFx provided PowerBi report*

  2. Run Kusto queries locally to determine your numbers

    (*) To get access to the PowerBi dashboard reference the Telemetry onboarding guide, then access the following Extension performance/reliability report

The first method is definitely the easiest way to determine your current assessment as this is maintained on a regular basis by the Fx team. You can, if preferred, run queries locally but ensure you are using the Fx provided Kusto functions to calculate your assessment.

Performance Checklist

Performance Frequently Asked Questions (FAQ)

My Blade 'Revealed' is above the bar, what should I do

  1. Assess what is happening in your Blades's constructor and OnInputsSet.
  2. Can that be optimized?
  3. If there are any AJAX calls, wrap them with custom telemetry and ensure they you aren't spending a large amount of time waiting on the result.
  4. Check the Part's on the Blade revealed times using Extension performance/reliability report, select your Extension and Blade on the filters.
  5. How many parts are on the blade?
    • If there's only a single part, if you're not using a <TemplateBlade> migrate your current blade over.
    • If there's a high number of parts (> 3), consider removing some of the parts

My Part 'Revealed' is above the bar, what should I do

  1. Assess what is happening in your Part's constructor and OnInputsSet.
  2. Can that be optimized?
  3. If there are any AJAX calls, wrap them with custom telemetry and ensure they you aren't spending a large amount of time waiting on the result.
  4. Do you have partial data before the OnInputsSet is fully resolved? If yes, then you can reveal early, display the partial data and handle loading UI for the individual components

My WxP score is below the bar, what should I do

Using the Extension performance/reliability report you can see the WxP impact for each individual blade. Although given the Wxp calculation, if you are drastically under the bar its likely a high usage blade is not meeting the performance bar, if you are just under the bar then it's likely it's a low usage blade which is not meeting the bar.

Performance best practices

Writing fast extensions

When writing extensions, there are a few patterns you can follow to make sure you're getting the most performance out the browser and the portal.

Use AMD

In the early days of the Azure Portal SDK, it was common to write extensions that bundled all scripts into a single file at compilation time. This generally happens if you use reference statements in your classes:

[DEPRECATED SYNATX]
/// <reference path="../TypeReferences.d.ts" />
/// <reference path="WebsiteSelection.ts" />
/// <reference path="../Models/Website.ts" />
/// <reference path="../DataContexts/DataContexts.ts" />

module RemoteExtension {
    export class WebsitesBladeViewModel extends MsPortalFx.ViewModels.Blade {
    ...
    }
}

In the example above, modules are imported using <reference> elements. This includes those scripts into a single script file at compile time, leading to a relatively large file which needs to be downloaded every time your extension projects UI.

Since that time, we've introduced support for using Asynchronous Module Definition (AMD). Instead of bundling all scripts into a single monolithic file, the portal is now capable of downloading only the files needed to project the current UI onto the screen. This makes it faster to unload and reload an extension, and provides for generally better performance in the browser. In this case, by using AMD, the following files will only be loaded at runtime as they're required (instead of one large bundle):

[CORRECT SYNATX]
import SecurityArea = require("../SecurityArea");
import ClientResources = require("ClientResources");
import Svg = require("../../_generated/Svg");

export class BladeViewModel extends MsPortalFx.ViewModels.Blade {
    ...
}

This leads to faster load time, and less memory consumption in the browser. You can learn more about the TypeScript module loading system in the official language specification.

Use QueryCache and EntityCache

When performing data access from your view models, it may be tempting to make data calls directly from the onInputsSet function. By using the QueryCache and EntityCache, you can control access to data through a single component. A single ref-counted cache can hold data across your entire extension. This has the benefits of:

  • Reduced memory consumption
  • Lazy loading of data
  • Less calls out to the network
  • Consistent UX for views over the same data.

Developers should use QueryCache and EntityCache for data access. These classes provide advanced caching and ref-counting. Internally, these make use of Data.Loader and Data.DataSet (which will be made FX-internal in the future).

To learn more, visit Querying for data.

Avoid unnecessary data reloading

As users navigate through the Ibiza UX, they will frequently revisit often-used resources within a short period of time. They might visit their favorite Website Blade, browse to see their Subscription details, and then return to configure/monitor their favorite Website. In such scenarios, ideally, the user would not have to wait through loading indicators while Website data reloads.

To optimize for this scenario, use the extendEntryLifetimes option common to QueryCache and EntityCache.

public websitesQuery = new MsPortalFx.Data.QueryCache<SamplesExtension.DataModels.WebsiteModel, any>({
    entityTypeName: SamplesExtension.DataModels.WebsiteModelType,
    sourceUri: MsPortalFx.Data.uriFormatter(Shared.websitesControllerUri),
    supplyData: (method, uri, headers, data) => {
        // ...
    },
    extendEntryLifetimes: true
});

QueryCache/EntityCache contain numerous cache entries, each of which are ref-counted based on not-disposed instances of QueryView/EntityView. When a user closes a Blade, typically a cache entry in the corresponding QueryCache/EntityCache will be removed, since all QueryView/EntityView instances will have been disposed. In the scenario where the user revisits their Website Blade, the corresponding cache entry will have to be reloaded via an ajax call, and the user will be subjected to loading indicators on the Blade and its Parts.

With extendEntryLifetimes, unreferenced cache entries will be retained for some amount of time, so when a corresponding Blade is reopened, data for the Blade and its Parts will already be loaded and cached. Here, calls to this._view.fetch() from a Blade or Part view model will return a resolved Promise, and the user will not see long-running loading indicators.

(Note - The time that unreferenced cache entries are retained in QueryCache/EntityCache is controlled centrally by the FX and the timeout will be tuned based on telemetry to maximize cache efficiency across extensions.)

For your scenario to make use of extendEntryLifetimes, it is very important that you take steps to keep your client-side QueryCache/EntityCache data caches consistent with server data. See Reflecting server data changes on the client for details.

Use paging for large data sets

When working with a large data set, extension authors should use the paging features of the grid. Paging allows deferred loading of rows, so even with a large dataset responsiveness can be maintained. Additionally it means many rows might not need to be loaded at all. To learn more about paging with grids, you can check out the samples:

\Client\Controls\Grid\ViewModels\PageableGridViewModel.ts

Use "map" and "filter" to reduce size of rendered data

Often, it is useful to use the Knockout projections to shape and filter model data loaded using QueryView and EntityView (see Shaping and filtering data).

Significant performance improvements can achieved by reducing the number and size of the model objects bound to controls like grids, lists, charts:

`\Client\Controls\Grid\ViewModels\SelectableGridViewModel.ts`
// Wire up the contents of the grid to the data view.
this._view = dataContext.personData.peopleQuery.createView(container);
var projectedItems = this._view.items
    .filter((person: SamplesExtension.DataModels.Person) => { return person.smartPhone() === "Lumia 520"; })
    .map((person: SamplesExtension.DataModels.Person) => {
        return <MappedPerson>{
            name: person.name,
            ssnId: person.ssnId
        };
    });

var personItems = ko.observableArray<MappedPerson>([]);
container.registerForDispose(projectedItems.subscribe(personItems));

In this example, map is used to project new model objects containing only those properties required to fill the columns of the grid. Additionally, filter is used to reduce the size of the array to just those items that will be rendered as grid rows.

Benefits to UI-rendering performance

Using the selectable grid SDK sample we can see the benefits to using map to project objects with only those properties required by a grid row:

![Using knockout projections to map an array][mapping] [mapping]: ../media/portalfx-performance/mapping.png

There is almost a 50% reduction in time with these optimizations, but also note that at 300 items it is still over 1.5s. Mapping to just the 2 columns in that selectable grid sample reduces the message size by 2/3 by using the technique described above. This reduces the time needed to transfer the view model as well as reducing memory usage.

Configuring CDN

Using the CDN

Extension authors may choose to use a CDN to serve static images, scripts, and stylesheets. The Azure Portal SDK does not require the use of a CDN, or the use of a particular CDN. However, extensions served from Azure can take advantage of the built-in CDN capabilities in the SDK.

Creating the CDN account

Follow this guide to set up your CDN account:

http://www.windowsazure.com/en-us/documentation/articles/cdn-how-to-use/

Configuring your CDN service

After creating your CDN, there are a few options that need to be set.

  • Make sure HTTP and HTTPS are enabled by clicking the "Enable HTTPS" command.
  • Make sure query string status is enabled by clicking the "Enable Query String" command.

Configuring your extension

To take advantage of the CDN capabilities in the Portal SDK, there are a few pieces that must be configured.

Configuring the Prefix

After setting up your CDN, you will receive a url which can be used to access your content. It will be in the form:

//<MyCDNNamespace>.vo.msecnd.net/

This is the prefix for your CDN service. Your production service should be configured to use this prefix. In your local web.config, can set this with the following appSetting:

<add key="Microsoft.Portal.Extensions.SamplesExtension.ApplicationConfiguration.CdnPrefix" 
     value="//<MyCDNNamespace>.vo.msecnd.net/" />

Notice that neither http nor https are used in the url. This is important. It allows your page to request content based on the current protocol of the request. Oftentimes, this setting will be blank in web.config, and instead configured in a cscfg for a cloud service.

Reading the prefix from configuration

To read any FX configuration, you must have a class which inherits from ApplicationContext. This class needs to include a CdnPrefix property:

\SamplesExtension\Configuration\CustomApplicationContext.cs
[Export(typeof(ApplicationContext))]
internal class CustomApplicationContext : ApplicationContext
{
    private ApplicationConfiguration configuration;

    [ImportingConstructor]
    public CustomApplicationContext(ApplicationConfiguration configuration)
    {
        this.configuration = configuration;
    }

    public override bool IsDevelopmentMode
    {
        get
        {
            return this.configuration.IsDevelopmentMode;
        }
    }

    public override string CdnPrefix
    {
        get
        {
            return this.configuration.CdnPrefix;
        }
    }
}

This class will assign properties which are available in your web.config or *.cscfg. To read the values from those files, create a C# class which inherits from ConfigurationSettings and exports ApplicationConfiguration:

\SamplesExtension\Configuration\ApplicationConfiguration.cs
[Export(typeof(ApplicationConfiguration))]
public class ApplicationConfiguration : ConfigurationSettings
{
    /// <summary>
    /// Gets a value indicating whether development mode is enabled.
	/// Development mode turns minification off
    /// </summary>
    /// <remarks>
	/// Development mode turns minification off. 
	/// It also disables any caching that be happening.
	/// </remarks>
    [ConfigurationSetting]
    public bool IsDevelopmentMode
    {
        get;
        private set;
    }

    /// <summary>
    /// Gets a value indicating a custom location where browser should 
	/// find cache-able content (rather than from the application itself).
    /// </summary>
    [ConfigurationSetting]
    public string CdnPrefix
    {
        get;
        private set;
    }
}

IIS / ASP.NET Configuration

Files are pushed to the CDN using the following process:

To enable this workflow, the CDN must be able to make an HTTP request to your extension. This would normally not be an issue, but some CDNs will make an HTTP 1.0 request. HTTP 1.0 technically does not support gzip/deflated content, so IIS does not enable compression by default. To turn this on, the noCompressionForHttp10 setting in <httpCompression> must be set to false:

http://www.iis.net/configreference/system.webserver/httpcompression

The url used for the request is in the following form:

http://myextension.cloudapp.net/CDN/Content/jquery.js

The /CDN/ portion of this url is inserted after the host address, and before the rest of the route for requested content. The request handling code in the SDK automatically handles incoming requests of the form /CDN/Content/... and /Content/...

Invalidating content on the CDN

When you release to ensure that users are served the latest static content, as opposed to stale content, you need to configure versioning.

Configuring versioning of your Extensioon

Updating extensions

The portal shell relies on environment version for making runtime decisions, e.g.:

  • invalidating cached manifests
  • invalidating static content served indirectly via CDN or from directly from your extension

By default this value is populated based on the version attributes present in the extension assembly. First the runtime tries to find the AssemblyInformationalVersionAttribute attribute and get the value from there. If this attribute isn't defined in the assembly, the runtime searches for the AssemblyFileVersion attribute and gets the value from this attribute. You can check the version of your extensions by typing in window.fx.environment.version in the browser console from the extension frame.

You should ensure that while building your extension assembly, the version number is correctly stamped and updated on every build. The assembly version is added to your assembly by specifying the assembly level attribute as shown below.

[assembly: System.Reflection.AssemblyFileVersion("5.0.0.56")]
[assembly: System.Reflection.AssemblyInformationalVersionAttribute("5.0.0.56 (COMPUTER.150701-1627)")]

You can also override this behavior by deriving from the ApplicationContext class and MEF-exporting the derived class as [Export(typeof(ApplicationContext))] and overriding the getter for the Version property on the class. If you do this, please ensure that the overridden getter returns a constant value for a specific build.

see AssemblyVersionAttribute see AssemblyInformationalVersionAttribute see (Azure internal teams only) OneBranch versioning

Once configured content will be served directly from your extension, or via CDN if configured, using a URL segment such as /Content/ e.g /Content/5.0.0.56/Scripts, Content/5.0.0.56/Images.

Implications of changing the version

You should not introduce breaking changes in your server code (e.g. incompatibility between client and server code). Instead leave a compatibile version of the old code around on the server for a few days, monitor its usage to ensure that customers/browsers are no longer accessing it (i.e all users have switched to the newer version of your code - likely by refreshing the portal), and then delete the code. This can easily be accomplished by making new controllers/methods instead of making breaking changes to existing ones. If you do end up in a situation where you make a breaking change, users will likely see a broken experience until they reload the portal. You will need to contact the portal team in order to find a way to get past this issue.

FAQ

  • I am not seeing paths w/ versioning during debug.
    • Ensure IsDevelomentMode in your *.config is set to false

Extension HomePage Caching

Server side caching of extension home pages

With the latest version of the SDK (5.0.302.85 or later) extension home pages can be cached (to different levels). This should help get slightly better load time especially from browsers that have high latency. Below are two example URLs from the portal running in production:

https://yourextension.contoso.com/
    ?extensionName=Your_Extension
    &shellVersion=5.0.302.85%20(production%23444e261.150819-1426)
    &traceStr=
    &sessionId=ece19d8501fb4d2cbe10db84b844c55b
    &l=en.en-us
    &trustedAuthority=portal.azure.com%3A
    #ece19d8501fb4d2cbe10db84b844c55b

You will notice that for the extension, the sessionId is passed in the query string part of the URL. This makes the extension essentially un-cacheable (because even if it was, we would generate a unique URL on each access essentially busting any cache – browser or server). If you enable server side caching of extension home pages, the URL will become:

https://yourextension.contoso.com/
    ?extensionName=Your_Extension
    &shellVersion=5.0.302.85%20(production%23444e261.150819-1426)
    &traceStr=
    &l=en.en-us
    &trustedAuthority=portal.azure.com%3A
    #ece19d8501fb4d2cbe10db84b844c55b

Notice that the sessionId is no longer present in the query string (only in the fragment). This allows the extension server to serve up the same version of the page to a returning browser (HTTP 304).

You need to do some work to enable caching on your extension server.

  1. There is a property Cacheability on your implementation of the Microsoft.Portal.Framework.ExtensionDefinition class.

  2. By default its value is ExtensionIFrameCacheability.None

  3. At the very least you should be able to set it to ExtensionIFrameCacheability.Server

Making this change assumes that you do not change the way your home page is rendered dynamically (different output for different requests). It assumes that if you do change the output, you only do so by also incrementing the value of Microsoft.Portal.Framework.ApplicationContext.Version. Note: In this mode, if you make live updates to your extension without bumping the version, some chunk of your customers may not see those for a while because of caching.

Client-side caching of extension home pages

The above version of the feature only enables server side caching. But there could be even more benefits if we could somehow cache on the client (avoid the network call altogether).

So we have added support for caching extension home pages in the browser itself. This can allow your extension to load with ZERO network calls from the browser (for a returning user). We believe that this should give us further performance and reliability improvements (fewer network calls => fewer network related errors).

To enable this, here are the steps you need to take:

  1. Move to a version of the SDK newer than 5.0.302.121.

  2. Implement persistent caching of your scripts. You should do this any way to improve extension reliability. If you do not do this, you will see higher impact on reliability as a result of home page caching.

  3. Ensure that your implementation of Microsoft.Portal.Framework.ApplicationContext.GetPageVersion() returns a stable value per build of your extension. We implement this for your extension by default for you by using the version of your assembly. If this value changes between different servers of the same deployment, the home page caching will not be very effective. Also if this value does not change between updates to your extension, then existing users will continue to load the previous version of your extension even after you update.

  4. In your implementation of Microsoft.Portal.Framework.ExtensionDefinition update this property:

    public override ExtensionIFrameCacheability Cacheability
    {
        get
        {
            return ExtensionIFrameCacheability.Manifest;
        }
    }
  5. Contact the Portal team or submit a Work Item Request so we can update the value from our side.
    Sorry about this step. We added it to ensure backward compatibility. When all extensions have moved to newer SDKs, we can eliminate it.

Implications of client side caching

  1. An implication of this change is that when you roll out an update to your extension, it might take a couple of hours for it to reach all your customers. But the reality is that this can occur even with the existing deployment process. If a user has already loaded your extension in their session, they will not really get your newer version till they F5 anyway. So extension caching adds a little more latency to this process.

  2. If you use this mechanism, you cannot use extension versioning to roll out breaking changes to your extension. Instead, you should make server side changes in a non-breaking way and keep earlier versions of your server side code running for a few days.

We believe that the benefits of caching and fast load time generally outweigh these concerns.

How this works

We periodically load your extensions (from our servers) to get their manifests. We call this "manifest cache". The cache is updated every few minutes. This allows us to start up the portal without loading every extension to find out very basic information about it (like its name and its browse entry/entries, etc.) When the extension is actually interacted with, we still load the latest version of its code, so the details of the extension should always be correct (not the cached values). So this works out as a reasonable optimization. With the newer versions of the SDK, we include the value of GetPageVersion() of your extension in its manifest. We then use this value when loading your extension into the portal (see the pageVersion part of the query string below). So your extension URL might end up being something like:

https://YourExtension.contoso.com/
    ?extensionName=Your_Extension
    &shellVersion=5.0.302.85%20(production%23444e261.150819-1426)
    &traceStr=
    &pageVersion=5.0.202.18637347.150928-1117
    &l=en.en-us
    &trustedAuthority=portal.azure.com%3A
    #ece19d8501fb4d2cbe10db84b844c55b

On the server side, we match the value of pageVersion with the current value of ApplicationContext.GetPageVersion(). If those values match, we set the page to be browser cacheable for a long time (1 month). If the values do not match we set no caching at all on the response. The no-caching case could happen during an upgrade, or if you had an unstable value of ApplicationContext.GetPageVersion()). This should provide a reliable experience even when through updates. When the caching values are set, the browser will not even make a server request when loading your extension for the second time.

You will notice that we include the shellVersion also in the query string of the URL. This is just there to provide a mechanism to bust extension caches if we needed to.

How to test your changes

You can verify the behavior of different caching modes in your extension by launching the portal with the following query string:

https://portal.azure.com/
    ?Your_Extension=cacheability-manifest
    &feature.canmodifyextensions=true

This will cause the extension named "Your_Extension" to load with "manifest" level caching (instead of its default setting on the server. You also need to add "feature.canmodifyextensions=true" so that we know that the portal is running in test mode.

To verify that the browser serves your extension entirely from cache on subsequent requests:

  • Open F12 developer tools, switch to the network tab, filter the requests to only show "documents" (not JS, XHR or others).
  • Then navigate to your extension by opening one of its blades, you should see it load once from the server.
  • You will see the home page of your extension show up in the list of responses (along with the load time and size).
  • Then F5 to refresh the portal and navigate back to your extension. This time when your extension is served up, you should see the response served with no network activity. The response will show "(from cache)". If you see this manifest caching is working as expected.

Co-ordinating these changes with the portal

Again, if you do make some of these changes, you still need to coordinate with the portal team to make sure that we make corresponding changes on our side too. Basically that will tell us to stop sending your extension the sessionId part of the query string in the URL (otherwise caching does not help at all). Sorry about this part, we had to do it in order to stay entirely backward compatible/safe.

Persistent Caching of scripts across extension updates

Making sure that scripts are available across extension updates

One problem that can impact reliability of extensions is scripts failing to load. And one corner case where this problem can occur is when update your extensions.

Suppose you have V1 of your extension deployed to production and it references a script file /Content/Script_A_SHA1.js We add the SHA1 to ensure maximum cacheability of the script. Now a user visits the portal and starts interacting with your V1 extension. They haven’t yet started loading Script_A_SHA1.js perhaps because it is only used by a different blade. At this time you update the extension server to V2. The update includes a change to Script_A so now its URL becomes /Content/Script_A_SHA2.js. Now when the user does visit that blade, Script_A_SHA1.js is no longer on your server and the request to fetch it from the browser will most likely result in a 404. The use of a CDN might reduce the probability of this occurring. And you should use a CDN. But these user actions can occur over several hours and the CDN does not guarantee keeping data around (for any duration let alone hours). So this problem can/does still occur.

To avoid this issue, you can implement a class that derives from Microsoft.Portal.Framework.IPersistentContentCache

On your extension server. The simplest way to do this is to derive from Microsoft.Portal.Framework.BlobStorageBackedPersistentContentCache

And MEF export your implementation. That is decorate it with:

[Export(typeof(Microsoft.Portal.Framework.IPersistentContentCache))]

You just need to provide it a storage account connection string that can be used to store the scripts. Keep the storage account the same across upgrades of your extension.

We save all your JavaScript, CSS, and image files (basically anything under /Content/...) in this cache to make upgrades smoother.

The storage account is a third layer cache. Layer 1 is CDN. Layer 2 is in memory in your extension server. So it should get hit very rarely and once read, it should warm up the other layers. So we don't think you need to geo-distribute this layer. If we detect that it is getting hit too often, we will come up with a geo-distribution strategy. If you do use one account per region to handle this, you will need to find a way to synchronize them. You could do this by using a custom implementation of the Microsoft.Portal.Framework.IPersistentContentCache interface.

Example implementation as done in HubsExtension

using System;
using System.ComponentModel.Composition;
using Microsoft.Portal.Framework;

namespace <your.extension.namespace>
{
    /// <summary>
    /// The configuration for hubs content caching.
    /// </summary>
    [Export(typeof(HubsBlobStorageBackedContentCacheSettings))]
    internal class HubsBlobStorageBackedContentCacheSettings : ConfigurationSettings
    {
        /// <summary>
        /// Gets the hubs content cache storage connection string.
        /// </summary>
        [ConfigurationSetting(DefaultValue = "")]
        public SecureConfigurationConnectionString StorageConnectionString
        {
            get;
            private set;
        }
    }

    /// <summary>
    /// Stores content in blob storage as block blobs.
    /// Used to ensure that cached content is available to clients
    /// even when the extension server code is newer/older than the content requested.
    /// </summary>
    [Export(typeof(IPersistentContentCache))]
    internal class HubsBlobStorageBackedContentCache : BlobStorageBackedPersistentContentCache
    {
        /// <summary>
        /// /// Creates an instance of the cache.
        /// </summary>
        /// <param name="applicationContext"> Application context which has environment settings.</param>
        /// <param name="settings"> The content cache settings to use.</param>
        /// <param name="tracer"> The tracer to use for any logging.</param>
        [ImportingConstructor]
        public HubsBlobStorageBackedContentCache(
            ApplicationContext applicationContext,
            HubsBlobStorageBackedContentCacheSettings settings,
            ICommonTracer tracer)
            :base(settings.StorageConnectionString.ToString(), "HubsExtensionContentCache", applicationContext, tracer)
        {
        }
    }
}

web.config

    <add key="<your.extension.namespace>.HubsBlobStorageBackedContentCacheSettings.StorageConnectionString" value="" />

Verfiying that persistent caching is working

  • Deploy a version of your extension. Examine the scripts it loads, they will be of the form prefix<sha hash>suffix.js
  • Use a blob explorer of your preference and verify that the scripts have been written to blob storage.
  • Then make changes to TS files in your solution, build and deploy a new version of your extension.
  • Look for scripts that have the same prefix and suffix but a different hash.
  • For those scripts try to request the original URL (from step 1) from your extension server (not via the cdn).
  • The script should still get served, but this time it is coming from the persistent cache.

Run portalcop to identify and resolve common performance issues

PortalCop

The Portal Framework team has built a tool called PortalCop that can help reduce code size and remove redundant RESX entries.

Installing PortalCop

Run the following command in the NuGet Package Manager Console.

Install-Package PortalFx.PortalCop -Source https://msazure.pkgs.visualstudio.com/DefaultCollection/_packaging/Official/nuget/v3/index.json -Version 1.0.0.339

Or run the following in a Windows command prompt.

nuget install PortalFx.PortalCop -Source https://msazure.pkgs.visualstudio.com/DefaultCollection/_packaging/Official/nuget/v3/index.json -Version 1.0.0.339

Running PortalCop

Namespace Mode

NOTE: If you do not use AMD, please do not run this mode in your codebase.

If there are nested namespaces in code (for example A.B.C.D) the minifier will only reduce the top level (A) name, leaving all remaining names uncompressed.

Example of uncompressible code and minified version MsPortalFx.Base.Utilities.SomeFunction(); -> a.Base.Utilities.SomeFunction();

As you implement your extension using our Framework, you may have done some namespacing import to help achieve better minification, like this: Import FxUtilities = MsPortalFx.Base.Utilities;

which yields a better minified version FxUtilities.SomeFunction(); -> a.SomeFunction();

In the Namespace mode, the PortalCop tool will normalize imports to the Fx naming convention. It won’t collide with any predefined names you defined. Using this tool, we achieved up to 10% code reduction in most of the Shell codebase.

Review the changes after running the tool. Especially, be wary of string content changes. The tool does string mapping, not syntax based replacement.

   portalcop Namespace

Resx

To reduce code size and save on localization costs, you can use the PortalCop RESX mode to find unused/redundant resx strings.

To list unused strings:
   portalcop Resx
   
To list and clean *.resx files:
    portalcop Resx AutoRemove

Constraints:

  • The tool may incorrectly flag resources as being un-used if your extension uses strings in unexpected formats. For example, if you try to dynamically read from resx based on string values.

    Utils.getResourceString(ClientResources.DeploymentSlots, slot))); export function getResourceString(resources: any, value: string): string { var key = value && value.length ? value.substr(0, 1).toLowerCase() + value.substr(1) : value; return resources[key] || value; }

  • You need to review the changes after running the tool and make sure that they are valid because of the above constraint.

  • If using the AutoRemove option, you need to open up the RESX files in VisualStudio to regenerate the Designer.cs files.

  • If you find any more scenarios that the tool incorrectly identifies as unused please report to Ibiza Fx PM

Performance alerting

Coming soon please reach out to sewatson if you are interested.

Reliability

Overview

Reliability of the Portal is one of the top pain points from a customers perspective. As an extension author you have a duty to uphold your experience to the reliability bar at a minimum.

Area Reliability Bar Telemetry Action/s How is it measured?
Extension See Power BI InitializeExtensions/LoadExtensions (( # of LoadExtensions starts - # of InitializeExtensions or LoadExtensions failures ) / # of load extension starts ) * 100
Blade See Power BI BladeLoaded vs BladeLoadErrored (( # of BladeLoaded started - # of BladeLoadErrored's) / # of BladeLoaded started) * 100
Part See Power BI PartLoaded (( # of PartLoaded started - # of PartLoaded canceled) / # of PartLoaded started) * 100

Extension reliability

This is core to your customers experience, if the FX is unable to load your extension it will be unable to surface any of your experience. Consequently your customers will be unable to manage/monitor their resources through the Portal.

Blade reliability

Second to Extension reliability, Blade reliability is next level of critical reliability. Your Blade reliability can be equated to a page loading in a website, it failing to load is a critical issue.

Part reliability

Parts are used throughout the portal, from a blade and dashboard perspective, if a part fails to load this results in the user potentially:

  1. not being able to navigate to the a blade or the next blade
  2. not seeing the critical data they expected on the dashboard
  3. etc...

Assessing extension reliability

There is two methods to assess your reliability:

  1. Visit the IbizaFx provided PowerBi report*

  2. Run Kusto queries locally to determine your numbers

    (*) To get access to the PowerBi dashboard reference the Telemetry onboarding guide, then access the following Extension performance/reliability report

The first method is definitely the easiest way to determine your current assessment as this is maintained on a regular basis by the Fx team. You can, if preferred, run queries locally but ensure you are using the Fx provided Kusto functions to calculate your assessment.

Checklist

There are a few items that the FX team advises all extensions to follow.

Code optimisations to improve extension reliability

Lazy initialization of data contexts and view model factories

The setDataContext API on view model factories was designed pre-AMD support in TypeScript and slows down extension load by increasing the amount of code downloaded on extension initialization. This also increases the risk of extension load failures due to increase in network activity. By switching to the setDataContextFactory method, we reduce the amount of code downloaded to the bare minimum. And the individual data contexts are loaded if and when required (e.g. if a blade that's opened requires it).

Old code:

this.viewModelFactories.Blades().setDataContext(new Blades.DataContext());

New code:

this.viewModelFactories.Blades().setDataContextFactory<typeof Blades>(
        "./Blades/BladesArea",
        (contextModule) => var x = new contextModule.DataContext()
);

Reliability Frequently Asked Questions (FAQ)

My Extension is below the reliability bar, what should I do

Run the following query

GetExtensionFailuresSummary(ago(1d))
| where extensionName contains "Microsoft_Azure_Compute"

Updating the extensionName to be your extension, and increase the time range if the last 24 hours isn't sufficient. Address the highest impacting issues, per occurence/affected users.

The query will return a summary of all the events which your extension failed to load.

Field name Definition
extensionName The extension the error correlates to
errorState The type of error that occurred
error The specific error that occurred
Occurences Number of occurrences
AffectedUsers Number of affected users
AffectedSessions Number of affected sessions
any_sessionId A sample of an affected session
any_message A sample message of what would normally be returned given errorState/error

Once you have ran the query you will be shown a list of errorStates and errors, for more greater details you can use the any_sessionId to investigate further.

Error States
Error State Definition Action items
FirstResponseNotReceived This error state means that the shell loaded the extension URL obtained from the config into an IFrame, however there wasn't any response from the extension
  1. Scan the events table to see if there are any other relevant error messages during the time frame of the alert
  2. Try opening the extension URL directly in the browser - it should show the default page for the extension
  3. Open the dev tools network tab in your browser and try opening the extension URL appending the following query string parameter sessionId=testSessionId - this should open a blank page and all requests in the network tab should be 200 or 300 level responses (no failures). If there is a server error in the extension - it will print out the error and a call stack if available. In case the failures are from a CDN domain, check if the same URL is accessible from the extension domain - if so, the CDN might be corrupt/out of sync. In this case, flushing the CDN would mitigate the issue.
HomePageTimedOut The index page failed to load within the max time period // Need steps to action on
ManifestNotReceived This error state means that the bootstrap logic was completed, however the extension did not return a manifest to the shell. The shell waits for a period of time and then timed out.
  1. Open the dev tools network tab in your browser and try opening the extension URL appending the following query string parameter sessionId=testSessionId - this should open a blank page and all requests in the network tab should be 200 or 300 level responses (no failures). If there is a server error in the extension - it will print out the error and a call stack if available. In case the failures are from a CDN domain, check if the same URL is accessible from the extension domain - if so, the CDN might be corrupt/out of sync. In this case, flushing the CDN would mitigate the issue.
  2. Scan the events table to see if there are any other relevant error messages during the time frame of the alert
InvalidExtensionName This error state means that the name of the extension specified in the extensions JSON in config doesn't match the name of the extension in the extension manifest.
  1. Verify what the correct name of the extension should be, and if the name in config is incorrect, update it.
  2. If the name in the manifest is incorrect, contact the relevant extension team to update tag in their PDL with the right extension name and recompile
InvalidManifest This error state means that the manifest that was received from an extension was invalid, i.e. it had validation errors Scan the error logs for all the validation errors in the extension manifest.
InvalidDefinition This error state means that the definition that was received from an extension was invalid, i.e. it had validation errors Scan the error logs for all the validation errors in the extension definition.
FailedToInitialize This error state means that the extension failed to initialize one or more calls to methods on the extension's entry point class failing
  1. Look for the error code and if it is present the call stack in the message to get more details.
  2. Scan the events table to get all the relevant error messages during the time frame of the alert
  3. These errors should have information about what exactly failed while trying to initialize the extension e.g. the initialize endpoint, the getDefinition endpoint, etc.
TooManyRefreshes This error state means that the extension try to reload itself within the IFrame multiple times. The error should specify the number of times it refreshed before the extension was disabled Scan the events table to see if there are any other relevant error messages during the time frame of the alert
TooManyBootGets This error state means that the extension try to send the bootGet message to request for Fx scripts multiple times. The error should specify the number of times it refreshed before the extension was disabled Scan the events table to see if there are any other relevant error messages during the time frame of the alert
TimedOut This error signifies that the extension failed to load after the predefined timeout.
  1. Scan the events table to see if there are any other relevant error messages during the time frame of the alert
  2. Analyze the error messages to try to deduce whether the problem is on the extension side or the shell.
  3. If the issue is with the extension, look at CPU utilization of the cloud service instances. If the CPU utilization is high, it might explain why clients are timing out when requesting resources from the server.
MaxRetryAttemptsExceeded This a collation of the above events Inspect the sample message and follow appropriate step above

My Blade is below the reliability bar, what should I do

Firstly, run the following query, ensure you update the extension/time range.

GetBladeFailuresSummary(ago(1h))
| where extension == "Microsoft_Azure_Compute"
Field name Definition
extension The extension the error correlates to
blade The blade the error correlates to
errorReason The error reason associated with the failure
Occurences Number of occurrences
AffectedUsers Number of affected users
AffectedSessions Number of affected sessions
any_sessionId A sample of an affected session
any_details A sample message of what would normally be returned given extension/blade/errorReason

Once you have that, correlate the error reasons with the below list to see the guided next steps.

Error reason Defintion Action items
ErrorInitializing The FX failed to initialize the blade due to an invalid definition.
  1. Verify the PDL definition of the given blade
  2. Verify the source opening the blade is passing the correct parameters
  3. Reference a sample session in the ClientEvents kusto table there should be correlating events before the blade failure
ErrorLoadingExtension The extension failed to load and therefore the blade was unable to load. Refer to the guidance provided for extension reliability
ErrorLoadingDefinition The FX was unable to retrieve the blade defintion from the Extension. Reference a sample session in the ClientEvents kusto table there should be correlating events before the blade failure
ErrorLoadingExtensionAndDefinition The FX was unable to retrieve the blade defintion from the Extension. Reference a sample session in the ClientEvents kusto table there should be correlating events before the blade failure
ErrorUnrecoverable The FX failed to restore the blade during journey restoration because of an unexpected error. This should not occur but if it does file a [shell bug](http://aka.ms/portalfx/shellbug).

My Part is below the reliability bar, what should I do

Firstly, run the following query, ensure you update the extension/time range.

GetPartFailuresSummary(ago(1h))
| where extension == "Microsoft_Azure_Compute"
Field name Definition
extension The extension the error correlates to
blade The blade the part is on, if blade === "Dashboard' then the part was loaded from a dashboard
part The part the error correlates to
errorReason The error reason associated with the failure
Occurences Number of occurrences
AffectedUsers Number of affected users
AffectedSessions Number of affected sessions
any_sessionId A sample of an affected session
any_details A sample message of what would normally be returned given extension/blade/part/errorReason

Once you have that, correlate the error reasons with the below list to see the guided next steps.

Error reason Defintion Action items
TransitionedToErrorState The part was unable to load and failed through its initialization or OnInputsSet Consult the any_details column, there should be sample message explaining explicitly what the issue was. Commonly this is a nullRef.
ErrorLocatingPartDefinition The FX was unable to determine the part definition. The likely cause of this is the extension has removed the part entirely from the PDL, this is not the guided pattern. See deprecating parts for the explicit guidance. __NEED LINK__
ErrorAcquiringViewModel The FX was unable to retrieve the part view model from the Extension. You can correlate the start of thesample message with one of the below for common explanations.
  • ETIMEOUT - This may be caused by a flooding of the RPC layer.
  • Script error - Dependent on the exact message, this may be due to timeouts/latency issues/connection problems.
  • Load timeout for modules - This may be caused by a slow or loss of connection.
  • description: - This is generic bucket, here the message will define the issue further. For example if there were null references
For all the above if enough information was not provided via the message explore the raw events function or reference a sample session in the ClientEvents kusto table as there should be correlating events before the failure.
ErrorLoadingControl The FX was unable to retrieve the control module. Reach out to the FX team if you see a large amount of these issues.
ErrorCreatingWidget The FX failed to create the widget. Check the sample message this is indicate the explicit reason why it failed, this was probably a ScriptError or failure to load the module.
OldInputsNotHandled In this case a user has a pinned representation of a old version of the tile. The extension author has changed the inputs in a breaking fashion. If this happens you need follow the guided pattern. __NEED LINK__

Alerts

This is in progress, if you have interest in adopting reliability alerts please contact sewatson

There are 3 types of alerts we will be firing:

  1. Extension reliability - this requires on-boarding please contact sewatson if you are interested
  2. Blade reliability hourly
  3. Part reliability hourly