API Proposal: Update PFI API to be easier to use #5625

JakeRadMSFT · 2021-02-16T18:09:15Z

Background and Motivation

The current PFI API is difficult to use. We've had a few issues opened to make it easier but we can use this issue to track a proposed API.

Prior Issue:
#4216

Example Support Issue to help developers use it:
dotnet/machinelearning-modelbuilder#1031 (comment)

The main issue with the API is that it returns an array and it's not easy to get back to the column name/feature name from the index.

VBuffer<ReadOnlyMemory<char>> nameBuffer = default;
            preprocessedTrainData.Schema["Features"].Annotations.GetValue("SlotNames", ref nameBuffer); // NOTE: The column name "Features" needs to match the featureColumnName used in the trainer, the name "SlotNames" is always the same regardless of trainer.
            var featureColumnNames = nameBuffer.DenseValues().ToList();

The second biggest issue (which actually comes earlier in the process :). Is that it's hard to know what to pass for ISingleFeaturePredictionTransformer argument. Perhaps this is something we can figure out how to extract for them from the training pipeline?

// Option 1: to extract predictor, requires to know the type in advance:
            // var predictor = ((TransformerChain<RegressionPredictionTransformer<LightGbmRegressionModelParameters>>)mlModel).LastTransformer;

            // Option 2: Should work always, as long as you _know_ the predictor is the last transformer in the chain.
            var predictor = ((IEnumerable<ITransformer>)mlModel).Last();

            // Option 3, need to load from disk the model first
            //var path = @"C:\Users\anvelazq\Desktop\PfiSample\model.zip";
            //mlContext.Model.Save(mlModel, trainingDataView.Schema, path);
            //var mlModel2 = mlContext.Model.Load(path, out var _);
            //var predictor = ((TransformerChain<ITransformer>) mlModel2).LastTransformer;

If we can do that ... then we can just take in "Microsoft.ML.IEstimator<Microsoft.ML.ITransformer> estimator" similar to the CrossValidate APIs.

Proposed API

namespace Microsoft.ML
{
     public static class PermutationFeatureImportanceExtensions {

     public static System.Collections.Immutable.ImmutableArray<Microsoft.ML.Data.RegressionMetricsStatistics> PermutationFeatureImportance<TModel> (this Microsoft.ML.RegressionCatalog catalog, Microsoft.ML.ISingleFeaturePredictionTransformer<TModel> predictionTransformer, Microsoft.ML.IDataView data, string labelColumnName = "Label", bool useFeatureWeightFilter = false, int? numberOfExamplesToUse = default, int permutationCount = 1) where TModel : class;
+    public static System.Collections.Dictionary<string, Microsoft.ML.Data.RegressionMetricsStatistics> PermutationFeatureImportance<TModel> (this Microsoft.ML.RegressionCatalog catalog, Microsoft.ML.IEstimator<Microsoft.ML.ITransformer> estimator, Microsoft.ML.IDataView data, string labelColumnName = "Label", bool useFeatureWeightFilter = false, int? numberOfExamplesToUse = default, int permutationCount = 1) where TModel : class;
     }

You may find the Framework Design Guidelines helpful.

Usage Examples

This is how it works today: dotnet/machinelearning-modelbuilder#1031 (comment)
Below is how I think it should work. The key things to note is the similarities to CrossValidate API.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using PfiSample.Model;
using System.Collections.Immutable;

namespace PfiSample.ConsoleApp
{
    public static class ModelBuilder
    {
        private static string TRAIN_DATA_FILEPATH = @"C:\Users\anvelazq\Desktop\PfiSample\PfiSample.ConsoleApp\taxi-fare-train.csv";
        private static string MODEL_FILE = ConsumeModel.MLNetModelPath;

        // Create MLContext to be shared across the model creation workflow objects 
        // Set a random seed for repeatable/deterministic results across multiple trainings.
        private static MLContext mlContext = new MLContext(seed: 1);

        public static void CreateModel()
        {
            // Load Data
            IDataView trainingDataView = mlContext.Data.LoadFromTextFile<ModelInput>(
                                            path: TRAIN_DATA_FILEPATH,
                                            hasHeader: true,
                                            separatorChar: ',',
                                            allowQuoting: true,
                                            allowSparse: false);

            // Build training pipeline and Train Model

            // Data process configuration with pipeline data transformations 
            var dataProcessPipeline = mlContext.Transforms.Categorical.OneHotEncoding(new[] { new InputOutputColumnPair("vendor_id", "vendor_id"), new InputOutputColumnPair("payment_type", "payment_type") })
                                      .Append(mlContext.Transforms.Concatenate("Features", new[] { "vendor_id", "payment_type", "rate_code", "passenger_count", "trip_time_in_secs", "trip_distance" }));
            // Set the training algorithm 
            var trainer = mlContext.Regression.Trainers.LightGbm(labelColumnName: "fare_amount", featureColumnName: "Features");

            IEstimator<ITransformer> trainingPipeline = dataProcessPipeline.Append(trainer);
           
           ITransformer model = trainingPipeline.Fit(trainingDataView);
            
            // Calculate PFI
            CalculatePFI(mlContext, trainingDataView, trainingPipeline);
            
            // Evaluate quality of Model
            Evaluate(mlContext, trainingDataView, trainingPipeline);

            // Save model
            SaveModel(mlContext, mlModel, MODEL_FILE, trainingDataView.Schema);
        }


        private static void Evaluate(MLContext mlContext, IDataView trainingDataView, IEstimator<ITransformer> trainingPipeline)
        {
            // Cross-Validate with single dataset (since we don't have two datasets, one for training and for evaluate)
            // in order to evaluate and get the model's accuracy metrics
            Console.WriteLine("=============== Cross-validating to get model's accuracy metrics ===============");
            var crossValidationResults = mlContext.Regression.CrossValidate(trainingDataView, trainingPipeline, numberOfFolds: 5, labelColumnName: "fare_amount");
            PrintRegressionFoldsAverageMetrics(crossValidationResults);
        }

        private static void CalculatePFI(MLContext mlContext, IDataView trainingDataView, IEstimator<ITransformer> trainingPipeline)
        {
            

            Dictionary<string, RegressionMetricsStatistics> permutationFeatureImportance =
                mlContext
                .Regression
                .PermutationFeatureImportance(trainingPipeline, trainingDataView, permutationCount: 1, labelColumnName: "fare_amount");



            Console.WriteLine("Feature\tPFI");
            foreach (KeyValuePair<string, RegressionMetricsStatistics> entry in permutationFeatureImportance )
            {
                Console.WriteLine($"{entry.Key}\t{entry.Value.RSquared.Mean:F6}");
            }
        }

        private static void SaveModel(MLContext mlContext, ITransformer mlModel, string modelRelativePath, DataViewSchema modelInputSchema)
        {
            // Save/persist the trained model to a .ZIP file
            Console.WriteLine($"=============== Saving the model  ===============");
            mlContext.Model.Save(mlModel, modelInputSchema, GetAbsolutePath(modelRelativePath));
            Console.WriteLine("The model is saved to {0}", GetAbsolutePath(modelRelativePath));
        }

        public static string GetAbsolutePath(string relativePath)
        {
            FileInfo _dataRoot = new FileInfo(typeof(Program).Assembly.Location);
            string assemblyFolderPath = _dataRoot.Directory.FullName;

            string fullPath = Path.Combine(assemblyFolderPath, relativePath);

            return fullPath;
        }
    }
}

Alternative Designs

If there is any opposition or technical challenges for making PFI have a similar API to CrossValidate ... I'm open to alternatives but I don't know the ML.NET APIs well enough to come up with other patterns.

Risks

I think the biggest risk/challenge is that folks can do a lot of things with pipelines and models to make them incompatible with PFI. I believe it takes exponentially longer to calculate PFI relative to number of columns. Certain things like OneHotHash can create hundreds of columns ...

ericstj · 2021-02-16T19:01:37Z

Changing return type is a binary breaking change. Is it possible to make this change additive instead?

JakeRadMSFT · 2021-02-16T19:05:11Z

I'm not sure the old one is really worth keeping ... if there is ever a good time to deprecate it .... but it could be an overload instead of replacing it.

ericstj · 2021-02-16T19:06:40Z

I'm not sure the old one is really worth keeping

We can't take that position. Someone might have gone through the trouble to get the old one working. Is adding this new method worth breaking that customer? What if they cannot recompile the component that used this API.

You can't overload on return type, we'd need to give it a different name.

JakeRadMSFT · 2021-02-16T19:08:45Z

It has different parameters as well

Microsoft.ML.ISingleFeaturePredictionTransformer predictionTransformer

->

Microsoft.ML.IEstimator<Microsoft.ML.ITransformer> estimator

JakeRadMSFT · 2021-02-16T19:09:50Z

We can't take that position. Someone might have gone through the trouble to get the old one working. Is adding this new method worth breaking that customer? What if they cannot recompile the component that used this API.

Totally ... but if we ever V2 ... is it worth keeping then?

ericstj · 2021-02-17T20:04:34Z

We have some guidance now for how to better obsolete members. I'd expect us to follow that: https://github.com/dotnet/designs/blob/main/accepted/2020/better-obsoletion/better-obsoletion.md

It has different parameters as well

Perfect! That allows for a compatible addition.

ericstj · 2021-05-12T16:52:20Z

cc @michaelgsharp @eerhardt what do you think about the feasibility of this addition?

eerhardt · 2021-05-14T16:32:15Z

I like the idea of having more convenient return information, i.e. the Dictionary<string, RegressionMetricsStatistics> permutationFeatureImportance that automatically maps columns to their statistics, without forcing users to know how to do the mapping themselves.

The only concern I have is changing the input to be a IEstimator<ITransformer>. Doing this would mean you need to re-train your pipeline during the call to PermutationFeatureImportance, which isn't something that PFI does today.

Maybe an alternate solution would be to create another convenience API that gets the ISingleFeaturePredictionTransformer from a trained pipeline. Basically, a fully correct version of what this code is trying to do:

            // Option 1: to extract predictor, requires to know the type in advance:
            // var predictor = ((TransformerChain<RegressionPredictionTransformer<LightGbmRegressionModelParameters>>)mlModel).LastTransformer;

            // Option 2: Should work always, as long as you _know_ the predictor is the last transformer in the chain.
            var predictor = ((IEnumerable<ITransformer>)mlModel).Last();

That would allow users to easily get the input they need to pass into PermutationFeatureImportance.

JakeRadMSFT · 2021-05-20T00:46:30Z

@eerhardt
I trust your guidance - the key issue is that it's hard to use. This was just the proposal I came up with.

Re: convenient return information
@michaelgsharp - is the reason this is so awkward today the same as the classification thing we talked about ... where there just isn't "Dictionary" support in IDataView?

houghj16 · 2021-05-25T14:04:22Z

Adding links to similar issues calling for clarification or improvements to the PFI API:

AutoML

Handling Features and Weights

JakeRadMSFT · 2021-05-25T16:09:24Z

Thanks @houghj16!!

JakeRadMSFT changed the title ~~(WIP) API Proposal: Update PFI API to be easier to use~~ API Proposal: Update PFI API to be easier to use Feb 16, 2021

ericstj added API Issues pertaining the friendly API API breaking change The change(s) required for this issue may break the current API enhancement New feature or request labels Feb 16, 2021

ericstj removed the API breaking change The change(s) required for this issue may break the current API label May 12, 2021

michaelgsharp mentioned this issue Sep 13, 2021

New PFI API #5934

Merged

michaelgsharp closed this as completed in #5934 Oct 1, 2021

michaelgsharp mentioned this issue Oct 27, 2021

PFI (Permutation Feature Importance) API needs to be simpler to use #4216

Closed

dotnet locked as resolved and limited conversation to collaborators Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Proposal: Update PFI API to be easier to use #5625

API Proposal: Update PFI API to be easier to use #5625

JakeRadMSFT commented Feb 16, 2021 •

edited

ericstj commented Feb 16, 2021

JakeRadMSFT commented Feb 16, 2021 •

edited

ericstj commented Feb 16, 2021

JakeRadMSFT commented Feb 16, 2021

JakeRadMSFT commented Feb 16, 2021

ericstj commented Feb 17, 2021

ericstj commented May 12, 2021

eerhardt commented May 14, 2021

JakeRadMSFT commented May 20, 2021

houghj16 commented May 25, 2021

JakeRadMSFT commented May 25, 2021

API Proposal: Update PFI API to be easier to use #5625

API Proposal: Update PFI API to be easier to use #5625

Comments

JakeRadMSFT commented Feb 16, 2021 • edited

Background and Motivation

Proposed API

Usage Examples

Alternative Designs

Risks

ericstj commented Feb 16, 2021

JakeRadMSFT commented Feb 16, 2021 • edited

ericstj commented Feb 16, 2021

JakeRadMSFT commented Feb 16, 2021

JakeRadMSFT commented Feb 16, 2021

ericstj commented Feb 17, 2021

ericstj commented May 12, 2021

eerhardt commented May 14, 2021

JakeRadMSFT commented May 20, 2021

houghj16 commented May 25, 2021

JakeRadMSFT commented May 25, 2021

JakeRadMSFT commented Feb 16, 2021 •

edited

JakeRadMSFT commented Feb 16, 2021 •

edited