Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal: Update PFI API to be easier to use #5625

Closed
JakeRadMSFT opened this issue Feb 16, 2021 · 11 comments · Fixed by #5934
Closed

API Proposal: Update PFI API to be easier to use #5625

JakeRadMSFT opened this issue Feb 16, 2021 · 11 comments · Fixed by #5934
Labels
API Issues pertaining the friendly API enhancement New feature or request

Comments

@JakeRadMSFT
Copy link
Contributor

JakeRadMSFT commented Feb 16, 2021

Background and Motivation

The current PFI API is difficult to use. We've had a few issues opened to make it easier but we can use this issue to track a proposed API.

Prior Issue:
#4216

Example Support Issue to help developers use it:
dotnet/machinelearning-modelbuilder#1031 (comment)

The main issue with the API is that it returns an array and it's not easy to get back to the column name/feature name from the index.

VBuffer<ReadOnlyMemory<char>> nameBuffer = default;
            preprocessedTrainData.Schema["Features"].Annotations.GetValue("SlotNames", ref nameBuffer); // NOTE: The column name "Features" needs to match the featureColumnName used in the trainer, the name "SlotNames" is always the same regardless of trainer.
            var featureColumnNames = nameBuffer.DenseValues().ToList();

The second biggest issue (which actually comes earlier in the process :). Is that it's hard to know what to pass for ISingleFeaturePredictionTransformer argument. Perhaps this is something we can figure out how to extract for them from the training pipeline?

// Option 1: to extract predictor, requires to know the type in advance:
            // var predictor = ((TransformerChain<RegressionPredictionTransformer<LightGbmRegressionModelParameters>>)mlModel).LastTransformer;

            // Option 2: Should work always, as long as you _know_ the predictor is the last transformer in the chain.
            var predictor = ((IEnumerable<ITransformer>)mlModel).Last();

            // Option 3, need to load from disk the model first
            //var path = @"C:\Users\anvelazq\Desktop\PfiSample\model.zip";
            //mlContext.Model.Save(mlModel, trainingDataView.Schema, path);
            //var mlModel2 = mlContext.Model.Load(path, out var _);
            //var predictor = ((TransformerChain<ITransformer>) mlModel2).LastTransformer;

If we can do that ... then we can just take in "Microsoft.ML.IEstimator<Microsoft.ML.ITransformer> estimator" similar to the CrossValidate APIs.

Proposed API

namespace Microsoft.ML
{
     public static class PermutationFeatureImportanceExtensions {

     public static System.Collections.Immutable.ImmutableArray<Microsoft.ML.Data.RegressionMetricsStatistics> PermutationFeatureImportance<TModel> (this Microsoft.ML.RegressionCatalog catalog, Microsoft.ML.ISingleFeaturePredictionTransformer<TModel> predictionTransformer, Microsoft.ML.IDataView data, string labelColumnName = "Label", bool useFeatureWeightFilter = false, int? numberOfExamplesToUse = default, int permutationCount = 1) where TModel : class;
+    public static System.Collections.Dictionary<string, Microsoft.ML.Data.RegressionMetricsStatistics> PermutationFeatureImportance<TModel> (this Microsoft.ML.RegressionCatalog catalog, Microsoft.ML.IEstimator<Microsoft.ML.ITransformer> estimator, Microsoft.ML.IDataView data, string labelColumnName = "Label", bool useFeatureWeightFilter = false, int? numberOfExamplesToUse = default, int permutationCount = 1) where TModel : class;
     }

You may find the Framework Design Guidelines helpful.

Usage Examples

This is how it works today: dotnet/machinelearning-modelbuilder#1031 (comment)
Below is how I think it should work. The key things to note is the similarities to CrossValidate API.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using PfiSample.Model;
using System.Collections.Immutable;

namespace PfiSample.ConsoleApp
{
    public static class ModelBuilder
    {
        private static string TRAIN_DATA_FILEPATH = @"C:\Users\anvelazq\Desktop\PfiSample\PfiSample.ConsoleApp\taxi-fare-train.csv";
        private static string MODEL_FILE = ConsumeModel.MLNetModelPath;

        // Create MLContext to be shared across the model creation workflow objects 
        // Set a random seed for repeatable/deterministic results across multiple trainings.
        private static MLContext mlContext = new MLContext(seed: 1);

        public static void CreateModel()
        {
            // Load Data
            IDataView trainingDataView = mlContext.Data.LoadFromTextFile<ModelInput>(
                                            path: TRAIN_DATA_FILEPATH,
                                            hasHeader: true,
                                            separatorChar: ',',
                                            allowQuoting: true,
                                            allowSparse: false);

            // Build training pipeline and Train Model

            // Data process configuration with pipeline data transformations 
            var dataProcessPipeline = mlContext.Transforms.Categorical.OneHotEncoding(new[] { new InputOutputColumnPair("vendor_id", "vendor_id"), new InputOutputColumnPair("payment_type", "payment_type") })
                                      .Append(mlContext.Transforms.Concatenate("Features", new[] { "vendor_id", "payment_type", "rate_code", "passenger_count", "trip_time_in_secs", "trip_distance" }));
            // Set the training algorithm 
            var trainer = mlContext.Regression.Trainers.LightGbm(labelColumnName: "fare_amount", featureColumnName: "Features");

            IEstimator<ITransformer> trainingPipeline = dataProcessPipeline.Append(trainer);
           
           ITransformer model = trainingPipeline.Fit(trainingDataView);
            
            // Calculate PFI
            CalculatePFI(mlContext, trainingDataView, trainingPipeline);
            
            // Evaluate quality of Model
            Evaluate(mlContext, trainingDataView, trainingPipeline);

            // Save model
            SaveModel(mlContext, mlModel, MODEL_FILE, trainingDataView.Schema);
        }


        private static void Evaluate(MLContext mlContext, IDataView trainingDataView, IEstimator<ITransformer> trainingPipeline)
        {
            // Cross-Validate with single dataset (since we don't have two datasets, one for training and for evaluate)
            // in order to evaluate and get the model's accuracy metrics
            Console.WriteLine("=============== Cross-validating to get model's accuracy metrics ===============");
            var crossValidationResults = mlContext.Regression.CrossValidate(trainingDataView, trainingPipeline, numberOfFolds: 5, labelColumnName: "fare_amount");
            PrintRegressionFoldsAverageMetrics(crossValidationResults);
        }

        private static void CalculatePFI(MLContext mlContext, IDataView trainingDataView, IEstimator<ITransformer> trainingPipeline)
        {
            

            Dictionary<string, RegressionMetricsStatistics> permutationFeatureImportance =
                mlContext
                .Regression
                .PermutationFeatureImportance(trainingPipeline, trainingDataView, permutationCount: 1, labelColumnName: "fare_amount");



            Console.WriteLine("Feature\tPFI");
            foreach (KeyValuePair<string, RegressionMetricsStatistics> entry in permutationFeatureImportance )
            {
                Console.WriteLine($"{entry.Key}\t{entry.Value.RSquared.Mean:F6}");
            }
        }

        private static void SaveModel(MLContext mlContext, ITransformer mlModel, string modelRelativePath, DataViewSchema modelInputSchema)
        {
            // Save/persist the trained model to a .ZIP file
            Console.WriteLine($"=============== Saving the model  ===============");
            mlContext.Model.Save(mlModel, modelInputSchema, GetAbsolutePath(modelRelativePath));
            Console.WriteLine("The model is saved to {0}", GetAbsolutePath(modelRelativePath));
        }

        public static string GetAbsolutePath(string relativePath)
        {
            FileInfo _dataRoot = new FileInfo(typeof(Program).Assembly.Location);
            string assemblyFolderPath = _dataRoot.Directory.FullName;

            string fullPath = Path.Combine(assemblyFolderPath, relativePath);

            return fullPath;
        }
    }
}

Alternative Designs

If there is any opposition or technical challenges for making PFI have a similar API to CrossValidate ... I'm open to alternatives but I don't know the ML.NET APIs well enough to come up with other patterns.

Risks

I think the biggest risk/challenge is that folks can do a lot of things with pipelines and models to make them incompatible with PFI. I believe it takes exponentially longer to calculate PFI relative to number of columns. Certain things like OneHotHash can create hundreds of columns ...

@JakeRadMSFT JakeRadMSFT changed the title (WIP) API Proposal: Update PFI API to be easier to use API Proposal: Update PFI API to be easier to use Feb 16, 2021
@ericstj
Copy link
Member

ericstj commented Feb 16, 2021

Changing return type is a binary breaking change. Is it possible to make this change additive instead?

@ericstj ericstj added API Issues pertaining the friendly API API breaking change The change(s) required for this issue may break the current API enhancement New feature or request labels Feb 16, 2021
@JakeRadMSFT
Copy link
Contributor Author

JakeRadMSFT commented Feb 16, 2021

I'm not sure the old one is really worth keeping ... if there is ever a good time to deprecate it .... but it could be an overload instead of replacing it.

@ericstj
Copy link
Member

ericstj commented Feb 16, 2021

I'm not sure the old one is really worth keeping

We can't take that position. Someone might have gone through the trouble to get the old one working. Is adding this new method worth breaking that customer? What if they cannot recompile the component that used this API.

You can't overload on return type, we'd need to give it a different name.

@JakeRadMSFT
Copy link
Contributor Author

It has different parameters as well

Microsoft.ML.ISingleFeaturePredictionTransformer predictionTransformer

->

Microsoft.ML.IEstimator<Microsoft.ML.ITransformer> estimator

@JakeRadMSFT
Copy link
Contributor Author

We can't take that position. Someone might have gone through the trouble to get the old one working. Is adding this new method worth breaking that customer? What if they cannot recompile the component that used this API.

Totally ... but if we ever V2 ... is it worth keeping then?

@ericstj
Copy link
Member

ericstj commented Feb 17, 2021

We have some guidance now for how to better obsolete members. I'd expect us to follow that: https://github.com/dotnet/designs/blob/main/accepted/2020/better-obsoletion/better-obsoletion.md

It has different parameters as well

Perfect! That allows for a compatible addition.

@ericstj ericstj removed the API breaking change The change(s) required for this issue may break the current API label May 12, 2021
@ericstj
Copy link
Member

ericstj commented May 12, 2021

cc @michaelgsharp @eerhardt what do you think about the feasibility of this addition?

@eerhardt
Copy link
Member

I like the idea of having more convenient return information, i.e. the Dictionary<string, RegressionMetricsStatistics> permutationFeatureImportance that automatically maps columns to their statistics, without forcing users to know how to do the mapping themselves.

The only concern I have is changing the input to be a IEstimator<ITransformer>. Doing this would mean you need to re-train your pipeline during the call to PermutationFeatureImportance, which isn't something that PFI does today.

Maybe an alternate solution would be to create another convenience API that gets the ISingleFeaturePredictionTransformer from a trained pipeline. Basically, a fully correct version of what this code is trying to do:

            // Option 1: to extract predictor, requires to know the type in advance:
            // var predictor = ((TransformerChain<RegressionPredictionTransformer<LightGbmRegressionModelParameters>>)mlModel).LastTransformer;

            // Option 2: Should work always, as long as you _know_ the predictor is the last transformer in the chain.
            var predictor = ((IEnumerable<ITransformer>)mlModel).Last();

That would allow users to easily get the input they need to pass into PermutationFeatureImportance.

@JakeRadMSFT
Copy link
Contributor Author

@eerhardt
I trust your guidance - the key issue is that it's hard to use. This was just the proposal I came up with.

Re: convenient return information
@michaelgsharp - is the reason this is so awkward today the same as the classification thing we talked about ... where there just isn't "Dictionary" support in IDataView?

@JakeRadMSFT
Copy link
Contributor Author

Thanks @houghj16!!

@dotnet dotnet locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants