Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PermutationFeatureImportance not working with AutoML API #5247

Closed
jacobthamblett opened this issue Jun 17, 2020 · 12 comments
Closed

PermutationFeatureImportance not working with AutoML API #5247

jacobthamblett opened this issue Jun 17, 2020 · 12 comments
Assignees
Labels
AutoML.NET Automating various steps of the machine learning process P3 Doc bugs, questions, minor issues, etc. question Further information is requested

Comments

@jacobthamblett
Copy link

System information

  • Windows 10 Pro, build 18363
  • Visual Studio 2019 Professional
  • C# Console Application, in .NET Core 3.1

Issue

  • Using the AutoML API to generate ML multiclassification model from large network datasets stored in CSV file. The model produced by the API provides accurate prediction, with reasonable results in the following metrics: MicroAccuracy, MacroAccuracy, LogLoss & LogLossReduction. Trying to get metrics on what feature selection was implemented by the API is proving impossible however.

  • Following all direction & documentation on implementing the PermutationFeatureImportance method has no success. It is possible to extract the pipeline from the AutoML BestRun model, and putting together the list of features in the custom class type it is using is not a problem either. However, there would appear to be no LastTransformer attribute for the BestRun model produced by the API. According to the official documentation on how to execute the PFI method on multiclass model, this is one of the main hurdles.

  • Attempting to follow the specific Multiclassification PFI documentation more exact, and defining a new pipeline with single multiclassification algorithm still throws an error. This is not ideal, as the new pipeline definition with single multiclassification algorithm does not necessarily match that used by the AutoML API's model, which is the model the PFI metrics are needed for.

Source code / logs

Example code following Multiclassification PFI Implementation from ML.Net Documentation, using pipeline extracted from AutoML bestRun Model:

//PFI code attempting to extract and use pipeline from AutoML bestRun model
            var featureColumns =
                new string[] { "Unnamed0","FlowID","SourceIP","SourcePort","DestinationIP","DestinationPort","Protocol","Timestamp","FlowDuration","TotalFwdPackets","TotalBackwardPackets","TotalLengthofFwdPackets","TotalLengthofBwdPackets","FwdPacketLengthMax","FwdPacketLengthMin","FwdPacketLengthMean","FwdPacketLengthStd","BwdPacketLengthMax","BwdPacketLengthMin","BwdPacketLengthMean","BwdPacketLengthStd","FlowBytes","FlowPackets","FlowIATMean","FlowIATStd","FlowIATMax","FlowIATMin","FwdIATTotal","FwdIATMean","FwdIATStd","FwdIATMax","FwdIATMin","BwdIATTotal","BwdIATMean","BwdIATStd","BwdIATMax","BwdIATMin","FwdPSHFlags","BwdPSHFlags","FwdURGFlags","BwdURGFlags","FwdHeaderLength","BwdHeaderLength","FwdPackets","BwdPackets","MinPacketLength","MaxPacketLength","PacketLengthMean","PacketLengthStd","PacketLengthVariance","FINFlagCount","SYNFlagCount","RSTFlagCount","PSHFlagCount","ACKFlagCount","URGFlagCount","CWEFlagCount","ECEFlagCount","DownUpRatio","AveragePacketSize","AvgFwdSegmentSize","AvgBwdSegmentSize","FwdHeaderLength1","FwdAvgBytesBulk","FwdAvgPacketsBulk","FwdAvgBulkRate","BwdAvgBytesBulk","BwdAvgPacketsBulk","BwdAvgBulkRate","SubflowFwdPackets","SubflowFwdBytes","SubflowBwdPackets","SubflowBwdBytes","Init_Win_bytes_forward","Init_Win_bytes_backward","act_data_pkt_fwd","min_seg_size_forward","ActiveMean","ActiveStd","ActiveMax","ActiveMin","IdleMean","IdleStd","IdleMax","IdleMin","SimillarHTTP","Inbound" };

            // Fit the pipeline to the data.
            var PFI_model = bestRun.Estimator.Fit(trainDataView);

            // Transform the dataset.
            var transformedData = PFI_model.Transform(trainDataView);

            // Extract the predictor.
            var linearPredictor = PFI_model.LastTransformer;

            // Compute the permutation metrics for the linear model using the
            // normalized data.
            var permutationMetrics = mlContext.MulticlassClassification
                .PermutationFeatureImportance(linearPredictor, transformedData,
                permutationCount: 30);

            // Now let's look at which features are most important to the model
            // overall. Get the feature indices sorted by their impact on
            // microaccuracy.
            var sortedIndices = permutationMetrics
                .Select((metrics, index) => new { index, metrics.MicroAccuracy })
                .OrderByDescending(feature => Math.Abs(feature.MicroAccuracy.Mean))
                .Select(feature => feature.index);

            Console.WriteLine("Feature\tChange in MicroAccuracy\t95% Confidence in "
                + "the Mean Change in MicroAccuracy");

            var microAccuracy = permutationMetrics.Select(x => x.MicroAccuracy)
                .ToArray();

            foreach (int i in sortedIndices)
            {
                Console.WriteLine("{0}\t{1:G4}\t{2:G4}",
                    featureColumns[i],
                    microAccuracy[i].Mean,
                    1.96 * microAccuracy[i].StandardError);
            }

Error Produced: Severity Code Description Project File Line Suppression State
Error CS1061 'ITransformer' does not contain a definition for 'LastTransformer' and no accessible extension method 'LastTransformer' accepting a first argument of type 'ITransformer' could be found (are you missing a using directive or an assembly reference?)

Example code also following Multiclassification PFI Implementation from ML.Net Documentation, using a newly created pipeline and single multiclassification algorithm:

//PFI code with new pipeline using single multiclassification algorithm
            var featureColumns =
                new string[] { "Unnamed0","FlowID","SourceIP","SourcePort","DestinationIP","DestinationPort","Protocol","Timestamp","FlowDuration","TotalFwdPackets","TotalBackwardPackets","TotalLengthofFwdPackets","TotalLengthofBwdPackets","FwdPacketLengthMax","FwdPacketLengthMin","FwdPacketLengthMean","FwdPacketLengthStd","BwdPacketLengthMax","BwdPacketLengthMin","BwdPacketLengthMean","BwdPacketLengthStd","FlowBytes","FlowPackets","FlowIATMean","FlowIATStd","FlowIATMax","FlowIATMin","FwdIATTotal","FwdIATMean","FwdIATStd","FwdIATMax","FwdIATMin","BwdIATTotal","BwdIATMean","BwdIATStd","BwdIATMax","BwdIATMin","FwdPSHFlags","BwdPSHFlags","FwdURGFlags","BwdURGFlags","FwdHeaderLength","BwdHeaderLength","FwdPackets","BwdPackets","MinPacketLength","MaxPacketLength","PacketLengthMean","PacketLengthStd","PacketLengthVariance","FINFlagCount","SYNFlagCount","RSTFlagCount","PSHFlagCount","ACKFlagCount","URGFlagCount","CWEFlagCount","ECEFlagCount","DownUpRatio","AveragePacketSize","AvgFwdSegmentSize","AvgBwdSegmentSize","FwdHeaderLength1","FwdAvgBytesBulk","FwdAvgPacketsBulk","FwdAvgBulkRate","BwdAvgBytesBulk","BwdAvgPacketsBulk","BwdAvgBulkRate","SubflowFwdPackets","SubflowFwdBytes","SubflowBwdPackets","SubflowBwdBytes","Init_Win_bytes_forward","Init_Win_bytes_backward","act_data_pkt_fwd","min_seg_size_forward","ActiveMean","ActiveStd","ActiveMax","ActiveMin","IdleMean","IdleStd","IdleMax","IdleMin","SimillarHTTP","Inbound" };

            var pipeline = mlContext.Transforms
                .Concatenate("Features", featureColumns)
                .Append(mlContext.Transforms.Conversion.MapValueToKey("Label"))
                .Append(mlContext.Transforms.NormalizeMinMax("Features"))
                .Append(mlContext.MulticlassClassification.Trainers
                .SdcaMaximumEntropy());

            // Fit the pipeline to the data.
            var PFI_model = pipeline.Fit(trainDataView);

            // Transform the dataset.
            var transformedData = PFI_model.Transform(trainDataView);

            // Extract the predictor.
            var linearPredictor = PFI_model.LastTransformer;

            // Compute the permutation metrics for the linear model using the
            // normalized data.
            var permutationMetrics = mlContext.MulticlassClassification
                .PermutationFeatureImportance(linearPredictor, transformedData,
                permutationCount: 30);

            // Now let's look at which features are most important to the model
            // overall. Get the feature indices sorted by their impact on
            // microaccuracy.
            var sortedIndices = permutationMetrics
                .Select((metrics, index) => new { index, metrics.MicroAccuracy })
                .OrderByDescending(feature => Math.Abs(feature.MicroAccuracy.Mean))
                .Select(feature => feature.index);

            Console.WriteLine("Feature\tChange in MicroAccuracy\t95% Confidence in "
                + "the Mean Change in MicroAccuracy");

            var microAccuracy = permutationMetrics.Select(x => x.MicroAccuracy)
                .ToArray();

            foreach (int i in sortedIndices)
            {
                Console.WriteLine("{0}\t{1:G4}\t{2:G4}",
                    featureColumns[i],
                    microAccuracy[i].Mean,
                    1.96 * microAccuracy[i].StandardError);
            }

Code builds but also fails at PFI_model definition: System.ArgumentOutOfRangeException
HResult=0x80131502
Message=Schema mismatch for input column 'Features': expected vector or scalar of Single or Double, got Vector
Source=Microsoft.ML.Data
StackTrace:
at Microsoft.ML.Transforms.NormalizingEstimator.GetOutputSchema(SchemaShape inputSchema)
at Microsoft.ML.Data.EstimatorChain1.GetOutputSchema(SchemaShape inputSchema) at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input)

@antoniovs1029
Copy link
Member

antoniovs1029 commented Jun 18, 2020

Hi, @jacobthamblett .

So, I think I have the solution for your first attempt, although I'm not sure of how to fix your second attempt. Still, I think that if my suggestion for your first attempt works, it will be enough to fix your problem and close this issue. 😄

About your first attempt (using AutoML with PFI)
To have access to the .LastTransformer of your trained model, you'll need to cast it to TransformerChain, and then, to use the .LastTransformer as your linearPredictor for PFI you'll also need to make another cast to ISingleFeaturePredictionTransformer<object> as explained in this other issue. For your particular case, it'll look something like this:

            // Extract the predictor.
            var lastTransformer = ((TransformerChain <ITransformer>)PFI_model).LastTransformer;
            var linearPredictor = (ISingleFeaturePredictionTransformer<object>)lastTransformer;

            // Compute the permutation metrics for the linear model using the
            // normalized data.
            var permutationMetrics = mlContext.MulticlassClassification
                .PermutationFeatureImportance(linearPredictor, transformedData,
                permutationCount: 30);

I think that should solve your issue. (NOTE: In some rare cases you might not need to extract the LastTransformer, i.e. it would be enough with var linearPredictor = (ISingleFeaturePredictionTransformer<object>)PFI_model, this would only happen if your PFI_Model consists of only 1 transformer, which is generally highly unlikely, and I'm not even sure if AutoML is capable of producing such a model).

Also, notice that you don't need to retrain your bestRun.Estimator in order to get your model, since bestRun already has a member called bestRun.Model which has the trained model already. So you could simply use this:

var PFI_model = bestRun.Model // No need to retrain the bestRun.Estimator.

However, you can train your bestRun.Estimator, just as you did in your sample, and it should work just as well, with the castings I mentioned.

About your second attempt (without AutoML)
I think it's not possible for us to find out what was wrong with this, unless you provide us with a full solution reproducing the error and a sample dataset for us to run.

But just by looking at the stacktrace, it looks like a problem with the NormalizingEstimator your using (i.e. it's not a problem with AutoML or PFI). Particularly, I think the problem would be that among the columns you're trying to concatenate there are non-numerical columns and/or that you'd need to add more estimators between the Concatenate and the Normalizing (or even before the Concatenate) to correctly preprocess your data before normalizing it.

Since you tried out this attempt only because your first attempt didn't work, I think it's ok not to look much into this other attempt, if my suggestion for the first attempt works for you 😄

@jacobthamblett
Copy link
Author

Thanks @antoniovs1029 ,

Extracting the already trained model was my initial hope, however the example above was just following the documentation - perhaps another argument correlating with your documentation issue.

I'm not sure whether my solution falls into your highly unlikely category, but the above suggestion throws an InvalidCastException error:

System.InvalidCastException
  HResult=0x80004002
  Message=Unable to cast object of type 'Microsoft.ML.Transforms.KeyToValueMappingTransformer' to type 'Microsoft.ML.ISingleFeaturePredictionTransformer`1[System.Object]'.
  Source=ML-Net v1-0
  StackTrace:
   at ML_Net_v1_0.AutoBuildProgram.Build(Int32 trainingBatchID, Int32 testingBatchID) in ...

@antoniovs1029
Copy link
Member

antoniovs1029 commented Jun 18, 2020

Judging by the exception message, it seems that AutoML is adding a KeyToValueMappingEstimator on top of your pipeline, so the .LastTransformer is a KeyToValueMappingTransformer and we can't extract the linearPredictor from that transformer... we need to find the "PredictionTransformer" inside the TransformerChain, to extract the linearPredictor from it (judging by your sample code, I think you're working with a Multiclassification model, so then you'll need to find the MulticlassPredictionTransformer<> inside your TransformerChain).

To do this, you'll need to do the following:

using System.Linq; // Add this to the file where you're trying to use PFI

....
// Cast your model into a TransformerChain:
TransformerChain<ITransformer> modelAsChain = (TransformerChain<ITransformer>)bestRun.Model;

// Convert your TransformerChain into an array of ITransformers
ITransformer[] transformersArray = modelAsChain.ToArray();

// Get the prediction transformer from the array
// I suspect it will be in the second-to-last position, so that's why I've used "transformersArray.Length - 2" as index
// But it might not be there, and you might need to use the debugger to see where in the Array your Prediction Transformer is
var predictionTransformer = transformersArray[transformersArray.Length - 2];

// Finally cast it to ISingleFeaturePredictionTransformer to use it for PFI
var linearPredictor = (ISingleFeaturePredictionTransformer<object>)predictionTransformer;

Please, let us know if this works for you. If it doesn't, please share the model you're trying to use with PFI. Thanks!

@jacobthamblett
Copy link
Author

So I can find the MulticlassPredictionTransformer - it's inside a TransformerChain within the transformersArray at index 0:
image

Extracting the MulticlassPredictionTransformer, which would appear to be the LastTransformer element within transformersArray[0] isn't easy as there is no method to go further into the transformersArray past it's first dimension.

Of course, setting var predictionTransformer = transformersArray[0]; produces the following error, as at that index location is another TransformerChain... Error included below:

System.InvalidCastException
  HResult=0x80004002
  Message=Unable to cast object of type 'Microsoft.ML.Data.TransformerChain`1[Microsoft.ML.ITransformer]' to type 'Microsoft.ML.ISingleFeaturePredictionTransformer`1[System.Object]'.
  Source=ML-Net v1-0
  StackTrace:
   at ...

@antoniovs1029
Copy link
Member

antoniovs1029 commented Jun 19, 2020

I see. Sorry, I wasn't aware that AutoML was capable of returning this kind of model, with the prediction transformer so hidden inside the TransformerChain.

So, the goal is still the same: to get the linearPredictor from the prediction transformer. Now that you've identified that the prediction transformer is inside a transformer chain inside the transformersArray[0], then I think this will work:

using System.Linq; // Add this to the file where you're trying to use PFI

....

TransformerChain<ITransformer> modelAsChain = (TransformerChain<ITransformer>)bestRun.Model;
ITransformer[] transformersArray = modelAsChain.ToArray();

// Get the prediction transformer, since it is inside a Transformer Chain inside the Transformer Chain
// as you've showed on your capture,
// then you can get the predictionTransformer like this:
var predictionTransformer = ((TransformerChain<ITransformer>)transformersArray[0]).LastTransformer;

// Finally cast it to ISingleFeaturePredictionTransformer to use it for PFI
var linearPredictor = (ISingleFeaturePredictionTransformer<object>)predictionTransformer;

@jacobthamblett
Copy link
Author

Thanks @antoniovs1029 , that's great - it would appear that the your solution has extracted the linear prediction transformer from the transformer chain inside the array of transformer chains. However...

Still following the documentation on PFI multiclassification:

            var linearPredictor = (ISingleFeaturePredictionTransformer<object>)predictionTransformer;

            var permutationMetrics = mlContext.MulticlassClassification
                .PermutationFeatureImportance(linearPredictor, testDataView, permutationCount: 30);

            var sortedIndices = permutationMetrics
                .Select((metrics, index) => new { index, metrics.MicroAccuracy })
                .OrderByDescending(feature => Math.Abs(feature.MicroAccuracy.Mean))
                .Select(feature => feature.index);

            Console.WriteLine("Feature\tChange in MicroAccuracy\t95% Confidence in "
                + "the Mean Change in MicroAccuracy");

            var microAccuracy = permutationMetrics.Select(x => x.MicroAccuracy)
                .ToArray();

            var featureColumns =
                new string[] { "Unnamed0", "FlowID", "SourceIP", "SourcePort", "DestinationIP", "DestinationPort", "Protocol", "Timestamp", "FlowDuration", "TotalFwdPackets", "TotalBackwardPackets", "TotalLengthofFwdPackets", "TotalLengthofBwdPackets", "FwdPacketLengthMax", "FwdPacketLengthMin", "FwdPacketLengthMean", "FwdPacketLengthStd", "BwdPacketLengthMax", "BwdPacketLengthMin", "BwdPacketLengthMean", "BwdPacketLengthStd", "FlowBytes", "FlowPackets", "FlowIATMean", "FlowIATStd", "FlowIATMax", "FlowIATMin", "FwdIATTotal", "FwdIATMean", "FwdIATStd", "FwdIATMax", "FwdIATMin", "BwdIATTotal", "BwdIATMean", "BwdIATStd", "BwdIATMax", "BwdIATMin", "FwdPSHFlags", "BwdPSHFlags", "FwdURGFlags", "BwdURGFlags", "FwdHeaderLength", "BwdHeaderLength", "FwdPackets", "BwdPackets", "MinPacketLength", "MaxPacketLength", "PacketLengthMean", "PacketLengthStd", "PacketLengthVariance", "FINFlagCount", "SYNFlagCount", "RSTFlagCount", "PSHFlagCount", "ACKFlagCount", "URGFlagCount", "CWEFlagCount", "ECEFlagCount", "DownUpRatio", "AveragePacketSize", "AvgFwdSegmentSize", "AvgBwdSegmentSize", "FwdHeaderLength1", "FwdAvgBytesBulk", "FwdAvgPacketsBulk", "FwdAvgBulkRate", "BwdAvgBytesBulk", "BwdAvgPacketsBulk", "BwdAvgBulkRate", "SubflowFwdPackets", "SubflowFwdBytes", "SubflowBwdPackets", "SubflowBwdBytes", "Init_Win_bytes_forward", "Init_Win_bytes_backward", "act_data_pkt_fwd", "min_seg_size_forward", "ActiveMean", "ActiveStd", "ActiveMax", "ActiveMin", "IdleMean", "IdleStd", "IdleMax", "IdleMin", "SimillarHTTP", "Inbound" };

            foreach (int i in sortedIndices)
            {
                Console.WriteLine("{0}\t{1:G4}\t{2:G4}",
                    featureColumns[i],
                    microAccuracy[i].Mean,
                    1.96 * microAccuracy[i].StandardError);
            }

...I could be wrong, but it sounds like the definition of permutationMetrics variable with multiclassClassification requires a single feature from within the linearPredictor, whereas the linearPredictor extracted from the AutoML's bestRun model has a vector of features:

image

This is throwing the following error:

System.ArgumentOutOfRangeException
  HResult=0x80131502
  Message=Features column 'Feature' not found 
  Source=Microsoft.ML.Core
  StackTrace:
   at Microsoft.ML.Data.RoleMappedSchema.MapFromNames(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
   at Microsoft.ML.Data.RoleMappedSchema..ctor(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
   at Microsoft.ML.Data.PredictedLabelScorerBase.BindingsImpl.ApplyToSchema(DataViewSchema input, ISchemaBindableMapper bindable, IHostEnvironment env)
   at Microsoft.ML.Data.PredictedLabelScorerBase..ctor(IHostEnvironment env, PredictedLabelScorerBase transform, IDataView newSource, String registrationName)
   at Microsoft.ML.Data.MulticlassClassificationScorer..ctor(IHostEnvironment env, MulticlassClassificationScorer transform, IDataView newSource)
   at Microsoft.ML.Data.MulticlassClassificationScorer.ApplyToDataCore(IHostEnvironment env, IDataView newSource)
   at Microsoft.ML.Data.RowToRowScorerBase.ApplyToData(IHostEnvironment env, IDataView newSource)
   at Microsoft.ML.Data.PredictionTransformerBase`1.Transform(IDataView input)
   at Microsoft.ML.Transforms.PermutationFeatureImportance`3.GetImportanceMetricsMatrix(IHostEnvironment env, IPredictionTransformer`1 model, IDataView data, Func`1 resultInitializer, Func`2 evaluationFunc, Func`3 deltaFunc, String features, Int32 permutationCount, Boolean useFeatureWeightFilter, Nullable`1 topExamples)
   at Microsoft.ML.PermutationFeatureImportanceExtensions.PermutationFeatureImportance[TModel](MulticlassClassificationCatalog catalog, ISingleFeaturePredictionTransformer`1 predictionTransformer, IDataView data, String labelColumnName, Boolean useFeatureWeightFilter, Nullable`1 numberOfExamplesToUse, Int32 permutationCount)
   at ML_Net_v1_0.AutoBuildProgram.Build(Int32 trainingBatchID, Int32 testingBatchID) in ...

@antoniovs1029
Copy link
Member

antoniovs1029 commented Jun 19, 2020

Hi, @jacobthamblett . Looking at the code you shared, local variables, and the stacktrace, I think the problem might be that you need to transform the testDataView with the model you're trying to evaluate, before passing it to PFI. As seen in the sample you've linked to what is passed to the PFI method is transformedData. In your case it would be something like this:

            // Transform the dataset.
            var transformedData = PFI_model.Transform(testDataView);

            // Compute the permutation metrics for the linear model using the
            // normalized data.
            var permutationMetrics = mlContext.MulticlassClassification
                .PermutationFeatureImportance(linearPredictor, transformedData,
                permutationCount: 30);

This applies to Multiclass classification transformers, but also to the other prediction transformers that use PFI.

The error you got was because the "Features" column doesn't exist in your testDataView (which simply contains the columns read from your input data, and nothing more). In order for the Features column to exist, you actually need to apply the transformer that creates that column (which in your case is PFI_model).

@antoniovs1029 antoniovs1029 added P3 Doc bugs, questions, minor issues, etc. question Further information is requested labels Jun 19, 2020
@jacobthamblett
Copy link
Author

Thank-you @antoniovs1029 , that appears to all now be working.

The PFI code is now running using the AutoML's BestRun model, it is taking a long time however for the PFI metrics to be produced, but I guess that's normal?

@antoniovs1029
Copy link
Member

Glad to hear that you've solved your issue trying to get PFI running! 😄 I will close this issue now.

It's normal for PFI to take some time, as the algorithm needs to to do permutations with every column in features, and it retrains a model from scratch with each permutation. Seeing that you have many columns that you're using as features, I wouldn't be surprised that it takes a couple of hours, as it is a resource intensive algorithm.

On the other hand, you could explore lowering the permutationCount parameter and/or the numberOfExamplesToUse parameter of PFI to get quicker (or not-so-slow) performance.

@jacobthamblett
Copy link
Author

jacobthamblett commented Jun 26, 2020

Thanks @antoniovs1029 for the suggestions around permutationCount and numberOfExamplesToUse - I've set the PFI code to perform only 2 permutations, and use 10 examples from the test data, yet this never seems to complete execution.

The code complies and runs, but seems to take forever to execute the following line:
var permutationMetrics = mlContext.MulticlassClassification.PermutationFeatureImportance(linearPredictor, testDataViewWithBestScore, labelColumnName: label_column, permutationCount: permutations_limit, numberOfExamplesToUse: examples_limit);

So far, it's taken just over 24 hours to run 2 permutations with a limit of 10 examples, and it has still not finished executing. I know PFI is resource intensive and can take a while, but should it really be this long?

@jacobthamblett
Copy link
Author

Thanks @antoniovs1029 for the suggestions around permutationCount and numberOfExamplesToUse - I've set the PFI code to perform only 2 permutations, and use 10 examples from the test data, yet this never seems to complete execution.

The code complies and runs, but seems to take forever to execute the following line:
var permutationMetrics = mlContext.MulticlassClassification.PermutationFeatureImportance(linearPredictor, testDataViewWithBestScore, labelColumnName: label_column, permutationCount: permutations_limit, numberOfExamplesToUse: examples_limit);

So far, it's taken just over 24 hours to run 2 permutations with a limit of 10 examples, and it has still not finished executing. I know PFI is resource intensive and can take a while, but should it really be this long?

UPDATE: The PFI code has now finished executing. It's taken roughly 34 hours to complete 2 permutations using 10 examples (this may have taken less time, but not sure what time it finished running over night). The results seem to output mostly, however there's an out of bounds error accessing the featureColumns array in the final for loop:

foreach (int i in sortedIndices) {
     Console.WriteLine("\t{0}\t{1:G4}\t{2:G4}",
          featureColumns[i],
               microAccuracy[i].Mean,
                    1.96 * microAccuracy[i].StandardError);
}

This should hopefully be a simple fix, I may just have to wait another 34 hours to get to that point in execution again.

@antoniovs1029
Copy link
Member

antoniovs1029 commented Jun 26, 2020

Hi, @jacobthamblett . Sorry to hear that running PFI is taking so long. I'd recommend opening a new Issue for this, on how PFI is taking so long, as that issue wouldn't be related to the problems you had with extracting the predictor from an AutoML model (which is the topic of this issue). This way others might be able to give their opinions if they've also had a similar experience, and we can take a closer look to the problem.

If possible, on that other issue, please include the model you're trying to use with PFI. Simply save it to disk with:

mlContext.Model.Save(PFI_model, testDataView.Schema, "C:\path\to\file\model.zip");

And share the model.zip with us. Describing the types of your columns in featureColumns would also be helpful.

But it would be even more helpful if you were able to provide a full repro of your scenario (including the code on how you're loading data, and your parameters for the AutoML experiment), and also including a sample dataset to actually run your code. This way we can have more information to decide if PFI taking so long is a problem with PFI, or with something else. But simply sharing your model might be good enough for us to start looking into the problem.

Now, I'm assuming you're using the same columns you showed us on your code (i.e. the var featureColumns = new string[]{...}), and so I've counted 86 columns there. I've actually never personally worked with so many input columns and PFI, so I wouldn't know for sure... but taking 30+ hours to run PFI with only 2 permutations and 10 examples sounds that there's something wrong.

Only once I've had a similar issue, and it was because my pipeline included a TextFeaturizingEstimator, which created a Features vector with thousands of columns, which then caused PFI to take a lot of time to run (I cancelled it after the first hour, when I realized the TextFeaturizingEstimator thing, specially because with such an estimator the results of PFI wouldn't be useful anyway (there's no use in knowing which text n-grams had more "feature importance").

Given the "out of bounds" error you're getting after PFI is completed, my guess is that your model is actually using more features than the ones you provided on your input dataset (i.e. that the model actually adds more columns into what is then used as the Features vector). Given that it's taking so long for your PFI to run, I'd expect that there are many such columns that are being added (it could be a TextFeaturizingEstimator adding thousands of columns, but I think other transformers could also cause this situation).

A way to confirm this is the problem, is to take a "preview" of your model's output, and inspect that (you can do this before calling PFI, so that you don't need to wait hours to confirm this):

var preview = PFI_model.Transform(testData).Preview();

Then inspect the preview object, looking for the "Features" column Schema, take a note on the vector size:
image
On this capture, my Features vector is of size 9145 (it has that number of columns, and it's actually an example of the output produced by a simple TextFeaturizingEstimator).

Depending on your model, you'll also be able to see a descriptive name of each column inside your Features vector by looking into the "Annotations" (search for the ''SlotNames" annotation and then .Value._values (this might take a while to show if you have thousands of columns)):

image

The capture tells me that Features[1] corresponds to the feature of the 3-gram "==r" as featurized by the TextFeaturizingEstimator.

Notice that the sample found on the official docs is somewhat a toy a example, as typically a model will actually add many more features than the ones that were present on your input, and the correct way to associate names to the output of PFI isn't by hardcoding the column names in a string array, but actually to try to retrieve them from the SlotNames annotation.

My immediate advice if this turns out to be the problem would be to remove any non-numerical column from your input columns (if there's a string column, then it's likely that AutoML is adding a TextFeaturizingEstimator to your pipeline). Working only with numerical columns would reduce the possibility of AutoML of adding many more columns, I think.

If this doesn't work, and/or you're still having issues with how much time PFI takes, then please open a new issue with the requested information.

Thanks! 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
AutoML.NET Automating various steps of the machine learning process P3 Doc bugs, questions, minor issues, etc. question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants