Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is "Slot" (PFI documentation suggestions) #5954

Open
torronen opened this issue Oct 3, 2021 · 6 comments
Open

What is "Slot" (PFI documentation suggestions) #5954

torronen opened this issue Oct 3, 2021 · 6 comments
Labels
documentation Related to documentation of ML.NET enhancement New feature or request question Further information is requested
Milestone

Comments

@torronen
Copy link
Contributor

torronen commented Oct 3, 2021

I am running the new PFI API (main branch with #5934) for a FastTreeBinary loaded model created by AutoML API.

Main question:
I receive items like "Slot 48416" from MLContext.BinaryClassification.PermutationFeatureImportanceNonCalibrated().
I did not find documentation about how to interpret these items. What do they mean? I am stuck with this issue.

As I understand it comes from the features vector, for slots without a name.
I am confused about why my features vector has these additional items, and how can I backtrack which original feature they belong to? I have about 2000 features in my dataset.

Side items / suggestion for documentation:
I notice there are some logging code in PFI which seem to set the progress of PFI to ProgressHeader, but I could not find documentation on how can I read the progress?
pch.SetHeader(new ProgressHeader("processed slots"), e => e.SetProgress(0, processedCnt));

There is also another GitHub issue about the recommended value for permutation and number of examples, and estimation of the running time. It seems the number of examples maybe has higher running time than O(n) but I have still not understood the source or concept of PFI adequately. It would be also useful to know if increasing number of examples or increasing number of permutations would result in more accurate results. Do I understand correctly the accuracy does increase until number of permutations reaches number of features? Increasing number of examples would increase chance the dataset is adequately represented, is it correct?

@michaelgsharp michaelgsharp added documentation Related to documentation of ML.NET question Further information is requested labels Oct 4, 2021
@michaelgsharp
Copy link
Member

So the name "Slot 48416" just comes if there isn't a name for that slot/index in the feature vector column. That can happen for various different reasons, like the original column not having a name, but its also very possible we aren't adding it correctly.

I am interested in the fact that you only have about 2000 features yet it seems like the feature column ends up with a lot more columns then that. Can you check the schema of that column in your pipeline and let me know what it says? We may be able to use this to trace down if there is a bug or something we are missing when either naming the slots or something else. Its also possible it is completely working as intended, we will just need more information to see.

Running time for sure seems to be longer than O(n), though honestly I am not sure what it is. @justinormont may have a better understanding of the time required.

For your other questions at the end I will need to ask a few people. I am not the most familiar with how PFI itself actually works under the hood.

@justinormont
Copy link
Contributor

justinormont commented Oct 5, 2021

Unnamed slots
There's a variety of transforms which fail to set their slot names.

Fixing on ML․NET dev side -- This is an issue that should be fixed. Ideally each transform would provide good names for each feature created. Alternatively, instead of fixing individual transforms, a less clean but easier fix is naming all slots in only the concat transform, naming its output slots as {inputColumnName} Slot {i}, ex: WeeklyLags Slot 23, for any previously unnamed slots. Ideally, the unnamed-slot names would only be calculated lazily (to reduce memory and model size).

Before a fix is in, you can backtrack the slot's purpose by looking at your concat transform. If it takes in columns {a, b, c} and produces column d, you can get the final size of each of {a, b, c} and calculate which original column and slot that Slot 48416 in column d maps to. Slots in your output column d are simply the array concatenation of each of your input columns {a, b, c}, and in remain order.

Slow PFI
Runtime for PFI should be: O(numFeatures * permutationCount * (numberOfExamplesToUse * modelPredictionTimePerRow + metricCalculationTime)).

The modelPredictionTimePerRow generally grows with more features (numFeatures) and more rows of data. The metricCalculationTime is O(numberOfExamplesToUse * log(numberOfExamplesToUse)) for binary classification, due to a sort in the AUC; and O(numberOfExamplesToUse) for other tasks.

Linear model modelPredictionTimePerRow is O(numFeatures), which would make PFI O(numFeatures²).

Trees are a bit more complex for runtime; their modelPredictionTimePerRow is O(numTrees * log(numLeavesPerTree)), which are settable hyperparameters, though both in-turn tend to optimally grow with more features and rows.

Speeding up PFI
For trainers which report their model feature weights, like FastTree binary, I'd recommend using that instead of PFI to get the global feature importance. Example code: https://github.com/justinormont/ImgurClassifier/blob/975973cde0f2ed6c7290f718f4052334bd925e22/ImgurClassifier.ConsoleApp/Explainability.cs#L101-L126. PFI has the benefit of being available for all trainers.

To speed up PFI, you can use UseFeatureWeightFilter, which uses the above mentioned model feature weights as a pre-filter (it's a NOOP for models not supporting feature weights). You can also limit numberOfExamplesToUse to use less rows (defaults to all rows; but note if limited, only the first N rows are used not a random sample; you can pre-shuffle before PFA), and set permutationCount to one or a small number of rounds (defaults to 1).

@torronen
Copy link
Contributor Author

torronen commented Jan 1, 2022

Global Feature Index works perfectly, thank you.

The features in PFI without label seems to be categorical string values. GFI reports the names in format "CityCode.HEL" / columnName.Value.

@torronen
Copy link
Contributor Author

torronen commented Mar 22, 2022

@michaelgsharp It seems this is creating high count of weights without names:

yield return OneHotHashEncodingExtension.CreateSuggestedTransform(Context, catHashColumnsNewArr, catHashColumnsNewArr);

I commented the below part, and started getting problems with GFI. In debug inspection I noted count of weights is much higher than count of slot names. I am using the sample with slight modifications from justnormont's link above.

if (column.Dimensions.Cardinality != null && column.Dimensions.Cardinality < 100)
{
foundCat = true;
catColumnsNew.Add(column.ColumnName);
}
else

lastTransformer.Model.SubModel.GetFeatureWeights(ref weights); give very high count of items (in the last dataset something like 200k)

output.Schema["Features"].GetSlotNames(ref slotNames); still gives as expected, in this case 7000.

I did not get into this further yet, I need to first complete the main task. So, it is possible I am misunderstanding something here.

@justinormont
Copy link
Contributor

@torronen One-hot hashing transform has the option of creating slot names:

[Argument(ArgumentType.AtMostOnce,
HelpText = "Limit the number of keys used to generate the slot name to this many. 0 means no invert hashing, -1 means no limit.",
ShortName = "ih")]
public int MaximumNumberOfInverts = OneHotHashEncodingEstimator.Defaults.MaximumNumberOfInverts;

When AutoML creates a one-hot hashing transform, it is not using the MaximumNumberOfInverts parameter:

return context.Transforms.Categorical.OneHotHashEncoding(cols);

The default of MaximumNumberOfInverts is 0, which disables the creation of slot names for one-hot hashing. This default is useful as it otherwise increases the model size.

One-hot hashing is used when the cardinality of the column is large; standard one-hot is used for lower cardinalities:

if (column.Dimensions.Cardinality != null && column.Dimensions.Cardinality < 100)
{
foundCat = true;
catColumnsNew.Add(column.ColumnName);
}
else
{
foundCatHash = true;
catHashColumnsNew.Add(column.ColumnName);
}

The slot names are created as: slotNames[HASH(str) % hashBucketLength] += (str + "|") (pseudocode)

Multiple strings can map to the same hash bucket, giving a slot name of cat|dog|fish. And importantly, many buckets will never have value hashed into it from the training dataset. Therefore even with MaximumNumberOfInverts set, many slots will not have a corresponding slot name.

Ideally, any empty slot names would auto-created lazily (as mentioned above) and filled in. This would require a fix to ML․NET.

Instead of using one-hot hashing, if you use the standard one-hot transform, it will produce a slot name for each slot.

@michaelgsharp
Copy link
Member

When you say created lazily are you meaning we would figure out which column the slot originally came from? Or if not how so since we won't know what was hashed to get to that slot originally. Right now for PFI (the new API's) if the slot isn't known it just fills in "Slot X".

@michaelgsharp michaelgsharp added this to the ML.NET Future milestone Apr 11, 2022
@michaelgsharp michaelgsharp added the enhancement New feature or request label Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Related to documentation of ML.NET enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants