Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add variable importance to linear models #21170

Open
lorentzenchr opened this issue Sep 27, 2021 · 8 comments
Open

FEA Add variable importance to linear models #21170

lorentzenchr opened this issue Sep 27, 2021 · 8 comments

Comments

@lorentzenchr
Copy link
Member

lorentzenchr commented Sep 27, 2021

Describe the workflow you want to enable

I'd like to have a feature importance method native to linear models (without L1 penalty) that is calculated on the training set:

clf = LogisticRegression(with_importance=True)
clf.fit(X, y)
clf.feature_importances_  # or some nice plot thereof

Describe your proposed solution

New proposal

Evaluate if the LMG (Lindeman, Merenda and Gold, see [1, 2]) is applicable and feasible for L2 penalized regression and for GLMs. Else, consider other measures of [1, 2].

In short, LMG is Shapley value decomposition of R2 by the features.

References:

Original proposal

Compute the t-statistic of the coefficients

t[j] = coef[j] / std(coef[j])

and use the absolute, i.e. |t|, as measure of (in-sample) importance. For GLMs like the logistic regression, see section 5.3 in https://arxiv.org/pdf/1509.09169.pdf for a formula of Var[coef].

Describe alternatives you've considered, if relevant

Any general importance measure (permutation importance, SHAP values, ...) also works.

Additional context

Given the great and legitimate need for interpretability, I would favor to have a native importance measure for linear models. Random Forests have their own native feature_importances_ with the warning

impurity-based feature importances can be misleading for high cardinality features (many unique values).

We could add a similar warning for collinear features like

feature importances can be misleading for collinear or high-dimensional features.

I guess, in the end, this is true for all feature importance measures, even for SHAP (see also our multicollinear example).

Prior discussions like #16802, #6773, #13048, focued on p-values which seem out-of-scope for scikit-learn for different reasons. I hope we can circumvent these reasons by focusing on feature importance only and not considering p-values.

@lorentzenchr
Copy link
Member Author

@GaelVaroquaux @rth @NicolasHug @TomDLT friendly ping in case of interest as you've been involved in earlier issues.

@GaelVaroquaux
Copy link
Member

I think that this is a very slippery slope: t-statistics are not well controlled outside of maximum-likelihood estimates.

Either people know what they are doing, and it's trivial to compute the above, or they don't, and they will misinterpret it (that's true of much of the model interpretation literature, that's been going around in circles for years because it is trying to give simple answers to problems that do not have a good solution in statistics).

I'm -1 on this line

@lorentzenchr
Copy link
Member Author

We don't have to call it "t-statistic", just "native linear model feature importance".

@GaelVaroquaux Your arguments could be used against random forest feature importance, or even any feature importance measure. What do you propose instead for answering: "How important is feature X in your model? Could we drop it (for whatever good reasons, maybe it costs money)?"

I think we should have answers for the most simple, most taught and most trusted model class: linear models. I also think that the focus on model interpretation of late was very important for building trust in ML and showing that predictive performance is not necessarily the most important thing. Though, admittedly, model interpretation might be a hard nut to crack of its own.

@glemaitre
Copy link
Member

By reusing the analysis that we did with the tree-based models, it seems that we had an understanding that there is no single good feature importance but rather feature importance methods that have some pros and cons. I assume that the same thing can be said about linear models, e.g. permutation importance vs. weights importance.

Adding a default feature_importance_ to the linear models means that, by default, we legitimate a single method. I am not sure that this is the right thing to do since this is somehow what we would like to run away from in tree-based models.

So there is probably a choice of API to think about: use native importance vs. use helper functions to compute the importance. If we want to avoid legitimate a particular feature importance, each model should provide a parameter/a method to compute a specific type of importance. However, we would still have some default feature importance. In the case, that we use helper functions, the choice of the importance will be user-specified. The issue, in this case, will be the integration with estimators that relied on coef_ and feature_importances_ attributes, e.g. feature selector estimators. I think that we can make some machinery such that the methods can take a model and a feature importance functions.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 28, 2021 via email

@adrinjalali
Copy link
Member

I think adding something specific to linear models to the inspection module would be a good way to not have it as easily accessible as .feature_importances_ and yet easy enough for people who want to use it.

@lorentzenchr lorentzenchr changed the title FEA Add variable importance to linear models via t-statistic FEA Add variable importance to linear models Oct 24, 2021
@ogrisel
Copy link
Member

ogrisel commented Oct 28, 2021

I think part of the problem is to provide a utility with a generic name such as "feature importance" which could imply that what we propose is "The Way" to assess the contributions of input features to a model.

Some of this problem would go away if we provide more specific names for different methods to compute local (per sample) and global (per dataset) "explanations" of model decisions.

For instance we could provide a utility function to compute "feature effects" for linear models to decompose the decision function for individual predictions as follows:

intercept                # baseline
+ coef_0 * X[i, 0]       # feature effect of feature 0 in the context of sample X[i]
+ coef_1 * X[i, 1]       # feature effect of feature 0 in the context of sample X[i]

And then this same function could be be aggregated across a dataset to compute a feature effect plot such as:

https://christophm.github.io/interpretable-ml-book/limo.html#effect-plot

This would be similar to the request to implement to implement decision_path for (H)GBRT in #19294 that would allow us to compute individual and aggregate feature effects for those models and them present the results using plots such as:

feature effects / impacts for a decision on a individual sample (local explanation)


from: https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211

feature effect of a given feature computed on a test set (global explanation)


from: https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211

If we want this utility to reflect the uncertainty caused the sampling of the training set and the training procedure, we could cross_validate() the model with return_estimators=True and use a set of models and predictions on their respective validations sets to compute the above plots using a dedicated from_cv_results method as is being currently drafted in #21211 in the context of calibration curves.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Dec 6, 2022

As someone said elsewhere on a different topic

... is very classic and used in many communities. People understand the meaning at a glance (even if the understanding is limited). I think that it is important that we support it.

I think the same applies here😏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants