Enhancement for partial dependence plot #14969

glemaitre · 2019-09-12T15:56:41Z

The partial dependence plot function could be improved

Add support for categorical columns (ENH Extend PDP for nominal categorical features #18298)
Once the support for the dataframe is merged ([MRG] ENH Add support for dataframe in PDP #14028), we should make feature_names optional when X is a dataframe. (ENH get column names by default in PDP when passing data… #15429)
Once the support for the dataframe is merged ([MRG] ENH Add support for dataframe in PDP #14028), we could simplify the to automatically infer the axis name without extra information from the user.

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2019-09-12T16:25:18Z

Is there literature on using bootstrap samples to calculate partial dependence?

This is kind of similar to https://github.com/AustinRochford/PyCEbox, which shows all curves (before the mean).

glemaitre · 2019-09-12T16:51:49Z

Is there literature on using bootstrap samples to calculate partial dependence?

Nop it was just inspired by what seaborn would expose when you make plots.

glemaitre · 2019-09-12T16:56:41Z

This is kind of similar to https://github.com/AustinRochford/PyCEbox, which shows all curves (before the mean).

This one is ICE which plots each sample from the dataset. This is something that we could implement as well but it should be another function.

cmarmo · 2019-10-24T13:04:41Z

For ICE discussion see #14126.

madhuracj · 2020-08-25T04:11:47Z

@glemaitre I would like to contribute to "Add support for categorical columns (bar plot)".
However, I have some doubts about the API. Assuming that a categorical feature is one-hot-encoded, it will be scattered across multiple columns and wonder how we can adapt the features parameter of plot_partial_dependence() to specify this list of columns without cluttering the features parameter.
As of now, the features parameter is,

features: list of {int, str, pair of int, pair of str}
    The target features for which to create the PDPs.
    If `features[i]` is an integer or a string, a one-way PDP is created;
    if `features[i]` is a tuple, a two-way PDP is created (only supported
    with `kind='average'`). Each tuple must be of size 2.
    if any entry is a string, then it must be in ``feature_names``.

glemaitre · 2020-08-25T08:26:49Z

I was recently thinking about this feature and I don't have it clear in mind. I would like to assume that someone is passing a pipeline containing a ColomnTransformer therefore I would not assume that the categories are already one-hot encoded. In this case, features would not be an issue. However, we need to know which columns are categorical and I think that we have a couple of approaches:

with a dataframe, we could infer using the categorical dtype;
inspect the pipeline seeking for some encoders (e.g. one-hot or ordinal) to find the corresponding columns;
add a new parameter categories where one can provide the column indices/names to be considered as categorical.

To be honest, I have little faith in 1. and 2. (even if I would like 1. to work :)).

Assuming that a categorical feature is one-hot-encoded

In case that the data have been preprocessed, then it starts to be even trickier. Maybe we could use the categories parameter to group columns but then I am not sure what to provide in features.

@jnothman @NicolasHug @thomasjpfan would you have any wise advice?

NicolasHug · 2020-08-25T12:27:13Z

Here are my thoughts, following the notation from our UG:

Categorical features that aren't target features (i.e. those in X_C) don't need to have a special treatment, we can treat them as any other feature
We only need to worry about categorical features that are in X_S. Let's go simple at first and only consider one-way PDP for now, i.e. X_S = one single target feature
The main difference between a categorical target feature and a continuous target feature is that the plot will be a bar plot instead of a continuous plot. I think it's just matter of changing how we build the grid? To that end, we just need a simple is_categorical parameter which would be the same size as features: it indicates whether the corresponding target feature is categorical.
if a feature is marked as categorical, we just compute its categories with np.unique and use that for the grid.
I would say we don't need to worry about whether categories are one hot encoded, or whether the data has been pre-processed: we should assume that X as passed to plot_pdp() is the raw data that the users would feed to their pipelines. The pipeline will internally use a OHE (or whatever it needs), but we should treat all this as a black box. In other word, we should assume the user is passing a non-OHEd X and that estimator is a pipeline with the OHE inside. I think this is what Guillaume was saying as well.

For example:

X = ... # features 0 and 3 are categorical, features 1, 2, 4 are continous
ct = ColumnTransformer(['ohe', OneHotEncoder(), (0, 3)], remainder='passthrough')
lr = LinearRegression()
pipe = make_pipeline(ct, lr)
pipe.fit(X, y)
plot_partial_dependence(pipe, X, features=(0, 2), is_categorical=(True, False))
# result: creates one bar plot for 0 and one continuous plot for 2

# Note: we don't want to support this use-case, i.e. passing OHEd data:
plot_partial_dependence(lr, ct.transform(X), features=???, is_categorical=???)

glemaitre · 2020-08-26T08:04:03Z

OK this looks neat as well. features and is_categorical having the same size seems to be the most intuitive.

madhuracj · 2020-08-26T08:23:32Z

I agree, this specification looks neat. I suppose the default value for is_categorical can be None, in which case we assume the old behaviour, which is all features are continuous.

# Note: we don't want to support this use-case, i.e. passing OHEd data:
plot_partial_dependence(lr, ct.transform(X), features=???, is_categorical=???)

Even when the dataset is OHEd, we can still calculate partial dependence for individual binary columns resulting from OHEing.

Will send a PR soon, if guys are happy with that.

glemaitre · 2020-08-26T08:29:33Z

I will be happy to review :)

…

On Wed, 26 Aug 2020 at 10:23, Madhura Jayaratne ***@***.***> wrote: I agree, this specification looks neat. I suppose the default value for is_categorical can be None, in which case we assume the old behaviour, which is all features are continuous. # Note: we don't want to support this use-case, i.e. passing OHEd data:plot_partial_dependence(lr, ct.transform(X), features=???, is_categorical=???) Even when the dataset is OHEd, we can still calculate partial dependence for individual binary columns resulting from OHEing. Will send a PR soon, if guys are happy with that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14969 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32P3NYHPO5B2RGMSBRQTSCTBBJANCNFSM4IWGO57A> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

madhuracj · 2020-08-30T06:10:47Z

@glemaitre Pull request that is still WIP: #18298

glemaitre · 2020-08-31T09:06:44Z

Thanks I will put it in my review-to-do list :)

…

On Sun, 30 Aug 2020 at 08:11, Madhura Jayaratne ***@***.***> wrote: @glemaitre <https://github.com/glemaitre> Pull request that is still WIP: #18298 <#18298> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14969 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32PYSG6AAIRM3CTCLETLSDHUPHANCNFSM4IWGO57A> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

amueller · 2020-11-17T23:48:16Z

Is there any literature about this? This feels more like an intervention and less like partial dependence.
It would certainly be useful to have something like it, but we should also make sure to be in line with standard phrasing and literature on the topic.

discdiver · 2021-01-17T20:40:04Z

Inferring feature_names, support for categorical columns (bar plot), and being able to handle a pipeline with make_column_transformer will be awesome additions! Thank you!

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Closes #14969

glemaitre added this to TO DO in Guillaume's pet Sep 12, 2019

thomasjpfan added the Enhancement label Oct 26, 2019

glemaitre mentioned this issue Nov 1, 2019

ENH get column names by default in PDP when passing data… #15429

Merged

madhuracj mentioned this issue Aug 30, 2020

ENH Extend PDP for nominal categorical features #18298

Merged

angela97lin mentioned this issue Dec 7, 2020

Partial dependence fails on non-numeric columns from featuretool's retail dataset alteryx/evalml#1509

Closed

glemaitre moved this from TO DO to IN PROGRESS in Guillaume's pet Jul 16, 2021

glemaitre self-assigned this Jan 26, 2022

cmarmo added the module:inspection label Mar 23, 2022

jjerphan closed this as completed in #18298 Nov 25, 2022

madhuracj mentioned this issue Nov 27, 2022

DOC correct link for image in the PDP documentation #25054

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement for partial dependence plot #14969

Enhancement for partial dependence plot #14969

glemaitre commented Sep 12, 2019 •

edited

thomasjpfan commented Sep 12, 2019

glemaitre commented Sep 12, 2019

glemaitre commented Sep 12, 2019

cmarmo commented Oct 24, 2019

madhuracj commented Aug 25, 2020

glemaitre commented Aug 25, 2020

NicolasHug commented Aug 25, 2020

glemaitre commented Aug 26, 2020

madhuracj commented Aug 26, 2020

glemaitre commented Aug 26, 2020 via email

madhuracj commented Aug 30, 2020

glemaitre commented Aug 31, 2020 via email

amueller commented Nov 17, 2020

discdiver commented Jan 17, 2021

Enhancement for partial dependence plot #14969

Enhancement for partial dependence plot #14969

Comments

glemaitre commented Sep 12, 2019 • edited

thomasjpfan commented Sep 12, 2019

glemaitre commented Sep 12, 2019

glemaitre commented Sep 12, 2019

cmarmo commented Oct 24, 2019

madhuracj commented Aug 25, 2020

glemaitre commented Aug 25, 2020

NicolasHug commented Aug 25, 2020

glemaitre commented Aug 26, 2020

madhuracj commented Aug 26, 2020

glemaitre commented Aug 26, 2020 via email

madhuracj commented Aug 30, 2020

glemaitre commented Aug 31, 2020 via email

amueller commented Nov 17, 2020

discdiver commented Jan 17, 2021

glemaitre commented Sep 12, 2019 •

edited