Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement for partial dependence plot #14969

Closed
3 tasks done
glemaitre opened this issue Sep 12, 2019 · 14 comments · Fixed by #18298
Closed
3 tasks done

Enhancement for partial dependence plot #14969

glemaitre opened this issue Sep 12, 2019 · 14 comments · Fixed by #18298

Comments

@glemaitre
Copy link
Member

glemaitre commented Sep 12, 2019

The partial dependence plot function could be improved

@glemaitre glemaitre added this to TO DO in Guillaume's pet Sep 12, 2019
@thomasjpfan
Copy link
Member

Is there literature on using bootstrap samples to calculate partial dependence?

This is kind of similar to https://github.com/AustinRochford/PyCEbox, which shows all curves (before the mean).

@glemaitre
Copy link
Member Author

Is there literature on using bootstrap samples to calculate partial dependence?

Nop it was just inspired by what seaborn would expose when you make plots.

@glemaitre
Copy link
Member Author

This is kind of similar to https://github.com/AustinRochford/PyCEbox, which shows all curves (before the mean).

This one is ICE which plots each sample from the dataset. This is something that we could implement as well but it should be another function.

@cmarmo
Copy link
Member

cmarmo commented Oct 24, 2019

For ICE discussion see #14126.

@madhuracj
Copy link
Contributor

@glemaitre I would like to contribute to "Add support for categorical columns (bar plot)".
However, I have some doubts about the API. Assuming that a categorical feature is one-hot-encoded, it will be scattered across multiple columns and wonder how we can adapt the features parameter of plot_partial_dependence() to specify this list of columns without cluttering the features parameter.
As of now, the features parameter is,

features: list of {int, str, pair of int, pair of str}
    The target features for which to create the PDPs.
    If `features[i]` is an integer or a string, a one-way PDP is created;
    if `features[i]` is a tuple, a two-way PDP is created (only supported
    with `kind='average'`). Each tuple must be of size 2.
    if any entry is a string, then it must be in ``feature_names``.

@glemaitre
Copy link
Member Author

I was recently thinking about this feature and I don't have it clear in mind. I would like to assume that someone is passing a pipeline containing a ColomnTransformer therefore I would not assume that the categories are already one-hot encoded. In this case, features would not be an issue. However, we need to know which columns are categorical and I think that we have a couple of approaches:

  1. with a dataframe, we could infer using the categorical dtype;
  2. inspect the pipeline seeking for some encoders (e.g. one-hot or ordinal) to find the corresponding columns;
  3. add a new parameter categories where one can provide the column indices/names to be considered as categorical.

To be honest, I have little faith in 1. and 2. (even if I would like 1. to work :)).

Assuming that a categorical feature is one-hot-encoded

In case that the data have been preprocessed, then it starts to be even trickier. Maybe we could use the categories parameter to group columns but then I am not sure what to provide in features.

@jnothman @NicolasHug @thomasjpfan would you have any wise advice?

@NicolasHug
Copy link
Member

Here are my thoughts, following the notation from our UG:

  • Categorical features that aren't target features (i.e. those in X_C) don't need to have a special treatment, we can treat them as any other feature
  • We only need to worry about categorical features that are in X_S. Let's go simple at first and only consider one-way PDP for now, i.e. X_S = one single target feature
  • The main difference between a categorical target feature and a continuous target feature is that the plot will be a bar plot instead of a continuous plot. I think it's just matter of changing how we build the grid? To that end, we just need a simple is_categorical parameter which would be the same size as features: it indicates whether the corresponding target feature is categorical.
  • if a feature is marked as categorical, we just compute its categories with np.unique and use that for the grid.
  • I would say we don't need to worry about whether categories are one hot encoded, or whether the data has been pre-processed: we should assume that X as passed to plot_pdp() is the raw data that the users would feed to their pipelines. The pipeline will internally use a OHE (or whatever it needs), but we should treat all this as a black box. In other word, we should assume the user is passing a non-OHEd X and that estimator is a pipeline with the OHE inside. I think this is what Guillaume was saying as well.

For example:

X = ... # features 0 and 3 are categorical, features 1, 2, 4 are continous
ct = ColumnTransformer(['ohe', OneHotEncoder(), (0, 3)], remainder='passthrough')
lr = LinearRegression()
pipe = make_pipeline(ct, lr)
pipe.fit(X, y)
plot_partial_dependence(pipe, X, features=(0, 2), is_categorical=(True, False))
# result: creates one bar plot for 0 and one continuous plot for 2

# Note: we don't want to support this use-case, i.e. passing OHEd data:
plot_partial_dependence(lr, ct.transform(X), features=???, is_categorical=???)

@glemaitre
Copy link
Member Author

OK this looks neat as well. features and is_categorical having the same size seems to be the most intuitive.

@madhuracj
Copy link
Contributor

I agree, this specification looks neat. I suppose the default value for is_categorical can be None, in which case we assume the old behaviour, which is all features are continuous.

# Note: we don't want to support this use-case, i.e. passing OHEd data:
plot_partial_dependence(lr, ct.transform(X), features=???, is_categorical=???)

Even when the dataset is OHEd, we can still calculate partial dependence for individual binary columns resulting from OHEing.

Will send a PR soon, if guys are happy with that.

@glemaitre
Copy link
Member Author

glemaitre commented Aug 26, 2020 via email

@madhuracj
Copy link
Contributor

@glemaitre Pull request that is still WIP: #18298

@glemaitre
Copy link
Member Author

glemaitre commented Aug 31, 2020 via email

@amueller
Copy link
Member

Is there any literature about this? This feels more like an intervention and less like partial dependence.
It would certainly be useful to have something like it, but we should also make sure to be in line with standard phrasing and literature on the topic.

@discdiver
Copy link
Contributor

Inferring feature_names, support for categorical columns (bar plot), and being able to handle a pipeline with make_column_transformer will be awesome additions! Thank you!

@glemaitre glemaitre moved this from TO DO to IN PROGRESS in Guillaume's pet Jul 16, 2021
@glemaitre glemaitre self-assigned this Jan 26, 2022
jjerphan pushed a commit that referenced this issue Nov 25, 2022
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Closes #14969
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Guillaume's pet
IN PROGRESS
Development

Successfully merging a pull request may close this issue.

7 participants